WWDC2003 Session 502

Transcript

Kind: captions
Language: en
[Applause]
good morning welcome accession 502 power
macintosh g5 system architecture
overview my name is Mark toes are
villages i am the desktop hardware
evangelist for Apple computer so what do
you guys think of the power macintosh g5
awesome oh great you gave a great
exciting presentation yesterday today
we're going to follow that up with a
little bit more in-depth technical
information both about the CPU as well
as system architecture so without
further ado I'd like to introduce mr.
Peter Sandin the IBM senior powerpc
processor architect thanks mark good
morning as Mark said I want to describe
to you this morning the IBM powerpc 970
microprocessor Steve covered it pretty
well yesterday but he left me a few
details to to fit in so I'm going to do
that what I'd like to do is provide some
details that perhaps you'll find useful
in your work with the g5 other details
I'm going to also put in that perhaps
you may not use directly but we'll take
advantage of indirectly as you use and
work with the g5 processor so last fall
I gave a high-level overview of the 970
at the microprocessor forum and I'm
going to start with several slides from
that presentation to give the high-level
overview that's the first two bullets
secondly I'll go into details on the
several aspects mentioned here so let me
start with some key aspects of the 970
first if design was derived from the
high performance power for
microprocessor which is used in IBM's
high-end server systems so the 970 is
also a high performance design
it runs at two gigahertz it it execute
instructions multiple dispatch at a time
multiple issue it also execute
instructions out of order to a degree
that you haven't seen in previous
PowerPC powerpc processors the 970 is a
full implementation of the 64-bit
PowerPC architecture but is compatible
with in fact runs natively 32-bit code
the g5 includes the vector enhancements
called the velocity engine also includes
a prefetch engine to reduce memory
latency and finally the high-speed bus
the Steve mentioned yesterday to off
chip memory and i/o runs up to 1
gigahertz corresponding to 8 gigabytes
per second of peak bandwidth so this is
a block diagram of the 970 showing
showing its major components I'm going
to kind of use this block diagram as a
map as we go along and discuss the
different components all the texts
surrounding I won't go through here but
I'll cover as we go along so let me
start with the instruction pipeline
shown on the left side of the block
diagram the l1 cache is a 64k bite l one
from which eight instructions per cycle
can be fetched by the instruction fetch
unit and up to five instructions fed
into the instruction dispatch unit
instruction decode unit and then on to
dispatch so as a group up to five
instructions can be dispatched up to ten
instructions per cycle can be issued to
the execution units and in all over 200
instructions can be in flight at any one
time the data pipe shown on the right
side of a diagram starts with the 32 k
bike l1b cash the two load/store units
below that l1 d cash move data between
the cash and the three register files
shown there the fpr GPR and vector
register file the two l1 caches are back
shown at the top by a half meg
l2 cache which in turn is backed by a
main memory via the biu and continuing
same diagram in the middle of the
diagram are shown the memory management
arrays that support virtual memory this
is a 64-bit implementation so effective
addresses are 64 bits wide real
addresses are 42 bits wide for a for
Cara bike memory range finally down at
the bottom are the computational
execution units the dual fixed point the
dual floating point and the dual vector
units along with the to load store units
that I just mentioned in the branch and
condition register unit which aren't
shown in the diagram those comprise the
10 execution units of the processor so
what I want to do now is repeat what I
just said but in a little more detail in
certain areas and starting with
instruction processing so this is a
pipeline diagram showing how
instructions move through the processor
each block here represents a stage where
an instruction spends a cycle
instructions move through the pipeline
starting at the top where they're
fetched from the cache move down through
decode dispatch then to issue an
execution and finally at the bottom come
out and complete the lower part of the
diagram shows the individual pipelines
of the individual execution units the
upper part just represents the movement
of the instructions through fetch and
decode so I'm going to start at the top
with instruction fetch again
instructions are fetched from the 64
kbyte l1 cache the instruction fetch
address register the I far shown there
holds the address the effective address
of the next instructions to be fetched
so the I far you can think as it says
roughly you can think program counter
although of course instructions are
fetched well ahead of being executed in
this deep pipeline all the caches in the
970 are organized as 128 bite cache
lines but the instruction cache lines
are further subdivided into 4 30
to bite sectors so it's a sector each
cycle that gets fetched from the
instruction cache and therefore for
maximizing performance it's important to
align your branch targets on these 32
byte boundaries to maximize the fetch
bandwidth once the instructions are
fetched they're put into the 32
instruction fetch buffer shown at the
bottom and then up to five instructions
per cycle are removed from the fetch
buffer to send off to decode and
dispatch so the goal of this part of the
hardware is to keep that pipeline busy
below the fetch buffer and what could
prevent that for example is amiss in the
I cash when the I far address is not
found in the l1 I cash a request goes to
the l2 cache the data is brought back if
it's found in the l2 cache and the such
stream continues but that stream is
stopped for 12 cycles while that happens
so when that l1 cache miss occurs not
only does the fetch Hardware go after
the missed cache line but it goes after
the next sequential cache line as well
brings it back into one of the four
prefetch buffer is shown at the top of
the diagram so that the next time in l1
cache miss occurs if that address is
found in the prefetch buffer there's
only three cycles missed of fetching
similarly on that table at the bottom
are shown these Layton sees when a
branch is taken predicted is taken the
branch prediction logic updates that I
far register with the new address and
there's a two cycle bubble in the in the
fetch stream of course the fetch buffer
the point of the fetch buffer is that as
you're feeding it it's starting to fill
up so that when you get these two or
three cycle bubbles in the such stream
you're still able to maintain the stream
of data down into decode it's the stream
of instructions
branch processing occurs in two places
in the 970 first branches are predicted
ever fetched from the cache and second
they are resolved when they get down to
the branch execution unit why do branch
processing Steve asks yesterday because
in a particularly in a deeply pipelined
a design like this we're always
sketching well ahead of executing so if
you had to wait until you execute that
is until you know the conditions of
whether a branch will be taken or not
you will miss opportunities to keep the
pipeline full so what you want to do is
predict branches early and predict them
accurately to avoid those delays so as
instructions are fetched from the cash
they are scanned for branches and up to
two branches per cycle are predicted
there are two branch mechanisms one to
predict the direction a conditional
branch will take and that mechanism uses
three branch history tables which
implement two different algorithms a
local and a global for predicting and a
means of selecting between them the
second mechanism is for predicting
branches to registers so there's a count
cash that's used to predict branch to
count branch targets and a link stack to
predict branch to link targets each of
those data structures hold previously
seen branch target addresses for later
predictions so predictions are made up
as instructions are fetched the branch
now works its way through decode and
dispatch it finally gets to the
execution unit and now it resolves that
is now it knows whether the condition
was true or false whether the branch
should have been taken or not if it
predicted correctly life goes on life is
good if it predicted incorrectly what
the branch execution unit does is it
updates the I far with the correct
branch target address and it flushes of
course all the instructions that were
behind that branch because they know now
now no longer belong to the correct
Stream the delay in that case to fill
the pipe and get it going again is 12
cycles so it's that 12 cycle branch
penalty that one wants to avoid this
mechanism this prediction mechanism over
a wide range of applications tends to be
accurate in the in the mid 90 percent
range so perhaps one out of 20 times a
branch will be miss predicted and so
very few times will you pay the penalty
the last bullet here simply points out
that this this dynamic branch prediction
facility can be overridden by software
using an extended branch conditional
instruction in the 970 which allows the
compiler or the or the programmer to
statically predict that a branch should
always be predicted taken or always not
taken instruction decode is a
multi-stage process here I'm just going
to mention one aspect of instruction
decode as it's different from most
previous power pcs and it as follows the
PowerPC architecture is a risk type
architecture and therefore each
instruction in general corresponds to
one simple operation however there are
exceptions to that for instance the load
with update instruction corresponds to
two simple operations a load of one
register and update of a separate index
register what the 970 does is it cracks
as we say that instruction into two
internal ops and those internal apps
then flow through the pipeline and
furthermore there are more complex
instructions like load multiple that
correspond to several of sequence of
several operations those are translated
into a micro coded sequence which then
flows through the pipeline and finally
finally in terms of fetch and decode we
get to dispatch this corresponds to the
to the transition from the fetch decode
to the execution stages it also
corresponds to the transition between in
order processing and out of order
processing of instructions so when
instructions reach the dispatch stage
they can be dispatched as a group of up
to five and
actions if all of their hardware
resources are available most instruction
types can dispatch out of any of the
four the first for dispatch slots there
the fist dispatch slots in the fifth
dispatch slot is reserved for branch
instructions so once dispatched an
instruction will take a place in one of
these issue queues all of the boxes
there and the issue cues show how many
entries can be in the issue queue once
in the issue queue an instruction can
issue to be executed if all of its
operands are available and so if one
instruction is waiting on operands from
a cache miss for example other
instructions behind it can continue to
be processed and it's this massive
opportunity for out of order execution
of instructions that allows the g5 to
keep processing even in the presence of
pipeline and memory delays which you
normally run into in the normal course
of processing finally once instructions
execute they wait until all of the
instructions in their dispatch group are
finished and they complete together in
order just briefly on virtual memory the
memory management unit one of its main
features is the support of address
translation for virtual memory now
virtual memory is something that makes
the programmers job easier its
programming model is easier it makes OS
implementations easier but it actually
involves some complexities in the
hardware to support it so briefly a
segmented paged virtual memory system
like this one requires a two-step
address translation process first an
effective address what you program in is
maps to a virtual address using a
segment table and second a virtual
address is mapped to a real address what
the hardware understands using a page
table and what's needed to support this
two-step process and then look up in the
cache is some sort of hardware
optimization to make this efficient so
what's implemented here is
the usual TLB table look aside buffer
which caches page table entries but also
a segment look aside buffer new to the
64-bit processor this replaces the
segment registers of the 32-bit
processors which caches the segment
table entries and still that two-stage
translation could be costly except that
we've implemented a second another level
of caching of address translations
called an ear at that's the effective to
real address translation table it caches
the most recent effective to real the
two-stage process effective to real
addresses in a small cache small fast
cash so what the diagram shows then is
that the effective address in the I far
accesses the l1 cache the l1 directory
and the ear at all at the same time and
if all goes well like it usually does
and those all hit you get the
instructions out on the next cycle
similarly for the there's a de rat to go
with data cache accesses for data
processing just a couple points to make
one is on the registers what the program
receives is a set of 32 general-purpose
registers a set of 32 floating point
registers and a set of 32 vector
registers that those are the architected
registers what's implemented in the
hardware to support those are more
registers for two reasons there's out of
order execution and there's multiple
execution units so to handle out of
order execution we need a place to put
the results that we've that we've
executed out of order until they become
the official result and go into the
architected register we call those
rename registers and since there is so
much capability for out of order there
are more rename registers than
architected registers so the 970 has 32
GPRS architected plus 48 renames for a
total of eighty registers all 64 bits
why the fpr similarly 32 architected 48
renames the vector edges registers
similarly 32 architected 48 renames
in addition we've got multiple execution
units and to keep up the supply of data
operands to those units we've duplicated
those register files so there's two
exact copies of the 80 GPRS the two
exact copies of the FP RS and so forth
so the 32 architected registers we've
implemented as 160 registers for each of
the register files the latencies at the
bottom just show load to use delays when
you do a load of an operand and then you
want to use it you can issue the load
and then you have to wait some number of
cycles to issue the dependence operation
in the case of the fixed point unit for
example it's three floating point is
five and the other values are shown
there the second thing I want to say
about the data side is the data is that
there is a data prefetch facility that
in Hardware initiates data stream
prefetching so the idea is that this
preset Hardware monitors the activity of
the l1 data cache when it sees to mrs.
to two adjacent cache lines it says oh
there's a there's a pattern I'll go
after I'll prefetch the third cache line
in the sequence if a Tennessee is a hit
to that third cache line it'll go after
the fourth line and pre such it into the
l1 and so forth so it's demand paste
which means it'll keep touching ahead
for as long as the data stream is is
accessed cache lines are brought into
the l1 and further ahead they're brought
into the l2 using this mechanism so in
addition to this Hardware initiated
prefetch software can also initiate a
data stream prefetch using an extended
version of the DCB touch instruction the
970 supports this extension of the DCP
touch which allows it to touch not just
one cache line and bring it in but to
start this prefetch mechanism to keep
stretching ahead and and a third
mechanism mechanism for prefetch is the
implementation of the data stream touch
instruction used associated with the
vector extensions the computation units
at the bottom of the block diagram I
just want to cover what gets executed
where there are 26 point units that are
nearly symmetrical they both execute the
usual arithmetic and logical and shift
and rotate type instructions they both
also execute multiplies so you can have
two multiplies going at the same time
the difference is that the one unit
executes the fixed point divides while
the other unit executes the SPR move
instructions the two floating point
units are symmetric they both execute I
Triple E single and double precision
operations they both support the I
Triple E formats for D norms not a
numbers infinities and so forth they
both support precise exceptions they
also both support the optional floating
point instructions for square root
select reciprocal estimate and
reciprocal square root estimate they do
not support a non I Triple E mode in the
two vector units the first is the vector
permute unit which executes the permute
instruction as well as the merged splat
and pack and unpack instructions and the
vector ALU unit which has three subunits
one which executes the floating point
the vector floating point instructions
the other two which executes the vector
fixed point instructions and finally at
the top the l2 and bus interface which
will segue us into the next segment the
memory subsystem has a few subcomponents
itself the cache interface unit shown at
the top takes four types of requests
from the core one from the fetch unit
for I cache misses one from each of the
load/store units for D cache misses and
a fourth one for the TLB hardware table
Walker and the prefetch hardware what
the CIU does is simply direct those
requests to the right place for instance
a an l1 I cache miss will be directed to
the l2 cache
where it'll be looked up if the data is
found it will be returned if the data is
not found the l2 cache controller will
forward it on to the biu and in on to
memory the non cacheable unit on the
left side simply handles all of the
other activity not associated with the
l2 cache that goes off to the bus so
this high bandwidth processor bus is
what we call the elastic interface it
consists of two buses to uni-directional
buses each for bikes wide point-to-point
it's not a shared bus source synchronous
the clocks are sent with the data and I
hit I put in this point about
initialization alignment at power-on
reset there's a procedure that the
processor and system controller go
through to deskew all of the bits on a
bus and then to Center the clock within
the the I of those data bits and my
reason for pointing this out is to say
that there's a lot of work involved on
both the processor in the system
controller side to get a bus to run at
one gigahertz the logical interface here
supports a pipelined out of order
transactions the address and control
information it shares the same bus as
the data there are three types of
command packets read write and control
each of those consists of 24 bite beats
on the bus that contained the 42 bit
real address transaction type size other
control information and the tag data
packets come in sizes from 24 bytes
beats to 32 beeps to send one bite on
the bus requires a to beat packet from
128 bytes the 32 beat packet is the
cache line size 128 bytes on the right
the diagram shows a little bit more
detail about what I called a 4-byte wide
bus the bus actually consists of three
segments one the address data segment
which is actually 35
it's the 32 of data plus some control
bits second there's a transfer handshake
single signal and two signals for Snoop
responses and so the outgoing with
respect to the processor and in going
buses are shown here here are those
three segments per direction again just
to show an example of a read transaction
the transaction is initiated by the
processor by putting a read command
packet up in the upper left corner out
on the address data out bus and I'll
give the end of the story first out on
the other side to the right to the right
is that is the data coming back from the
memory controller what's happening in
between without giving a lot of detail
is that there's handshaking going on to
acknowledge transfer of information and
also to support memory coherency so
again this is a point-to-point bus so
one processor can't see directly what
the other processor is doing in order to
maintain memory coherency the system
controller has to get involved and
reflect commands back to all the
processors so they can snoop and stay
coherent and that's what you see some of
that hand shaking this looks like not
very good utilization of the bus that's
because I just isolated the read
transaction normally all of this
activity would be interleaved with all
the other activity on the bus and the
other point that the numbering shows
that there are the way the bus is
managed is that there are fixed delays
between activity and responses to
activity and this is the way we
correspond the handshaking with the
original transaction because things are
happening out of order and the snoop
responses and the handshakes are not
tagged or validated in any way okay so
let me just go over one more time what
I've said this g5 processor is a
high-performance processor it achieves
its high performance by running it to
gigahertz also by its superscalar
completion of five
instructions per cycle by it's out of
order execution of instructions it's a
implementation that supports both 64-bit
and 32-bit applications and operating
systems I've mentioned kind of the width
of the pipeline that we can fetch eight
dispatch five issue 10 instructions
every cycle also that the branch
prediction scheme is highly accurate
across a range of applications so that
we avoid that branch penalty that I
mentioned we get high computational
throughput by using to fix point two
floating point and two vector units as
well as to load store units to keep
everything busy with data and also this
data prefetch engine which keeps the
latency to memory the effective latency
to memory low by keeping things as close
in to the processor as possible and
finally the high-speed bus which I just
mentioned on the two gigahertz processor
will run at one gigahertz for a eight
gigabyte per second bandwidth to
off-chip memory and i/o so that's all I
have to say I'd like to thank mark and
Jesse Stein from IBM for helping me
prepare this presentation and I'd like
to thank you for your attention and your
interest in the G
thank you Peter and you thought he was
going to only answer the branch
processing question Steve adda so to
point you to some more information if
you wanted to get some more document
specifically from the IBM powerpc page a
couple URLs here available for you there
are several documents posted there later
on the presentation I'll give you some
more pointers to other references on the
apple site so to continue our journey
from where IBM handed off the powerpc
970 the g5 processor to apple and what
we did then with the system architecture
I like to introduce to you Keith Cox
principal engineer systems architecture
thank you Mark so Peter told you a
little bit about the g5 processor itself
I'm here to tell you more about the
system we wrapped around it and how we
our vision of bringing that performance
out and turning it into real world
performance for your users and your
applications so this is the general
block diagram of the power mac g5 it the
thing I want you to get from this is
that we started over with this system we
did not take the power mac g4
architecture and say okay how do we
tweak it we got to get a little faster
what we said was we're getting a really
cool processor from IBM it's going to
really chew up instructions it's going
to really need data we really need to
keep this sucker fed so we started from
the ground up we opened up all the pipes
so what I want you to get from my
presentation is that not only is this
the next generation PowerPC architecture
but in addition to that we've added high
bandwidth buses everywhere we've
improved the memory system greatly we've
increased the PCI buses and the i/o
system and on top of that we've added an
advanced thermal management system
because we know the users like their
systems be quiet they don't like them to
be filed and roaring like jet airplanes
or anything so this is the general block
diagram of the power mac g5 it's
actually very similar just in blocks to
a t4 block diagram but there's some
important differences to note the first
is that the processor bus is not shared
in a multiprocessor system that's a key
difference when you get to MP in the
kind of performance that we have in the
kind of bandwidths that we need to be
able to deliver to the user another
important difference is that the system
controller connection to the i/o system
is no longer a pci bus it's actually a
hyper transport bus that has up to 3.2
gigabytes a second of bandwidth connects
to high bandwidth devices down below the
system controller that's all new so if
we compare the g4 and the g5 processors
you've heard you've just heard from
Peter about how the g5 can keep a
million things in flight or at least two
hundred and some odd
it runs at two gigahertz and can
complete five instructions at a time it
just has a huge appetite it's a big leap
over the G for the system similarly we
believe is a big leap over the g for the
front side bus is has six times the band
width of a g-force system if you've got
a multiprocessor system it actually has
twelve times the bandwidth of a g-force
system the memory system is more than
two times faster and the PCI system is
seven times the bandwidth so we've
really tried to open up the inside of
the system and let's dig down in a
little more detail on all of that the
front side bus is eight gigabytes per
second we quoted as double data rate
64-bit as Peter was just showing you
that's not quite correct it's actually a
pretty complicated bus to describe and
so that's what we put in the in the
marketing fluff to describe it because I
mean we really want our users to
understand the basic gist of it which is
it's effectively 64 bits wide of data
and it's 8 gigabytes a second of
bandwidth in reality that's 2 4
gigabytes per second channel 4 gigabytes
a second going up 4 gigabytes a second
coming down on each processor there's a
little bit of overhead for the packet
headers and that sort of stuff so the
real achievable bandwidth numbers a
little smaller than that but it it is
close to the 8 gigabytes per second
total on that interface then if you had
two processors we've got two two
interfaces so that's a total of sixteen
gigabytes a second four up four down
times 2 processors to get the full
bandwidth in order to deal with that you
really need a really high bandwidth
system controller this was a ground-up
redesign at Apple that really intended
to achieve these real levels of
performance be able to deliver these
kinds of bandwidth in addition to this
moving 16 gigabytes a second of data
there's all the coherency protocol that
Peter was just describing or one
processor request something you've got
to check the other processors it may
have it modified in
ash so you know apples always delivered
cache coherent systems we do that here
the g5 implements something called cash
intervention as well which it says that
if processor one wants that line in the
cash processor to has it modified the
system controller actually delivers the
data coming out of processor two
straight across and back up to processor
one without having to go through the
memory system what this does is it does
two things one it doesn't chew up your
valuable memory bandwidth if you don't
need to the other thing is is it it
takes full advantage of the high
bandwidth of the processor interfaces to
deliver things fast to the other
processor while not really interfering
with the other processor you know yes it
takes a few beats of the bandwidth for
processor two to deliver the data but it
had to do that anyway it had it modified
it on it owns that data and so it cost
it nothing else and yet we got the lower
latency and higher throughput by doing
that in addition one of the points
you're going to hear throughout my talk
is that all these links are
point-to-point we're connecting end
points directly to get the highest
efficiency possible the lowest latency
possible and really just make the data
screen through the system without
bottlenecking in any single point so you
hear you just heard how the g5
processors can talk directly to each
other without interacting with any of
the rest of the system in reality the
AGP bus has its own direct port into
memory the i/o system through
hypertransport has a poured into memory
each processor has their own individual
read and write queues into memory if you
look inside the system controller if you
could open it up they're actually direct
point-to-point links between all the
interfaces as well so we've really tried
to avoid any of the bottlenecks of some
system controller design where things
really get choked up if we move on to
the memory system the first thing we did
was we doubled the width I mean that's
the obvious thing you need more
bandwidth you go wider you get more
bandwidth in addition we pushed it up to
400 mega transfers per second or pc 3200
dram or however you whatever label you
want to apply this gives us a total
bandwidth of six point four gigabytes a
second that's
pretty much state-of-the-art that's the
best you can do with current memory
technology without going really
extremely wide which starts to impact
your costs in a very negative manner
going on in twentieth this why'd you do
have to put two dimms wide because each
Tim is 64 bits the too wide to get 128
bits so you have to install them in
pairs but one thing you'll see in the
power mac g5 system that you don't see
anywhere else is the depth of our memory
system is two times wide by four dims
deep at four hundred mega transfers per
second that as far as I am aware is not
done anywhere else in the industry it's
actually a great challenge to get four
hundred mega transfers per second on for
dims that are all connected together to
the same memory interface and that's one
of the values that are one of the places
where Apple put a lot of engineering to
get both the memory speed the memory
width and the memory depth so that we
can have the large memory systems and
the customers can get to eight gigabytes
of memory and eight dims that we support
if we move on to the AGP system it's
pretty much a standard AGP 8x AGP 3.0
all buzzword compliance or respect
compliant interface AGP pro is move for
us in our case we support up to 70 watt
ATP cards the AGP prospec has different
levels and it's those different levels
you can start growing your heat sink
into the slot space of the PCI cards so
technically at a 70 watt card the card
vendors allowed to take up two of your
PCI slots it's just heat sink to cool
that so something to be aware of I don't
know that there's much more to say about
that if we move on to the i/o system
coming out of the system controller is
the last major bus which is the hyper
transport bus coming down to the pci x
bridge that bus hypertransport does
describes it as a 16-bit bus it's really
two 16-bit point-to-point interfaces on
each direction similar to the processor
bus so you've got 16 bits up 16 bits
down running at eight hundred mega
transfers
and in our implementation connected to
that you've got a pci-x bridge with two
completely independent pci-x buses so
the pc i expect says that if you have
one slot you can run it at 133 megahertz
if you have two slots you can only run
it at 100 megahertz so that's what we
did we needed three slots we had two
buses this is a bandwidth we get it's
seven times the 64 bit pci bandwidth of
what we've had in our previous systems
so one thing you might be aware of is on
the two slot bus if you plug in two
cards and one of them flow and one of
them's fast the bus has to run at the
speed of the slowest card so it can
handle the transactions and understand
what's going on so as a configuration
issue maybe if you're designing cards
and documenting how to install them you
should be aware that if you've got two
cards that are fast and one that's slow
you might actually wanna put the slow
card in the single slot because it only
slows down one slot as opposed to
slowing down the other two another thing
to do with pci x the pc i expec dropped
support for five volts pci cards that's
really just a requirement to get the
interface to run at the speeds that it
runs at so what happens is there are
five volt cards they're mostly very old
cards there's not not new 5 volt cards
being designed that I'm aware of her
haven't been aware for a couple of years
most cards nowadays are 3.3 volt
universal cards as they're called those
cards can exist on a 5 volt bus but only
signal at 3.3 volt levels and then of
course standard 3.3 volt pci cards also
signal 3.3 volt levels those two flavors
the 3.3 volt and the universal cards are
fully compatible with pci x so the bus
controller figures out that I've got a
pci card instead of a pci x card and
it's capable or it only runs at 33
megahertz say and it slows down the
clocks on the bus to support that card
likewise there pci-x cards that only run
it 100 megahertz so even if you plug
them into the 133 slots they
won't run 133 because they've reported
the speed that they're capable of if we
move on to the i/o system it also hangs
off hypertransport coming out the far
side of the pci x bridge is another
hypertransport interface this one's only
eight bits wide that's really it's not
16 bits because it doesn't need to be is
the basic answer the 8-bit
hypertransport has one point six
gigabytes a second of bandwidth for i 0
you know historically the i/o
controllers had about a hundred
megabytes a second so it's only 16 times
so it was it was sufficient we did move
the gigabit ethernet interface and the
firewire interface down into the i/o
controller which works just fine because
they it now has plenty of bandwidth to
do that if any of you remember the g4
block diagram those two functions were
in the North Bridge or the system
controller I mean the g4 system simply
because they couldn't get enough
bandwidth off the PCI bus to exist there
in addition we've gone to serial ATA
which is a higher bandwidth or it's
actually roughly equivalent to ultra ATA
100 but the thing is now you've got two
of them and the disks are completely
independent as opposed to an ultra ATA
master slave where the drives really
interact horribly as far as if you're
accessing stuff off one versus the other
you have to wait for one before you get
to the other here the drive interfaces
are completely independent so the drives
can be run simultaneously at full
bandwidth without beating on each other
a note about the USB 2 controller I've
seen lots of comments and confusion out
in the technical community as well as
the user community about when somebody
says USB 2 0 is it really 480 megabit
per second or is it just USB 2 0 which
label did they have high speed or full
speed one of those they've playing games
with names and saying they're USB 2 0
when they really still only run at 12
megabits a second and just to be clear
this implementation is the full 480
megabit per second USB 2.0
also we added the optical digital audio
i/o we have customers that really like
that analog I do I owe in NN out as
usual this machine supports Bluetooth
and it also supports airport extreme
since as you can see this enclosure is
basically a metal box it's kind of hard
to get an antenna out of that so there's
actually ports on the rear with small
antennas that stick out that are
installable that either come with the
machine or with the Bluetooth or airport
option when you buy it in addition we
put some new ports on the front of the
machine in addition to the headphone
port we've added a USB port and a
firewire 400 port that's really for
connecting that's really for connecting
those digital hub type devices you know
when you bring your iPod or your digital
camera so something that you plug in and
out all the time it's really just for
convenience and I'm glad to hear that
you guys like it because there was quite
a bit of debate about that it's hard to
do believe me it sounds simple but FCC
gets involved and they like things not
to interfere with radio stations and
such so anyhow now I'm going to talk a
little bit about thermal management in
the system this is one of the places
where we really put a lot of thought and
a lot of effort and really wanted to do
a good job thermal management in some
sense is about cooling but it's really
about noise it's really about you walk
into an office or you walk them much
more important you walk into that
bedroom or office in your home where
you've got your computer and if it's
roaring away it's just a horribly noisy
annoying thing we implemented sleep a
few years ago as one way to help solve
that problem because when you put your
machine to sleep because it goes
virtually silent for this machine we
wanted it to be virtually silent while
running now that's a challenge because
you've got two of these g5 processors
which have just huge amounts of
processing power and it takes
electrical power to do that which
generates heat we've also got pci cards
that and some people systems can take
huge amounts of power if they're doing
video processing and that sort of stuff
so managing all this to a least common
denominator type solution just would not
work the thing would roar like an
airplane and we knew that wasn't
acceptable so what we've done is we've
broken the machine into separate
discrete thermal zones you can kind of
see them coming into picture on the left
there they let's start at the bottom
that's the power supply actually hiding
under there it's pretty much hidden from
the user you can't see it blow this edge
but there's actually a wall right here
in the bottom of the box the power
supply takes in cool air from the front
and exhaust hot air out the back that
means it's not pre heated by the CPUs
and not nor does it preheat anything
else the power supply management itself
and perfected it's getting cool air
means that it does not have to run its
fans very fast to keep all these parts
within specification which has been a
challenge to us in the past if we go up
to the top of the box that's where the
optical drive is that's where the hard
drives are that that zone has it it's
separate thermal chamber as well air
comes in the front goes through the box
and comes out the back in this
particular case we have a temperature
sensor mounted up in the corner of the
box that monitors exhaust temperature
constantly if the Machine moves into a
hot room we need to move a little more
air to keep those drives cool if all of
a sudden you're hitting your hard drive
hard it's going to be putting off off a
lot of power heating up the air we see
it get hotter we turn up the cooling to
keep that drive cool we maintain that
zone within spec but only to the amount
you're using it and only to the amount
required by your environment so if
you're in a cold room your machines
quiet earth during a hot room it has to
move the air a little faster too to keep
the machine cool but as I say it's
absolutely the minimum required to
maintain the machine in its operating
state if you go down into the next zone
right here you can actually see the kind
of dip in the plastic chamber this
guides the airflow over the pci card
so rather than all the air running up
over the top and out the back as fast as
it can actually run through the cards
between the cards and keeps them cool
individually given the huge variety of
placement options and power
configuration options there are in pci
cards there's no way we can predict you
know that the slot card in slot 2 is
going to be hot while the card is cool
and we can't put a temperature sensor
anywhere to determine how to cool that
zone so instead we went to actually
monitoring the power consumed by all
your cards so if you have a graphics
card in there and that's it and it's
consuming very little power the fans
going to run at its minimum speed which
is quiet it's really quiet you can't
hear it if if you have a
high-performance nvidia card or ati card
that actually is pretty high power but
you're not gaming right now you're not
using that power and it's not being
consumed and the fans still runs low
speed if you're gaming yeah the card
starts to get hot but we just start
turning up the fan and keeping it cool
just to the absolute level required to
cool the machine we've got lots of
airflow to work with we don't have to
work incredibly hard to cool most of
these cards until you get to a full pci
configuration that fan runs relatively
slow the most complex zone in the system
of course is the one that handles the g5
if you notice there's actually two fans
in the front of the box and two fans in
the rear of the box actually these two
right here and to right back here at the
at the back of the box now I've been
watching the web and people are saying
you know with nine fans and the things
going to roar well it's actually the
exact opposite as I've been explaining
about the other fans you know we only
cool to the minimum possible and since
we don't preheat any air going from one
device to another everything's getting
cold air so it just takes much less air
to cool it the CPUs have this same
philosophy and the push/pull nature of
those fans actually let us run them
slower as well because the heat sink has
a resistance to air flow so as we push
air through it if we didn't have
something pulling on the other side then
we have to push harder I you would have
to run the fan
fan faster the fan pushing against that
pressure actually is what makes it make
noise or a good portion of that noise is
actually the back pressure the fan field
so by putting the two fans in the
push-pull configuration for given amount
of airflow we're actually much quieter
than we would be with a single fan in
addition they're paired top and bottom
to match up with the CPUs I mean you can
see the lines in the animation this fan
and this fan cool this cpu for the most
part I mean there's some cross coupling
and we call it one zone but the two
pairs of fans are controlled separately
so say you have a multiprocessor machine
and you have one thread that just eat
the CPU and then the other cpus sitting
idle we don't have to turn up the fans
on both cpus even we just turn up the
fans on the cpu that's getting hot in
addition the fans are controlled by the
temperature actually of the cpu so we're
actually since sensing the temperature
that's important to keep within
specification so we're once again
cooling only the amount required by the
cpu this brings up another trick that
we've got in our back pocket which is on
power books for a few years now you've
seen the options to run them faster or
slower or automatic switching today in
the power mac g5 we've added that
technology to the g5 what the two
gigahertz machine actually does is when
you need it it runs two gigahertz when
you don't need it it runs at one point
three gigahertz or two-thirds of its
full horsepower now in reality most of
the time for what you're doing that's
plenty I mean a one point three
gigahertz g5 is a screamer but you know
there are people running Photoshop and
final cut pro renders and all sorts of
high-end applications that really chew
on the processor that performance is
fully there for them and it's fully
there for you whenever you need it to
run your compiles or whatever but the
thing is is when we can drop that
performance by that one third we can
save
roughly sixty percent of the power
consumed by the processor itself and
when we put on all our dynamic scaling
we actually get down to about one sixth
of the maximum processor power so when
your machine sitting there idle in the
finder it's consuming 16 the power that
you have available to you anytime you
need it it switches in milliseconds and
speeds back up and I was actually not
even a processing latency hit to speed
up and slow down we continue execution
as we go from one point three to two
gigahertz and then back down from two to
one point three you just slowly get
faster and slowly get slower if you
don't need it so that allows us to save
a whole lot of power and let us run the
fans incredibly slowly on the processor
in fact when the machine is idling we
may end up with the fans spinning a
little bit just because but we don't
need to turn them at all that's the
efficiency of the cooling system that we
put into the g5 machine we do not
actually have to spin the fans to cool
these CPUs when they're sitting there
idle in the finder so if you leave your
machine and you get up to go to the
bathroom it's not doing anything or
you're just sitting there staring at
your mail it's not doing anything and
all something else take time to read in
process actually clicking next on your
little mail program doesn't take any
real horsepower either so a lot of the
stuff that you do you know editing
source code for example it doesn't take
a lot of the CPU so when you're in that
mode we're down at a fix the power and
the fans hardly spinning at all is at
all so we think that's really important
and it's a real of real value to our
users and one of the messages I just
want to get across is although there's
nine fans in there that's so we can spin
them slow if you only have one fan
something's probably going to be hot
just because it's been heated by
everything else as the air winds its way
through the box and all the different
heat sources and you've got to run it
fast all the time and by putting all the
different fans in the control the
different zones independently and only
to as much as they need we can manage to
keep all the fans running slowly as much
as the time as possible and keep the
whole system quiet
so I guess in summary I just like to
point out that this the real goal of the
g5 architecture was to take the g5
processor and wrap a system around it
they could allow you guys to deliver the
applications and the performance to your
users that really screens and really
makes them want to buy more computers
releases that's my only personal take
but anyhow so what we did was we just
opened up all the pipes in the system
we've got the high bandwidth interfaces
from the processor the system controller
the system controller that connects
everything together and then high
bandwidth memory system AGP interface
and i/o system to boot to deliver
everything to everybody that they mean
thank you very much thank you in terms
of reference tech notes that we posted
went live yesterday there are two
important ones here a tuning for g5 a
practical guide and the powerpc g5
performance primer now the presentation
that you saw today regarding the g5
processor from Peter and the system
architecture from Keith I don't want you
to leave here thinking great Apple
delivered this superfast system that my
application is going to run fast and yes
it will but there's a whole lot more
performance that you can achieve out of
this architecture and that was the goal
are the powerpc g5 has a lot more to
offer than what you see here we've
provided a lot of resources at the
developers conference online and
following the developers conference in
terms of developer kitchens that we will
have to help you understand how to
unlock that performance in your
applications there are several sessions
that will cover how you do that so I
want to go and show you just a few of
those here on the roadmap Wednesday
there's a session entitled shed
performance optimization tools and depth
that session 506
highly recommended if you are not
profiling analyzing the performance of
your application looking at where your
function call spending our spending the
most time you are leaving a lot of
performance on the table you need to be
at this session to understand how to
optimize your applications for the g5
processor session 507 Mac os10 high
performance math libraries our math
flight our method formance group has
worked extensively to tune these
libraries specifically for the g5
processor these are libraries that come
in Mac OS 10 that will be available as
well in Panther as well as in Jaguar
that will give you high performance
access to rhythmic functions session 304
GCC and depth will talk about how using
the compiler you can set Flags
appropriate for the g5 processor to
again unlock that performance and then
finally throughout this whole week and
today until midnight we are holding a g5
optimization lab on the first floor in
the California room there are 40 system
setup to enable you developers bring and
work with our engineers on your source
code to understand exactly how to use
the tools to profile your application
for performance and what changes you
need to make to unlock that performance
again one of this the main goal of this
lab is not to sit down take a test drive
see how fast these dual processor
systems work it's really to sit down
with an engineer and work on your code
later on the week there is an ADC
compatibility lab at the very end of the
labs on the first floor where I'll have
a system there if you want to take a
look at the insides and just kind of get
a feel for the system itself i'll have a
system there for you but again the lab
itself is really gold for you to work on
source code work with our engineers we
have them engineers from IBM we have
engineers from Apple's several several
apples engineering groups so please take
advantage of that again the hours will
be today all day through midnight
Wednesday Thursday Friday 9 a.m. to six
pm okay
who to contact if you have questions
information follow up on any of the
information that you saw today please
contact me via email tozer at apple com
and hopefully you'll be hearing from me
shortly after the developers conference
on kitchens specifically designed to
help you again optimize your
applications for the g5 thank you very
much we'll start our QA
you