WWDC2003 Session 502

Transcript

Kind: captions Language: en [Applause] good morning welcome accession 502 power macintosh g5 system architecture overview my name is Mark toes are villages i am the desktop hardware evangelist for Apple computer so what do you guys think of the power macintosh g5 awesome oh great you gave a great exciting presentation yesterday today we're going to follow that up with a little bit more in-depth technical information both about the CPU as well as system architecture so without further ado I'd like to introduce mr. Peter Sandin the IBM senior powerpc processor architect thanks mark good morning as Mark said I want to describe to you this morning the IBM powerpc 970 microprocessor Steve covered it pretty well yesterday but he left me a few details to to fit in so I'm going to do that what I'd like to do is provide some details that perhaps you'll find useful in your work with the g5 other details I'm going to also put in that perhaps you may not use directly but we'll take advantage of indirectly as you use and work with the g5 processor so last fall I gave a high-level overview of the 970 at the microprocessor forum and I'm going to start with several slides from that presentation to give the high-level overview that's the first two bullets secondly I'll go into details on the several aspects mentioned here so let me start with some key aspects of the 970 first if design was derived from the high performance power for microprocessor which is used in IBM's high-end server systems so the 970 is also a high performance design it runs at two gigahertz it it execute instructions multiple dispatch at a time multiple issue it also execute instructions out of order to a degree that you haven't seen in previous PowerPC powerpc processors the 970 is a full implementation of the 64-bit PowerPC architecture but is compatible with in fact runs natively 32-bit code the g5 includes the vector enhancements called the velocity engine also includes a prefetch engine to reduce memory latency and finally the high-speed bus the Steve mentioned yesterday to off chip memory and i/o runs up to 1 gigahertz corresponding to 8 gigabytes per second of peak bandwidth so this is a block diagram of the 970 showing showing its major components I'm going to kind of use this block diagram as a map as we go along and discuss the different components all the texts surrounding I won't go through here but I'll cover as we go along so let me start with the instruction pipeline shown on the left side of the block diagram the l1 cache is a 64k bite l one from which eight instructions per cycle can be fetched by the instruction fetch unit and up to five instructions fed into the instruction dispatch unit instruction decode unit and then on to dispatch so as a group up to five instructions can be dispatched up to ten instructions per cycle can be issued to the execution units and in all over 200 instructions can be in flight at any one time the data pipe shown on the right side of a diagram starts with the 32 k bike l1b cash the two load/store units below that l1 d cash move data between the cash and the three register files shown there the fpr GPR and vector register file the two l1 caches are back shown at the top by a half meg l2 cache which in turn is backed by a main memory via the biu and continuing same diagram in the middle of the diagram are shown the memory management arrays that support virtual memory this is a 64-bit implementation so effective addresses are 64 bits wide real addresses are 42 bits wide for a for Cara bike memory range finally down at the bottom are the computational execution units the dual fixed point the dual floating point and the dual vector units along with the to load store units that I just mentioned in the branch and condition register unit which aren't shown in the diagram those comprise the 10 execution units of the processor so what I want to do now is repeat what I just said but in a little more detail in certain areas and starting with instruction processing so this is a pipeline diagram showing how instructions move through the processor each block here represents a stage where an instruction spends a cycle instructions move through the pipeline starting at the top where they're fetched from the cache move down through decode dispatch then to issue an execution and finally at the bottom come out and complete the lower part of the diagram shows the individual pipelines of the individual execution units the upper part just represents the movement of the instructions through fetch and decode so I'm going to start at the top with instruction fetch again instructions are fetched from the 64 kbyte l1 cache the instruction fetch address register the I far shown there holds the address the effective address of the next instructions to be fetched so the I far you can think as it says roughly you can think program counter although of course instructions are fetched well ahead of being executed in this deep pipeline all the caches in the 970 are organized as 128 bite cache lines but the instruction cache lines are further subdivided into 4 30 to bite sectors so it's a sector each cycle that gets fetched from the instruction cache and therefore for maximizing performance it's important to align your branch targets on these 32 byte boundaries to maximize the fetch bandwidth once the instructions are fetched they're put into the 32 instruction fetch buffer shown at the bottom and then up to five instructions per cycle are removed from the fetch buffer to send off to decode and dispatch so the goal of this part of the hardware is to keep that pipeline busy below the fetch buffer and what could prevent that for example is amiss in the I cash when the I far address is not found in the l1 I cash a request goes to the l2 cache the data is brought back if it's found in the l2 cache and the such stream continues but that stream is stopped for 12 cycles while that happens so when that l1 cache miss occurs not only does the fetch Hardware go after the missed cache line but it goes after the next sequential cache line as well brings it back into one of the four prefetch buffer is shown at the top of the diagram so that the next time in l1 cache miss occurs if that address is found in the prefetch buffer there's only three cycles missed of fetching similarly on that table at the bottom are shown these Layton sees when a branch is taken predicted is taken the branch prediction logic updates that I far register with the new address and there's a two cycle bubble in the in the fetch stream of course the fetch buffer the point of the fetch buffer is that as you're feeding it it's starting to fill up so that when you get these two or three cycle bubbles in the such stream you're still able to maintain the stream of data down into decode it's the stream of instructions branch processing occurs in two places in the 970 first branches are predicted ever fetched from the cache and second they are resolved when they get down to the branch execution unit why do branch processing Steve asks yesterday because in a particularly in a deeply pipelined a design like this we're always sketching well ahead of executing so if you had to wait until you execute that is until you know the conditions of whether a branch will be taken or not you will miss opportunities to keep the pipeline full so what you want to do is predict branches early and predict them accurately to avoid those delays so as instructions are fetched from the cash they are scanned for branches and up to two branches per cycle are predicted there are two branch mechanisms one to predict the direction a conditional branch will take and that mechanism uses three branch history tables which implement two different algorithms a local and a global for predicting and a means of selecting between them the second mechanism is for predicting branches to registers so there's a count cash that's used to predict branch to count branch targets and a link stack to predict branch to link targets each of those data structures hold previously seen branch target addresses for later predictions so predictions are made up as instructions are fetched the branch now works its way through decode and dispatch it finally gets to the execution unit and now it resolves that is now it knows whether the condition was true or false whether the branch should have been taken or not if it predicted correctly life goes on life is good if it predicted incorrectly what the branch execution unit does is it updates the I far with the correct branch target address and it flushes of course all the instructions that were behind that branch because they know now now no longer belong to the correct Stream the delay in that case to fill the pipe and get it going again is 12 cycles so it's that 12 cycle branch penalty that one wants to avoid this mechanism this prediction mechanism over a wide range of applications tends to be accurate in the in the mid 90 percent range so perhaps one out of 20 times a branch will be miss predicted and so very few times will you pay the penalty the last bullet here simply points out that this this dynamic branch prediction facility can be overridden by software using an extended branch conditional instruction in the 970 which allows the compiler or the or the programmer to statically predict that a branch should always be predicted taken or always not taken instruction decode is a multi-stage process here I'm just going to mention one aspect of instruction decode as it's different from most previous power pcs and it as follows the PowerPC architecture is a risk type architecture and therefore each instruction in general corresponds to one simple operation however there are exceptions to that for instance the load with update instruction corresponds to two simple operations a load of one register and update of a separate index register what the 970 does is it cracks as we say that instruction into two internal ops and those internal apps then flow through the pipeline and furthermore there are more complex instructions like load multiple that correspond to several of sequence of several operations those are translated into a micro coded sequence which then flows through the pipeline and finally finally in terms of fetch and decode we get to dispatch this corresponds to the to the transition from the fetch decode to the execution stages it also corresponds to the transition between in order processing and out of order processing of instructions so when instructions reach the dispatch stage they can be dispatched as a group of up to five and actions if all of their hardware resources are available most instruction types can dispatch out of any of the four the first for dispatch slots there the fist dispatch slots in the fifth dispatch slot is reserved for branch instructions so once dispatched an instruction will take a place in one of these issue queues all of the boxes there and the issue cues show how many entries can be in the issue queue once in the issue queue an instruction can issue to be executed if all of its operands are available and so if one instruction is waiting on operands from a cache miss for example other instructions behind it can continue to be processed and it's this massive opportunity for out of order execution of instructions that allows the g5 to keep processing even in the presence of pipeline and memory delays which you normally run into in the normal course of processing finally once instructions execute they wait until all of the instructions in their dispatch group are finished and they complete together in order just briefly on virtual memory the memory management unit one of its main features is the support of address translation for virtual memory now virtual memory is something that makes the programmers job easier its programming model is easier it makes OS implementations easier but it actually involves some complexities in the hardware to support it so briefly a segmented paged virtual memory system like this one requires a two-step address translation process first an effective address what you program in is maps to a virtual address using a segment table and second a virtual address is mapped to a real address what the hardware understands using a page table and what's needed to support this two-step process and then look up in the cache is some sort of hardware optimization to make this efficient so what's implemented here is the usual TLB table look aside buffer which caches page table entries but also a segment look aside buffer new to the 64-bit processor this replaces the segment registers of the 32-bit processors which caches the segment table entries and still that two-stage translation could be costly except that we've implemented a second another level of caching of address translations called an ear at that's the effective to real address translation table it caches the most recent effective to real the two-stage process effective to real addresses in a small cache small fast cash so what the diagram shows then is that the effective address in the I far accesses the l1 cache the l1 directory and the ear at all at the same time and if all goes well like it usually does and those all hit you get the instructions out on the next cycle similarly for the there's a de rat to go with data cache accesses for data processing just a couple points to make one is on the registers what the program receives is a set of 32 general-purpose registers a set of 32 floating point registers and a set of 32 vector registers that those are the architected registers what's implemented in the hardware to support those are more registers for two reasons there's out of order execution and there's multiple execution units so to handle out of order execution we need a place to put the results that we've that we've executed out of order until they become the official result and go into the architected register we call those rename registers and since there is so much capability for out of order there are more rename registers than architected registers so the 970 has 32 GPRS architected plus 48 renames for a total of eighty registers all 64 bits why the fpr similarly 32 architected 48 renames the vector edges registers similarly 32 architected 48 renames in addition we've got multiple execution units and to keep up the supply of data operands to those units we've duplicated those register files so there's two exact copies of the 80 GPRS the two exact copies of the FP RS and so forth so the 32 architected registers we've implemented as 160 registers for each of the register files the latencies at the bottom just show load to use delays when you do a load of an operand and then you want to use it you can issue the load and then you have to wait some number of cycles to issue the dependence operation in the case of the fixed point unit for example it's three floating point is five and the other values are shown there the second thing I want to say about the data side is the data is that there is a data prefetch facility that in Hardware initiates data stream prefetching so the idea is that this preset Hardware monitors the activity of the l1 data cache when it sees to mrs. to two adjacent cache lines it says oh there's a there's a pattern I'll go after I'll prefetch the third cache line in the sequence if a Tennessee is a hit to that third cache line it'll go after the fourth line and pre such it into the l1 and so forth so it's demand paste which means it'll keep touching ahead for as long as the data stream is is accessed cache lines are brought into the l1 and further ahead they're brought into the l2 using this mechanism so in addition to this Hardware initiated prefetch software can also initiate a data stream prefetch using an extended version of the DCB touch instruction the 970 supports this extension of the DCP touch which allows it to touch not just one cache line and bring it in but to start this prefetch mechanism to keep stretching ahead and and a third mechanism mechanism for prefetch is the implementation of the data stream touch instruction used associated with the vector extensions the computation units at the bottom of the block diagram I just want to cover what gets executed where there are 26 point units that are nearly symmetrical they both execute the usual arithmetic and logical and shift and rotate type instructions they both also execute multiplies so you can have two multiplies going at the same time the difference is that the one unit executes the fixed point divides while the other unit executes the SPR move instructions the two floating point units are symmetric they both execute I Triple E single and double precision operations they both support the I Triple E formats for D norms not a numbers infinities and so forth they both support precise exceptions they also both support the optional floating point instructions for square root select reciprocal estimate and reciprocal square root estimate they do not support a non I Triple E mode in the two vector units the first is the vector permute unit which executes the permute instruction as well as the merged splat and pack and unpack instructions and the vector ALU unit which has three subunits one which executes the floating point the vector floating point instructions the other two which executes the vector fixed point instructions and finally at the top the l2 and bus interface which will segue us into the next segment the memory subsystem has a few subcomponents itself the cache interface unit shown at the top takes four types of requests from the core one from the fetch unit for I cache misses one from each of the load/store units for D cache misses and a fourth one for the TLB hardware table Walker and the prefetch hardware what the CIU does is simply direct those requests to the right place for instance a an l1 I cache miss will be directed to the l2 cache where it'll be looked up if the data is found it will be returned if the data is not found the l2 cache controller will forward it on to the biu and in on to memory the non cacheable unit on the left side simply handles all of the other activity not associated with the l2 cache that goes off to the bus so this high bandwidth processor bus is what we call the elastic interface it consists of two buses to uni-directional buses each for bikes wide point-to-point it's not a shared bus source synchronous the clocks are sent with the data and I hit I put in this point about initialization alignment at power-on reset there's a procedure that the processor and system controller go through to deskew all of the bits on a bus and then to Center the clock within the the I of those data bits and my reason for pointing this out is to say that there's a lot of work involved on both the processor in the system controller side to get a bus to run at one gigahertz the logical interface here supports a pipelined out of order transactions the address and control information it shares the same bus as the data there are three types of command packets read write and control each of those consists of 24 bite beats on the bus that contained the 42 bit real address transaction type size other control information and the tag data packets come in sizes from 24 bytes beats to 32 beeps to send one bite on the bus requires a to beat packet from 128 bytes the 32 beat packet is the cache line size 128 bytes on the right the diagram shows a little bit more detail about what I called a 4-byte wide bus the bus actually consists of three segments one the address data segment which is actually 35 it's the 32 of data plus some control bits second there's a transfer handshake single signal and two signals for Snoop responses and so the outgoing with respect to the processor and in going buses are shown here here are those three segments per direction again just to show an example of a read transaction the transaction is initiated by the processor by putting a read command packet up in the upper left corner out on the address data out bus and I'll give the end of the story first out on the other side to the right to the right is that is the data coming back from the memory controller what's happening in between without giving a lot of detail is that there's handshaking going on to acknowledge transfer of information and also to support memory coherency so again this is a point-to-point bus so one processor can't see directly what the other processor is doing in order to maintain memory coherency the system controller has to get involved and reflect commands back to all the processors so they can snoop and stay coherent and that's what you see some of that hand shaking this looks like not very good utilization of the bus that's because I just isolated the read transaction normally all of this activity would be interleaved with all the other activity on the bus and the other point that the numbering shows that there are the way the bus is managed is that there are fixed delays between activity and responses to activity and this is the way we correspond the handshaking with the original transaction because things are happening out of order and the snoop responses and the handshakes are not tagged or validated in any way okay so let me just go over one more time what I've said this g5 processor is a high-performance processor it achieves its high performance by running it to gigahertz also by its superscalar completion of five instructions per cycle by it's out of order execution of instructions it's a implementation that supports both 64-bit and 32-bit applications and operating systems I've mentioned kind of the width of the pipeline that we can fetch eight dispatch five issue 10 instructions every cycle also that the branch prediction scheme is highly accurate across a range of applications so that we avoid that branch penalty that I mentioned we get high computational throughput by using to fix point two floating point and two vector units as well as to load store units to keep everything busy with data and also this data prefetch engine which keeps the latency to memory the effective latency to memory low by keeping things as close in to the processor as possible and finally the high-speed bus which I just mentioned on the two gigahertz processor will run at one gigahertz for a eight gigabyte per second bandwidth to off-chip memory and i/o so that's all I have to say I'd like to thank mark and Jesse Stein from IBM for helping me prepare this presentation and I'd like to thank you for your attention and your interest in the G thank you Peter and you thought he was going to only answer the branch processing question Steve adda so to point you to some more information if you wanted to get some more document specifically from the IBM powerpc page a couple URLs here available for you there are several documents posted there later on the presentation I'll give you some more pointers to other references on the apple site so to continue our journey from where IBM handed off the powerpc 970 the g5 processor to apple and what we did then with the system architecture I like to introduce to you Keith Cox principal engineer systems architecture thank you Mark so Peter told you a little bit about the g5 processor itself I'm here to tell you more about the system we wrapped around it and how we our vision of bringing that performance out and turning it into real world performance for your users and your applications so this is the general block diagram of the power mac g5 it the thing I want you to get from this is that we started over with this system we did not take the power mac g4 architecture and say okay how do we tweak it we got to get a little faster what we said was we're getting a really cool processor from IBM it's going to really chew up instructions it's going to really need data we really need to keep this sucker fed so we started from the ground up we opened up all the pipes so what I want you to get from my presentation is that not only is this the next generation PowerPC architecture but in addition to that we've added high bandwidth buses everywhere we've improved the memory system greatly we've increased the PCI buses and the i/o system and on top of that we've added an advanced thermal management system because we know the users like their systems be quiet they don't like them to be filed and roaring like jet airplanes or anything so this is the general block diagram of the power mac g5 it's actually very similar just in blocks to a t4 block diagram but there's some important differences to note the first is that the processor bus is not shared in a multiprocessor system that's a key difference when you get to MP in the kind of performance that we have in the kind of bandwidths that we need to be able to deliver to the user another important difference is that the system controller connection to the i/o system is no longer a pci bus it's actually a hyper transport bus that has up to 3.2 gigabytes a second of bandwidth connects to high bandwidth devices down below the system controller that's all new so if we compare the g4 and the g5 processors you've heard you've just heard from Peter about how the g5 can keep a million things in flight or at least two hundred and some odd it runs at two gigahertz and can complete five instructions at a time it just has a huge appetite it's a big leap over the G for the system similarly we believe is a big leap over the g for the front side bus is has six times the band width of a g-force system if you've got a multiprocessor system it actually has twelve times the bandwidth of a g-force system the memory system is more than two times faster and the PCI system is seven times the bandwidth so we've really tried to open up the inside of the system and let's dig down in a little more detail on all of that the front side bus is eight gigabytes per second we quoted as double data rate 64-bit as Peter was just showing you that's not quite correct it's actually a pretty complicated bus to describe and so that's what we put in the in the marketing fluff to describe it because I mean we really want our users to understand the basic gist of it which is it's effectively 64 bits wide of data and it's 8 gigabytes a second of bandwidth in reality that's 2 4 gigabytes per second channel 4 gigabytes a second going up 4 gigabytes a second coming down on each processor there's a little bit of overhead for the packet headers and that sort of stuff so the real achievable bandwidth numbers a little smaller than that but it it is close to the 8 gigabytes per second total on that interface then if you had two processors we've got two two interfaces so that's a total of sixteen gigabytes a second four up four down times 2 processors to get the full bandwidth in order to deal with that you really need a really high bandwidth system controller this was a ground-up redesign at Apple that really intended to achieve these real levels of performance be able to deliver these kinds of bandwidth in addition to this moving 16 gigabytes a second of data there's all the coherency protocol that Peter was just describing or one processor request something you've got to check the other processors it may have it modified in ash so you know apples always delivered cache coherent systems we do that here the g5 implements something called cash intervention as well which it says that if processor one wants that line in the cash processor to has it modified the system controller actually delivers the data coming out of processor two straight across and back up to processor one without having to go through the memory system what this does is it does two things one it doesn't chew up your valuable memory bandwidth if you don't need to the other thing is is it it takes full advantage of the high bandwidth of the processor interfaces to deliver things fast to the other processor while not really interfering with the other processor you know yes it takes a few beats of the bandwidth for processor two to deliver the data but it had to do that anyway it had it modified it on it owns that data and so it cost it nothing else and yet we got the lower latency and higher throughput by doing that in addition one of the points you're going to hear throughout my talk is that all these links are point-to-point we're connecting end points directly to get the highest efficiency possible the lowest latency possible and really just make the data screen through the system without bottlenecking in any single point so you hear you just heard how the g5 processors can talk directly to each other without interacting with any of the rest of the system in reality the AGP bus has its own direct port into memory the i/o system through hypertransport has a poured into memory each processor has their own individual read and write queues into memory if you look inside the system controller if you could open it up they're actually direct point-to-point links between all the interfaces as well so we've really tried to avoid any of the bottlenecks of some system controller design where things really get choked up if we move on to the memory system the first thing we did was we doubled the width I mean that's the obvious thing you need more bandwidth you go wider you get more bandwidth in addition we pushed it up to 400 mega transfers per second or pc 3200 dram or however you whatever label you want to apply this gives us a total bandwidth of six point four gigabytes a second that's pretty much state-of-the-art that's the best you can do with current memory technology without going really extremely wide which starts to impact your costs in a very negative manner going on in twentieth this why'd you do have to put two dimms wide because each Tim is 64 bits the too wide to get 128 bits so you have to install them in pairs but one thing you'll see in the power mac g5 system that you don't see anywhere else is the depth of our memory system is two times wide by four dims deep at four hundred mega transfers per second that as far as I am aware is not done anywhere else in the industry it's actually a great challenge to get four hundred mega transfers per second on for dims that are all connected together to the same memory interface and that's one of the values that are one of the places where Apple put a lot of engineering to get both the memory speed the memory width and the memory depth so that we can have the large memory systems and the customers can get to eight gigabytes of memory and eight dims that we support if we move on to the AGP system it's pretty much a standard AGP 8x AGP 3.0 all buzzword compliance or respect compliant interface AGP pro is move for us in our case we support up to 70 watt ATP cards the AGP prospec has different levels and it's those different levels you can start growing your heat sink into the slot space of the PCI cards so technically at a 70 watt card the card vendors allowed to take up two of your PCI slots it's just heat sink to cool that so something to be aware of I don't know that there's much more to say about that if we move on to the i/o system coming out of the system controller is the last major bus which is the hyper transport bus coming down to the pci x bridge that bus hypertransport does describes it as a 16-bit bus it's really two 16-bit point-to-point interfaces on each direction similar to the processor bus so you've got 16 bits up 16 bits down running at eight hundred mega transfers and in our implementation connected to that you've got a pci-x bridge with two completely independent pci-x buses so the pc i expect says that if you have one slot you can run it at 133 megahertz if you have two slots you can only run it at 100 megahertz so that's what we did we needed three slots we had two buses this is a bandwidth we get it's seven times the 64 bit pci bandwidth of what we've had in our previous systems so one thing you might be aware of is on the two slot bus if you plug in two cards and one of them flow and one of them's fast the bus has to run at the speed of the slowest card so it can handle the transactions and understand what's going on so as a configuration issue maybe if you're designing cards and documenting how to install them you should be aware that if you've got two cards that are fast and one that's slow you might actually wanna put the slow card in the single slot because it only slows down one slot as opposed to slowing down the other two another thing to do with pci x the pc i expec dropped support for five volts pci cards that's really just a requirement to get the interface to run at the speeds that it runs at so what happens is there are five volt cards they're mostly very old cards there's not not new 5 volt cards being designed that I'm aware of her haven't been aware for a couple of years most cards nowadays are 3.3 volt universal cards as they're called those cards can exist on a 5 volt bus but only signal at 3.3 volt levels and then of course standard 3.3 volt pci cards also signal 3.3 volt levels those two flavors the 3.3 volt and the universal cards are fully compatible with pci x so the bus controller figures out that I've got a pci card instead of a pci x card and it's capable or it only runs at 33 megahertz say and it slows down the clocks on the bus to support that card likewise there pci-x cards that only run it 100 megahertz so even if you plug them into the 133 slots they won't run 133 because they've reported the speed that they're capable of if we move on to the i/o system it also hangs off hypertransport coming out the far side of the pci x bridge is another hypertransport interface this one's only eight bits wide that's really it's not 16 bits because it doesn't need to be is the basic answer the 8-bit hypertransport has one point six gigabytes a second of bandwidth for i 0 you know historically the i/o controllers had about a hundred megabytes a second so it's only 16 times so it was it was sufficient we did move the gigabit ethernet interface and the firewire interface down into the i/o controller which works just fine because they it now has plenty of bandwidth to do that if any of you remember the g4 block diagram those two functions were in the North Bridge or the system controller I mean the g4 system simply because they couldn't get enough bandwidth off the PCI bus to exist there in addition we've gone to serial ATA which is a higher bandwidth or it's actually roughly equivalent to ultra ATA 100 but the thing is now you've got two of them and the disks are completely independent as opposed to an ultra ATA master slave where the drives really interact horribly as far as if you're accessing stuff off one versus the other you have to wait for one before you get to the other here the drive interfaces are completely independent so the drives can be run simultaneously at full bandwidth without beating on each other a note about the USB 2 controller I've seen lots of comments and confusion out in the technical community as well as the user community about when somebody says USB 2 0 is it really 480 megabit per second or is it just USB 2 0 which label did they have high speed or full speed one of those they've playing games with names and saying they're USB 2 0 when they really still only run at 12 megabits a second and just to be clear this implementation is the full 480 megabit per second USB 2.0 also we added the optical digital audio i/o we have customers that really like that analog I do I owe in NN out as usual this machine supports Bluetooth and it also supports airport extreme since as you can see this enclosure is basically a metal box it's kind of hard to get an antenna out of that so there's actually ports on the rear with small antennas that stick out that are installable that either come with the machine or with the Bluetooth or airport option when you buy it in addition we put some new ports on the front of the machine in addition to the headphone port we've added a USB port and a firewire 400 port that's really for connecting that's really for connecting those digital hub type devices you know when you bring your iPod or your digital camera so something that you plug in and out all the time it's really just for convenience and I'm glad to hear that you guys like it because there was quite a bit of debate about that it's hard to do believe me it sounds simple but FCC gets involved and they like things not to interfere with radio stations and such so anyhow now I'm going to talk a little bit about thermal management in the system this is one of the places where we really put a lot of thought and a lot of effort and really wanted to do a good job thermal management in some sense is about cooling but it's really about noise it's really about you walk into an office or you walk them much more important you walk into that bedroom or office in your home where you've got your computer and if it's roaring away it's just a horribly noisy annoying thing we implemented sleep a few years ago as one way to help solve that problem because when you put your machine to sleep because it goes virtually silent for this machine we wanted it to be virtually silent while running now that's a challenge because you've got two of these g5 processors which have just huge amounts of processing power and it takes electrical power to do that which generates heat we've also got pci cards that and some people systems can take huge amounts of power if they're doing video processing and that sort of stuff so managing all this to a least common denominator type solution just would not work the thing would roar like an airplane and we knew that wasn't acceptable so what we've done is we've broken the machine into separate discrete thermal zones you can kind of see them coming into picture on the left there they let's start at the bottom that's the power supply actually hiding under there it's pretty much hidden from the user you can't see it blow this edge but there's actually a wall right here in the bottom of the box the power supply takes in cool air from the front and exhaust hot air out the back that means it's not pre heated by the CPUs and not nor does it preheat anything else the power supply management itself and perfected it's getting cool air means that it does not have to run its fans very fast to keep all these parts within specification which has been a challenge to us in the past if we go up to the top of the box that's where the optical drive is that's where the hard drives are that that zone has it it's separate thermal chamber as well air comes in the front goes through the box and comes out the back in this particular case we have a temperature sensor mounted up in the corner of the box that monitors exhaust temperature constantly if the Machine moves into a hot room we need to move a little more air to keep those drives cool if all of a sudden you're hitting your hard drive hard it's going to be putting off off a lot of power heating up the air we see it get hotter we turn up the cooling to keep that drive cool we maintain that zone within spec but only to the amount you're using it and only to the amount required by your environment so if you're in a cold room your machines quiet earth during a hot room it has to move the air a little faster too to keep the machine cool but as I say it's absolutely the minimum required to maintain the machine in its operating state if you go down into the next zone right here you can actually see the kind of dip in the plastic chamber this guides the airflow over the pci card so rather than all the air running up over the top and out the back as fast as it can actually run through the cards between the cards and keeps them cool individually given the huge variety of placement options and power configuration options there are in pci cards there's no way we can predict you know that the slot card in slot 2 is going to be hot while the card is cool and we can't put a temperature sensor anywhere to determine how to cool that zone so instead we went to actually monitoring the power consumed by all your cards so if you have a graphics card in there and that's it and it's consuming very little power the fans going to run at its minimum speed which is quiet it's really quiet you can't hear it if if you have a high-performance nvidia card or ati card that actually is pretty high power but you're not gaming right now you're not using that power and it's not being consumed and the fans still runs low speed if you're gaming yeah the card starts to get hot but we just start turning up the fan and keeping it cool just to the absolute level required to cool the machine we've got lots of airflow to work with we don't have to work incredibly hard to cool most of these cards until you get to a full pci configuration that fan runs relatively slow the most complex zone in the system of course is the one that handles the g5 if you notice there's actually two fans in the front of the box and two fans in the rear of the box actually these two right here and to right back here at the at the back of the box now I've been watching the web and people are saying you know with nine fans and the things going to roar well it's actually the exact opposite as I've been explaining about the other fans you know we only cool to the minimum possible and since we don't preheat any air going from one device to another everything's getting cold air so it just takes much less air to cool it the CPUs have this same philosophy and the push/pull nature of those fans actually let us run them slower as well because the heat sink has a resistance to air flow so as we push air through it if we didn't have something pulling on the other side then we have to push harder I you would have to run the fan fan faster the fan pushing against that pressure actually is what makes it make noise or a good portion of that noise is actually the back pressure the fan field so by putting the two fans in the push-pull configuration for given amount of airflow we're actually much quieter than we would be with a single fan in addition they're paired top and bottom to match up with the CPUs I mean you can see the lines in the animation this fan and this fan cool this cpu for the most part I mean there's some cross coupling and we call it one zone but the two pairs of fans are controlled separately so say you have a multiprocessor machine and you have one thread that just eat the CPU and then the other cpus sitting idle we don't have to turn up the fans on both cpus even we just turn up the fans on the cpu that's getting hot in addition the fans are controlled by the temperature actually of the cpu so we're actually since sensing the temperature that's important to keep within specification so we're once again cooling only the amount required by the cpu this brings up another trick that we've got in our back pocket which is on power books for a few years now you've seen the options to run them faster or slower or automatic switching today in the power mac g5 we've added that technology to the g5 what the two gigahertz machine actually does is when you need it it runs two gigahertz when you don't need it it runs at one point three gigahertz or two-thirds of its full horsepower now in reality most of the time for what you're doing that's plenty I mean a one point three gigahertz g5 is a screamer but you know there are people running Photoshop and final cut pro renders and all sorts of high-end applications that really chew on the processor that performance is fully there for them and it's fully there for you whenever you need it to run your compiles or whatever but the thing is is when we can drop that performance by that one third we can save roughly sixty percent of the power consumed by the processor itself and when we put on all our dynamic scaling we actually get down to about one sixth of the maximum processor power so when your machine sitting there idle in the finder it's consuming 16 the power that you have available to you anytime you need it it switches in milliseconds and speeds back up and I was actually not even a processing latency hit to speed up and slow down we continue execution as we go from one point three to two gigahertz and then back down from two to one point three you just slowly get faster and slowly get slower if you don't need it so that allows us to save a whole lot of power and let us run the fans incredibly slowly on the processor in fact when the machine is idling we may end up with the fans spinning a little bit just because but we don't need to turn them at all that's the efficiency of the cooling system that we put into the g5 machine we do not actually have to spin the fans to cool these CPUs when they're sitting there idle in the finder so if you leave your machine and you get up to go to the bathroom it's not doing anything or you're just sitting there staring at your mail it's not doing anything and all something else take time to read in process actually clicking next on your little mail program doesn't take any real horsepower either so a lot of the stuff that you do you know editing source code for example it doesn't take a lot of the CPU so when you're in that mode we're down at a fix the power and the fans hardly spinning at all is at all so we think that's really important and it's a real of real value to our users and one of the messages I just want to get across is although there's nine fans in there that's so we can spin them slow if you only have one fan something's probably going to be hot just because it's been heated by everything else as the air winds its way through the box and all the different heat sources and you've got to run it fast all the time and by putting all the different fans in the control the different zones independently and only to as much as they need we can manage to keep all the fans running slowly as much as the time as possible and keep the whole system quiet so I guess in summary I just like to point out that this the real goal of the g5 architecture was to take the g5 processor and wrap a system around it they could allow you guys to deliver the applications and the performance to your users that really screens and really makes them want to buy more computers releases that's my only personal take but anyhow so what we did was we just opened up all the pipes in the system we've got the high bandwidth interfaces from the processor the system controller the system controller that connects everything together and then high bandwidth memory system AGP interface and i/o system to boot to deliver everything to everybody that they mean thank you very much thank you in terms of reference tech notes that we posted went live yesterday there are two important ones here a tuning for g5 a practical guide and the powerpc g5 performance primer now the presentation that you saw today regarding the g5 processor from Peter and the system architecture from Keith I don't want you to leave here thinking great Apple delivered this superfast system that my application is going to run fast and yes it will but there's a whole lot more performance that you can achieve out of this architecture and that was the goal are the powerpc g5 has a lot more to offer than what you see here we've provided a lot of resources at the developers conference online and following the developers conference in terms of developer kitchens that we will have to help you understand how to unlock that performance in your applications there are several sessions that will cover how you do that so I want to go and show you just a few of those here on the roadmap Wednesday there's a session entitled shed performance optimization tools and depth that session 506 highly recommended if you are not profiling analyzing the performance of your application looking at where your function call spending our spending the most time you are leaving a lot of performance on the table you need to be at this session to understand how to optimize your applications for the g5 processor session 507 Mac os10 high performance math libraries our math flight our method formance group has worked extensively to tune these libraries specifically for the g5 processor these are libraries that come in Mac OS 10 that will be available as well in Panther as well as in Jaguar that will give you high performance access to rhythmic functions session 304 GCC and depth will talk about how using the compiler you can set Flags appropriate for the g5 processor to again unlock that performance and then finally throughout this whole week and today until midnight we are holding a g5 optimization lab on the first floor in the California room there are 40 system setup to enable you developers bring and work with our engineers on your source code to understand exactly how to use the tools to profile your application for performance and what changes you need to make to unlock that performance again one of this the main goal of this lab is not to sit down take a test drive see how fast these dual processor systems work it's really to sit down with an engineer and work on your code later on the week there is an ADC compatibility lab at the very end of the labs on the first floor where I'll have a system there if you want to take a look at the insides and just kind of get a feel for the system itself i'll have a system there for you but again the lab itself is really gold for you to work on source code work with our engineers we have them engineers from IBM we have engineers from Apple's several several apples engineering groups so please take advantage of that again the hours will be today all day through midnight Wednesday Thursday Friday 9 a.m. to six pm okay who to contact if you have questions information follow up on any of the information that you saw today please contact me via email tozer at apple com and hopefully you'll be hearing from me shortly after the developers conference on kitchens specifically designed to help you again optimize your applications for the g5 thank you very much we'll start our QA you