---
title: WWDC2003 Session 507
framework: wwdc
role: article
path: wwdc/wwdc2003-507
---

# WWDC2003 Session 507

## Transcript

Kind: captions Language: en good morning l chasm the desktop hardware evangelist and a flick appear in worldwide developed relations this is session of 507 Mac os10 high performance libraries the vector numerics group if you're not familiar with them already through the different sessions this year as well as previous years to be doing a lot of work and optimizing performance in the library that we ship in our operating system as well as working with several of our developers in our own application groups if you've used itunes a lot of the performance that you see that we achieve in that application as a result of this group today we offer some more and new information yesterday the group also introduced the new library for V image vector image library processing today I'd like to introduce Steve Peters who will go over these new libraries particularly understand how the 970 the g5 processor takes advantage of these libraries thank you thank you Mark welcome good morning it's a wonderful time to be doing mathematics on the Macintosh and I'd like to tell you a little bit about why I think so one more time there we go so by way of introduction I wanted to lay out what I think our Charter is in the numerix and vectorization group first of all we want your apps to achieve optimum floating-point performance on our platform we're in a great position to help you with that today we also want to make sure that your apps deliver robust high quality numerics it's been a tradition at Apple since the same days back on the 68 oxo to deliver really the best in numerical algorithms one of the numerical science we continue to keep the bar high and aim to set it even higher we want your apps to readily port to industry standard ap is in our Mac os10 frameworks this is a leverage point we take advantage of open source software in building these frameworks and we pay a great deal of attention to optimization in these frameworks and finally we want your apps to enjoy easy access to the Apple value-added features such as altivec in our platform and where we can transparently give you access to those features we do and you'll see a little later how that works out so what you'll learn today we'd like to inform you about floating point performance the floating point performance work we're develop delivering for g5 in Panther will show you how to leverage that work and you're at your development will walk you through a regimen that I've been using to squeeze out the last bit of floating point performance in the libraries you might find that helpful in your applications and will recommend some learning resources so this is where in many presentations you'll see the obligatory layer cake diagram and sort of one piece cut out and put in it's the technology framework assessment and being a bit of a contrarian and I decided well you know we actually touch in more than one place on that layer cake math performance depends on the silicon the floating point cores it depends on the frameworks that we deliver to you and it relies ever more on the performance tools that help you optimize your apps and that that's in the chudd toolkit it again so let's review what's been done over the last year since we've last night at WWDC what do we deliver in jaguar lib mr standard c math library lib m is censored on the following design principles first that we should be standards conforming you should see the c99 floating-point function entry points on our platform just the way you do on everybody else's and that's certainly true for doubles in Jaguar our algorithms are numerically robust in lib m it's really the central point that our core approximation should deliver correctly rounded results the functions that depend on those should be very close on the errors in the last place we bring you Best of Breed algorithms these are modern algorithms nothing from numerical recipes here folks and matched to the hardware how do you code to this stuff if you're writing and see couldn't be simpler include math.h if you have special needs and you know who you are to touch the floating-point environment you could include FN ph and otherwise compile and go lib M is part of the lib system umbrella that's automatically linked on the CC line so you need to say no more so from time to time we get questions and phone calls let's say G is it possible for me to go even faster than what you've delivered in live in the answer is yes but you have to make some compromises and it's been our position that we're not making those compromises in our deliverable with them but we can clue you in about what you need to do let em emphasizes numerical robustness if you're willing to compromise on that you might be able to go a little faster Lib Dem uses the standard c99 api's which turn out too data starve RFP use typically a call like sign presents the math library with a single argument and your ass to the contract is a return single return value that ends up just leaving bubbles all the way through the floating point pipes if you can provide more data through an API it's quite possible to produce results faster and furiously out the other end Levin is obliged to handle all the special cases man's infinities plus and minus zeros as needed and live M is obliged to preserve protect and defend the state of the floating point flags so there's a fair amount of detail work that goes on in the library few willing couple modulite stuff you can go faster so here's a homely little example a very respected developer came to us and said hey your hypotenuse function is way slow i can write a one liner that goes much faster than what you guys do so here you see the Pythagorean algorithm a little kitty Pythagorean algorithm return the square root of x squared plus y squared and sure enough you run this function on a triangle whose one leg is length three the other leg leg is laying floor and the hypotenuse is 5 well lets you know sort of scale this thing up I don't know is this the size of the universe's we understand it these days in any case the result of applying this function to these arguments is not five times e to the hundred and sixty it's with infinity so there's some intermediate overflow here right and this developer made some compromises probably over the range of arguments that he was interested in using this function work just fine but lib M takes on all comers and we're obliged by standards to do something different and here's just for it's just like my clicker at home after the kids have gotten to it there we go so it's a purpose of this forest all right there's a lot more going on here than that simple one-liner those with sharp eyes unlike mine may notice hmm let's see there's some comparisons up front a division halfway down finally that square root and a rescaling and some work on the environment so those are the kind of details that live in and the lengths that libram goes to to meet standards and robustness requirements if you can compromise on those you might be able to go faster so that's the end of a long aside delivered in Jaguar beclan we've used a clip as a one-stop shopping place for math performance we deliver digital signal processing one and two the real and complex FFTs the blahs level one two and three these are based on the open source software Atlas automatically tune linear algebra system it actually is a code generation system back at Cupertino before every release I grind through Atlas it generates code for linear algebra matched to the processor we get extraordinary performance from this and for certain the entry points it's actually SMP aware so you can go multiprocessor transparently right if the processors there it goes ahead and uses it it's a wonderful wonderful think we also delivered la pack for solving linear systems and eigenvalue problems again open soft saw source software we delivered tuned four x 4 8 x 8 16 x 16 and 32 by 32 matrix multiply matrix matrix matrix vector vector matrix these are for folks who know that they're dealing with very specific sizes and want to go really fast these are basically completely unrolled loops and on the single precision side hit altivec very very hard and so that's the next point is wherever altivec is appropriate and available to use beck lib tried to go to altivec how do you code to altivec lib using c include Veck Lebec lib dot h that's the framework header you might want to take a peek in there and see all the other header files that get included he blogs at HCL a packed-out hv big them etc from there you can fan out and find interfaces that you'd like to code to and then just add framework Veck lib to the CC line well we announced this week that there's an umbrella framework called accelerate that will collect all of our high-performance math and so in Panther you could just as well and is probably the right migration paths linked against framework accelerate you'll get the same stuff so again that same question comes up every once in a while can I go faster than the subroutine that i'm calling in beclan well if you've chosen the appropriate entry point we think the answer is no probably not we squeeze these things pretty hard and will continue to do so so let's look at Panther what's new in Panther furla been well clearly we've tuned for 970 we've opened up all the core algorithms recast them to exploit the two FP use the two LSU's paid very careful attention to way to the way the instructions are issued to the machine so that we achieve maximum parallelism and we watch out for load store issues you'll see one of those come up a little later and at long last we have Hardware square root this should be the end of the your square root is way too slow complaints that's transparent if you're on a g5 you call square root as a subroutine you'll get the hardware screw root on other platforms will dive into the polynomial approximations there's also an opportunity to inline square root there compiler flags now if you're compiling and you know you're on g5 you can inline the square root two great great advantage so it turns out having done all this work we went back and said well now how does this how's this going to go on the the older models of G floors are we going to have to maintain two copies of the library turns out not at least for this code doing these kinds of things to tune for 970 also helped us a little bit a few percent not a whole lot but we're faster these days on g4 as well your mileage may vary it's worth experimenting before you contemplate having a sort of multiple modules multiple plugins go look and see if the things you do for g5 don't also help you on g4 so I talked a little bit in the last I'd about careful construction of issue groups dispatch groups on the 970 I wanted to put up one slide and simply remark that the issue is this we've got a four-issue machine five if we've got a branch need to be very careful that if we're going to try to get to let's say floating-point multiply a dish ooh on simultaneously on the two CPUs we need to make sure that each instruction is fed into the issue queue appropriate to the unit and that means placing the instruction appropriately as it forms as the machine forms dispatch groups shud is a champ in showing us where we need to line up our instructions a little bit differently the timer's the sim g4 sim g5 seem to come tools are also essential to do that that's probably enough said about that there's a trace later on where we can return so how did we do when we opened up our core algorithms did this work and moved on to the g5 here are your favorite liban functions with here are the cycles the number of machine cycles in vacations of these functions for typical arguments took on the 7455 the g4 the high energy for bottle here's what we've done on the 970 here's what our competition publishes for the p4 and in general I think you'll see we're dominating the competition even through clock scaling we're going to go faster than the fastest p for you can buy the only question is around school route where you know its neck and neck if you're able to in line you can get square root the first square root out in about 40 cycles subsequent in 35 cycles and you can do that on to F to use if you need to call the subroutine library we pick up a little dynamic linking overhead and can only go essentially on one it's one of those data starved situations and we only go through one cpu on FPU in 52 cycles so you know let's call that a draw maybe unless you come you know in line to a time and then it's not so much for drawing alright so turning to the other big component of our mathematics efforts avec lib what's new in penton's perfected double precision math performance right we've all been waiting for real good double precision engines we've now got to and that clip takes great advantage of those now our DSP routines have been tuned in double precision for the 970 blahs has been tuned and has really I think just phenomenal performance on the 970 the scheme we use for example for matrix multiply derives from atlas given a very large matrix say a thousand by thousand elements it gets broken down into much smaller pieces 64 x 64 in our case that data gets moved into cash for each of the operands a times B resulting in C and then we cut loose with what's called the mat mulkear nuh lon that guy which takes the data which is lying in cash runs it through the floating-point units develops the output argument and repeat in that matt mohlke colonel we're able to ups we're able to sustain about 3.4 floating ops per clock cycle right that's eighty-four percent of the peak available on the machine and it measures out just about 6.7 gigaflops at two gigahertz when we look at the larger problem you know reassembling the 64 x 64 block results the overhead of pulling all the stuff in that cache we get the GM performance on a thousand by thousand matrix that runs from memory from Ram a thousand by a thousand matrix been sit and RAM after all at 2.4 flops for clock it's about sixty percent of peak and measures out at about 4.8 gigaflops on a single 2.0 gigahertz processor we've also added double precision four by four or eight by eight special size matrices all tuned for the 970 they achieve similar if not in some cases slightly better performance than the deccan numbers we've completely unrolled all those loops the matrices are small enough they sit in the cache all the time it just goes like crazy and finally we've gusta la pack a bit if you're a fan of the singular value decomposition you'll be happy to know we go really fast on FB geez now and expecting Panther that la pack will be thread-safe leave it on our mail guys are happy about that I love it okay so the results are in FFT s are really fast on this box these are some slides giving performance of you'll see one and two the real and complex mcd's FFTs going to the vector unit smaller numbers here in microseconds are faster so we want to be underneath the competition here and one after another component have to get my six year old out here we go complex 2d alright it's nice to be on the g5i alright how about our linear algebra performance the industry standard is the LA pack 1000 which is a solution of a linear system thousand by a thousand it's we use it the Atlas based techniques those are blocked and the way I told you before moving 64 x 64 chunks in and out of the cash and going like mad with the colonel and let's just focus on the first number is the one that probably most folks as the pointer are familiar with so our dlp it's double precision linpack 1000 that's the supercomputer benchmark 2.64 gigaflops we think that's pc industry leading there we go so here's a pitch following on from yesterday's the image session there's a new umbrella framework in Panther called accelerate one-stop shopping for all your math and image processing needs minus framework accelerate gets you all our digital signal processing stuff the linear algebra I've described a vector version of selected entry points in lib Emmett's single-precision some large number of arithmetic support and new the image image processing library moving forward code- framework accelerators it's the right thing to go and additional math stuff is going to end up in there so it's a it's a it's a way of the future okay so actually something go back from it and it's it's how we leverage right use accelerate and leverage Apple's work right here's the point of leverage for you folks you're really it by selecting the right API you get the advantage of you know our efforts getting these things fast so that's the segue into well what if you've got code that you like to go faster really doesn't fit into any of the the work we've done how do you go about doing that first profile you really need to know where your code is spending the time where to spend your effort we actually smart business decision tree profiling on the 970 is very interesting this is a machine that keeps hundreds of instructions in flight has long and deep pipelines very often I've looked at a tracer and had the aha experience this wasn't where I expected the performance to be being spent let's arrange to tune for that remarkably office or interesting there's a for a rough cut you can use the command line tool called sample but for the real action you want to use shark that's part of the chug toolkit it will zero in on your hot spots and let you make really rapid progress it's been key in our development and so I'd like to introduce Eric Miller will come up and talk a little bit about the ched tools if I can work this so anyways I am Eric Miller with the architecture and performance group just a quick thing about the tug tools it was a big session yesterday believe it was number 506 there we go microphone can we hear me now quick thing about shud tools there was a session yesterday I believe number was 506 is that correct and so that'll be on the DVD a very lengthy demonstration with shark that I think you should also take advantage of that and you shark as much as you can but there are a couple of other Chubb tools it's a sweet so there there are probably nine tools to go through we we leverage the low level performance monitor counters inside the hardware and in the operating system when we put the ones in the operating system in there just for this purpose you can find the problems and improve your code with shark and monster and all the best thing about shuttles is there free we have an FTP site I urge you when you get your developer tools CD and there's a chud package install that package immediately update because we drop up an update every several days while we're in our beta period and once we have a gold master it'll be probably on a semi weekly basis there'll be fixes and improvements so we're introducing the chudd tools three-point oh shark was formerly called shikari in version 2.0 and 2.5 and it as steve moment mentioned it does do instruction level profiling it can also tune help you tuned for that instruction dispatch grouping that the 970 uses monsters a spreadsheet for performance events things like cache misses and instructions completed instructions dispatched whether or not how its kind of utilization you're getting out of the various floating-point vector units integer units in the processor Saturn can actually take advantage of the pmcs and to call graph visualization with additional information about what types of performance events are being utilized in a particular function in your code we have several tracing tools and Steve will probably talk to some of this stuff amber is the most important tool because it actually collects a trace from your application of every single instrument that's executed on the processor you take this trace and you use that as input to acid sim g4 and once it comes out sim g5 you can also take advantage of the chudd framework which allows you to actually instrument your applications to directly monitor performance events or control the profiling tools from inside your code so you can bracket important pieces of code so i mentioned performance counters many times already just a couple of slides when what they are is a set of special purpose registers that exist in the processor and memory controller and often the OS we have virtual performance counters these can be accessed by software obviously so we've created a Judd tools to do that for you automatically although there are actually user versions of some of the some of the performance monitor counters that you can access from user code but in general to set them up and drive them requires a supervisor applications that like the colonel so we put stuff in there as I mentioned earlier its performance events and page faults is one of the 01 of the operating system of eq metric measure there's quite a number of virtual memory counters available in the operating system performance counters that you might want to take advantage of so for a very high-level look at what's going on with your disk access and virtual memory and visible memory access in the operating system so Steve is alluded to shark that's the icon in the upper right there the nice thing about shark is you can profile over time and with the chudd tools the profiling can be as small as 50 microseconds per sample or as large as a second or you can use the events you can actually profile every so many cycles you can take a sample ever so many instructions completed you could take a sample you capture everything with shark from the kernel up through your application so you can know exactly where times being spent in what frameworks time is being spent and as I mentioned the overhead is very low it's about 40 to 50 microseconds per sample you can expect us to expend taking data from the performance counters the other nice thing about shark is there's automated analysis not only will it show you where you're spending time it'll try and explain to you why time is being spent there and give you up opportunities to try to alleviate that bottleneck Steve also mentioned that you need again if we use static analysis to construct the theoretical dispatch groups on the 970 and this is all demonstrated quite nicely in the 506 session which unfortunately was yesterday you can save and review all the sessions which of course should always be there but it's new in three we didn't have save and review for our tools before except like text output and also neatly there's command line version so you can script your applications if you have a large suite of tests application test programs or your application is a command-line tool you can actually script or launch your tool with you can say shark and then the rest of the statement on the command line would be your tools of command command line name and arguments and sharp will launch your app and then instrument before and after your application runs and give you the output the normal output which you can review in the in the graphic user interface so there's a couple pictures of it on the left you can see the the heavy tree trace in this particular case square roots on top and you can see some of the types of outputs you can get there's some source code in the right picture and some of the shark commentary that comes in the form of these little exclamation points in the in the column on toward the right of the window I'll just briefly touch on Monster again same thing timed intervals event counts all the chug tools have a hotkey even the command line tools if you're on the console you can hit option escape and launch sharks profiler and option escape to toggle it off again monster uses control escape so they can both be on the system at the same time the Salukis kind of neat even in a command-line tool it'll just sit there waiting for you to start profiling and you don't have to have it you don't have to go into the shell and type something you can just do it from anywhere on your wherever your app is you can now go ahead and start that so the big thing about Monster is shortcuts you could take these performance monitor event counts and combine them together with a simple four function calculator notation so you can take cache misses and cycles and compute cache misses per cycle or cycles per cache miss whichever way you want to do that ratio you can also use them controller counter the memory controller counters and compute bandwidth by collecting all the transactions and multiplying by a certain value that will represent bytes per transaction for reads and writes and sort of thing you can also create there are several shortcuts that are predefined for each CPU but you can also make your own ratios and proportions and percentages to print out and they come out in the in the tabular comic columns of monster and then you can charge that data same thing save review sessions command line version there's a picture so there's some shortcuts have been highlighted those purple columns on the left and then you draft them by pressing the draw a chart button these are percentages of load store instructions with regard to all the instructions that were collected in the trace Saturn records your function call history and instruments all your code by using the GCC instrumentation Flags very similar to code war has some instrumentation flags that will put a prologue and epilogue in every function then once you have those prologues and epilogue the data can be collected you can see a call a typical call tree at the top half of the screen and then you can get the picture of the call tree the called depth is vertical and the time the call wasn't was executing his horizontal so if you have long spiky calls you want to try to alleviate those issues you can use call you can collect call counts that's great but you can also collect PMC event counts and see those things and what kind of duration they had so and we have these instructions and tracing tools i mentioned acids kind of nice as a quick pass it's sort of sim G 475 light you just can collect can collect stray statistics so use amber and collect and instruction trace that's accurate then you run acid on it you can get these data's pieces of data out of acid very readily and maybe one or two screens in the terminal whereas sim g4 and sim g5 are very cycle accurate simulators for the respective processors and that takes some learning which Steve will get into to understand their output using the chudd framework you can instrument your short coat your source code start and stop the other graphic user interface tools and also directly read and write the performance counters in your code there's also the HTML reference guide is is generated every time we do a build of the framework the HTML is updated to any new things we put in the prologues so a quick example you can initially always child initialize whenever useless stuff you require remote access and you tell shark that you want to use remote access then you start the review start shark and give it a label then you then your important function executes then you stop it and release the remote access so another threat or another client can use it but then shark will automatically profile your important function in your important function only and you'll get the results in the GUI slightly longer example where you actually set the counters explicitly clear them to start them then your important work happens then you stop the counters and then you take the returns arrays of thought double-precision floating-point values if there's six counters there will be six entries for each 035 in the output erase you take those out for the Rays and then do whatever you want to do to present them to yourself maybe log them or shark them this thing is a little finicky so how do you get shud the easiest ways it's on the developer tool CD but updates will come directly from the web there's an updater that will run automatically the first time you install chud and thereafter there's preferences you can you can check the status of the chudd package for new updates hourly daily weekly or monthly we do have internal guys but do check out early and they get upset when they couple hours go by and there's no new updates so the best way to get in contact with our team is to use the judge hype and tools hype and feedback at group apple com and with that I will turn it back over to Steve excellent thank you thank you Eric I think I'm going to go with a keyboard yes much better to me again let me just remark arcus covered this astonishing capabilities it shud but don't think you have to do anything really special the most you need to do really is to learn to use the hotkey if you haven't been down to the performance lab the GTI performance lab with your app please do come down if you find me there I'm likely to be a pest and hover over your shoulder wait till your app is running and grinding the CPU and say hey can I start shud and it's basically start the app hit the hotkey wait a few seconds and then look at the sample people have been dropping jaw at what they see and how simple it is they tried comes up I'll take a question the enemy beautiful good question you don't even need that so and we can show you why so people are quite astonished to see how quickly it happens and sort of almost immediate insight they get gee there's my hi runner right at the top of the window now let's click into that and see you know which instructions are slowing me down here is it's really a great thing so with some background on shut behind us and I want to talk a little bit about this regimen I go through to really torque down and get some performance out of floating-point incentive code first concern often is memory the machine has an enormous amount of bandwidth if you can use it effectively you need to load data early so that it's available early to the out-of-order execution course if the data ain't there the the course can't can't do the instruction so get the data there early so in examples you'll see that I a load polynomial coefficients literal constants very early on in subroutines even speculatively even if they may not be used in a particular branch of the code I'll often load them early just to have them available in the case we drop through and go so load early load often harness the to LSU's to drive those 2's views if you can load the data sequentially there are hardware initiated prefetch streams that are really effective at getting the data into the machine DFT and rec DST the moto entry point are bad eggs they're big help on g4 their execution synchronizing on g5 it probably ought to be avoided you're better off if you want to prefetch data using the DCB TL ECB VL class instructions you need to be aware that the cache line size on the 970 is 128 128 bytes the loops that are enclosed the DC ptl's need to be cognizant of that so did I say that the 970 has to fdu cooler and I think so use them at each cycle on each FPU we can do F met instruction that counts as two floating point ops so you can net for float it for flops per cycle on our CPU and that is achievable there's none of this four out of five six out of seven can be scheduled you can get all these I've seen it the data has to be in register but you can get that stuff there are 32 floating point registers 48 additional renames the machine will execute things out of order and take advantage of the renames very effectively typical latency for a floating-point instruction that is the time from when you started instruction until you can use its result in the subsequent operation it's typically six cycles throughput is one these things are fully pipelines you can throw me the pipe one after the other the key exceptions our square root and division that should come as no surprise to anyone since there are two ft use a simple strategy for making sure that you're getting both ft use fully utilize is to think that you've got a 12 cycle type that is start a result don't plan to use it until 12 cycles later 12 intervening operations that often means you need to think a little bit more about paralyzing or software pipelining your algorithms to get that kind of distance between between uses so here's a little piece on choice of algorithm sometimes you need to sort of just pop up a couple of levels or think now is there some way to recast the algorithm I have to be more effective on floating point so the example is it takes two hands to matrix multiply you're trying to form the matrix C as the product of a and B in high school or just junior high school nowadays you learn how to take the product of two n by n matrices by using two hands to form the output element CIJ you take the I throw in one hand and a J's column on the other hand and you form a dot product to n fetches and n multiplies later you have a single element of the output CIJ turns out to be not a really efficient use of the floating point unit of the register set unless you rethink the algorithm something along these lines and this is what Atlas actually does set of two hands one finger used four fingers on each hand to form the four by four output block in matrix C you grab four elements across the first say the first four rows of a four elements in the first four columns of B and form all 16 possible products pairwise products continue down and rows and n columns you've done and let me I'm just going to look at this so I remembered exactly right you're accumulating 16 simultaneous interpret the eight and fetches 16 and operations it's actually a factor of four reduction in memory bandwidth you've used nearly all the registers possible to keep the floating-point units bubble free and basically that's the trick that Atlas uses an hour matt mohlke kernel to drive the machine at in the kernel eighty-four percent of peace so the take-home messages think parallel if you can but small parallel you know four by fours it's manageable so here's a little case study I thought we'd go through that gets a little bit more into the tracing tool the code is the arm of the lib M sine function for arguments that are smallish between PI over 4 and minus PI over 4 and just accept some landmarks for orientation you'll see that there's a absolute value function taken at the top comparison here to decide if we're in the right arm some manipulation of the floating point environment some arithmetic and then what looks like a polynomial approximation formation of the final result some more adjustments of the floating point environment and out we go so on the g4 series we had a tool called sim g4 and if we look at this segment of code run on a smallish argument to sign we see this picture this is actually a very good picture sim g4 wise at the top ceiling isn't going again this is really tough here we go for our landmark there is that fab construction and we read across to see this fat instruction issued roughly at cycle 1200 one second for our 628 spent two cycles getting instruction fetch one in dispatch a couple and execution and then retired next line store word with update additional time the object here is to fall off the cliff you're retiring instructions as fast as you can so this looks pretty good this cause really quite good on g4 taking just that code unchanged bringing it to the g5 the 970 and using the sim g5 trace well first of all it's a more complicated machine and the letters have changed so we spend some time and fetch five cycles and dispatch sometime in the mapper we finally hit execution unit 5 for that ffs which is now on the other side of the screen we finish that operation what's six cycles later and we hang around and wait for the rest of that completion group to all finish up and then we complete this doesn't look too bad either you know there again the object is to fall off the cliff well until we get down to the bottom here this is trouble and there's a key for these letters and if we were to look up the key we'd see that we had a essentially a load store reject so somebody is trying to load from an address that was recently stored well how recently turns out that store occurred way up top this is a machine that's putting hundreds of instructions in flight these dependencies can stretch out over really large length of code so beware be forewarned and you know watch out for these kind of things by adjusting for this manipulation of the floating point of our environment we can end up doing quite a bit better you'll notice first of all this is shorter and the fall off the clip is much more precipitous and we've eliminated that nastiness down in here so that's the kind of information you can gain from but seems you 7425 class of tools I think it's an adjunct shud i think shud with shark is the first place to look but when you you know want to eke out the last little bit of a performance this is the tool that I turn to so here's the final version of that code just for landmarks there's that fabs again still have the compares but we've turns out we've Mac we've manipulated the environment by bringing it to register rather than storing I've adopted this style where I split across a single line in the C code places where I think I can gain parallel ISM so here's loading early and often of the polynomial coefficients here our operations that I believe go in parallel on the tooth do lefty use and then out ok so quick summary of the regiment i like to use start with judd look at shark go back to your code pay attention to load store issues think in terms of do left to use even just organizing the layout of your code can help you think when things when you can take advantage of things going in parallel use as many registers as possible let the hardware initiated prefetch streams help you get data into the machine early and often and when directed by sim geforce ng5 kind of tools look at dispatch group formation just to make sure that you're not crowding instructions that you think ought to be going to separate FB use one on top of the other in a single issue queue that was the slide that came much earlier in the talk so time to wrap up you can review these sessions on the DVD can contact myself or Ollie for questions ok I'll leave does a little skunk works operation every once in a while a book he can tell you about that and for more information there are two really fine technotes now on the web at the developer site and cover in detail many of the things I spoke about today also includes sort of first time usage of the chudd tools compiler options that can be a big help too and then the techno 2087 is a quick comparison to remind you of the differences between g4 and g5 there's some other interesting documentation at the somebody does HTML here we'll have to resolve this and finally I'm a big fan of the fellow who writes for Ars Technica describing a powerpc disease or you know really lovely introduction to the machine and a good place to start you know quiet evening with your laptop