---
title: WWDC2003 Session 506
framework: wwdc
role: article
path: wwdc/wwdc2003-506
---

# WWDC2003 Session 506

## Transcript

Kind: captions Language: en good afternoon my name is Mark tozer vilchez I'm the desktop hardware evangelist and Apple computer in world wide developer relations welcome to session 506 sheds performance optimization tools in depth now optimization means a lot of things to a lot of different people it can be from trying to get your application to launch faster access the network faster get higher frame rates bottom line is it's about speed it's about performance about getting your application to run faster than what it currently does today or faster than what may be your competitive application does bottom line is there's a common denominator you're looking to increase performance and you need to know where that performance can be increased in order to do that you need tools to be able to allow you to understand where that work can be done Apple has created a set of tools for developers ship freely on the developer tools that you'll hear about today and I'd like to introduce up here a member of that team that's involved in creating these tools mr. Sanjay Patel of the architecture performance group Thanks thanks mark so my name is Sanjay Patel I'm in the architecture and performance group we're going to start off today by talking a little bit about g5 from a programmers perspective some issues you may run into as you're moving code from g3 and g4 over to the new systems so to start off with first of all the 970 the powerpc 970 is the official chips name is a very superscalar very wide and very deep machine it's based on ibm's power for architecture it's a true 64-bit chip of the power PCAs architecture it has the full altivec instruction sets of full 162 instructions all implemented in hardware also has a high bandwidth point-to-point interface so there's a little different than a bus what we actually have is direct connections between the processor and the memory controller and we also supply automated hardware prefetch engine so what these guys do is they'll start detecting patterns of memory accesses and prefetching those memory accesses into local caches for you so here's a picture of the die you might have seen this in the keynote or at the 970 presentation yesterday but what we have here is to load store units independent to fixed point units to I Triple E compliance independent floating point units the full set of the four altivec subunits the ALU as well as the permute a branch unit and a unit to handle can condition register or logical operations so here's another view from the top there you see instructions would be in the l1 cache they'll go into fetch and sit around in some cues they'll go to dispatch where they can be dispatched up to four instructions plus one branch on every clock so this is really a wide machine there you can see you get fed into 10 issue cues to the 12 execution units so again this is just a the widest machine you've probably dealt with okay how does it's all compared against g4 well to keep everything all these units flowing the core can actually have over 200 instructions in flight versus a little over 30 in the g four if you count the completion buffers as well as the various cues ok the pipeline stages have been expanded so we're at 16 stages for a simple instruction vs 7 in the g4 as I mentioned we have to load store units versus one as well as two floating point units versus one in the g4 there are two general purpose fixed point units where in the g4 there were three dedicated simple units and one complex unit vector is similar there's a the ALU which includes floating-point complex integer simple integer and the firm use unit we talked about the cash is there the quite a few differences here first and foremost for programmers is the cache line size has changed it's 128 bytes where it used to be 32 bucks the l1 data cache is the same size but it's a two-way associative design and right through vs 8 way and right back on the G for the instruction cache has been doubled sword 64k now although it's a direct map design versus 8-way associative on the g4 l2 cache is also doubled so now we're at a full half Smeg also an eight-way associative in both g4 and g5 the replacement algorithm is LRU vs vs random on the g4 there is no l3 cache on the G size whereas on the g4 you had up to 2 megabytes that's partially made up for by the fact that processor bandwidth is just tremendously higher on the g5 it's up to 3.5 gigabytes a second effective in each direction those two in from memory simultaneously versus a 1.3 gigabytes per second bus for the g4 on the other side of the memory controller we've doubled the width of the DDR interface as well as increase the clock frequency so more than twice the bandwidth is here available from the DDR chips six point four gigs versus two point seven gigs on the g4 ok so what does all mean from a programmers perspective where those are there going to be some things you're going to look out for is your porting your code and optimizing it on this chip and so the first thing you'll notice is that there are more pipeline stages here which means instructional agencies have grown from g4 ok so how do you work around that in your code well you should do more in parallel right so manually unroll important loops or try to use compiler flags such as unroll real oops with GCC you can also schedule your code using em to an equal 970 with the new GCC 3.3 now similarly because the pipeline is longer branch mispredictions are going to cost more just going to take longer to recover from a Miss predict so there are several solutions you can use here you're coding in c GCC offers the ability and expect so that's for a very highly predictable branch such as maybe exception code you expect it not to be taken very much you can use this macro built-in expect if you're coding in assembly we have these new plus plus and minus minus suffixes for all branches so either highly taken or highly not taken that solution is to just not do branches right so in in floating point you have F cell that's enabled with fast math and what that allows you to do is a conditional move in floating point registers in the vector domain you have V cell very similar operation use it with masks in the integer integer domain you have the carry bit so this can be used for min and Max type of operations you can also use masks to avoid branches when you're doing integer would effectively be conditional moves and then feedback directed optimization is something most programmers don't try but this can be very effective on g5 because if you can have a representative run of your program let the compiler annotates that run and then mark all branches with highly not taken or highly taken this can be very effective for improving poor performance on this long pipeline so as I said the data cache is quite different than it was on g4 and the most important thing here is that it's 128 by line okay what can you do to work around that well that's either a win or a loss for you depending on your code right if you have a lot of locality well you're probably going to incur one miss where you would have had for mrs. on the g4 system so you must design your algorithms and your data structures to move through data to move the memory sequentially continuously as possible okay that's also going to trigger the hardware prefetcher and this is very powerful because it will amortize all the latency up to memory so that's the next topic because it is a point-to-point interface to the memory controller latency effective latency may be higher than what you've seen on a g4 system and that's because to maintain coherency you have to go to the memory controller and then bounce back to another processor what can you do to avoid any of those penalties well software prefetching so there are several instructions and of course the hardware prefetcher is it the best solution because it's self paced and it'll be synchronous with your code as you miss the hardware detects those mrs. detects the pattern and pre fetches lines for you you can also batch your load so so you need to access several pieces of data you know you're going to need them in advance try to group those loads together because the bus can support several mrs. simultaneously so the data stream data stream touch instruction from the altivec instruction set is a its execution serializing on the d5 because it's mapped on to the existing hardware prefetcher mechanism so what can you do to avoid DFC well first of all you can probably remove it right it is just a hint so there's no guarantee that VST is going to be effective for you in the first place the preferred solution is to rely on the hardware prefetcher so assuming you have contiguous memory accesses that's going to automatically work for you now if you have non contiguous accesses we recommend that you replace a single DSC with several DC BTW instructions that's a data cache block touch so you would issue one of those for each line so legacy code that uses d cbz which is the zeroing of a cache block or dcba the allocation of a cache block is going to perform very poorly on g5 why is that well D cbz is emulated effectively to only work on 32 bytes and we had to do this to ensure backwards compatibility with existing code ok BCDA is not implemented on g5 so that's just going to be an illegal instruction you'll end up in an exception handler and then bounce back to your code so this is just going to be tremendously bad for any performance critical code and the only reason you would have used these instructions is because its performance critical so the solution is get those DC disease and DC bas out of the code again dcba is just a performance hint so that shouldn't affect any kind of functionality if you do need 20 cache line we would recommend that use memset or b0 rather than trying to roll your own zeroing functions but if you do need a DC bz function or an allocate of a cache line we have this new mnemonic called DC bzl and that's going to effectively zero out whatever the natives cache line length is for any system whether it's p3 g45 so here's an example of how to use DC bcl now for those of you have used C cbz you say well the original definition of DC VZ was simply to zero out the native cache line length so what have we changed well the reason we have to have this new mnemonic is because most programmers ignored that warning they coded for 32 bytes and now they're going to get bitten so what we much rather you have have you do is code base on line size so effectively stride through memory based on the line size which you can get from the operating system whatever the current line size is on the system and of course if you're just doing a mem fin operation we'd much prefer that you use memset or be 0 so synchronization primitives glocks thinks I sinks they are going to be more costly on this architecture on this chip than they were on g4 that's for two reasons one the longer pipeline and to the longer latency is to memory so there's a tough one but what you have to do is just make sure all your locking is absolutely necessary minimize the lock whole time so you're not contending for loss as much and of course ensure that each lock is in its own cache line so you don't have fighting between processors so scheduling is crucial for this chip and it's going to require recompiling or even hand scheduling for optimal performance so what we recommend is you use GCC 33 which has a pipeline model and scheduling model for 970 the other thing you can do for really performance critical code is understand dispatch group formation using shark and for those of you don't know what shark is we'll get to that in just a minute ok so in summary this is a very parallel core you have to basically of each unit LSU's FPU sfx use lots of renames lots of instructions in flight so if you have very synchronous code it's simply not going to take advantage of this core so what you want to do is of course unroll and schedule you can also use all sub X to calculate up to a theoretical peak of 32 gigaflops on a 2 gigahertz system a tool this 970 has the full precision hardware square root so you don't need to make calls to any lib em functions for square root anymore you you're using GCC we offer this flag PowerPC dash GP opt we also have native long long support because this is a 64-bit chip it can natively do double word length in leaf functions using PowerPC 64 ok so again the system and the chip are all designed for high bandwidth there's incredible bandwidth to the l1 cache between the caches and out to memory 32 64 and effectively three-and-a-half gigabytes per second in each direction on the bus take advantage of that by using streaming using software screaming and hardware streaming prefetch so again the optimal cash controls instruction rather than a DST is to use DC BTW to prefetch if you have a DSC that covered a lot of ground multiple cache lines will then issue multiple bc bcs in its place don't use CC bz because that simulated use the DC bzl instruction but be careful if you're using it make sure you account for cache line size and again DC VA and DC bi are illegal so those just need to be removed from code ok so we've talked a lot of theory here how you actually get down and dirty with your code figure out what's going on well that's where shud comes in so I'd like to introduce Nathan slingerland okay thank Thank You Sanjay so hopefully a lot of you were introduced to Chad tools last year wwc at least the the version 2 tools and this year we're happy to to give you the version 3 of the tools and we have a lot of enhancements and improvements to that but basically the ched tools are there a suite of low-level performance analysis tools written by apple's architecture and performance group and they give you access to the performance counters in the processor memory controller operating system and using these counters and with Chad you can find problems in your code and improve your code and of course it's free so it's on the developer tool see and it's also on the website so in three point oh we have a several classes of tools profiling tools so tool to find out where things are happening these include shark so this is the successor to shikari if you've used that and we'll get to all the great new features it has monster is a spreadsheet for performance events and it has a lot of great new features to Saturn is a new tool for visualizing function calling behavior for tracing so if you've ever done alt avec or a very processor processor critical code sometimes it's useful to see how things are actually happening on the processor so we have amber to take an instruction level trace of a particular program and then acid is a program to analyze this race or sims you for a PowerPC 7400 cycle accurate simulator and soon some some g5 so you'll be able to simulate for the powerpc 970 and of course we provide the ched framework so this is the api we use in all our tools and you can use this to make your own tools or call call into the chat tools and have them do what you need ok so the performance counters these are in the processor memory controller and operating system as I mentioned and what these do they count interesting low level performance event that things such as cache misses or instruction stall cycles that would other be otherwise you'd have to use a simulator to find out what's happening at this level page faults in the operating system you can find out when those happen and the ched to let you configure these counters tell them what to count record the counts and then you can use the tools to look at what what the result is ok so the first we're going to talk about is shark shark is a system-wide profiling tool so you can use it to profile a process thread or the entire system if you want to look at that the most basic usage is just a time profile so this will show you where the hotspots are in the system where the system is spending its time you can also use any of the performance counter event so you can get an event profile to see where hardware events relate to your source code for example where cache misses might be coming from in your code we capture everything drivers colonel applications so if your driver or Colonel extension writer you can also use chug to to see the calls calls back and find out where things are coming from and of course are very low overhead this is all handled inside of the inside of our own kernel extension and once you've gathered this session that you're interested taking the samples that you want to look at we provide automated analysis that we annotate your source code disassembly of your source code give you optimization tips about how to improve your code and there's also static analysis you can use this to just look for for example DCB dcba instructions in your code if you you know if you want to make sure you catch every instance of that there's a command line version so this is both scriptable and you can also use it you know you can tell that into a machine and use this from the command line and of course you can save sessions and you can give them to your colleague pass them around whatever you'd like to do so monster is a more direct interface to the counter it this lets you look directly at the results of the counters you can configure them using monster collect the pmc data based on timed intervals hotkey event counts every 10,000 cache misses for example and then you can view the result in a spreadsheet or a chart and in addition to just the the raw performance counts there is an a built-in shortcut language and this uses a an infix computational language that you can program your own metrics that you're interested in interested in or you can use the built-in one's things like memory bandwidth so memory bandwidth over the memory bus or cycles per instruction variety of things there's a command line version of monster provided as well for your scripting and remote sessions and you can also save and review these sessions as well okay so the best best way to see how to use these tools is with a demonstration so what we're going to look at is a program called the noble life simulation this is written by Tom barber lane and he's stimulating a eight Sun and tropical island and these apes can sink and he's stimulating the biological environment so the food and you know the other animals on the on the island and the cognitive as well as the cognitive processes of apes I mean obviously simple cognitive processes such as desire and fear and those kind of things so this is open source and for more information please check out his website at humble life com so let's switch to the demo machine you can see this fee no belief in action okay so this is the map window here we just showed this to the islands right and the each red dot represents an ape running around the island doing its thing and we can select 18 at a time that's the eighth with the red box around him there and for the safe we can see his his brain what's happening in his brain and the brain window right and of course any good performance study requires a performance metric and our metric is eight thoughts per second so this is the for the original code getting we have we have this metric so around 1200 1300 or so okay so the first thing we'd like to do we're going to launch shark we'll see what's happening in the system okay so this is the main shark window and its really pared down and simple just to let you start start your work by default we come up with the time profile so this would be the most common thing you'd use it for would provide a bunch of other built-in shortcuts and configuration of course you can create your own using any of the performance counters but for now reuse time profile there's a start button here for starting sampling but you can also use there's a global hot key so that shark doesn't have to be in the foreground it can be in the background and you can start it so we'll use that hockey we'll take a five or ten seconds sample see what's happening all right so here's the profile you know what we've done here this is just listing the functions that it sampled inside noble 8 from from most sample to leaf samples so you know when you're optimizing you want to work on what's running the most of the time and then you're going to get the most benefit out of optimizing that code so we see that no belief is fifty percent of the system this is a the process pop-up right and this like top it listed this what was running in the system it's kind of strange fifty percent of the time even though we know that we're CPU bound well if we go to the thread pop up here you can see that in fact it's single threaded so this is this running noble a pond a powermac g5 it has dual two gigahertz processors so alright so next step we want to thread this thing right since we can want to take advantage of both processors so this is the heavy view that we're looking at there's a heavy profile of you in a tree view the heavy view we can open up these disclosure triangles and see how we got to this heavy function so we started in Maine we called flat cycle called control cycle called cycle a troop called the cycle troop brain scalar this important function so we know our code and we know that we can't really split split the processing of things between simulation cycles right the way this app works there's a you know it does a simulation cycle and you know with any simulation cycle it's processing a bunch of eight simulation cycles well the simulation cycles themselves are not independent they depend on one another right but we know that the apes are independent they're independent thinking ape so you know we can paralyze it at that level we can process the h in corolla for each simulation cycle so that's what we did we we threaded that to split up the number of apes we have 64 apes to split it between two threads evenly so you'll notice that the brain rate is originally around 1200 when we do this so you get around 25 almost 24 it or not quite so that's pretty good we've got a nice speed up just from threading from taking advantage of that second processor so let's profile again and see what's going on okay right so now we can see that in the process pop up we can see that noble ape is taking up a much more significant amount of the system that's a good thing and we can see from the thread pop-up that we have their main thread that's the 9.2 percent which spun up to computational threads each about forty percent of the time so the next thing we'd like to do is actually optimize this function this function is important to us right it's almost ninety percent of the time we spend in here so if we double-click on this shark will present us with our source that's been highlighted where we sampled so this tells us where our in our source code we're spending time so it's actually inside of this cycle troop brain scalar it's just it's this for loop so we can see that this for loop actually represents about ninety-four percent of the time in this function okay so I should probably talk about a couple of these other things here oh and well so the scroll bar at the side is it is it like an overview you can easily jump to the hot spots in your code right so it's colored according accordingly right a brighter yellow means more samples at the top we have a source file list this is you know sometimes you can have more than one source file contributing to a particular function that would header files and like that and this function pop up is like what you have in project builder you can easily jump to different functions and then we have the edit button so what this allows you to do is it will jump in to project builder at the same selected line right so you can easily go to where you want to edit and change something once you know what we're the problem okay so let's go back to shark and what shark does is it provides us with advice with us what this little ! button is is this advice for us this is calling something out and so the stupid some advice here but we'll focus just on the first one so this loop contain bait that integer computation and if you obviously if you're spending a lot of time in this 8-bit integer computation it might be a good idea to use alt avec tu to really improve the this coat so that was what we did that was our next step so let's I guess let's go try that out right gotta get too weak that's a nice beat up well we're not done yet that's good so let's profile again let's see where we're spending time now alright so we'll double click again on this we see the vector function shows up the top of the profile and this is the vector code so so a lot of you are probably if you've used shikari yeah you saw the assembly view right if you double click on any of these any source line it's going to jump to the assembly view that you're familiar with and if you jump double click on the assembly will jump back and it's going to highlight the line the instruction or instructions that correspond to that source line so you can see this can give you an idea of how good the code jen is for your compiler right how many what kind of instructions is generating for each source line and okay so if you've seen this before the columns here we have samples that how many times we sample each instruction address obviously the address and instructions the instructions themselves and switch between various views of the address cycles is the latency and throughput of a particular instruction so these are for the 970 right that's the CPU model down at the right a little right hand corner there tells you that and the comment column various things about about this code and of course the source file pin now one of the nice things that we give you an ability to visualize dispatch groups so if we go to this option turn this on and we can see if you remember the the diagram that Sanjay showed earlier with the the dispatch we can see here how they you know you between usually between four and five instructions dispatch each this would be in each cycle so this can give you a good idea of how thing about how things are actually behaving on the machine so the other thing we provide to is it's functional unit and dispatch slot utilization graph so so you want to talk a little bit to that right so so on 970 what as a programmer the key bottleneck you'll have to face is maximizing dispatch group with because that's one of the narrower points in the Corps because it's for instructions wide plus a branch so what we got for you here in dispatch slot utilization over here on the right is the average group size you can see how effectively is your code taking advantage of this really wide issue machine right wide dispatch machine right and dispatch defines where instructions are issued to which functional units so here you see a map of the 12 functional units that I talked about and you can see that the units are symmetrical like the two LSU's here if there's a big imbalance between say one of one of these the LSU is wanted to do it a lot and one's not well that's something that you could probably correct with scheduling or reordering your code because what you want to do is balance the execution units you don't want half the chip doing all the work and the other half sitting idle all of that is defined by dispatch groups so that's why we put dispatch group modeling into shark right till you can you can select a few room you can visit right immigration so this is dynamic you select a few instructions and it'll tell you where they got mapped to the charts update and the the numbers will update is with it great so so this can help you obviously on the power the power mac g5 tuning your code so but let's go back to the source view and if we look a little closer at our vector code you remember we vectorized this this inner loop that we thought was taking up ninety-four percent of the time right well now that loop is still important but it's taking up a smaller portion of the time in this function we can see also that up to the top and bottom of this function we're spending more time or time relatively speaking in the scalar code right the two loops that we didn't touch so what we this code is very similar to the to the other loop it's almost exactly the same and leave sharp wolf point this out but yeah right is saying yeah a vectorized this lib to this is important now so that was the next step was to was to vectorize the rest of this so to vectorize the entire function and as well as a few other there are few other optimizations as well so let's try that out so we're starting around 10,000 or so so another forty to fifty percent we can eke out by vectorizing the rest of that function you can see if some of the gorillas have gone off into the water you can bring them back to life by dragging them back to land little bit suicidal yeah they just like the beach all right so kind of 15 x 14 x speed up that's pretty decent we we hope you can all do that well in your code too so the next step okay so no don't do that it we have a few more things to show there are a couple of things we didn't really talk about shark allows you to manage the sampling session so we've taken about four sampling sessions here you can either look at these in parallel right in a multi window mode where you can put them back and you know deal with one window at a time the multi window mode is nice because you can put them side by side there's also a you know a session drawer you can quickly switch between them and single in the single window mode and of course as we mentioned you can save sessions there's also ancillary information included so whenever you take a session there's it records what kind of machine it was and gives you some space to write notes to yourself about what what's happening on this so this is archival you can keep it around remember what happened there's also an extensive user guide included so it's just online here please read it there's lots lots more information and features that's covered that are covered in there okay so one other thing we wanted to look at we want to use the monster tool and we want to look at some more of some of these performance counters in depth so this is the main monster window and this is the spreadsheet right so I'm there on the left-hand side we can see the various performance counters that are on this system and on the right is the spreadsheet itself and the shortcut pop up is similar to the sampling config selection on shark same thing but if we can we can do is we can edit these shortcuts so if we go to the shortcut tab and we're going to look at the ER memory bandwidth right so this is it's just it's taking a few of the you three counters the memory controller counters on this power mac g5 and it's going to calculate the number of megabytes transferred over the memory bus alright so the way does this is it counts the number of beats in each be to 16 bytes in the data bus so it can multiply out and figure out for every 10 millisecond sample how many megabytes that'd be so we have a session let's let's open that that's all done the best right so if we pop open the run pop-up see it from this session we have there are 44 runs taken the first was just the original scalar code the next was threaded scalar code then the vectorized threaded code and finally the optimized vector threaded code so what we did is we want to look at how this how many how this memory bandwidth changed as we change the code so we can use the drug chart so we can see this so wherever we took we took 100 samples total so 25 samples for each type of test so originally we can see that you know obviously the memory bandwidth is varies over time a little bit we're around 250 megabytes per second for the original single threaded code we threaded it we're using closer to 400 megabytes per second then altivec can use a lot more of this bandwidth around the 1250 SI megabytes per second on an average and then finally the optimized code seventeen hundred megabytes per second average see you can see how as we optimize the code we were able to take better and better advantage of this bandwidth massive bandwidth is available on them at Power Mac g5 ok so let's go back to the slides I think right you might wonder I guess only conde talk to you so you might wonder how does this compare against g4 so we started out with the regular scalar code right and on the g4 you get about 1208 dots per second we were getting closer to 1300 on the g4 well you say well the g4 is running much higher frequency but we're we're barely getting a little more than ten percent faster performance here so what is the bottleneck well that the initial bottleneck was all integer performance so with the longer latency longer pipeline instructions you're just not going to get the full frequency increase in your performance increase so when we went to threaded we started to expose the better bandwidth on the bus so because we have two processors and they each have independent point-to-point connections to the memory controller now you can see we're over twenty percent or so faster than the g four and then we really break this open when we go to vector right the g4 does well right we get a two and a half or so X speed up from using vector but on g5 we don't have any bandwidth limit yet so we get a full 4x improvement from going to vector and then with vector optimized you can see again the g4 does pretty well nice feet up and the g5 gets the sixty percent speed up and that's you know if you look at the back of that monster chart you can see we were getting peak bandwidth of 2.5 gigabytes a second on the bus so we're still not done yet but we didn't have more time to optimize before this demo but clearly you know there are a lot of resources there and if you start with basic code well you might get a decent speed up over a g4 but if you put a little effort into it you can get very big speed up if you take advantage of altivec take advantage of all the bandwidth that's available to you right okay so a third tool we'll talk a little bit about a Saturn Saturn so the other tool so the shark is there are another profiling tool that we've talked about that provides a statistical profile right it's periodically interrupting the system recording where you are and then going on and then afterwards we say well wherever we got the most samples that's where the most time was spent well Saturn is going to use is going to instrument every function in your source code to give you an exact profile will show you it'll this allows you to visualize the call tree it uses GCC to instrument each function at entry and exit and it records the function call history to a trace and with this we can get call counts so how many how many times each function was called performance monitor accounts that can use those as well as the execution time of each function right adding them so we look at this this you know at the top we have the familiar call stack you know call tree view that says you know how where we spent time in each function and its descend anything like that and at the bottom we have something that's viewing the same data but in a different way it's plotting call stack death vertically versus time on the horizontal axis and what you can use this for if you see a very sharp you know narrow spike that means that you're spending a lot of time and calling overhead right you're not getting you're going through many many function calls and not getting a lot of work done if it's not a wide call a wide stack so ok that's a turn and of course the ched framework you can use the tread framework to instrument your source code you can use this to you know start and stop monster shark you can also write your own performance tools a lot of the functionality almost all of it that's in shark and monster is exposed in this framework so you can set up a start and stop the PMC's collect information about the hardware a lot of things that would otherwise you'd have to go through i okay who might be you know some extra a lot of code to get at and of course an extensive HTML reference guide is provided okay so here's an example of using the framework to remotely control either shark or monster what you do is you pick the the profile that you're interested in and then place either of those tools in remote mode that means allow other tools that are want to want to connect and control the start and stop of the counters to do so so first we initialize you know acquire remote access me make sure that another tool is actually waiting for us to do something and this will block if the other tools and currently waiting start the remote perf monitor so in this for this function you can give it a label that's going to appear in the tool do whatever it is you're interested in whatever the code of interest is stop to remote perf monitor and of course release to be a good citizen so that's one way to use the the framework that's to instrument the other way is just more directly you can set up the counters directly and read them directly you know initialize acquire the sampling facility so there's only one sort of performance counters in the system right because they're these are there's one set of hardware is 11 physical device so you have to acquire the sampling facility the colonel extension that we have manages access to this setup the counter event clear the counter start the counters do whatever it is you're interested in stop the counters and then process the result okay so we also provide some lower-level tools so if you've ever done as I mentioned alt effect for programming this you know or in any kind of really intense tuning you'd like to know what's happening on the processor core why you know why is it slower than you expect or you know what's happening so with amber this is a command-line tool to record an instruction trace to disk so this for all the threads and you know given process record that's a disk and then you can run that that trace file through acid this is a trace analyzer gives you some interesting trade statistics you can plot the memory footprint that this trace walks through point out problem problematic instruction sequences and then you can also run this trace through Tim g4 for the 7400 processor or eventually when this is available soon the 75 powerpc 970 simulator to know exactly what's happening okay so at this time we'll turn it back over to mark I think for session wrap up thank you so to give you a little bit of a road map of other sessions that are going to be valuable to you the tuning software with performance tools session 305 this afternoon at five o'clock in presidio and then the mac OS 10 high performance libraries again another set of tools really libraries are not tools but venues here for you to be able to eke out performance out of the operating system again optimization just to go back on my introductory statements should not always be a process of or an afterthought optimization should begin with when you first start writing your code it should be part of the process of how you want your code to be written so you're not going back after the application is written in and think about well maybe I should thread my application I should you know utilize threading processes you know an application like the no bleep you know we can add threading because it's not a lot of code but if you're talking about a much larger application or word processor or graphic imaging editing application then you're looking at a whole redesign possibly and then you becomes more frustrating so again optimization should be something in that is both at the beginning of your project as well as an afterthought of once you finish your project how do you get more performance out of it you know the other thing I wanted to mention is that the g5 powerpc processor is a very unique architecture much different than the g4 as Sanjay pointed out in his session presentation and for that reason we want to make sure that you have as many resources available for you to understand what those differences are and how to take advantage of that all week we have been running a g5 optimization lab on the first floor in the California room urge you to visit talk to the many engineers that have made themselves available spending countless hours Monday Tuesday we were there until midnight you'll be able to talk to Sanjay Nathan and several other engineers both from apple and IBM throughout the week as well we'll have follow-on kitchens available to you as developers in the developer program at cupertino follow following the developers conference [Music] you