WWDC2003 Session 103
Transcript
Kind: captions Language: en good morning and welcome as we are setting up for the conference I had in the database that this was a keithley secret session number one I had a number of people asked me if I had a second secret session and I well no I wasn't so lucky the reason why this was secret is because we really wanted to make a dramatic statement about some new features in our acceleration libraries what we'll be talking about today will be techniques that you can use to accelerate image processing using the image as part of the accelerate library so to do that let me bring up Robert merly Thank You Craig I'm very happy to be here this morning to an additional as you a major new technology from Apple our vector accelerated image processing library called V image before I jump into that demonstration though there's another message that I would want would like to make right here at the outset all of the vector library is starting with panther are going to be contained in a new high-performance state-of-the-art computing framework called accelerate accelerate contains not only the GM is the major subject of this talk today but also all of the vector libraries that have been available previously in OS versions in the framework called exact lib the digital signal processing library the basic linear algebra subroutines la pack the math libraries and the big nom library so I wanted to make sure that that was clear before we went on now today we're getting into the main part of the demonstration here what you'll learn about the image well that may be a little optimistic what I'm going to try to convey about VMA anyway functionality the instructors an API at least some of the examples of them it's much too expensive to go over completely some of the features that are not included in the first two subjects then I'm going to bring up with one of my colleagues in the architecture inter excuse me in the erecting the Merricks group gentleman who will talk about implementation techniques and performance and then finally although it's not on this slider will be a section at the end by Eric Miller the architecture and performance group on an overview of the chudd tools the performance tools so what is in the functionality of the VM image library it can be broadly grouped into these main topics listed here I'm going to talk a little bit about each one of them so I won't belabor it at this point but at this time I'd like to jump into the first demo of the morning and what I want to show you is a image processing function called inversion which is a very simple technique where each pixel of an image in fact each color component of each pixel is the complement of that value is taken so what you see here is a picture of a bed-and-breakfast in Portugal that I happen to stay out earlier this year and what I'm going to do is perform a an inverse operation on it and what you see is essentially the photographic negative now the functions that are comprised in the image run quite a gamut and complexity from quite simple to quite complex this one is way over on the simple side very easy to do the first subject first major category was convolution I want to say a word about area processes in image processing area processes our process is that year's book with source pixel and other pixels nearby to generate the destination pixel both convolution and morphology the to the first two examples i'm going to give you our examples of area processes convolution in particular you know creates an output pixel by taking a weighted sum of pixels nearby the input or source pixel and that weight and therefore the effect of the process is determined by matrix called the convolution colonel so i want to give you an example of how convolution works what i have here is a fairly blurry and kind of low intensity image of downtown lisbon and i want to emphasize that this is probably poor equipment and not the photographer skill that was involved here but anyway i'm going to operate on this with a five by five sharpening kernel and what the result of that is is this a lot of the blurriness is gone features a much sharper than they were we'll go back that's the blurry part and the sharpening point and you can see the colonel there is five by five it's not that complex so all of convolution it operates the same way that it's all matrix multiplication of areas of pixels and the effect simply depends on what the matrix is another example this edifice I'm going to use a rather extreme edge definition processor edge detection process and produced an embossed image this is as you can see an extremely simple convolution colonel three by three to get a pretty dramatic effect second I'd like to talk about or the second major topic morphology morphology in general adjust the shape of objects in the image to conform more closely to the shape of a probe and the probe is defined and also in a matrix called and morphology matrix in practice it can be used to pick objects in a small or large objects and it picks an image and lighten them or darken them they can make them larger or smaller it can be used to alter their shape to remove fine details while preserving larger objects and so forth so to demonstrate this I have here a couple of very simple images a small circle and a large circle I'm going to do a morphology process on these two images using a probe in the shape of a right triangle and it's a fairly large matrix about the size of the smaller circle so the result of that operation is this as you can see both the circles take on a more triangular characteristic and in fact a smaller circle in the smaller circle the center circle completely disappears because of the size of the convolution colonel secondly I'm going to take this image or that was called a dilate operation by the way I'm going to now do what's called an a road operation I'm going to take a circular filter about the same size as the colonel here operate on these two images and I get this result so you can see that in the case of the smaller circle all the circularity is gone as turned completely into a try angle and the smaller circle has taken on some triangular structure and lost a lot of its circularity so there is a couple of examples of morphology in action the class of functions in geometry is pretty much self-explanatory they perform some sort of a geometric operation on the image transform it make it largely smaller are reflected whatever for an example of geometry we're going to take this picture of J and I'm going to transpose it of translate it make it bigger but only in the longitudinal direction and that results in this image secondly I'm going to go back to the original pitch here and do a shearing operation off to the right side and that results in this image and is it my imagination or is that bird getting more irritated with each picture maybe I've just been looking at them a little bit too long histogram operations are those that use a intensity distribution histogram of the image to perform some function the example I'm going to use is histogram equalization a process whereby an image with a poor non uniform intensity distribution is modified so that intensities are distributed more evenly so I'm going to go back to this [Music] bed-and-breakfast here in Portugal and perform this equalization operation on it and it results in this now what you can see is is a great more detailed visible here in this image than in the original notice in particular the weather stands below the window sills on the second floor they were virtually undetectable in the original image so this the equalization operation brings out a lot of detail that was absent in the original this is probably a lot closer to the way that building really looked and be my yes here's an example or rather the before-and-after histograms of the intensity distribution I've added all three color channels into each bar to simplify it although in practice the operation is done in each color component separately but you can see in the before image there's a lot of white with some starkly contrasted black and in the after image it's much more uniform a lot of different grades so i could go on quite a long time actually about functionality but we do have a limited amount of time so i want to proceed on to some examples of data structures and api first i want to talk about data types and layouts that we support in our initial incarnation of the image there are two different data types supported one is the 8-bit integer per color component or per channel I'll use those terms interchangeably and the second is a 32-bit floating point value per color component or channel we also excuse me we also support two different data layouts one is the planar layout whereby each channel is in its own array and if I'm using RGB image as an example that simply means that the Reds the greens and the Blues are all in their separate buffers and if you're calling a convolution or excuse me image process to perform some function on this image you would need to call it three times for all three color components the event is bill is you if you don't want to do the process in all three challenges can do them on only one of two as you wish the second layout is what we call the Arg be interleaved layout where all the color channels are interleaved into a single buffer we support at the current time a four channel interleave layout which can be either 485 integers or for 32-bit floating point values the advantages here are that the only take a single call to perform an operation on these all of these different color components you have either an alpha channel as your first component as it is indicated here with the red green blue as a examples of course for the other three channels it were you can have four color channels without an alpha what the case may be is is given to the image in a flags word that's passed to each function so if you specify that the first channel is in fact an alpha channel then it will simply be copied to the destination unchanged unless we're dealing with an alpha compositing function that's a separate issue if you indicate that it's a color channel then the same process will be performed on it as the other three channels we do supply data conversion utilities to go between the different data layouts and different data types and now I'd like to go on to what is probably the single most important data structure almost the only public data structure we have in vmh the image buffer as you can see it's a very simple data structure only four elements there's a pointer to the start of the data which would be the upper left-hand corner of the image a high number of pixels width in the number of pixels and then a row bytes which is the number of bytes from one row to another or the stride from one river to the other pictorially if the name of the vmh buffer is image then we have image data at the leper upper left-hand corner they have the height the width and if you imagine that that white space to the right of the image is extra memory that's not used in the image but just sitting there at the end of the row then you can see that the row bytes parameter includes that length and the stride that comes in handy if you want to do sixteen byte alignment on each role for example that's not a requirement by the image but certainly may be helpful in your own work okay I want to I want to go back before I show that I want to make a distinction at this point between the full image buffer as shown here and what we call a region of interest the region of interest is that portion of the image which is going to be modified by the operation you're performing in many cases met perhaps a null k most cases the region of interest and the full image buffer will be the same but they don't have to be and here is an example where they are not the same we have a subset of the image using the very same data structure the V image buffer data structure you adjust the data pointer the height and the willed the road lights would remain the same and simply passed the that those parameters in and there are no there is no copying required from your own bus that's the best of one of the beauties of the simplicity and flexibility of this data structure you can have the whole image or a piece of the image same data structure is used and this therefore allows you to do for example tiling you want to do that to make take advantage of caching although we will also do that for you if you wish and it it has quite a number of other advantages as well so here's an example of Equalization an example of a very simple call to the image image equalization where there's only three parameters couldn't get too much more simple the vmas buffer source the VMS buffer for the destination and then the flags word which the information in the flags or it varies with each function you notice that you don't have to specify what the data layout or the data type is because that's implicit in the name of the function in this case plain or age so every function has four different variants planar rates that planar floats and release it good and in early float there are some functions that do require us to know nvm it's both the full image buffer and the region of interest and these are the functions that I mentioned earlier I referred to as area processes the components are all showing here you have the full image buffer the source ry which may or may not be smaller than the full image buffer a convolution colonel a matrix shown by the yellow rectangle and then the destination buffer the result I'd like to go into the relationship between these things a little bit more so this is the discussion further discussion of buffers and regions of interest all right I think we all know what the full image buffer is in a call to an area function morphology or convolution the region of interest is not specified by a second vm edge buffer but rather simply by x and y offsets from the beginning of the full buffer so as you can see here you would indicate the upper left-hand corner the region of interest by an x and y offset from the upper left hand corner of the full buffer the row bytes is the same in both cases you also pass a vm as buffer indicating the destination which has the height and a width and independent role bytes and notice that we have not specified as yet the source the region of interest height and width and that's for a simple reason it has to be the same height and width as the destination so we simply take it from there this is an example of one of these function calls convolution you have the source and destination image buffers the offsets to the region of interest and then some like of your information defining the colonel and a few other things that we need to know so this is one of probably one of the more complicated calls that you're going to run into we have three computational cases that we need to worry about when we're doing these calculations the first one is fairly simple knot and to explain this I just keep in mind the three four different elements that I'm talking about here the full image buffer the region of interest the convolution kernel which is simply a matrix and then in this image resource pixel shown by the tiny red rectangle there so if we are going to calculate the destination rectangle from that source pixel we need to do a matrix multiplication of the pixels in the regional source pixel is shown there the first case is very simple because the entire matrix is contained within the region of interest so there's no issue about where the data comes from the second case is a little bit more complicated what happens if the region of the matrix the computational matrix extends out beyond the region of interest and this is exactly why we need to know what the full image buffer is because of it it still remains in the full image buffer then we can use that data without further concerning the third case is the more complex case what if if the computational matrix goes even beyond the full image buffer and in that case we have to do something to substitute for the pixels that are missing so we have an edge case problem and we supply you in this instance with three different options to deal with these educate cases back now in color edge extend and copy in place and to demonstrate these 3 i'm going to start with this as an original image all the lines between the different colors are clean and smooth and the edges are clean and i'm going to do a blurring operation on it and the first time i'm going to do this i'm going to specify that for the edges the color to use as black if we don't have a pixel in the computation will use a black pixel so the result of that comes out like this you can see that the colors merge together on the edges and on the outside of the image it just feeds off into black gradually the other extreme of that is a background color white it ends up looking like this with a black background and you can see quite a difference there the second case so that was the background color the first option that we give you the the second option would give you is edge extend which means that we take the pixel at the pixels at the outside border of the image and just extend them out copy them out as far as we need to to perform the operation so the result of that blurring operation is this and as you would expect the you really don't see any change when you get to the edge of the image it just continues on as it does in the beginning or in the middle the third case is copy in place and what we are saying there is that the few if we don't have all the data we need to do the computation at any point and then we won't do it will just copy the source pixel to the destination pixel and be done with it and this is what that looks like you have to concentrate on the edges of the image and you can see that towards the edges there is no blurring effect once the computational matrix goes off the edge we just do a copy from the so so those are the various options that we give you to handle the edge caves edge cases a couple of features that I haven't yet mentioned or maybe I have all of the Apple libraries the vector accelerated libraries are optimized for all apple processors so if you are exactly for example running on a g3 the whole system is a g3 then a form of any given routine that is not vectorized but still highly optimized for scalar will be chosen if you're running on a G 45 then an appropriately optimized vectorized version will be chosen this is all done transparently to you the caller our system our library is multi the image in particular here is multiprocessor says I should also mention that it's it's interrupts safe if you take some precautions to make it interrupts safe there's a lot of routines nvm is to do call malloc to allocate memory however if you don't want it to do that we do give you the option to supply your own memory the calls that need memory also have a zileri call that returns to you the minimum buffer size that we will need to do the operation so you can call that allocate your own memory and then there will be no system calls during the course of the operation the images standard part of Panther the data structures are unencoded it's simple and flexible and unlike an a competitor or two I could name but won't there is no license fees okay so that completes my portion of the talk I'd like to bring up my colleague Yeon Coleman who will talk about implementation techniques and performance thank you I wanted to touch on two subjects mostly what you can do to use be Emma Joyce effectively in your apps to get the best possible performance out of it and then just for your own curiosity some of the things we did to tune the functions that you can get through the village sub framework under the acceleration framework so a couple things you can focus on a logical touch on so there are some alignment memory alignment things that you can do we don't require that you do anything in particular but some things help so I'll mention them I'll be briefly talk about tiling and then also some multiprocessing on a real-time consideration insofar as alignment on all of our powerpc processors obviously we keep our data available in the caches and these are blocked together in the cache line on the g4 and g3 the cache lines are 32 bytes long and their line to 32 bytes the g5 128 bytes and similarly aligned to 128 bytes so if you just arbitrarily configure your buffer to fit in memory without worrying about how its wine you'll probably end up with a set of pixels that are just sitting in a set of cache lines I've drawn for here in the top and they'll be just leftover space there in certain cases though we find that we get a small performance loss for just arbitrarily mind pixels so it's often worth your while to allocate your buffers in a way such that each pixel row starts at an alliance address this is not required obviously and certainly in many cases in your image processing work we understand that you need to operate us arbitrarily defined let's say the user is drawing a box on your screen you don't have any way to control the alignment so all of our code works just fine it just works a little bit better if your line one of the pitfalls you can go into with rigor Lizzie rigorously align pick larose is that you can get into a situation where each pixel row has a width which is an integer power of two small integer powers of two are not bad large ones can get you into a situation where the processor has difficulty distinguishing between pixels that are in one row in an immediate row right above it or right below it and then that can cause some small delays when storing data and then trying to reload data elsewhere the processor may decide it can't tell the difference and things might go a little more slowly so if you find that you're allocating buffers that are for example 4096 bytes wide which might happen if you had a 1024 x 768 you know full screen image at 32 bits then what you need to do is perhaps that a little padding on the end of each row and you can do that very easily with our API we support that just make sure your robots field a little bit waters on the width tiling of course the commonly used technique in image processing basic approaches you divide up your image into smaller segments which are cache size and this allows you to operate on segments and keep them in the caches while you're working on them so for example if you had several change filters you want to do in series or as imply one of the whole image then do the next filter the whole image then the third filters the whole image you could pick a small subset of the image do all three to that and that means that for the second and third filters you'd be very likely to have the pixels already in the caches so you're less likely to pay any penalty for going out to do and you get them so if you tips on how to do that we found that it's tiling is only hopeful some of the time not all the time so don't waste your time if it didn't and you found it very easy to simply add say to find out where that tiling is going to work for you by just pushing a small image through your code as it is on optimized and then push a big one through take a look at how many pixels per second you're able to calculate in each case if there's a big difference than maybe tiling will pay off for you and it's worth it's time to go through it in our experience we found that pile sizes that are the record will fit in the l1 cache show which are about 16 k 232 k work best wyd is better than taller square and it can be very wide we found cases where only 16 pixels high but a thousand 24 is the optimal case so we also do some tiling in some of our functions if you're going to do your own tiling in certain cases we imagine all that we haven't found any examples of it that these two things could interact adversely so we provided you with a flag you can test it says kV image do not tile which basically tells us not to tell you're going to do it yourself another thing you can do is take advantage of our planar data format originally we were thinking of only providing cleaner but we had so many requests for 80 gb that it's a feature however there are many drawbacks argv and if you use planar data formats you can get around them first of all is for a RGB you may not wish to operate on the alpha channel so it's twenty-five percent or thirty three percent more work it's using air gb formats in that case compared to just operating on the three colour channels so going with cleaner would allow you to just do the work that you need to do and skip over the other stuff and touch touch less memory as well another nice thing about planar is that it's a kind of a limited form of tiling in the sensitive now split up your image into three smaller for smaller parts so in certain cases this may allow you to exist entirely in the cache rose and half in and half out so that would allow you to push through several filters with just read for example and then move on to just screen and do pretty well one of the problems with geometric tiling which is what I presented in the previous slide is that if you've got something with a Colonel matrix that needs to be applied and work for each pixel eating to look at all the pixels around it that can make tiling a little bit tricky so with this one waiver you've got the entire image for each color channels so that solves that problem quite nicely so it works pretty well and then finally a bit of an implementation detail a lot of our argv code will take the argv in early for matix bring it into planar do the work convert it back and then give you the result and all that happens in register so it's pretty fast but it's nicer not to have to do it at all so if you use planar data you probably will get somewhat better performance and we often see the difference is about thirty percent so multiprocessing real time issues we were something you have real time constraints or want to get the most use out of multiple processors so a couple of things all of our functions are single threaded we don't go to any effort to try to use two processors for you we've made that decision because we felt the eve new better where your data was and your timing and you could do that better than we could so we're all single threaded but we've made it as easy as possible for you so we're MPC's you can call it's reentrant lee if you're calling multiple functions over the same piece of data then you need to do your unlocking schemes to make sure you don't have any conflicts there for real-time needs we've made gone through a considerable amount of effort to make sure that we don't call anything that's going to block on a lock or do anything else is going to give you latency problems so some of our functions do require additional memory to do work they take a temp buffer you can just pass null for the temp buffer if you don't want to actually worry about it we'll call malloc and I'll get taken care of for you but if you don't want to see malik then you need to allocate your temper for ahead of time asset to us we'll just want to it finally you know for MP and do better speed through tiling just one small comment about that the one issue you're going to base is it be tile vertically which is not as shown here I mean if you tile so that your one processor we're going to say to let tap the image and ones working on the right half of the image then at the border in between you may have a lot of crosstalk between the processors one model size 1 pixel row the other one might have to read it and there's some communication issues so you're off the better off of you with the image top half bottom half and of course you know if you're going to do that multiprocessing they not be aware for functions that youth kernels where we need to look at mobile pixels around the ones in particular that you're interested in if one cpu is video manipulating that you could get google recognition so that's just something to watch out for so then I'd like to spend a few minutes talking about the optimization techniques we used and hope that you may find them useful in your own code essentially we developed over the years working with a clip of a general theory about the best way to go back pushing data to the processor and essentially this means pouring data down the process which throat as fast as possible we'll talk about it as a minute and we also don't send a lot of time guessing about what's wrong in fact we don't even use standard profiler tools of the kind you would find ten years ago we try to get as much information as possible what's going on and just fix the problem that we see we do our own calling and such so as far as data rich programming goes currency tours are extremely parallel machines the g5 is much more so as I'm sure you've got from skis talk even on the g4 you can have a before an interesting process ten scalar floating-point ops using our fuse multiply add 40 vector floating-point odds all these things running in parallel so if you don't have that much data independence if you aren't processing that many independent streams of data and concurrently then you're just wasting processor cycles so what we try to do is make sure that we actually have that much independence going on at all times and flush that much data through the brothers are concurrently so in practice is nor to achieve that you know the simple things you can unroll loops we're not doing that to get rid of loop overhead we're getting that to make sure that we have eight or 12 or 50 or however many parallel calculations we have going on currently so then keep the processor full we identify eliminate compiler aliasing so if you have pointers pointing to buffers Tyler might not know how these overlaps and it might decide to keep to the load/store order from low do operations store load do operation storage strict order and that will kill your power at ilysm too so we look for those we get rid of them we move all the loads up to the top do the work put all the stores down below that kind of thing you want to lemonade LSU bottlenecks a lot of code to spend all time loading data in and out of register so we look for ways to very small operations that many small operations into two big ones and that way we can spend most of our time actually doing work if you have certain instructions that are spending a lot of time they take six eight ten cycles to get through then we try to find enough work to keep us busy while we wait for that to happen we avoid branching like the plague so we use a lot of select and other kinds of things to make sure that our code flies in a straight line as I mentioned earlier we try to keep all the execution units busy at the same time so if we're busy doing something the floating-point unit this might be a good time to also be loading data for the next loop so we schedule things pretty aggressively and finally we prefetch our data just make sure it's in the cash when we need it so we don't have to take a long stall waiting for data appear out of around in so far as our tiling goes we only did for some functions because we only found only some functions benefited generally what we did was we took a look at the first experiment I suggested earlier run a small image and the big one and see whether there was some improvement for doing smaller images we also took a look at different tile shapes so here I you see a graph where I've taken a three by three kernel and 21 by 21 kernel for the same function and looked at how much time it takes for different 2 tile width so the titles are all the same size is just rewind them should shrink them vertically at the same time so you can see that you know there's some advantage to a particularly tile with in this case you know a thousand twenty four twenty forty eight place is probably the optimal case so that's what we choose so and then of course we to new things / processors we actually end up running this experiment several times to make sure that the file sizes we pick 43 are optimal for g3 and the ones you pick 44 g5 are optimal they're finally just distress like everything else in accelerated framework we vectorize so our intent is to use the velocity engine across the board everywhere we can so you'll see that in the final product we're going to have all the back pretty much everywhere the only exception is going to be histogram which the class of functions that just don't work very well with the vector unit typical speedups we see or a scalar code for that is four to ten times if you haven't tried vectorization i suggest you do that doesn't mean that our scalar cloudy's nu sludge we make sure that runs as fast as possible too and in a couple of cases such as I resampling filters we use the extra speed to deliver a lot better image quality so hopefully you'll like that and I've got this is a beta release so I'm quite finished every every bit of vectorization like to do but certainly we're working hard at it finally experimental e driven optimization we never guess or if we find we are guessing we try to figure out how to run the right experiment to find out what's actually going on so obviously always profile I'm sure you've heard that before you can use tools like II profit sampler there's only give you function level information and only tell you which function is performing slowly it won't tell you why or what part of it or what instruction in particular is getting a stall so actually most of our work is done using shud or shark which they're going to talk about later on today I'll invite her to give us a short overview for and we also use cpu simulators like sims ii for and so these things can be used to actually narrow in and directly tell whether we're running into cache misses or paging or any of the numbers remember other problems which historically have been very hard to tell what exactly is going on and you're just kind of guessing what's going on but we don't we 0n a problem and solve that and that lets us very efficiently get to the high performance could and then finally if you aren't already allergy to inspect your compiler outside for functions that really make a difference since so we are almost always surprised by some of the mistakes we make so with that I'll introduce Eric Miller from the architecture performance group it's gone here and tell you a little bit of that shud which the tool that we use to tune our code good morning i am eric miller with the architecture performance group as the instead the ched tools are one of his favorite toys and i'm glad you put on the list although i would like to MC even reverse the order and put it above G profit sampler but that's just me so what are shut tools well there are sweet of performance analysis tools there are several that are interesting probably the most interesting we'll get to in a minute but the idea behind them is that they give you low level access to the performance monitor Hardware counters in the processor and the memory controller and then we have implemented some software versions in the operating system that behave exactly the same as the hardware performance monitor counters the idea is to help you find problems and improve your code performance and the best part is they're freely available on the web and they're also on a developer tool CD one of the neat things about the CD this year is that you'll be able to install the tools and immediately we have a chud updater which is very similar to software updater shut up data will automatically go out and check the ftp site for new versions of the tools and then you can download me your convenience we recommend that you do this because the version that's on the CD is probably a week and a half old and we made quite a number of improvements in those eight eight days so we generally will put out a release every week at least during the beta period and probably slightly is reduce the frequency later once we have a gold master so there are three main tools the first tool is a profiling tool called shark which Ian alluded to it is an instruction level profiler it can do many things that we'll get to in a minute and not least wages Ian mentioned that you can inspect your your compiler output shark can produce the disassembly from your source code very readily in fact it's one of its key features monsters a spreadsheet for performance events and by that I mean you can collect information about many things the processor capable of measuring internally like cache misses and instruction counts these cycles executed these sorts of things and you can look them in a nice tabular spreadsheet form Saturn is a call graph visualizer as it says would be what the idea there is is it's kind of like using G prof it goes through and actually instruments all your application code and then produces the results of Paula Melton the functions get called but you can also have auxiliary information with regard to performance monitor counts we also have several tracing tools amber which actually when you run amber can collect every single instruction that is executed on behalf of your application on the processor and put that into a file then those files will be consumed by acid which is a tool that we wrote in our group and by sim g4 which is produced by motorola and sims ii 5 which is be produced by IBM those are cyclic yards CPU simulators and of course Ian and his team used the sim g4 product quite quite readily the other thing you can do with the chudd tools is instrument your applications and along with that I'm running out of dots on that's on the slide which you can also create your own application performance analysis tools using the chudd framework because that's the exact framework that we developed in order to create shark monster and Saturn so I mentioned performance counter several times what are they well they're a series of dedicated special purpose registers actually in the processor and in the memory controller that we create that's in the g 4g 5.mp3 systems so what we can do with those is have them set them up to count and record what we call performance events things like the number of l1 cache misses or l2 cache misses or l3 cache misses or instruction counts instruction mrs. execution stalls page faults in the operating system there are there are a plethora of events in fact on g4 you have something in the order of maybe 200 events you can measure on g5 there are literally thousands of events that can be measured so we we use the jug tools and in particular the judge framework to configure and control all the PMC's so I'm not going to do any demos this morning because we're pretty short on time but I just wanted to mention shark because all you do to use shark is push the start button and it will profile the entire system its defaults to a time profile and what that will give you is in your application when you select it from the list of profiled threads or processes you'll see where in your application in relation to your source code will highlight it for you and show you this is where you spent your time if you do an event profile suppose you selected CPU cycles we shark can tell you exactly how many cycles were spent in your code for a particular line of code and shark captures every single thread on the system the driver you know any drivers or kernel extensions the kernel itself and all the applications that are running at any given time the best thing about shark is it's very low overhead as are all the jug tools you can actually set the time profile down to a minimum about 50 microseconds per time sample which is a couple of order magnitude smaller than you can use with sampler and it also gives you an automated analysis which will show up as this column of exclamation points beside your code so we annotate your source code you click on these annotations and it'll tell you things like this loop is has a non changing variable and it's serialized so you may want to move that variable out of the loop or if this loop is a good candidate for alt effects or parallelization because there aren't any data dependencies we do static analysis and this can lead to the surprises that Ian mentioned from the compiler it actually because will show you the disassembly with the compiler generated on your behalf and annotate that as to how many stalls you'll have how many delays might be involved from from other aspects and new features this year from well let me say this shark was formerly called shikari in the chudd tools from last year so it's been renamed shark with a lot of new features one of the features is that you can now save and review all the sessions that you collect or for later analysis and there's also a command line version that you can use to instrument like head with scripts and things so we use this command line version of shark whenever we have a old eunuch scientific application that just runs in the command line we and it has a launch script we can just script shark to begin and then run our tour command line application as normal and then script shark to end his little photograph of shark and what you can see in the left-hand picture would be the result of actually a time profile and in this particular picture we were running a test and it turns out that the square root function was forty-two percent of the time at the lower by the bottom of that left hand picture you can see there's a little process menu and that lists all the processes that were running on the system when you did the trace you can choose from any of those and normally you would choose your own this is a screenshot from last year's demo which was called flurry which is the screensaver in the lower right window you can see you're a piece of source code that's annotated and the bright yellow lines show you where your your samples we're hitting in the in your code and its twenty four point five percent about midway in that image there then you can see the exclamation points on the right that I mentioned and this in this particular case it was telling that there is a using floating point division could be quite costly and so you should probably try to remove that and do a multiplication if you were to double click on that hot line there which then show you the assembly that that that line refers to and it would have some some more detailed annotations there the next tool is monster which is the most direct way to configure and set up the performance monitor counters you can use a chud tools in general there are timed intervals so you can select a certain number of milliseconds or micro seconds four seconds for that matter that you would like to collect per per sample in the hardware you can also collect data based on other events you can set it up to collect a sample every every so many cycles or ever so many instructions completed or ever so many cache misses there's also a third way that's that actually is related to both which is called a hotkey all the chug tools have a global hotkey in the case of shark it's option escape in the case of monster its command a skirt command escape and if you use those keys you don't actually have to have the application in front of the in front of your other application so if you have your application running in its full screen then you can just use this hot key to activate shark or monster or Saturn and do the collection without having to bring it to the front and disturb your process and affect your sampling one of the main things about monster is that it is a big spreadsheet of event collections over time per sample and that's kind of nice but a lot of times you be interested in combining results for example you can collect a lot of information from the memory controller about transactions reads and writes and you know the amount of time because they're sampled over time for ample then you can take those transactions and apply a calculation to them we call shortcuts so you can say every every Reed is 16 bytes so I take the number of reads x 16 bytes as a number of bytes divided by the time I have the bandwidth so you can set up these calculations in monster and have additional columns in your spreadsheet and these calculations are just standard in fix mathematical notation with parentheses and it's basically a four function calculator there's a table and you can also draw charts and shark is also capable of drawing charts then you can also new in this version of monster you can save and review the sessions and the nice thing about this is you can review sessions on a system that you don't have in front of you so you could do collections you had a g5 of your disposal you collect data with shark or monster on your g5 then take it back to your laptop or your desktop g4 or even your iMac and review those results and print off the charts and those sorts of things and there's also a scriptable command line version of monster which is new this year here's a screenshot of monster on the on the left of the first of the leftmost image there's a column that where you could click on those those entries will highlight those columns in the data so and when you highlight columns of data you can then just press the draw a chart button and it would result in a chart and there's many options for charting there's bar charts various colors ations line charts with markers logarithmic scales direct scales samples over time and samples as a single x-axis just per sample plots you can see in this particular case that what's been highlighted are some of the shortcuts so it was a load store session was done so all the load instructions were collected all the store instructions were collected and all the regular instructions collected men percentages of each were calculated along with that you know for every sample each sample is listed horizontally in the in the table there and so vertically is each of these shortcuts so then you just highlight those columns of shortcuts and we plot the percentages which is what you see in the second picture there is quite an extensive set of sampling controls to to configure the performance monitor counters in both shark and monster so the last thing is a new new tool we call Saturn which record like it says you record your function call history and the way we do this is by instrumenting the functions at entry and exit with GCC there's a compiler flag that you throw and do a build and it'll inject of the Saturn entry and exit prologue and epilogue functions are in every function in your application now to be completely thorough you have to go through and recompile all of the frameworks and libraries but that's so that's and that's similar to G prof which is really not that fun to do so most of time we like to focus just on actual application code but the nice thing about Saturn is that once you have this function call history you can visualize that call tree and here in this image you can see that the call tree for CSE under main it's been highlighted and you see the red dashes in that in that stack of bars there that's where that function is called and run so what you would want to use Saturn for is in particular with C++ you have a lot of call depth so you want to if things are very skinny and tall you're spending a lot of time calling functions not doing any work so you want to try to avoid that you want have a nice flat profile you can also collect call counts PMC event counts and execution times by using the performance monitor counters with with your instrumented functions that are injected at entry and exit of each of your functions so as I mentioned on the first slide we've got the instruction tracing and simulation amber is the instruction tracing mechanism and the resultant files are in a format that's called pt6 these TT six files are consumed by the other programs mentioned on this slide which is a tray acid is our internal trace analyzer and actually the acid trace analyzer is the parent of the code coach and the parts in shark that explain why you have bottlenecks and what you might do to change them and these come out of acid and it can also do a couple things on its own which is memory footprint of your application will give you a look a new plot file you can find your instruction sequences that may be an issue and then try and remove those through the informational notes that it gives you sim g4 is a cycle accurate simulator for PPC 7400 which is an old Macs processor from early g4 systems and sim g5 will be available in the future in the near future and that will be a cycle accurate simulator for the new PPC 970 these can be quite handy in tracing particularly complicated performance issues although the outputs of sim g4 and 75 requires a terminal window maybe maybe would require maybe a 50 inch monitor that would work lastly the judge framework is available to like I said instrument your source code one of the things you can do with instrumentation is to is do one function call to start and stop monster or shark sampling so you can you can sort of put a caliper around your entrance interesting code suppose you find a piece of code that shark says is a hot spot you want to get more detail than just trace so that through that you can add code it's should start remote performance monitor and Chad stop remote performance monitor and what happens as you said it you just click a key in monster or shark and it will be in remote mode and be waiting for messages from your application and your application only so you can just you can just collect the data for your interesting code you can directly read and report on the pmcs by writing small pieces of code either instrumented in your application or write a separate standalone application as I mentioned you can write your own performance tools and do all the things that need to be done in order to create a performance tool like shark which is control the performance monitor counters click the information about the system hardware which can be handy in a lot of ways you can know that you're on a g5 you can know that you're under g3 you can know the bus speed of the system the amount of memory in the system number of processors you can also modify some of that information and there also is an HTML reference document online that describes all the various functions in the judge framework here's a small example of code with the chudd framework and this is i mentioned that instrument your code to start and stop shark or monster so you just have to call include the chudd H header file initialize and then acquire the remote access start remote performance monitor with a label that will show up in your output in shark or monster so you know which which instrumentation it was then you run through your important code stop the monitors released remote access secondly a more slightly more complex mode I mentioned you can write your own performance monitoring tool you initialize acquire the sampling facility you turn on some special filters maybe mark your process as the only one to be counted and then you set the events in particular you say both cpus process performance monitor counter number one event number one which happens to be cycles an event number 2 which happens to be instructions clear the counters start the counters you know your important function executes stop the counters then you collect these results and you can perform a calculation and get cycles per instruction in your own application for the for more information about that stuff you can get your own download at this web dress developer apple com tools debugger stat HTML and then you can always contact myself and my colleagues on the chudd tools development team at this add your email address and we tried to be pretty responsive and that's probably the best way to get your feature requests and complaints into our cue let's see what's next oh I guess I'm done so let me bring up mr. Keithley that'd be great [Applause] so the roadmap a couple more sessions today obviously one special specializing and shut itself but we should move on to Q&A pretty quickly we're into that time right now here's some contact info our reference library information