WWDC2004 Session 307
Transcript
Kind: captions Language: en ladies and gentlemen please welcome performance engineer nathan slingerland right yeah good afternoon welcome to God shark so here's what we want to talk about today obviously we're here to to talk about the big shark so we want to introduce shark to you if you haven't used it yet all right for those of you haven't had a chance to run it we want to go over the basics show you how simple it is to use and for those of you who are longtime shark experts we want to want to show you all the great new features we've gotten shark for so and then of course if you've seen you may some of you may have seen the demo on Monday in the devtools keynote we want to go a lot more in depth with with what we did there to optimize celestia using shark so let's go so what is shark well it's been called apples best kept secret but not anymore starting this year we you know in this year's WWDC we've had a big push for this and we want you all to know that this is a great way to find and fix your performance problems so what is shark well it's a really easy to use profiling tool there's a GUI version as well as the command line version the command line version is very useful for scripting performance regression testing so you can you can archive your performance analysis for later review it see how you've made progress as you continually optimize your application and public beta is available on the web today of shark for and chud for and off this website off the developer tools website and it runs on Panther and tiger so that means yes you can use it on your development machines right now right and of course we support basically any kind of executable Makos CFM compatible with all the compilers that you can use in Mac OS 10 X 10 debug format stabs format so whatever you can throw at it shark and show you what's going on all right so so let's see how easy it is right what do you do to get started with shark it's just one click there's just this main window with one button start and stop I mean this is the typical way that you you start profiling and stop it with shark and you're on your way by default we selected time profile but if we also supply a rich set of other preset configurations you can also create your own configurations for doing customized performance analysis and by default we're going to profile the entire system but some profiling configs for up to make more sense to target a specific process so for example Malik tracing which we'll talk more about later you want to pick a target process or static analysis you want to pick a particular executable file sent to look at but for the most part you don't even need to go here you can just click the start button and go so once you've selected your your configuration or just left it at the default probably you're going to do a time profile right so what is that this is the most common workflow basically your time is it's just it's just like optimizing application performance you want to spend your time where you can affect application performance the most so that's what ever functions take the most wall clock time or whatever pieces of your code are taking the most wall clock time so we're using for time profile this is a statistical form of sampling and we're taking by default a thousand samples per second and each sample contains you know back-trace so how you got to a particular location in your code which process which thread and Sarah so all this information and we collect that and when we combine it we can create a profile so this is going to show you over the sampling period where you were spending time in a given process or thread or overall in the system and we're capturing everything so this includes the driver any drivers any kernel code that you're this executing applications so no matter what kind of developer you are this can help you profile your code and we're able to do this because we have also have a kernel extension right so we're sitting in the kernel capturing these these samples and this gives us very low overhead we really don't perturb the code on the system hardly at all so once you've got a profile you've hit start you hit stop and you run whatever it is you're interested in you're presented with this by default the heavy view and this is really the bottom line this is showing you starting from the top the most sampled to the least sampled function in the in the selected process by default we select the heaviest process on the system so a lot most of the time this is your code this is your application that you're trying to look at so that should you were the heavy view showed you where the hotspots are the tree view is going to show you where the hot code paths are so how did you get to those important functions right so that's just this tree button here at the bottom and new in shark for heavy entry simultaneously so you can you can use one to help you navigate the other so those are two ways we use very commonly to look at high-level problems in our code the third way is the enhanced chart view so this chart view has been there a while but it's been greatly enhanced in shark for basically what we're looking at is call stack depth that's on the y-axis versus time and the x axis and this can give you a really good idea of a couple things first of all I can show you patterns in your in your code right so patterns in the execution of your code and this can be good to look for repetitive behavior and we'll talk about more about that later in relation to malloc tracing and function tracing which we'll get to it also is a good way to look for you know if you feel thin narrow spikes in this call in this chart view that usually indicates that you're spending a lot of time and function call overhead right because you had to call many level money function levels deep but you did not actually doing all that much work once you get there so this can be you know we see this a lot in object-oriented modern you know layered code where we just we have everything hidden in layers and so sometimes I can hurt performance so for the colors here blue is the user space if you may see there are a few maroon colored stacks those are colonel and the yellow call stack is what's selected so that's what's being shown at the the right hand side there is the current call stack lightness doesn't mean anything except for layers so it's just indicating as you see these gradients in color each each color change is indicating a new function right vertically so those are the three main tools right that shark is providing for looking at your application but you know a lot of you probably are working on some pretty sophisticated code bases and you have some complex function call behavior and you really would like to you know you want to understand the performance of that application so shark for has introduced a lot of new features that are trying to help with high level analysis and understanding these more complex application architectures the first is is this color by library and what this is going to do is just going to color each different module that your program is using different frameworks or libraries a different color and it remembers the color inconsistent between the different views and this is an easy really quick visual way to see the difference between the modules or visually to see where you're spending time in what modules okay so another common problem you probably will see in your code is this is a tree view and you may see that you you have this important intermediate function this interior function for example be in this example and it's you'd like you see it in two different spots in the tree all right and you'd really like to know how much time total is being spent in B and its children right because we tend to subdivide work so in this example we have be calling c1 c2 and c3 and not a whole lot of time is being spent in and be itself so what we've added is a canonical view and if you select that we're going to see not just you know in the tree of you normally we see main calls flu calls bar well in this case we're going to see all the symbols all the interior symbols every symbol that shows up and it's total contribution right so in this case B and what it calls total up to 100 at the time where you're being spent so this is a good way to understand how much time you're spending in a function and its children so that kind of raises this question right traditional profiling tools tend to look at performance just like the same way that the program is broken down into functions right and that works pretty well when you have a few functions where you spend a lot of time we're in a few heavy function but if you have just a lot of functions in your code you've subdivided the work a lot and you're reusing code everywhere or your you spending a lot of time in system libraries that really isn't the best presentation and that's what data mining allows you to do it's going to allow you to customize how you're looking at a profile and by customizing it you can help ease your understanding of it right so shark provides a couple of basic data mining operations the first is exclude similar library and what this is going to do it's going to hide a similar library that you select and so we're not just taking the sample doubt it's important to remember this we're not just removing these samples right so if you hide the function foo any samples that landed in foo or just being assigned to whatever called it right so if you have a piece of code that call square root for example and you hide you exclude square root it's going to just assign the time to that function your function right and really that's a lot easier for a lot of us to understand is time spent in our code right because we can't do a lot about system libraries typically so along those same lines is flattened library we you know we are familiar with the pieces of app kit or core foundation or the other system libraries that we call right those function entry points but a lot of times you'll see interior functions inside of those libraries that you really don't know much about and you really don't care much about right about the implementation interior to that so flattened library is going to do that it's going to hide everything inside of a library except for the entry point so you're going to just see all the time it show up in the entry point which is what you understand right because that's what you called from your code I finally focused symbol if you typically navigate very deep call trees or if you just you you know you have a certain symbol that you only want to look at and what it calls that's what focus does it's going to just start the display of the the various call path from that symbol so these are different ways to really to customize how you're looking at the profile and internally in Apple we've found this very helpful in looking at performance analysis for tiger so those are the detailed data mining operations there are more macro level off up operations that are also eat very useful flatten recursion so you've probably seen this if you've ever looked at you know a piece of a piece of code you've seen a recursive call may be in you know NS view or some other even your own code and really that's not the most informative way to look at a piece of function right that's recursive function you'll actually like to see all those recursive calls collapsed into one and see all the time together and what all the recursive calls actually do collapse together so the flattened recursion check box is in the this advanced settings drawer that we've been looking at and you check this check box and all the recursive calls anywhere in the profile just get collapse to one so we talked about flatten a single library this will this flattened system libraries will slide in all of the system libraries that occur in the profile this is a really quick way to hide all that interior code that you don't really you don't really want to see and so another very useful and is exclude entries without source info usually when you're working on an application in your system you've turned on debugging information because you want to see symbols and shark and you want to get all the source information well the good news is that that's almost the only thing in the system with that information in it so if you exclude everything without that information you're going to only see your code it's going to perform that exclude operation on any symbol that doesn't have source information associated with it so this again this will just focus in on what you know on your source code so you know I encourage you to play with all these different officer up operations and you'll see very quickly what they do and how they can help you finally exclude supervisor you know a lot of us are very familiar with the colonel and what is doing inside and you know how did I get these for these red samples and Mike in my profile this is just a quick way to get rid of those samples and you know keep them from being a distraction while you're looking at your code okay so those are the high-level optimization tools that shark provides once you you've kind of gotten those out of the way you should you should be able to at least understand your code at a level where you want to optimize a specific function right so that's where we're shark shines for function level optimization right the most basic thing that shark does for you as it shows you your source code if you double-click on any symbol on those profile views that we talked about heavier tree it's going to show the source code and what it's doing here it's coloring the source according to how many samples landed on a particular source line so the brighter the yellow the more samples landed there that's a really quick visual way to see where you're spending time in your source and you also note that along the in the scrollbar it the coloring in the background of the scrollbar matches the coloring in the background of the table so this is a really very easy way to navigate you know quickly between hot spots in your code and finally you know to make the the profile and optimization cycle complete there's an edit button and this is going to show your source code inside of Xcode it's going to bring you back to Xcode so the typical optimization cycle right you you profile the shark you go to the code browser you see a problem you go you click Edit go back to Xcode recompile and so on so it's this iterative process where you continually go back and optimize again and again until you're happy with the code another great thing that shark will do for you is it will provide you with optimization tips so if you see one of these exclamation points and you're in the code browser and your code that means shark has a tip for you hook so it has an idea about what might be wrong or why you might be spending extra time on that particular line of source it so if you click on the on the ! it'll bring up this help balloon and finally in shark for we've added this ability to look at both source and assembly side by side right so now rather than having to jump back and forth or try and figure out how the you know what the compiler generated for a particular line of source code when you navigate around in the source viewer you'll see very quickly okay you know the highlighted line if you select here we have a line of source selected on the left and there are maybe three or four instructions there that are selected on the right and those are the instructions that go with that source wine so this is a real quick way to see what instructions were being generated okay well that's great but I don't know any PowerPC assembly well here you go right there's this online help you can you know it will follow along as you navigate through the assembly and it'll tell you in English you know here's what this instruction does and you know assembly languages just like any other programming languages has these simple operations and you can read about it here in this online help guide right so that's time profiling those are the tools that we use the time profiling right when you but you'll see that we also use these same tools over and over but with more sophisticated types of profiling which shark also provides some shark for we've added mallet racing and what this is doing is every time you an allocation happens in a target program it's going to take a sample it's going to remember the call stack and also how much memory was allocated so it's not shown here but there's also you can look at this in the profile view and say okay this function over the sampling run time allocated this much memory and that's useful for tracking down you know overall how much memory was allocated in various function but another interesting thing to do is to use the chart view to see patterns in the program behavior because this is not a statistical profile this is an exact trace of the malloc function and most interesting program behavior does some sort of allocation you can see patterns in your in your code in the execution of your code so here what we're looking at this is the dock as we mouse over and it kind of magnifies different icons or probably this we're looking at three icons here right and so you can see this repeated behavior sometimes in like in the case of the doc well maybe we expect that we expect repetition but this can also be a good clue is when you're doing extra operations that aren't me if necessary right so mallet racing is is a it's more general than just looking at where memory with allocated and a more general form of Malik tracing its function tracing so rather than just tracing the allocation functions in your code you can trace any functions in your code and again this is an exact trace so there are some predefined presets for 500 api's typical Network api's locking api's but you can also add your own functions right so if you do if you have a set of functions that you want to watch in your code and you want to measure exact function count and then the amount of time that the function took this is a great way to do it you can you just attach to your function or attach to your application and select the functions that you want to watch and this is in you know this is measuring each function calls time very accurately with the PowerPC DEKRA manner but without having to write any instrumentation code in your application the finally event profiling so if you are you long time shark users this you know we've had this for a while but this is also a very good way to correlate hardware and operating system events so things like cache misses and the hardware or virtual memory faults in the operating system with your code all right so here in this example we have l1 and l2 data cache misses but we're profiling a you know event profiling along with our normal time profile and so we see two columns have been added to the profile each of those counts l1 and l2 cache misses and also in the chart view we can see there's a chart above so you can track the l2 cache miss over time how much how is that changing over time in your application so and there are literally hundreds and hundreds of these performance events available finally in shark for java profiling so everything that you could do with you know CFM and ma co applications is also available for java now time and allocation tracing function tracing and it's going to show if you double-click on the Java function and source codes available it's going to show you the Java source in a code browser so oh you also you know those are the kind of the built-in configurations that we've been talking about you can create your own right using this configuration editor and you know one of the hard things that was there was in shark 3 was finding where performance counter events lived right so which pnc which performance monitor counter there's an event that counts cache misses live on well you'd have to either go to the processor user guide and look it up or you you would you know you just kind of blindly search around well now there's a text search for the performance event so you can type in cache miss or cash and it will come up with a list of which PMC's have an event that would that substring in it okay and in shark for we also added network profiling this is using rendezvous you can set up you know your your ex serve in your closet or you know you have a HP see cluster and you want to monitor what's going on in that machine that doesn't have a monitor and a mouse in front of it in the keyboard well this will let you monitor one or more of those machines that you set that up on so you don't have to sit in front of this machine you just go out it works very similar to X codes distributed build you know pick them as target machines and hit start and stop on your client machine and so let's let's have a demo with that can we switch to the demo machine please great alright so over here I have a machine this is Sanjay's machine it's running some code on it but all on shark here right and I'll go to the the network profiling window I'll say control network profiling of shared computers and there we can say you know somebody has shared this w WC 6 ok let's use that and it's it's running something over there let's hit start so again sharks very simple just hit the start button stop and now we're processing on the other machine here and this is the this is a profile taken from this machine so we can see oh that's what's the left yet well I wonder what that is maybe yeah Honda can tell us so Sanjay Patel that's can we have you come up here and explain it thanks so for those of you here on Monday you probably saw our celestia demo we'd like to show you a few more details about how we optimize that code so if we go back to slides for just a second hello ok they're back ok so Celestia is this open source program we found on the web and it's really cool it's a 3d space simulation of all the large bodies in the universe for more information you can check them out on this web link they actually use real data from Pasadena down at JPL to calculate where the planets and the stars are how they're orbiting so let's go back to the demo machine and so this is celeste use main window you can see it's hovering around the earth right now and we wanted to find out how this thing performs on Mac os10 so the first thing you have to do if you're going to try to optimize performance is put down a good metric in our case that would be how long it takes using a timer as you can see in the corner to run our art script and so right now we're running in timing so we can bring up shark and you can see as Nathan said shark is pretty simple there's a start button in the middle here we can choose our configuration but by default most of us are going to use the time profile let's just hit the start button so right now we're actually taking a thousand samples per second of whatever is running on the system beyond just Celestia if you hit stop shark is now processing the samples it took and it's going to show us by default what we call the heavy view so this is a list in order of importance of these functions that actually ran on this system now at the bottom here you can see so let's just taking up less than half of the processor cycles and if you pop that open you notice that the first thing is the mock colonel so new for shark for we actually filter out the idol thread so you won't pop there by default thanks for old shark fan and you might wonder why is Celeste you're getting less than half the CPU cycles on this system if you pop open the thread pops up you'll see that it's a single threaded app so we're running on a dual g5 it can't possibly take all of the seats you cycles on the machine now another way to look at this code is as Nathan mentioned we have a 3 view so this is the traditional top-down view of code so you start at start or main and you work your way down to the leaves of the function this is great for understanding the structure of your code how did you get there and new for shark floor we can do both heavy entry simultaneously so you can get a really good feel for the behavior of your program and of course it's Nathan mentioned we have the new chart view so this is interesting we have a dual processor system so we get two charts and on the x-axis we have time for each CPU now we know that celestia is a single threaded program so let's just look at it's one thread in one chart and new for sharp for we have this zoom slider this is a big upgrade compared to previous zooming and you can go back and forth the nice part is if you click on one of these call stacks you'll get the back traits for that point in time so you can see chronologically how your program behaves over time and you'll notice that there are some of these maroon or red samples those are calls into supervisor space and one thing I noticed was for the Celestia profile we had a lot of these calls into the supervisor space it turns out we're trying to draw about a thousand times per second and for those of us who are familiar with graphics applications you don't really need to try to draw that often because you can't so we limited it to 30 frames per second and that's what i call warp 2 we can go to that you can see things have sped up immensely by not doing all that work of trying to draw too often we can also go to this writing model which would be work three so it turns out you can do drawing in parallel with some of the computations for orbits and we can take another sample here and see how r s profile is changed so i'll put on my hardware hat for a second and say the threading is probably the most important thing you guys can do if you have any serious computation going on in your programs because no matter what architecture we're talking about we're going to end up being multi-core multiprocessor multi-threaded multi something so if you're not taking advantage of that parallel ISM in Hardware you're going to lose out but now we can see if we compare the first session to our second session you'll notice that it's very different on the first session we have this call to log it's at the top but now that we've limited drawing and done threading we have a totally different profile so this brings up an important point you shouldn't try to micro optimize right off the bat you should look for some high-level problems to figure out what your code is actually doing before you start diving into function level tuning but now we have a new profile and you'll see that cosine from libsystem is actually at the top of the chart and there's not much we can do about cosine but we should figure out who's calling cosine so if you open up a disclosure triangle you can see our main color is this orbit function now new and shark for we have what we call data mining operations and as Nathan mentioned if we want to hide say whole library for example libsystem we can just exclude and now all of the samples that used to be in co-signer attributed to its callers so you see our orbit function is now at the top of the chart and of course if you double-click you get your source code and it's highlighted according to hotspot you can see in this function almost all the time is going into one line calling cosine and of course it's a for loop what we should do is open up Xcode and see what we can do about that so you just hit the edit button okay and you can see we're calling cosine in a loop and we're actually just iterating over a whole array of data and getting the cosine of each value now new and tiger in the accelerate framework there just happens to be a cosine function that does that it just iterates over an entire array and if we jump over to that line you can see i just made a call here it's called V V cosine app that's available on tiger and what it does is it just takes an entire array operates cosine across the whole path and gives you a return array so we made that optimization we call that warped for and you'll see we took about 20 seconds to run through the demo before while we were threaded in limiting drawing using the accelerate framework we doubled the performance but there is a downside I kind of cheated that cosine function is single precision and the inputs and outputs in Celestia were double precision so I didn't want to cause any kind of astronomical catastrophe or global warming or anything so we decided we'll do our own cosine approximation function in double precision and I called that warp 5 and it end up being a little slower than just calling it a V force oops lost something there and then we also went to warp 6 which was a known unrolling of Maya code and that's important just unrolling one loop in this program bought us another thirty percent and what we'll end up with is performance that's on par with with the accelerate framework but it has a higher precision because i'm doing a higher precision estimate of cosine so let's take one more profile and figure out what our code is doing after we've made these optimizations so this is performance tuning of course it's an iterative process again you can see the profile has changed cosines no longer there we have our approximation function on top and what I want to do is take a look at this function called bigfix it's a C++ constructor and you look here and you say whoa shark is angry about something it's got a lot of what we call the call-outs there and if you you click on one of those you get some friendly advice or not so friendly advice depending on how bad your code was in this case we're doing type conversions from floating point in so this is one of the big bottle necks of for a g5 or even a g4 processor and what we want to do is try to optimize that but let's take a look at what the compiler generates underneath this code so we can now look at source and assembly side by side and of course we're trying to sell those big monitors this looks really great on 30 inch now the really nice part here is you can see Nathan scrolling around and for each line of source we're actually in sync with the as imbue so that's going to pop as you scroll around in the source so you can see exactly which lines of Adam correspond to which lines of source okay now if we just want to look at azum and don't really understand Adam then we have this new handy PPC Help button as Nathan showed and the cool part here is as you click on your azum the right page of the manual pops open with you so lots of information right at your fingertips you don't have to go fumbling around for manuals that's a pretty big improvement for someone like me and of course if you're really tuning for g5 or powerpc 970 we have this extra help here if we click this show dispatch groups your source will now be highlighted according to dispatch groups for the g5 so for those of you familiar with the g5 architecture you can actually tune down to the hardware the other nice feature is this 970 details drawer and if you highlight code you can see dynamically we're updating the charts at the bottom so you can see where every as of instruction is going so on the Left we have the functional units and on the right we have the dispatch slots so for those of you who really want to tune this is great stuff okay so let's go back to Celestia yeah so over a period of about two weeks or so with a lot of interruptions doing real work I spent I used shark to try to optimize this code so i ended up writing some altivec code i did some scheduling for the g5 i did some g five specific optimization and the some results of that is what i call warp nine and if we run through that see that we finish in about five seconds so that was pretty good now this is open source I'm going to return this all of course to the authors and post it on the web so you guys can have a crack at it too if you like it's a really cool program if we go back to our slides now here's the result of a couple of weeks of tuning effort we started out and if you run this demo as is it'll take over 20 25 minutes on a g5 and when we were done of course we got 25 seconds so around 300 times the performance so that was pretty nice so let's just summarize what we've talked about today the things we really want to say are as you can see hopefully that it's a really quick process to start using shark you don't have to do any kind of hardcore profiling you can take your optimized binary and simply profile that in order to see your source all you have to do is add the dash G flex they don't strip out your debugging symbols and of course we provide a lot of high-level views so if you're used to trees you can see it in tree view you can use the heavy view you can look at your code over time is a chart and of course once you've identified functions that are hot spots you want to source you want to tune the source code or tune the assembly and let shark give the optimization tips and of course now with new new and shark for we have the advanced profiling with so things like tracing malik's tracing specific functions or tracing specific system events like cache misses or page faults and of course hopefully you can tell shark we've tried really hard to make easy there's no magic here you just push the start button and that should get you going but there is a lot of machinery underneath here that makes it powerful enough or even the most expert users who want to tune all the way down to the metal so it's great for what we call high level or elder its algorithm 'ok analysis as well as low level tuning or hacking and we're compatible with codewarrior GCC xlc XLF snag whatever you've got we should be able to support you and the best part sharks available for free and it's available right now [Applause]