WWDC2004 Session 307
Transcript
Kind: captions
Language: en
ladies and gentlemen please welcome
performance engineer nathan slingerland
right yeah good afternoon welcome to God
shark so here's what we want to talk
about today obviously we're here to to
talk about the big shark so we want to
introduce shark to you if you haven't
used it yet all right for those of you
haven't had a chance to run it we want
to go over the basics show you how
simple it is to use and for those of you
who are longtime shark experts we want
to want to show you all the great new
features we've gotten shark for so and
then of course if you've seen you may
some of you may have seen the demo on
Monday in the devtools keynote we want
to go a lot more in depth with with what
we did there to optimize celestia using
shark so let's go so what is shark well
it's been called apples best kept secret
but not anymore starting this year we
you know in this year's WWDC we've had a
big push for this and we want you all to
know that this is a great way to find
and fix your performance problems so
what is shark well it's a really easy to
use profiling tool there's a GUI version
as well as the command line version the
command line version is very useful for
scripting performance regression testing
so you can you can archive your
performance analysis for later review it
see how you've made progress as you
continually optimize your application
and public beta is available on the web
today of shark for and chud for and off
this website off the developer tools
website and it runs on Panther and tiger
so that means yes you can use it on your
development machines right now right and
of course we support basically any kind
of executable Makos CFM compatible with
all the compilers that you can use in
Mac OS 10 X 10
debug format stabs format so whatever
you can throw at it shark and show you
what's going on all right so so let's
see how easy it is right what do you do
to get started with shark it's just one
click there's just this main window with
one button start and stop I mean this is
the typical way that you you start
profiling and stop it with shark and
you're on your way by default we
selected time profile but if we also
supply a rich set of other preset
configurations you can also create your
own configurations for doing customized
performance analysis and by default
we're going to profile the entire system
but some profiling configs for up to
make more sense to target a specific
process so for example Malik tracing
which we'll talk more about later you
want to pick a target process or static
analysis you want to pick a particular
executable file sent to look at but for
the most part you don't even need to go
here you can just click the start button
and go so once you've selected your your
configuration or just left it at the
default probably you're going to do a
time profile right so what is that this
is the most common workflow basically
your time is it's just it's just like
optimizing application performance you
want to spend your time where you can
affect application performance the most
so that's what ever functions take the
most wall clock time or whatever pieces
of your code are taking the most wall
clock time so we're using for time
profile this is a statistical form of
sampling and we're taking by default a
thousand samples per second and each
sample contains you know back-trace so
how you got to a particular location in
your code which process which thread and
Sarah so all this information and we
collect that and when we combine it we
can create a profile so this is going to
show you over the sampling period where
you were spending time in a given
process or thread or overall in the
system and we're capturing everything so
this includes the driver any drivers any
kernel code that you're this executing
applications so no matter what kind of
developer you are this can help you
profile your code
and we're able to do this because we
have also have a kernel extension right
so we're sitting in the kernel capturing
these these samples and this gives us
very low overhead we really don't
perturb the code on the system hardly at
all so once you've got a profile you've
hit start you hit stop and you run
whatever it is you're interested in
you're presented with this by default
the heavy view and this is really the
bottom line this is showing you starting
from the top the most sampled to the
least sampled function in the in the
selected process by default we select
the heaviest process on the system so a
lot most of the time this is your code
this is your application that you're
trying to look at so that should you
were the heavy view showed you where the
hotspots are the tree view is going to
show you where the hot code paths are so
how did you get to those important
functions right so that's just this tree
button here at the bottom and new in
shark for heavy entry simultaneously so
you can you can use one to help you
navigate the other so those are two ways
we use very commonly to look at
high-level problems in our code the
third way is the enhanced chart view so
this chart view has been there a while
but it's been greatly enhanced in shark
for basically what we're looking at is
call stack depth that's on the y-axis
versus time and the x axis and this can
give you a really good idea of a couple
things first of all I can show you
patterns in your in your code right so
patterns in the execution of your code
and this can be good to look for
repetitive behavior and we'll talk about
more about that later in relation to
malloc tracing and function tracing
which we'll get to it also is a good way
to look for you know if you feel thin
narrow spikes in this call in this chart
view that usually indicates that you're
spending a lot of time and function call
overhead right because you had to call
many level money function levels deep
but you did not actually doing all that
much work once you get there so this can
be you know we see this a lot in
object-oriented modern you know layered
code where we just we have everything
hidden in layers and so sometimes I can
hurt performance so for the colors here
blue is the user space if you may see
there are a few maroon colored
stacks those are colonel and the yellow
call stack is what's selected so that's
what's being shown at the the right hand
side there is the current call stack
lightness doesn't mean anything except
for layers so it's just indicating as
you see these gradients in color each
each color change is indicating a new
function right vertically so those are
the three main tools right that shark is
providing for looking at your
application but you know a lot of you
probably are working on some pretty
sophisticated code bases and you have
some complex function call behavior and
you really would like to you know you
want to understand the performance of
that application so shark for has
introduced a lot of new features that
are trying to help with high level
analysis and understanding these more
complex application architectures the
first is is this color by library and
what this is going to do is just going
to color each different module that your
program is using different frameworks or
libraries a different color and it
remembers the color inconsistent between
the different views and this is an easy
really quick visual way to see the
difference between the modules or
visually to see where you're spending
time in what modules okay so another
common problem you probably will see in
your code is this is a tree view and you
may see that you you have this important
intermediate function this interior
function for example be in this example
and it's you'd like you see it in two
different spots in the tree all right
and you'd really like to know how much
time total is being spent in B and its
children right because we tend to
subdivide work so in this example we
have be calling c1 c2 and c3 and not a
whole lot of time is being spent in and
be itself so what we've added is a
canonical view and if you select that
we're going to see not just you know in
the tree of you normally we see main
calls flu calls bar well in this case
we're going to see all the symbols all
the interior symbols every symbol that
shows up and it's total contribution
right so in this case B and what it
calls total up to 100
at the time where you're being spent so
this is a good way to understand how
much time you're spending in a function
and its children so that kind of raises
this question right traditional
profiling tools tend to look at
performance just like the same way that
the program is broken down into
functions right and that works pretty
well when you have a few functions where
you spend a lot of time we're in a few
heavy function but if you have just a
lot of functions in your code you've
subdivided the work a lot and you're
reusing code everywhere or your you
spending a lot of time in system
libraries that really isn't the best
presentation and that's what data mining
allows you to do it's going to allow you
to customize how you're looking at a
profile and by customizing it you can
help ease your understanding of it right
so shark provides a couple of basic data
mining operations the first is exclude
similar library and what this is going
to do it's going to hide a similar
library that you select and so we're not
just taking the sample doubt it's
important to remember this we're not
just removing these samples right so if
you hide the function foo any samples
that landed in foo or just being
assigned to whatever called it right so
if you have a piece of code that call
square root for example and you hide you
exclude square root it's going to just
assign the time to that function your
function right and really that's a lot
easier for a lot of us to understand is
time spent in our code right because we
can't do a lot about system libraries
typically so along those same lines is
flattened library we you know we are
familiar with the pieces of app kit or
core foundation or the other system
libraries that we call right those
function entry points but a lot of times
you'll see interior functions inside of
those libraries that you really don't
know much about and you really don't
care much about right about the
implementation interior to that so
flattened library is going to do that
it's going to hide everything inside of
a library except for the entry point so
you're going to just see all the time it
show up in the entry point which is what
you understand right because that's what
you called from your code
I finally focused symbol if you
typically navigate very deep call trees
or if you just you you know you have a
certain symbol that you only want to
look at and what it calls that's what
focus does it's going to just start the
display of the the various call path
from that symbol so these are different
ways to really to customize how you're
looking at the profile and internally in
Apple we've found this very helpful in
looking at performance analysis for
tiger so those are the detailed data
mining operations there are more macro
level off up operations that are also
eat very useful flatten recursion so
you've probably seen this if you've ever
looked at you know a piece of a piece of
code you've seen a recursive call may be
in you know NS view or some other even
your own code and really that's not the
most informative way to look at a piece
of function right that's recursive
function you'll actually like to see all
those recursive calls collapsed into one
and see all the time together and what
all the recursive calls actually do
collapse together so the flattened
recursion check box is in the this
advanced settings drawer that we've been
looking at and you check this check box
and all the recursive calls anywhere in
the profile just get collapse to one so
we talked about flatten a single library
this will this flattened system
libraries will slide in all of the
system libraries that occur in the
profile this is a really quick way to
hide all that interior code that you
don't really you don't really want to
see and so another very useful and is
exclude entries without source info
usually when you're working on an
application in your system you've turned
on debugging information because you
want to see symbols and shark and you
want to get all the source information
well the good news is that that's almost
the only thing in the system with that
information in it so if you exclude
everything without that information
you're going to only see your code it's
going to perform that exclude operation
on any symbol that doesn't have source
information associated with it so this
again this will just focus in on what
you know on your source code so you know
I encourage you to play with all these
different officer up operations and
you'll see very quickly
what they do and how they can help you
finally exclude supervisor you know a
lot of us are very familiar with the
colonel and what is doing inside and you
know how did I get these for these red
samples and Mike in my profile this is
just a quick way to get rid of those
samples and you know keep them from
being a distraction while you're looking
at your code okay so those are the
high-level optimization tools that shark
provides once you you've kind of gotten
those out of the way you should you
should be able to at least understand
your code at a level where you want to
optimize a specific function right so
that's where we're shark shines for
function level optimization right the
most basic thing that shark does for you
as it shows you your source code if you
double-click on any symbol on those
profile views that we talked about
heavier tree it's going to show the
source code and what it's doing here
it's coloring the source according to
how many samples landed on a particular
source line so the brighter the yellow
the more samples landed there that's a
really quick visual way to see where
you're spending time in your source and
you also note that along the in the
scrollbar it the coloring in the
background of the scrollbar matches the
coloring in the background of the table
so this is a really very easy way to
navigate you know quickly between hot
spots in your code and finally you know
to make the the profile and optimization
cycle complete there's an edit button
and this is going to show your source
code inside of Xcode it's going to bring
you back to Xcode so the typical
optimization cycle right you you profile
the shark you go to the code browser you
see a problem you go you click Edit go
back to Xcode recompile and so on so
it's this iterative process where you
continually go back and optimize again
and again until you're happy with the
code another great thing that shark will
do for you is it will provide you with
optimization tips so if you see one of
these exclamation points and you're in
the code browser and your code that
means shark has a tip for you hook so it
has an idea about what might be wrong or
why you might be spending extra time on
that particular line of source it so if
you click on the on the ! it'll bring up
this help balloon
and finally in shark for we've added
this ability to look at both source and
assembly side by side right so now
rather than having to jump back and
forth or try and figure out how the you
know what the compiler generated for a
particular line of source code when you
navigate around in the source viewer
you'll see very quickly okay you know
the highlighted line if you select here
we have a line of source selected on the
left and there are maybe three or four
instructions there that are selected on
the right and those are the instructions
that go with that source wine so this is
a real quick way to see what
instructions were being generated okay
well that's great but I don't know any
PowerPC assembly well here you go right
there's this online help you can you
know it will follow along as you
navigate through the assembly and it'll
tell you in English you know here's what
this instruction does and you know
assembly languages just like any other
programming languages has these simple
operations and you can read about it
here in this online help guide right so
that's time profiling those are the
tools that we use the time profiling
right when you but you'll see that we
also use these same tools over and over
but with more sophisticated types of
profiling which shark also provides some
shark for we've added mallet racing and
what this is doing is every time you an
allocation happens in a target program
it's going to take a sample it's going
to remember the call stack and also how
much memory was allocated so it's not
shown here but there's also you can look
at this in the profile view and say okay
this function over the sampling run time
allocated this much memory and that's
useful for tracking down you know
overall how much memory was allocated in
various function but another interesting
thing to do is to use the chart view to
see patterns in the program behavior
because this is not a statistical
profile this is an exact trace of the
malloc function and most interesting
program behavior does some sort of
allocation you can see patterns in your
in your code in the execution of your
code so here what we're looking at this
is the dock as we mouse over and it kind
of magnifies different icons or probably
this we're looking at three icons here
right and so you can see this repeated
behavior
sometimes in like in the case of the doc
well maybe we expect that we expect
repetition but this can also be a good
clue is when you're doing extra
operations that aren't me if necessary
right so mallet racing is is a it's more
general than just looking at where
memory with allocated and a more general
form of Malik tracing its function
tracing so rather than just tracing the
allocation functions in your code you
can trace any functions in your code and
again this is an exact trace so there
are some predefined presets for 500
api's typical Network api's locking
api's but you can also add your own
functions right so if you do if you have
a set of functions that you want to
watch in your code and you want to
measure exact function count and then
the amount of time that the function
took this is a great way to do it you
can you just attach to your function or
attach to your application and select
the functions that you want to watch and
this is in you know this is measuring
each function calls time very accurately
with the PowerPC DEKRA manner but
without having to write any
instrumentation code in your application
the finally event profiling so if you
are you long time shark users this you
know we've had this for a while but this
is also a very good way to correlate
hardware and operating system events so
things like cache misses and the
hardware or virtual memory faults in the
operating system with your code all
right so here in this example we have l1
and l2 data cache misses but we're
profiling a you know event profiling
along with our normal time profile and
so we see two columns have been added to
the profile each of those counts l1 and
l2 cache misses and also in the chart
view we can see there's a chart above so
you can track the l2 cache miss over
time how much how is that changing over
time in your application so and there
are literally hundreds and hundreds of
these performance events available
finally in shark for java profiling so
everything that you could do with you
know CFM and ma co applications is also
available for java now
time and allocation tracing function
tracing and it's going to show if you
double-click on the Java function and
source codes available it's going to
show you the Java source in a code
browser so oh you also you know those
are the kind of the built-in
configurations that we've been talking
about you can create your own right
using this configuration editor and you
know one of the hard things that was
there was in shark 3 was finding where
performance counter events lived right
so which pnc which performance monitor
counter there's an event that counts
cache misses live on well you'd have to
either go to the processor user guide
and look it up or you you would you know
you just kind of blindly search around
well now there's a text search for the
performance event so you can type in
cache miss or cash and it will come up
with a list of which PMC's have an event
that would that substring in it okay and
in shark for we also added network
profiling this is using rendezvous you
can set up you know your your ex serve
in your closet or you know you have a HP
see cluster and you want to monitor
what's going on in that machine that
doesn't have a monitor and a mouse in
front of it in the keyboard well this
will let you monitor one or more of
those machines that you set that up on
so you don't have to sit in front of
this machine you just go out it works
very similar to X codes distributed
build you know pick them as target
machines and hit start and stop on your
client machine and so let's let's have a
demo with that can we switch to the demo
machine please great
alright so over here I have a machine
this is Sanjay's machine it's running
some code on it but all on shark here
right and I'll go to the the network
profiling window I'll say control
network profiling of shared computers
and there we can say you know somebody
has shared this w WC 6 ok let's use that
and it's it's running something over
there let's hit start so again sharks
very simple just hit the start button
stop and now we're processing on the
other machine here and this is the this
is a profile taken from this machine so
we can see oh that's what's the left yet
well I wonder what that is maybe yeah
Honda can tell us so Sanjay Patel that's
can we have you come up here and explain
it thanks so for those of you here on
Monday you probably saw our celestia
demo we'd like to show you a few more
details about how we optimize that code
so if we go back to slides for just a
second
hello ok they're back ok so Celestia is
this open source program we found on the
web and it's really cool it's a 3d space
simulation of all the large bodies in
the universe for more information you
can check them out on this web link they
actually use real data from Pasadena
down at JPL to calculate where the
planets and the stars are how they're
orbiting so let's go back to the demo
machine and so this is celeste use main
window you can see it's hovering around
the earth right now and we wanted to
find out how this thing performs on Mac
os10 so the first thing you have to do
if you're going to try to optimize
performance is put down a good metric in
our case that would be how long it takes
using a timer as you can see in the
corner to run our art script and so
right now we're running in timing so we
can bring up shark and you can see as
Nathan said shark is pretty simple
there's a start button in the middle
here we can choose our configuration but
by default most of us are going to use
the time profile let's just hit the
start button so right now we're actually
taking a thousand samples per second of
whatever is running on the system beyond
just Celestia if you hit stop shark is
now processing the samples it took and
it's going to show us by default what we
call the heavy view so this is a list in
order of importance of these functions
that actually ran on this system now at
the bottom here you can see so let's
just taking up less than half of the
processor cycles and if you pop that
open you notice that the first thing is
the mock colonel so new for shark for we
actually filter out the idol thread so
you won't pop there by default thanks
for old shark fan and you might wonder
why is Celeste you're getting less than
half the CPU cycles on this system if
you pop open the thread pops up you'll
see that it's a single threaded app so
we're running on a dual g5 it can't
possibly take all of the seats
you cycles on the machine now another
way to look at this code is as Nathan
mentioned we have a 3 view so this is
the traditional top-down view of code so
you start at start or main and you work
your way down to the leaves of the
function this is great for understanding
the structure of your code how did you
get there and new for shark floor we can
do both heavy entry simultaneously so
you can get a really good feel for the
behavior of your program and of course
it's Nathan mentioned we have the new
chart view so this is interesting we
have a dual processor system so we get
two charts and on the x-axis we have
time for each CPU now we know that
celestia is a single threaded program so
let's just look at it's one thread in
one chart and new for sharp for we have
this zoom slider this is a big upgrade
compared to previous zooming and you can
go back and forth the nice part is if
you click on one of these call stacks
you'll get the back traits for that
point in time so you can see
chronologically how your program behaves
over time and you'll notice that there
are some of these maroon or red samples
those are calls into supervisor space
and one thing I noticed was for the
Celestia profile we had a lot of these
calls into the supervisor space it turns
out we're trying to draw about a
thousand times per second and for those
of us who are familiar with graphics
applications you don't really need to
try to draw that often because you can't
so we limited it to 30 frames per second
and that's what i call warp 2 we can go
to that you can see things have sped up
immensely by not doing all that work of
trying to draw too often we can also go
to this writing model which would be
work three so it turns out you can do
drawing in parallel with some of the
computations for orbits and we can take
another sample here and see how r s
profile is changed so i'll put on my
hardware hat for a second and say the
threading is probably the most important
thing you guys can do if you have any
serious computation going on in your
programs because no matter what
architecture we're talking about we're
going to end up being multi-core
multiprocessor multi-threaded multi
something so if you're not taking
advantage of that parallel ISM in
Hardware you're going to lose out but
now we can see if we compare the first
session to our second session you'll
notice that it's very different on the
first session we have this call to log
it's at the top but now that we've
limited drawing and done threading we
have a totally different profile so this
brings up an important point you
shouldn't try to micro optimize right
off the bat you should look for some
high-level problems to figure out what
your code is actually doing before you
start diving into function level tuning
but now we have a new profile and you'll
see that cosine from libsystem is
actually at the top of the chart and
there's not much we can do about cosine
but we should figure out who's calling
cosine so if you open up a disclosure
triangle you can see our main color is
this orbit function now new and shark
for we have what we call data mining
operations and as Nathan mentioned if we
want to hide say whole library for
example libsystem we can just exclude
and now all of the samples that used to
be in co-signer attributed to its
callers so you see our orbit function is
now at the top of the chart and of
course if you double-click you get your
source code and it's highlighted
according to hotspot you can see in this
function almost all the time is going
into one line calling cosine and of
course it's a for loop what we should do
is open up Xcode and see what we can do
about that so you just hit the edit
button
okay and you can see we're calling
cosine in a loop and we're actually just
iterating over a whole array of data and
getting the cosine of each value now new
and tiger in the accelerate framework
there just happens to be a cosine
function that does that it just iterates
over an entire array and if we jump over
to that line you can see i just made a
call here it's called V V cosine app
that's available on tiger and what it
does is it just takes an entire array
operates cosine across the whole path
and gives you a return array so we made
that optimization we call that warped
for and you'll see we took about 20
seconds to run through the demo before
while we were threaded in limiting
drawing using the accelerate framework
we doubled the performance but there is
a downside I kind of cheated that cosine
function is single precision and the
inputs and outputs in Celestia were
double precision so I didn't want to
cause any kind of astronomical
catastrophe or global warming or
anything so we decided we'll do our own
cosine approximation function in double
precision and I called that warp 5 and
it end up being a little slower than
just calling it a V force oops lost
something there and then we also went to
warp 6 which was a known unrolling of
Maya code and that's important just
unrolling one loop in this program
bought us another thirty percent and
what we'll end up with is performance
that's on par with with the accelerate
framework but it has a higher precision
because i'm doing a higher precision
estimate of cosine so let's take one
more profile and figure out what our
code is doing after we've made these
optimizations so this is performance
tuning of course it's an iterative
process again you can see the profile
has changed cosines no longer there we
have our approximation function on top
and what I want to do is take a look at
this function called bigfix it's a C++
constructor and you look here and you
say whoa shark is angry about something
it's got a lot of what we call the
call-outs there and if you you click on
one of those you get some friendly
advice or not so friendly advice
depending on how bad your code was in
this case we're doing type conversions
from floating point in so this is one of
the big bottle necks of for a g5 or even
a g4 processor and what we want to do is
try to optimize that but let's take a
look at what the compiler generates
underneath this code so we can now look
at source and assembly side by side and
of course we're trying to sell those big
monitors this looks really great on 30
inch now the really nice part here is
you can see Nathan scrolling around and
for each line of source we're actually
in sync with the as imbue so that's
going to pop as you scroll around in the
source so you can see exactly which
lines of Adam correspond to which lines
of source okay now if we just want to
look at azum and don't really understand
Adam then we have this new handy PPC
Help button as Nathan showed and the
cool part here is as you click on your
azum the right page of the manual pops
open with you so lots of information
right at your fingertips you don't have
to go fumbling around for manuals that's
a pretty big improvement for someone
like me and of course if you're really
tuning for g5 or powerpc 970 we have
this extra help here if we click this
show dispatch groups your source will
now be highlighted according to dispatch
groups for the g5 so for those of you
familiar with the g5 architecture you
can actually tune down to the hardware
the other nice feature is this 970
details drawer and if you highlight code
you can see dynamically we're updating
the charts at the bottom so you can see
where every as of instruction is going
so on the Left we have the functional
units and on the right we have the
dispatch slots so for those of you who
really want to tune this is great stuff
okay so let's go back to Celestia yeah
so over a period of about two weeks or
so with a lot of interruptions doing
real work I spent I used shark to try to
optimize this code so i ended up writing
some altivec code i did some scheduling
for the g5 i did some g five specific
optimization and the some results of
that is what i call warp nine and if we
run through that see that we finish in
about five seconds so that was pretty
good now this is open source I'm going
to return this all of course to the
authors and post it on the web so you
guys can have a crack at it too if you
like it's a really cool program if we go
back to our slides now
here's the result of a couple of weeks
of tuning effort we started out and if
you run this demo as is it'll take over
20 25 minutes on a g5 and when we were
done of course we got 25 seconds so
around 300 times the performance so that
was pretty nice so let's just summarize
what we've talked about today the things
we really want to say are as you can see
hopefully that it's a really quick
process to start using shark you don't
have to do any kind of hardcore
profiling you can take your optimized
binary and simply profile that in order
to see your source all you have to do is
add the dash G flex they don't strip out
your debugging symbols and of course we
provide a lot of high-level views so if
you're used to trees you can see it in
tree view you can use the heavy view you
can look at your code over time is a
chart and of course once you've
identified functions that are hot spots
you want to source you want to tune the
source code or tune the assembly and let
shark give the optimization tips and of
course now with new new and shark for we
have the advanced profiling with so
things like tracing malik's tracing
specific functions or tracing specific
system events like cache misses or page
faults and of course hopefully you can
tell shark we've tried really hard to
make easy there's no magic here you just
push the start button and that should
get you going but there is a lot of
machinery underneath here that makes it
powerful enough or even the most expert
users who want to tune all the way down
to the metal so it's great for what we
call high level or elder its algorithm
'ok analysis as well as low level tuning
or hacking and we're compatible with
codewarrior GCC xlc XLF snag whatever
you've got we should be able to support
you and the best part sharks available
for free
and it's available right now
[Applause]