WWDC2003 Session 103

Transcript

Kind: captions
Language: en
good morning and welcome as we are
setting up for the conference I had in
the database that this was a keithley
secret session number one I had a number
of people asked me if I had a second
secret session and I well no I wasn't so
lucky the reason why this was secret is
because we really wanted to make a
dramatic statement about some new
features in our acceleration libraries
what we'll be talking about today will
be techniques that you can use to
accelerate image processing using the
image as part of the accelerate library
so to do that let me bring up Robert
merly Thank You Craig I'm very happy to
be here this morning to an additional as
you a major new technology from Apple
our vector accelerated image processing
library called V image before I jump
into that demonstration though there's
another message that I would want would
like to make right here at the outset
all of the vector library is starting
with panther are going to be contained
in a new high-performance
state-of-the-art computing framework
called accelerate accelerate contains
not only the GM is the major subject of
this talk today but also all of the
vector libraries that have been
available previously in OS versions in
the framework called exact lib the
digital signal processing library the
basic linear algebra subroutines la pack
the math libraries and the big nom
library so I wanted to make sure that
that was clear before we went on now
today we're getting into the main part
of the demonstration here what you'll
learn about the image well that may be a
little optimistic what I'm going to try
to convey about VMA
anyway functionality the instructors an
API at least some of the examples of
them it's much too expensive to go over
completely some of the features that are
not included in the first two subjects
then I'm going to bring up with one of
my colleagues in the architecture inter
excuse me in the erecting the Merricks
group gentleman who will talk about
implementation techniques and
performance and then finally although
it's not on this slider will be a
section at the end by Eric Miller the
architecture and performance group on an
overview of the chudd tools the
performance tools so what is in the
functionality of the VM image library it
can be broadly grouped into these main
topics listed here I'm going to talk a
little bit about each one of them so I
won't belabor it at this point but at
this time I'd like to jump into the
first demo of the morning and what I
want to show you is a image processing
function called inversion which is a
very simple technique where each pixel
of an image in fact each color component
of each pixel is the complement of that
value is taken so what you see here is a
picture of a bed-and-breakfast in
Portugal that I happen to stay out
earlier this year and what I'm going to
do is perform a an inverse operation on
it and what you see is essentially the
photographic negative now the functions
that are comprised in the image run
quite a gamut and complexity from quite
simple to quite complex this one is way
over on the simple side very easy to do
the first subject first major category
was convolution I want to say a word
about
area processes in image processing area
processes our process is that year's
book with source pixel and other pixels
nearby to generate the destination pixel
both convolution and morphology the to
the first two examples i'm going to give
you our examples of area processes
convolution in particular you know
creates an output pixel by taking a
weighted sum of pixels nearby the input
or source pixel and that weight and
therefore the effect of the process is
determined by matrix called the
convolution colonel so i want to give
you an example of how convolution works
what i have here is a fairly blurry and
kind of low intensity image of downtown
lisbon and i want to emphasize that this
is probably poor equipment and not the
photographer skill that was involved
here but anyway i'm going to operate on
this with a five by five sharpening
kernel and what the result of that is is
this a lot of the blurriness is gone
features a much sharper than they were
we'll go back that's the blurry part and
the sharpening point and you can see the
colonel there is five by five it's not
that complex so all of convolution it
operates the same way that it's all
matrix multiplication of areas of pixels
and the effect simply depends on what
the matrix is another example this
edifice I'm going to use a rather
extreme edge definition processor edge
detection process and produced an
embossed image
this is as you can see an extremely
simple convolution colonel three by
three to get a pretty dramatic effect
second I'd like to talk about or the
second major topic morphology morphology
in general adjust the shape of objects
in the image to conform more closely to
the shape of a probe and the probe is
defined and also in a matrix called and
morphology matrix in practice it can be
used to pick objects in a small or large
objects and it picks an image and
lighten them or darken them they can
make them larger or smaller it can be
used to alter their shape to remove fine
details while preserving larger objects
and so forth so to demonstrate this I
have here a couple of very simple images
a small circle and a large circle I'm
going to do a morphology process on
these two images using a probe in the
shape of a right triangle and it's a
fairly large matrix about the size of
the smaller circle so the result of that
operation is this as you can see both
the circles take on a more triangular
characteristic and in fact a smaller
circle in the smaller circle the center
circle completely disappears because of
the size of the convolution colonel
secondly I'm going to take this image or
that was called a dilate operation by
the way I'm going to now do what's
called an a road operation I'm going to
take a circular filter about the same
size as the colonel here operate on
these two images and I get this result
so you can see that in the case of the
smaller circle all the circularity is
gone as turned completely into a try
angle and the smaller circle has taken
on some triangular structure and lost a
lot of its circularity so there is a
couple of examples of morphology in
action the class of functions in
geometry is pretty much self-explanatory
they perform some sort of a geometric
operation on the image transform it make
it largely smaller are reflected
whatever for an example of geometry
we're going to take this picture of J
and I'm going to transpose it of
translate it make it bigger but only in
the longitudinal direction and that
results in this image secondly I'm going
to go back to the original pitch here
and do a shearing operation off to the
right side and that results in this
image and is it my imagination or is
that bird getting more irritated with
each picture maybe I've just been
looking at them a little bit too long
histogram operations are those that use
a intensity distribution histogram of
the image to perform some function the
example I'm going to use is histogram
equalization a process whereby an image
with a poor non uniform intensity
distribution is modified so that
intensities are distributed more evenly
so I'm going to go back to this
[Music]
bed-and-breakfast here in Portugal and
perform this equalization operation on
it and it results in this now what you
can see is is a great more detailed
visible here in this image than in the
original notice in particular the
weather stands below the window sills on
the second floor they were virtually
undetectable
in the original image so this the
equalization operation brings out a lot
of detail that was absent in the
original this is probably a lot closer
to the way that building really looked
and be my yes here's an example or
rather the before-and-after histograms
of the intensity distribution I've added
all three color channels into each bar
to simplify it although in practice the
operation is done in each color
component separately but you can see in
the before image there's a lot of white
with some starkly contrasted black and
in the after image it's much more
uniform a lot of different grades so i
could go on quite a long time actually
about functionality but we do have a
limited amount of time so i want to
proceed on to some examples of data
structures and api first i want to talk
about data types and layouts that we
support in our initial incarnation of
the image there are two different data
types supported one is the 8-bit integer
per color component or per channel I'll
use those terms interchangeably and the
second is a 32-bit floating point value
per color component or channel we also
excuse me we also support two different
data layouts one is the planar layout
whereby each channel is in its own array
and if I'm using RGB image as an example
that simply means that the Reds the
greens and the Blues are all in their
separate buffers and if you're calling a
convolution or excuse me image process
to perform some function on this image
you would need to call it three times
for all three color components the event
is bill is you if you don't want to do
the process in all three challenges can
do them on only one of two as you wish
the second layout is what we call the
Arg be interleaved layout where all the
color channels are interleaved into a
single buffer we support at the current
time a four channel interleave layout
which can be either 485 integers or for
32-bit floating point values the
advantages here are that the only take a
single call to perform an operation on
these all of these different color
components you have either an alpha
channel as your first component as it is
indicated here with the red green blue
as a examples of course for the other
three channels it were you can have four
color channels without an alpha what the
case may be is is given to the image in
a flags word that's passed to each
function so if you specify that the
first channel is in fact an alpha
channel then it will simply be copied to
the destination unchanged unless we're
dealing with an alpha compositing
function that's a separate issue if you
indicate that it's a color channel then
the same process will be performed on it
as the other three channels we do supply
data conversion utilities to go between
the different data layouts and different
data types and now I'd like to go on to
what is probably the single most
important data structure almost the only
public data structure we have in vmh
the image buffer as you can see it's a
very simple data structure only four
elements there's a pointer to the start
of the data which would be the upper
left-hand corner of the image a high
number of pixels width in the number of
pixels and then a row bytes which is the
number of bytes from one row to another
or the stride from one river to the
other pictorially if the name of the vmh
buffer is image then we have image data
at the leper upper left-hand corner they
have the height the width and if you
imagine that that white space to the
right of the image is extra memory
that's not used in the image but just
sitting there at the end of the row then
you can see that the row bytes parameter
includes that length and the stride that
comes in handy if you want to do sixteen
byte alignment on each role for example
that's not a requirement by the image
but certainly may be helpful in your own
work okay I want to I want to go back
before I show that I want to make a
distinction at this point between the
full image buffer as shown here and what
we call a region of interest the region
of interest is that portion of the image
which is going to be modified by the
operation you're performing in many
cases met perhaps a null k most cases
the region of interest and the full
image buffer will be the same but they
don't have to be and here is an example
where they are not the same we have a
subset of the image using the very same
data structure the V image buffer data
structure you adjust the data pointer
the height and the willed the road
lights would remain the same and simply
passed the that those parameters in and
there are no there is no copying
required from your own bus
that's the best of one of the beauties
of the simplicity and flexibility of
this data structure you can have the
whole image or a piece of the image same
data structure is used and this
therefore allows you to do for example
tiling you want to do that to make take
advantage of caching although we will
also do that for you if you wish and it
it has quite a number of other
advantages as well so here's an example
of Equalization an example of a very
simple call to the image image
equalization where there's only three
parameters couldn't get too much more
simple the vmas buffer source the VMS
buffer for the destination and then the
flags word which the information in the
flags or it varies with each function
you notice that you don't have to
specify what the data layout or the data
type is because that's implicit in the
name of the function in this case plain
or age so every function has four
different variants planar rates that
planar floats and release it good and in
early float there are some functions
that do require us to know nvm it's both
the full image buffer and the region of
interest and these are the functions
that I mentioned earlier I referred to
as area processes the components are all
showing here you have the full image
buffer the source ry which may or may
not be smaller than the full image
buffer a convolution colonel a matrix
shown by the yellow rectangle and then
the destination buffer the result I'd
like to go into the relationship between
these things a little bit more so this
is the discussion further discussion of
buffers and regions of interest all
right I think we all know what the full
image buffer is in a call to an area
function morphology or convolution the
region of interest is not specified by a
second vm edge buffer but rather simply
by x and y offsets from the beginning of
the full buffer so as you can see here
you would indicate the upper left-hand
corner the region of interest by an x
and y offset from the upper left hand
corner of the full buffer the row bytes
is the same in both cases you also pass
a vm as buffer indicating the
destination which has the height and a
width and independent role bytes and
notice that we have not specified as yet
the source the region of interest height
and width and that's for a simple reason
it has to be the same height and width
as the destination so we simply take it
from there
this is an example of one of these
function calls convolution you have the
source and destination image buffers the
offsets to the region of interest and
then some like of your information
defining the colonel and a few other
things that we need to know so this is
one of probably one of the more
complicated calls that you're going to
run into we have three computational
cases that we need to worry about when
we're doing these calculations the first
one is fairly simple knot and to explain
this I just keep in mind the three four
different elements that I'm talking
about here the full image buffer the
region of interest the convolution
kernel which is simply a matrix and then
in this image resource pixel shown by
the tiny red rectangle there so if we
are going to calculate the destination
rectangle from that source pixel we need
to do a matrix multiplication of the
pixels in the regional source pixel is
shown there the first case is very
simple because the entire matrix is
contained within the region of interest
so there's no issue about where the data
comes from the second case is a little
bit more complicated what happens if the
region of the matrix the computational
matrix extends out beyond the region of
interest and this is exactly why we need
to know what the full image buffer is
because of it it still remains in the
full image buffer then we can use that
data without further concerning the
third case is the more complex case what
if if the computational matrix goes even
beyond the full image buffer and in that
case we have to do something to
substitute for the pixels that are
missing so we have an edge case problem
and we supply you in this instance with
three different options to deal with
these educate
cases back now in color edge extend and
copy in place and to demonstrate these 3
i'm going to start with this as an
original image all the lines between the
different colors are clean and smooth
and the edges are clean and i'm going to
do a blurring operation on it and the
first time i'm going to do this i'm
going to specify that for the edges the
color to use as black if we don't have a
pixel in the computation will use a
black pixel so the result of that comes
out like this you can see that the
colors merge together on the edges and
on the outside of the image it just
feeds off into black gradually the other
extreme of that is a background color
white it ends up looking like this with
a black background and you can see quite
a difference there the second case so
that was the background color the first
option that we give you the the second
option would give you is edge extend
which means that we take the pixel at
the pixels at the outside border of the
image and just extend them out copy them
out as far as we need to to perform the
operation so the result of that blurring
operation is this and as you would
expect the you really don't see any
change when you get to the edge of the
image it just continues on as it does in
the beginning or in the middle the third
case is copy in place and what we are
saying there is that the few if we don't
have all the data we need to do the
computation at any point and then we
won't do it will just copy the source
pixel to the destination pixel and be
done with it and this is what that looks
like you have to concentrate on the
edges of the image and you can see that
towards the edges there is no blurring
effect once the computational matrix
goes off the edge we just do a copy from
the so
so those are the various options that we
give you to handle the edge caves edge
cases a couple of features that I
haven't yet mentioned or maybe I have
all of the Apple libraries the vector
accelerated libraries are optimized for
all apple processors so if you are
exactly for example running on a g3 the
whole system is a g3 then a form of any
given routine that is not vectorized but
still highly optimized for scalar will
be chosen if you're running on a G 45
then an appropriately optimized
vectorized version will be chosen this
is all done transparently to you the
caller our system our library is multi
the image in particular here is
multiprocessor says I should also
mention that it's it's interrupts safe
if you take some precautions to make it
interrupts safe there's a lot of
routines nvm is to do call malloc to
allocate memory however if you don't
want it to do that we do give you the
option to supply your own memory the
calls that need memory also have a
zileri call that returns to you the
minimum buffer size that we will need to
do the operation so you can call that
allocate your own memory and then there
will be no system calls during the
course of the operation the images
standard part of Panther the data
structures are unencoded it's simple and
flexible and unlike an a competitor or
two I could name but won't there is no
license fees
okay so that completes my portion of the
talk I'd like to bring up my colleague
Yeon Coleman who will talk about
implementation techniques and
performance thank you I wanted to touch
on two subjects mostly what you can do
to use be Emma Joyce effectively in your
apps to get the best possible
performance out of it and then just for
your own curiosity some of the things we
did to tune the functions that you can
get through the village sub framework
under the acceleration framework so a
couple things you can focus on a logical
touch on so there are some alignment
memory alignment things that you can do
we don't require that you do anything in
particular but some things help so I'll
mention them I'll be briefly talk about
tiling and then also some
multiprocessing on a real-time
consideration insofar as alignment on
all of our powerpc processors obviously
we keep our data available in the caches
and these are blocked together in the
cache line on the g4 and g3 the cache
lines are 32 bytes long and their line
to 32 bytes the g5 128 bytes and
similarly aligned to 128 bytes so if you
just arbitrarily configure your buffer
to fit in memory without worrying about
how its wine you'll probably end up with
a set of pixels that are just sitting in
a set of cache lines I've drawn for here
in the top and they'll be just leftover
space there in certain cases though we
find that we get a small performance
loss for just arbitrarily mind pixels so
it's often worth your while to allocate
your buffers in a way such that each
pixel row starts at an alliance address
this is not required obviously and
certainly in many cases in your image
processing work we understand that you
need to operate us arbitrarily defined
let's say the user is drawing a box on
your screen you don't have any way to
control the alignment so all of our code
works just fine
it just works a little bit better if
your line one of the pitfalls you can go
into with rigor Lizzie rigorously align
pick larose is that you can get into a
situation where each pixel row has a
width which is an integer power of two
small integer powers of two are not bad
large ones can get you into a situation
where the processor has difficulty
distinguishing between pixels that are
in one row in an immediate row right
above it or right below it and then that
can cause some small delays when storing
data and then trying to reload data
elsewhere the processor may decide it
can't tell the difference and things
might go a little more slowly so if you
find that you're allocating buffers that
are for example 4096 bytes wide which
might happen if you had a 1024 x 768 you
know full screen image at 32 bits then
what you need to do is perhaps that a
little padding on the end of each row
and you can do that very easily with our
API we support that just make sure your
robots field a little bit waters on the
width tiling of course the commonly used
technique in image processing basic
approaches you divide up your image into
smaller segments which are cache size
and this allows you to operate on
segments and keep them in the caches
while you're working on them so for
example if you had several change
filters you want to do in series or as
imply one of the whole image then do the
next filter the whole image then the
third filters the whole image you could
pick a small subset of the image do all
three to that and that means that for
the second and third filters you'd be
very likely to have the pixels already
in the caches so you're less likely to
pay any penalty for going out to do and
you get them so if you tips on how to do
that we found that it's tiling is only
hopeful some of the time not all the
time so don't waste your time
if it didn't and you found it very easy
to simply add say to find out where that
tiling is going to work for you by just
pushing a small image through your code
as it is on optimized and then push a
big one through take a look at how many
pixels per second you're able to
calculate in each case if there's a big
difference than maybe tiling will pay
off for you and it's worth it's time to
go through it in our experience we found
that pile sizes that are the record will
fit in the l1 cache show which are about
16 k 232 k work best wyd is better than
taller square and it can be very wide we
found cases where only 16 pixels high
but a thousand 24 is the optimal case so
we also do some tiling in some of our
functions if you're going to do your own
tiling in certain cases we imagine all
that we haven't found any examples of it
that these two things could interact
adversely so we provided you with a flag
you can test it says kV image do not
tile which basically tells us not to
tell you're going to do it yourself
another thing you can do is take
advantage of our planar data format
originally we were thinking of only
providing cleaner but we had so many
requests for 80 gb that it's a feature
however there are many drawbacks argv
and if you use planar data formats you
can get around them first of all is for
a RGB you may not wish to operate on the
alpha channel so it's twenty-five
percent or thirty three percent more
work it's using air gb formats in that
case compared to just operating on the
three colour channels so going with
cleaner would allow you to just do the
work that you need to do and skip over
the other stuff and touch touch less
memory as well another nice thing about
planar is that it's a kind of a limited
form of tiling in the sensitive now
split up your image into three smaller
for smaller parts so in certain cases
this may allow you to exist entirely in
the cache rose and half in and half out
so that would allow you to push through
several filters with just read for
example and then move on to just screen
and do pretty well one of the problems
with geometric tiling which is what I
presented in the previous slide is that
if you've got something with a
Colonel matrix that needs to be applied
and work for each pixel eating to look
at all the pixels around it that can
make tiling a little bit tricky so with
this one waiver you've got the entire
image for each color channels so that
solves that problem quite nicely so it
works pretty well and then finally a bit
of an implementation detail a lot of our
argv code will take the argv in early
for matix bring it into planar do the
work convert it back and then give you
the result and all that happens in
register so it's pretty fast but it's
nicer not to have to do it at all so if
you use planar data you probably will
get somewhat better performance and we
often see the difference is about thirty
percent so multiprocessing real time
issues we were something you have real
time constraints or want to get the most
use out of multiple processors so a
couple of things all of our functions
are single threaded we don't go to any
effort to try to use two processors for
you we've made that decision because we
felt the eve new better where your data
was and your timing and you could do
that better than we could so we're all
single threaded but we've made it as
easy as possible for you so we're MPC's
you can call it's reentrant lee if
you're calling multiple functions over
the same piece of data then you need to
do your unlocking schemes to make sure
you don't have any conflicts there for
real-time needs we've made gone through
a considerable amount of effort to make
sure that we don't call anything that's
going to block on a lock or do anything
else is going to give you latency
problems so some of our functions do
require additional memory to do work
they take a temp buffer you can just
pass null for the temp buffer if you
don't want to actually worry about it
we'll call malloc and I'll get taken
care of for you but if you don't want to
see malik then you need to allocate your
temper for ahead of time asset to us
we'll just want to it finally you know
for MP and do better speed through
tiling just one small comment about that
the one issue you're going to base is it
be tile vertically which is not as shown
here
I mean if you tile so that your one
processor we're going to say to let tap
the image and ones working on the right
half of the image then at the border in
between you may have a lot of crosstalk
between the processors one model size 1
pixel row the other one might have to
read it and there's some communication
issues so you're off the better off of
you with the image top half bottom half
and of course you know if you're going
to do that multiprocessing they not be
aware for functions that youth kernels
where we need to look at mobile pixels
around the ones in particular that
you're interested in if one cpu is video
manipulating that you could get google
recognition so that's just something to
watch out for so then I'd like to spend
a few minutes talking about the
optimization techniques we used and hope
that you may find them useful in your
own code essentially we developed over
the years working with a clip of a
general theory about the best way to go
back pushing data to the processor and
essentially this means pouring data down
the process which throat as fast as
possible we'll talk about it as a minute
and we also don't send a lot of time
guessing about what's wrong in fact we
don't even use standard profiler tools
of the kind you would find ten years ago
we try to get as much information as
possible what's going on and just fix
the problem that we see we do our own
calling and such so as far as data rich
programming goes currency tours are
extremely parallel machines the g5 is
much more so as I'm sure you've got from
skis talk even on the g4 you can have a
before an interesting process ten scalar
floating-point ops using our fuse
multiply add 40 vector floating-point
odds all these things running in
parallel so if you don't have that much
data independence if you aren't
processing that many independent streams
of data and concurrently then you're
just wasting processor cycles so what we
try to do is make sure that we actually
have that much independence going on at
all times and flush that much data
through the brothers are concurrently
so in practice is nor to achieve that
you know the simple things you can
unroll loops we're not doing that to get
rid of loop overhead we're getting that
to make sure that we have eight or 12 or
50 or however many parallel calculations
we have going on currently so then keep
the processor full we identify eliminate
compiler aliasing so if you have
pointers pointing to buffers Tyler might
not know how these overlaps and it might
decide to keep to the load/store order
from low do operations store load do
operation storage strict order and that
will kill your power at ilysm too so we
look for those we get rid of them we
move all the loads up to the top do the
work put all the stores down below that
kind of thing you want to lemonade LSU
bottlenecks a lot of code to spend all
time loading data in and out of register
so we look for ways to very small
operations that many small operations
into two big ones and that way we can
spend most of our time actually doing
work if you have certain instructions
that are spending a lot of time they
take six eight ten cycles to get through
then we try to find enough work to keep
us busy while we wait for that to happen
we avoid branching like the plague so we
use a lot of select and other kinds of
things to make sure that our code flies
in a straight line as I mentioned
earlier we try to keep all the execution
units busy at the same time so if we're
busy doing something the floating-point
unit this might be a good time to also
be loading data for the next loop so we
schedule things pretty aggressively and
finally we prefetch our data just make
sure it's in the cash when we need it so
we don't have to take a long stall
waiting for data appear out of around in
so far as our tiling goes we only did
for some functions because we only found
only some functions benefited generally
what we did was we took a look at the
first experiment I suggested earlier run
a small image and the big one and see
whether there was some improvement for
doing smaller images we also took a look
at different tile shapes so here I you
see a graph where I've taken a three by
three kernel and 21 by 21 kernel for the
same function and looked at how much
time it takes for different 2 tile width
so the titles are all the same size is
just rewind them should shrink them
vertically at the same time so you can
see that
you know there's some advantage to a
particularly tile with in this case you
know a thousand twenty four twenty forty
eight place is probably the optimal case
so that's what we choose so and then of
course we to new things / processors we
actually end up running this experiment
several times to make sure that the file
sizes we pick 43 are optimal for g3 and
the ones you pick 44 g5 are optimal
they're finally just distress like
everything else in accelerated framework
we vectorize so our intent is to use the
velocity engine across the board
everywhere we can so you'll see that in
the final product we're going to have
all the back pretty much everywhere the
only exception is going to be histogram
which the class of functions that just
don't work very well with the vector
unit typical speedups we see or a scalar
code for that is four to ten times if
you haven't tried vectorization i
suggest you do that doesn't mean that
our scalar cloudy's nu sludge we make
sure that runs as fast as possible too
and in a couple of cases such as I
resampling filters we use the extra
speed to deliver a lot better image
quality so hopefully you'll like that
and I've got this is a beta release so
I'm quite finished every every bit of
vectorization like to do but certainly
we're working hard at it finally
experimental e driven optimization we
never guess or if we find we are
guessing we try to figure out how to run
the right experiment to find out what's
actually going on so obviously always
profile I'm sure you've heard that
before you can use tools like II profit
sampler there's only give you function
level information and only tell you
which function is performing slowly it
won't tell you why or what part of it or
what instruction in particular is
getting a stall so actually most of our
work is done using shud or shark which
they're going to talk about later on
today I'll invite her to give us a short
overview for and we also use cpu
simulators like sims ii for and so these
things can be used to actually narrow in
and directly tell whether we're running
into cache misses
or paging or any of the numbers remember
other problems which historically have
been very hard to tell what exactly is
going on and you're just kind of
guessing what's going on but we don't we
0n a problem and solve that and that
lets us very efficiently get to the high
performance could and then finally if
you aren't already allergy to inspect
your compiler outside for functions that
really make a difference since so we are
almost always surprised by some of the
mistakes we make so with that I'll
introduce Eric Miller from the
architecture performance group it's gone
here and tell you a little bit of that
shud which the tool that we use to tune
our code good morning i am eric miller
with the architecture performance group
as the instead the ched tools are one of
his favorite toys and i'm glad you put
on the list although i would like to MC
even reverse the order and put it above
G profit sampler but that's just me so
what are shut tools well there are sweet
of performance analysis tools there are
several that are interesting probably
the most interesting we'll get to in a
minute but the idea behind them is that
they give you low level access to the
performance monitor Hardware counters in
the processor and the memory controller
and then we have implemented some
software versions in the operating
system that behave exactly the same as
the hardware performance monitor
counters the idea is to help you find
problems and improve your code
performance and the best part is they're
freely available on the web and they're
also on a developer tool CD one of the
neat things about the CD this year is
that you'll be able to install the tools
and immediately we have a chud updater
which is very similar to software
updater shut up data will automatically
go out and check the ftp site for new
versions of the tools and then you can
download me your convenience we
recommend that you do this because the
version that's on the CD is probably a
week and a half old and we made quite a
number of improvements in those eight
eight days
so we generally will put out a release
every week at least during the beta
period and probably slightly is reduce
the frequency later once we have a gold
master so there are three main tools the
first tool is a profiling tool called
shark which Ian alluded to it is an
instruction level profiler it can do
many things that we'll get to in a
minute and not least wages Ian mentioned
that you can inspect your your compiler
output shark can produce the disassembly
from your source code very readily in
fact it's one of its key features
monsters a spreadsheet for performance
events and by that I mean you can
collect information about many things
the processor capable of measuring
internally like cache misses and
instruction counts these cycles executed
these sorts of things and you can look
them in a nice tabular spreadsheet form
Saturn is a call graph visualizer as it
says would be what the idea there is is
it's kind of like using G prof it goes
through and actually instruments all
your application code and then produces
the results of Paula Melton the
functions get called but you can also
have auxiliary information with regard
to performance monitor counts we also
have several tracing tools amber which
actually when you run amber can collect
every single instruction that is
executed on behalf of your application
on the processor and put that into a
file then those files will be consumed
by acid which is a tool that we wrote in
our group and by sim g4 which is
produced by motorola and sims ii 5 which
is be produced by IBM those are cyclic
yards CPU simulators and of course Ian
and his team used the sim g4 product
quite quite readily the other thing you
can do with the chudd tools is
instrument your applications and along
with that I'm running out of dots on
that's on the slide which you can also
create your own application performance
analysis tools using the chudd framework
because
that's the exact framework that we
developed in order to create shark
monster and Saturn so I mentioned
performance counter several times what
are they well they're a series of
dedicated special purpose registers
actually in the processor and in the
memory controller that we create that's
in the g 4g 5.mp3 systems so what we can
do with those is have them set them up
to count and record what we call
performance events things like the
number of l1 cache misses or l2 cache
misses or l3 cache misses or instruction
counts instruction mrs. execution stalls
page faults in the operating system
there are there are a plethora of events
in fact on g4 you have something in the
order of maybe 200 events you can
measure on g5 there are literally
thousands of events that can be measured
so we we use the jug tools and in
particular the judge framework to
configure and control all the PMC's so
I'm not going to do any demos this
morning because we're pretty short on
time but I just wanted to mention shark
because all you do to use shark is push
the start button and it will profile the
entire system its defaults to a time
profile and what that will give you is
in your application when you select it
from the list of profiled threads or
processes you'll see where in your
application in relation to your source
code will highlight it for you and show
you this is where you spent your time if
you do an event profile suppose you
selected CPU cycles we shark can tell
you exactly how many cycles were spent
in your code for a particular line of
code and shark captures every single
thread on the system the driver you know
any drivers or kernel extensions the
kernel itself and all the applications
that are running at any given time the
best thing about shark is it's very low
overhead as are all the jug tools you
can actually set the time profile down
to a minimum about 50 microseconds per
time sample which is a couple of order
magnitude smaller than you can use with
sampler
and it also gives you an automated
analysis which will show up as this
column of exclamation points beside your
code so we annotate your source code you
click on these annotations and it'll
tell you things like this loop is has a
non changing variable and it's
serialized so you may want to move that
variable out of the loop or if this loop
is a good candidate for alt effects or
parallelization because there aren't any
data dependencies we do static analysis
and this can lead to the surprises that
Ian mentioned from the compiler it
actually because will show you the
disassembly with the compiler generated
on your behalf and annotate that as to
how many stalls you'll have how many
delays might be involved from from other
aspects and new features this year from
well let me say this shark was formerly
called shikari in the chudd tools from
last year so it's been renamed shark
with a lot of new features one of the
features is that you can now save and
review all the sessions that you collect
or for later analysis and there's also a
command line version that you can use to
instrument like head with scripts and
things so we use this command line
version of shark whenever we have a old
eunuch scientific application that just
runs in the command line we and it has a
launch script we can just script shark
to begin and then run our tour command
line application as normal and then
script shark to end his little
photograph of shark and what you can see
in the left-hand picture would be the
result of actually a time profile and in
this particular picture we were running
a test and it turns out that the square
root function was forty-two percent of
the time at the lower by the bottom of
that left hand picture you can see
there's a little process menu and that
lists all the processes that were
running on the system when you did the
trace you can choose from any of those
and normally you would choose your own
this is a screenshot from last year's
demo which was called flurry which is
the screensaver
in the lower right window you can see
you're a piece of source code that's
annotated and the bright yellow lines
show you where your your samples we're
hitting in the in your code and its
twenty four point five percent about
midway in that image there then you can
see the exclamation points on the right
that I mentioned and this in this
particular case it was telling that
there is a using floating point division
could be quite costly and so you should
probably try to remove that and do a
multiplication if you were to double
click on that hot line there which then
show you the assembly that that that
line refers to and it would have some
some more detailed annotations there the
next tool is monster which is the most
direct way to configure and set up the
performance monitor counters you can use
a chud tools in general there are timed
intervals so you can select a certain
number of milliseconds or micro seconds
four seconds for that matter that you
would like to collect per per sample in
the hardware you can also collect data
based on other events you can set it up
to collect a sample every every so many
cycles or ever so many instructions
completed or ever so many cache misses
there's also a third way that's that
actually is related to both which is
called a hotkey all the chug tools have
a global hotkey in the case of shark
it's option escape in the case of
monster its command a skirt command
escape and if you use those keys you
don't actually have to have the
application in front of the in front of
your other application so if you have
your application running in its full
screen then you can just use this hot
key to activate shark or monster or
Saturn and do the collection without
having to bring it to the front and
disturb your process and affect your
sampling one of the main things about
monster is that it is a big spreadsheet
of event collections over time per
sample and that's kind of nice but a lot
of times you be interested in combining
results for example you can collect a
lot of information from the memory
controller about transactions reads and
writes and you know the amount of time
because they're sampled over time for
ample then you can take those
transactions and apply a calculation to
them we call shortcuts so you can say
every every Reed is 16 bytes so I take
the number of reads x 16 bytes as a
number of bytes divided by the time I
have the bandwidth so you can set up
these calculations in monster and have
additional columns in your spreadsheet
and these calculations are just standard
in fix mathematical notation with
parentheses and it's basically a four
function calculator there's a table and
you can also draw charts and shark is
also capable of drawing charts then you
can also new in this version of monster
you can save and review the sessions and
the nice thing about this is you can
review sessions on a system that you
don't have in front of you so you could
do collections you had a g5 of your
disposal you collect data with shark or
monster on your g5 then take it back to
your laptop or your desktop g4 or even
your iMac and review those results and
print off the charts and those sorts of
things and there's also a scriptable
command line version of monster which is
new this year here's a screenshot of
monster on the on the left of the first
of the leftmost image there's a column
that where you could click on those
those entries will highlight those
columns in the data so and when you
highlight columns of data you can then
just press the draw a chart button and
it would result in a chart and there's
many options for charting there's bar
charts various colors ations line charts
with markers logarithmic scales direct
scales samples over time and samples as
a single x-axis just per sample plots
you can see in this particular case that
what's been highlighted are some of the
shortcuts so it was a load store session
was done so all the load instructions
were collected all the store
instructions were collected and all the
regular instructions collected men
percentages of each were calculated
along with that you know for every
sample each sample is listed
horizontally in the in the table there
and so vertically is each of these
shortcuts
so then you just highlight those columns
of shortcuts and we plot the percentages
which is what you see in the second
picture there is quite an extensive set
of sampling controls to to configure the
performance monitor counters in both
shark and monster so the last thing is a
new new tool we call Saturn which record
like it says you record your function
call history and the way we do this is
by instrumenting the functions at entry
and exit with GCC there's a compiler
flag that you throw and do a build and
it'll inject of the Saturn entry and
exit prologue and epilogue functions are
in every function in your application
now to be completely thorough you have
to go through and recompile all of the
frameworks and libraries but that's so
that's and that's similar to G prof
which is really not that fun to do so
most of time we like to focus just on
actual application code but the nice
thing about Saturn is that once you have
this function call history you can
visualize that call tree and here in
this image you can see that the call
tree for CSE under main it's been
highlighted and you see the red dashes
in that in that stack of bars there
that's where that function is called and
run so what you would want to use Saturn
for is in particular with C++ you have a
lot of call depth so you want to if
things are very skinny and tall you're
spending a lot of time calling functions
not doing any work so you want to try to
avoid that you want have a nice flat
profile you can also collect call counts
PMC event counts and execution times by
using the performance monitor counters
with with your instrumented functions
that are injected at entry and exit of
each of your functions so as I mentioned
on the first slide we've got the
instruction tracing and simulation amber
is the instruction tracing mechanism and
the resultant files are in a format
that's called pt6 these TT six files are
consumed by the other programs mentioned
on this slide which is a tray acid is
our internal trace analyzer and actually
the acid trace analyzer is the parent of
the code coach and
the parts in shark that explain why you
have bottlenecks and what you might do
to change them and these come out of
acid and it can also do a couple things
on its own which is memory footprint of
your application will give you a look a
new plot file you can find your
instruction sequences that may be an
issue and then try and remove those
through the informational notes that it
gives you sim g4 is a cycle accurate
simulator for PPC 7400 which is an old
Macs processor from early g4 systems and
sim g5 will be available in the future
in the near future and that will be a
cycle accurate simulator for the new PPC
970 these can be quite handy in tracing
particularly complicated performance
issues although the outputs of sim g4
and 75 requires a terminal window maybe
maybe would require maybe a 50 inch
monitor that would work lastly the judge
framework is available to like I said
instrument your source code one of the
things you can do with instrumentation
is to is do one function call to start
and stop monster or shark sampling so
you can you can sort of put a caliper
around your entrance interesting code
suppose you find a piece of code that
shark says is a hot spot you want to get
more detail than just trace so that
through that you can add code it's
should start remote performance monitor
and Chad stop remote performance monitor
and what happens as you said it you just
click a key in monster or shark and it
will be in remote mode and be waiting
for messages from your application and
your application only so you can just
you can just collect the data for your
interesting code you can directly read
and report on the pmcs by writing small
pieces of code either instrumented in
your application or write a separate
standalone application as I mentioned
you can write your own performance tools
and do all the things that need to be
done in order to create a performance
tool like shark which is control the
performance monitor counters
click the information about the system
hardware which can be handy in a lot of
ways you can know that you're on a g5
you can know that you're under g3 you
can know the bus speed of the system the
amount of memory in the system number of
processors you can also modify some of
that information and there also is an
HTML reference document online that
describes all the various functions in
the judge framework here's a small
example of code with the chudd framework
and this is i mentioned that instrument
your code to start and stop shark or
monster so you just have to call include
the chudd H header file initialize and
then acquire the remote access start
remote performance monitor with a label
that will show up in your output in
shark or monster so you know which which
instrumentation it was then you run
through your important code stop the
monitors released remote access secondly
a more slightly more complex mode I
mentioned you can write your own
performance monitoring tool you
initialize acquire the sampling facility
you turn on some special filters maybe
mark your process as the only one to be
counted and then you set the events in
particular you say both cpus process
performance monitor counter number one
event number one which happens to be
cycles an event number 2 which happens
to be instructions clear the counters
start the counters you know your
important function executes stop the
counters then you collect these results
and you can perform a calculation and
get cycles per instruction in your own
application for the for more information
about that stuff you can get your own
download at this web dress developer
apple com tools debugger stat HTML and
then you can always contact myself and
my colleagues on the chudd tools
development team at this add your email
address and we tried to be pretty
responsive and that's probably the best
way to get your feature requests and
complaints into our cue let's see what's
next oh I guess I'm done so let me bring
up mr. Keithley that'd be great
[Applause]
so the roadmap a couple more sessions
today obviously one special specializing
and shut itself but we should move on to
Q&A pretty quickly we're into that time
right now here's some contact info our
reference library information