WWDC2003 Session 506

Transcript

Kind: captions
Language: en
good afternoon my name is Mark tozer
vilchez I'm the desktop hardware
evangelist and Apple computer in world
wide developer relations welcome to
session 506 sheds performance
optimization tools in depth now
optimization means a lot of things to a
lot of different people it can be from
trying to get your application to launch
faster access the network faster get
higher frame rates bottom line is it's
about speed it's about performance about
getting your application to run faster
than what it currently does today or
faster than what may be your competitive
application does bottom line is there's
a common denominator you're looking to
increase performance and you need to
know where that performance can be
increased in order to do that you need
tools to be able to allow you to
understand where that work can be done
Apple has created a set of tools for
developers ship freely on the developer
tools that you'll hear about today and
I'd like to introduce up here a member
of that team that's involved in creating
these tools mr. Sanjay Patel of the
architecture performance group Thanks
thanks mark so my name is Sanjay Patel
I'm in the architecture and performance
group we're going to start off today by
talking a little bit about g5 from a
programmers perspective some issues you
may run into as you're moving code from
g3 and g4 over to the new systems so to
start off with first of all the 970 the
powerpc 970 is the official chips name
is a very superscalar very wide and very
deep machine it's based on ibm's power
for architecture it's a true 64-bit chip
of the power PCAs architecture it has
the full altivec instruction sets of
full 162 instructions all implemented in
hardware also has a high bandwidth
point-to-point interface so there's a
little different than a bus what we
actually have is direct connections
between the processor and the memory
controller and we also supply automated
hardware prefetch engine so what these
guys do is
they'll start detecting patterns of
memory accesses and prefetching those
memory accesses into local caches for
you so here's a picture of the die you
might have seen this in the keynote or
at the 970 presentation yesterday but
what we have here is to load store units
independent to fixed point units to I
Triple E compliance independent floating
point units the full set of the four
altivec subunits the ALU as well as the
permute a branch unit and a unit to
handle can condition register or logical
operations so here's another view from
the top there you see instructions would
be in the l1 cache they'll go into fetch
and sit around in some cues they'll go
to dispatch where they can be dispatched
up to four instructions plus one branch
on every clock so this is really a wide
machine there you can see you get fed
into 10 issue cues to the 12 execution
units so again this is just a the widest
machine you've probably dealt with okay
how does it's all compared against g4
well to keep everything all these units
flowing the core can actually have over
200 instructions in flight versus a
little over 30 in the g four if you
count the completion buffers as well as
the various cues ok the pipeline stages
have been expanded so we're at 16 stages
for a simple instruction vs 7 in the g4
as I mentioned we have to load store
units versus one as well as two floating
point units versus one in the g4 there
are two general purpose fixed point
units where in the g4 there were three
dedicated simple units and one complex
unit vector is similar there's a the ALU
which includes floating-point complex
integer simple integer and the firm use
unit we talked about the cash is there
the quite a few differences here first
and foremost for programmers is the
cache line size has changed it's 128
bytes where it used to be 32 bucks the
l1 data cache is the same size but it's
a two-way associative design and right
through vs 8 way and right back on the G
for the instruction cache has been
doubled sword 64k now although it's a
direct map design versus 8-way
associative on the g4 l2 cache is also
doubled so now we're at a full half Smeg
also an eight-way associative in both g4
and g5 the replacement algorithm is LRU
vs vs random on the g4 there is no l3
cache on the G size whereas on the g4
you had up to 2 megabytes that's
partially made up for by the fact that
processor bandwidth is just tremendously
higher on the g5 it's up to 3.5
gigabytes a second effective in each
direction those two in from memory
simultaneously versus a 1.3 gigabytes
per second bus for the g4 on the other
side of the memory controller we've
doubled the width of the DDR interface
as well as increase the clock frequency
so more than twice the bandwidth is here
available from the DDR chips six point
four gigs versus two point seven gigs on
the g4 ok so what does all mean from a
programmers perspective where those are
there going to be some things you're
going to look out for is your porting
your code and optimizing it on this chip
and so the first thing you'll notice is
that there are more pipeline stages here
which means instructional agencies have
grown from g4 ok so how do you work
around that in your code well you should
do more in parallel right so manually
unroll important loops or try to use
compiler flags such as unroll real oops
with GCC you can also schedule your code
using em to an equal 970 with the new
GCC 3.3 now similarly because the
pipeline is longer branch mispredictions
are going to cost more just going to
take longer to recover from a Miss
predict so there are several solutions
you can use here you're coding in c GCC
offers the ability and expect so that's
for a very highly predictable branch
such as maybe exception code you expect
it not to be taken very much you can use
this macro built-in expect if you're
coding in
assembly we have these new plus plus and
minus minus suffixes for all branches so
either highly taken or highly not taken
that solution is to just not do branches
right so in in floating point you have F
cell that's enabled with fast math and
what that allows you to do is a
conditional move in floating point
registers in the vector domain you have
V cell very similar operation use it
with masks in the integer integer domain
you have the carry bit so this can be
used for min and Max type of operations
you can also use masks to avoid branches
when you're doing integer would
effectively be conditional moves and
then feedback directed optimization is
something most programmers don't try but
this can be very effective on g5 because
if you can have a representative run of
your program let the compiler annotates
that run and then mark all branches with
highly not taken or highly taken this
can be very effective for improving poor
performance on this long pipeline so as
I said the data cache is quite different
than it was on g4 and the most important
thing here is that it's 128 by line okay
what can you do to work around that well
that's either a win or a loss for you
depending on your code right if you have
a lot of locality well you're probably
going to incur one miss where you would
have had for mrs. on the g4 system so
you must design your algorithms and your
data structures to move through data to
move the memory sequentially
continuously as possible okay that's
also going to trigger the hardware
prefetcher and this is very powerful
because it will amortize all the latency
up to memory so that's the next topic
because it is a point-to-point interface
to the memory controller latency
effective latency may be higher than
what you've seen on a g4 system and
that's because to maintain coherency you
have to go to the memory controller and
then bounce back to another processor
what can you do to avoid any of those
penalties well software prefetching so
there are several instructions and of
course the hardware prefetcher is it
the best solution because it's self
paced and it'll be synchronous with your
code as you miss the hardware detects
those mrs. detects the pattern and pre
fetches lines for you you can also batch
your load so so you need to access
several pieces of data you know you're
going to need them in advance try to
group those loads together because the
bus can support several mrs.
simultaneously so the data stream data
stream touch instruction from the
altivec instruction set is a its
execution serializing on the d5 because
it's mapped on to the existing hardware
prefetcher mechanism so what can you do
to avoid DFC well first of all you can
probably remove it right it is just a
hint so there's no guarantee that VST is
going to be effective for you in the
first place the preferred solution is to
rely on the hardware prefetcher so
assuming you have contiguous memory
accesses that's going to automatically
work for you now if you have non
contiguous accesses we recommend that
you replace a single DSC with several DC
BTW instructions that's a data cache
block touch so you would issue one of
those for each line so legacy code that
uses d cbz which is the zeroing of a
cache block or dcba the allocation of a
cache block is going to perform very
poorly on g5 why is that well D cbz is
emulated effectively to only work on 32
bytes and we had to do this to ensure
backwards compatibility with existing
code ok BCDA is not implemented on g5 so
that's just going to be an illegal
instruction you'll end up in an
exception handler and then bounce back
to your code so this is just going to be
tremendously bad for any performance
critical code and the only reason you
would have used these instructions is
because its performance critical so the
solution is get those DC disease and DC
bas out of the code again dcba is just a
performance hint so that shouldn't
affect any kind of functionality if you
do need 20 cache line we would recommend
that use memset or b0 rather than trying
to roll your own zeroing functions but
if you do need a DC bz
function or an allocate of a cache line
we have this new mnemonic called DC bzl
and that's going to effectively zero out
whatever the natives cache line length
is for any system whether it's p3 g45 so
here's an example of how to use DC bcl
now for those of you have used C cbz you
say well the original definition of DC
VZ was simply to zero out the native
cache line length so what have we
changed well the reason we have to have
this new mnemonic is because most
programmers ignored that warning they
coded for 32 bytes and now they're going
to get bitten so what we much rather you
have have you do is code base on line
size so effectively stride through
memory based on the line size which you
can get from the operating system
whatever the current line size is on the
system and of course if you're just
doing a mem fin operation we'd much
prefer that you use memset or be 0 so
synchronization primitives glocks thinks
I sinks they are going to be more costly
on this architecture on this chip than
they were on g4 that's for two reasons
one the longer pipeline and to the
longer latency is to memory so there's a
tough one but what you have to do is
just make sure all your locking is
absolutely necessary minimize the lock
whole time so you're not contending for
loss as much and of course ensure that
each lock is in its own cache line so
you don't have fighting between
processors so scheduling is crucial for
this chip and it's going to require
recompiling or even hand scheduling for
optimal performance so what we recommend
is you use GCC 33 which has a pipeline
model and scheduling model for 970 the
other thing you can do for really
performance critical code is understand
dispatch group formation using shark and
for those of you don't know what shark
is we'll get to that in just a minute ok
so in summary this is a very parallel
core you have to basically of each unit
LSU's FPU sfx use lots of renames lots
of instructions in flight so if you have
very synchronous code it's simply not
going to take advantage of this core so
what you want to do is of course unroll
and schedule you can also use all sub X
to calculate up to a theoretical peak of
32 gigaflops on a 2 gigahertz system a
tool this 970 has the full precision
hardware square root so you don't need
to make calls to any lib em functions
for square root anymore you you're using
GCC we offer this flag PowerPC dash GP
opt we also have native long long
support because this is a 64-bit chip it
can natively do double word length in
leaf functions using PowerPC 64 ok so
again the system and the chip are all
designed for high bandwidth there's
incredible bandwidth to the l1 cache
between the caches and out to memory 32
64 and effectively three-and-a-half
gigabytes per second in each direction
on the bus take advantage of that by
using streaming using software screaming
and hardware streaming prefetch so again
the optimal cash controls instruction
rather than a DST is to use DC BTW to
prefetch if you have a DSC that covered
a lot of ground multiple cache lines
will then issue multiple bc bcs in its
place don't use CC bz because that
simulated use the DC bzl instruction but
be careful if you're using it make sure
you account for cache line size and
again DC VA and DC bi are illegal so
those just need to be removed from code
ok so we've talked a lot of theory here
how you actually get down and dirty with
your code figure out what's going on
well that's where shud comes in so I'd
like to introduce Nathan slingerland
okay thank Thank You Sanjay so hopefully
a lot of you were introduced to Chad
tools last year wwc at least the the
version 2 tools and this year we're
happy to to give you the version 3 of
the tools and we have a lot of
enhancements and improvements to that
but basically the ched tools are there a
suite of low-level performance analysis
tools written by apple's architecture
and performance group and they give you
access to the performance counters in
the processor memory controller
operating system and using these
counters and with Chad you can find
problems in your code and improve your
code and of course it's free so it's on
the developer tool see and it's also on
the website so in three point oh we have
a several classes of tools profiling
tools so tool to find out where things
are happening these include shark so
this is the successor to shikari if
you've used that and we'll get to all
the great new features it has monster is
a spreadsheet for performance events and
it has a lot of great new features to
Saturn is a new tool for visualizing
function calling behavior for tracing so
if you've ever done alt avec or a very
processor processor critical code
sometimes it's useful to see how things
are actually happening on the processor
so we have amber to take an instruction
level trace of a particular program and
then acid is a program to analyze this
race or sims you for a PowerPC 7400
cycle accurate simulator and soon some
some g5 so you'll be able to simulate
for the powerpc 970 and of course we
provide the ched framework so this is
the api we use in all our tools and you
can use this to make your own tools or
call call into the chat tools and have
them
do what you need ok so the performance
counters these are in the processor
memory controller and operating system
as I mentioned and what these do they
count interesting low level performance
event that things such as cache misses
or instruction stall cycles that would
other be otherwise you'd have to use a
simulator to find out what's happening
at this level page faults in the
operating system you can find out when
those happen and the ched to let you
configure these counters tell them what
to count record the counts and then you
can use the tools to look at what what
the result is ok so the first we're
going to talk about is shark shark is a
system-wide profiling tool so you can
use it to profile a process thread or
the entire system if you want to look at
that the most basic usage is just a time
profile so this will show you where the
hotspots are in the system where the
system is spending its time you can also
use any of the performance counter event
so you can get an event profile to see
where hardware events relate to your
source code for example where cache
misses might be coming from in your code
we capture everything drivers colonel
applications so if your driver or
Colonel extension writer you can also
use chug to to see the calls calls back
and find out where things are coming
from and of course are very low overhead
this is all handled inside of the inside
of our own kernel extension and once
you've gathered this session that you're
interested taking the samples that you
want to look at we provide automated
analysis that we annotate your source
code disassembly of your source code
give you optimization tips about how to
improve your code and there's also
static analysis you can use this to just
look for for example DCB dcba
instructions in your code if you you
know if you want to make sure you catch
every instance of that there's a command
line version so this is both scriptable
and you can also use it you know you can
tell that into a machine and use this
from the command line and of course you
can save sessions and you can give them
to your colleague pass them around
whatever you'd like to do
so monster is a more direct interface to
the counter it this lets you look
directly at the results of the counters
you can configure them using monster
collect the pmc data based on timed
intervals hotkey event counts every
10,000 cache misses for example and then
you can view the result in a spreadsheet
or a chart and in addition to just the
the raw performance counts there is an a
built-in shortcut language and this uses
a an infix computational language that
you can program your own metrics that
you're interested in interested in or
you can use the built-in one's things
like memory bandwidth so memory
bandwidth over the memory bus or cycles
per instruction variety of things
there's a command line version of
monster provided as well for your
scripting and remote sessions and you
can also save and review these sessions
as well okay so the best best way to see
how to use these tools is with a
demonstration so what we're going to
look at is a program called the noble
life simulation this is written by Tom
barber lane and he's stimulating a eight
Sun and tropical island and these apes
can sink and he's stimulating the
biological environment so the food and
you know the other animals on the on the
island and the cognitive as well as the
cognitive processes of apes I mean
obviously simple cognitive processes
such as desire and fear and those kind
of things so this is open source and for
more information please check out his
website at humble life com so let's
switch to the demo machine you can see
this fee no belief in action okay so
this is the map window here we just
showed this to the islands right and the
each red dot represents an ape running
around the island doing its thing and we
can select 18 at a time that's the
eighth with the red box around him there
and for the safe we can see his his
brain what's happening in his brain and
the brain window right and of course any
good performance study requires a
performance metric and our metric is
eight thoughts per second so this is the
for the original code getting
we have we have this metric so around
1200 1300 or so okay so the first thing
we'd like to do we're going to launch
shark we'll see what's happening in the
system okay so this is the main shark
window and its really pared down and
simple just to let you start start your
work by default we come up with the time
profile so this would be the most common
thing you'd use it for would provide a
bunch of other built-in shortcuts and
configuration of course you can create
your own using any of the performance
counters but for now reuse time profile
there's a start button here for starting
sampling but you can also use there's a
global hot key so that shark doesn't
have to be in the foreground it can be
in the background and you can start it
so we'll use that hockey we'll take a
five or ten seconds sample see what's
happening all right
so here's the profile you know what
we've done here this is just listing the
functions that it sampled inside noble 8
from from most sample to leaf samples so
you know when you're optimizing you want
to work on what's running the most of
the time and then you're going to get
the most benefit out of optimizing that
code so we see that no belief is fifty
percent of the system this is a the
process pop-up right and this like top
it listed this what was running in the
system it's kind of strange fifty
percent of the time even though we know
that we're CPU bound well if we go to
the thread pop up here you can see that
in fact it's single threaded so this is
this running noble a pond a powermac g5
it has dual two gigahertz processors so
alright so next step we want to thread
this thing right since we can want to
take advantage of both processors so
this is the heavy view that we're
looking at there's a heavy profile of
you in a tree view the heavy view we can
open up these disclosure triangles and
see how we got to this heavy function so
we started in Maine we called flat cycle
called control cycle called cycle a
troop called the cycle troop brain
scalar this important function so we
know our code and we know that we can't
really split split the processing of
things between simulation cycles right
the way this app works there's a you
know it does a simulation cycle and you
know with any simulation cycle it's
processing a bunch of eight simulation
cycles well the simulation cycles
themselves are not independent they
depend on one another right but we know
that the apes are independent they're
independent thinking ape so you know we
can paralyze it at that level we can
process the h in corolla for each
simulation cycle so that's what we did
we we threaded that to split up the
number of apes we have 64 apes to split
it between two threads evenly so you'll
notice that the brain rate is originally
around 1200 when we do this so you get
around 25 almost 24 it or not quite so
that's pretty good we've got a nice
speed up just from threading from taking
advantage of that second processor so
let's profile again and see what's going
on okay
right so now we can see that in the
process pop up we can see that noble ape
is taking up a much more significant
amount of the system that's a good thing
and we can see from the thread pop-up
that we have their main thread that's
the 9.2 percent which spun up to
computational threads each about forty
percent of the time so the next thing
we'd like to do is actually optimize
this function this function is important
to us right it's almost ninety percent
of the time we spend in here so if we
double-click on this shark will present
us with our source that's been
highlighted where we sampled so this
tells us where our in our source code
we're spending time so it's actually
inside of this cycle troop brain scalar
it's just it's this for loop so we can
see that this for loop actually
represents about ninety-four percent of
the time in this function okay so I
should probably talk about a couple of
these other things here oh and well so
the scroll bar at the side is it is it
like an overview you can easily jump to
the hot spots in your code right so it's
colored according accordingly right a
brighter yellow means more samples at
the top we have a source file list this
is you know sometimes you can have more
than one source file contributing to a
particular function that would header
files and like that and this function
pop up is like what you have in project
builder you can easily jump to different
functions and then we have the edit
button so what this allows you to do is
it will jump in to project builder at
the same selected line right so you can
easily go to where you want to edit and
change something once you know what
we're the problem okay so let's go back
to shark and what shark does is it
provides us with advice with us what
this little ! button is is this advice
for us this is calling something out and
so the stupid some advice here but we'll
focus just on the first one so this loop
contain bait that integer computation
and if you obviously if you're spending
a lot of time in this 8-bit integer
computation it might be a good idea to
use alt avec tu to really improve the
this coat so that was what we did that
was our next step so let's I guess let's
go try that out right gotta get too weak
that's a nice beat up well we're not
done yet that's good so let's profile
again let's see where we're spending
time now alright so we'll double click
again on this we see the vector function
shows up the top of the profile and this
is the vector code so so a lot of you
are probably if you've used shikari yeah
you saw the assembly view right if you
double click on any of these any source
line it's going to jump to the assembly
view that you're familiar with and if
you jump double click on the assembly
will jump back and it's going to
highlight the line the instruction or
instructions that correspond to that
source line so you can see this can give
you an idea of how good the code jen is
for your compiler right how many what
kind of instructions is generating for
each source line and okay so if you've
seen this before the columns here we
have samples that how many times we
sample each instruction address
obviously the address and instructions
the instructions themselves and switch
between various views of the address
cycles is the latency and throughput of
a particular instruction so these are
for the 970 right that's the CPU model
down at the right a little right hand
corner there tells you that and the
comment column various things about
about this code and of course the source
file pin now one of the nice things that
we give you an ability to visualize
dispatch groups so if we go to this
option turn this on and we can see if
you remember the the diagram that Sanjay
showed earlier with the the dispatch we
can see here how they you know you
between usually between four and five
instructions dispatch each this would be
in each cycle so this can give you a
good idea of how thing about how things
are actually behaving on the machine so
the other thing we provide to is
it's functional unit and dispatch slot
utilization graph so so you want to talk
a little bit to that right so so on 970
what as a programmer the key bottleneck
you'll have to face is maximizing
dispatch group with because that's one
of the narrower points in the Corps
because it's for instructions wide plus
a branch so what we got for you here in
dispatch slot utilization over here on
the right is the average group size you
can see how effectively is your code
taking advantage of this really wide
issue machine right wide dispatch
machine right and dispatch defines where
instructions are issued to which
functional units so here you see a map
of the 12 functional units that I talked
about and you can see that the units are
symmetrical like the two LSU's here if
there's a big imbalance between say one
of one of these the LSU is wanted to do
it a lot and one's not well that's
something that you could probably
correct with scheduling or reordering
your code because what you want to do is
balance the execution units you don't
want half the chip doing all the work
and the other half sitting idle all of
that is defined by dispatch groups so
that's why we put dispatch group
modeling into shark right till you can
you can select a few room you can visit
right immigration so this is dynamic you
select a few instructions and it'll tell
you where they got mapped to the charts
update and the the numbers will update
is with it great so
so this can help you obviously on the
power the power mac g5 tuning your code
so but let's go back to the source view
and if we look a little closer at our
vector code you remember we vectorized
this this inner loop that we thought was
taking up ninety-four percent of the
time right well now that loop is still
important but it's taking up a smaller
portion of the time in this function we
can see also that up to the top and
bottom of this function we're spending
more time or time relatively speaking in
the scalar code right the two loops that
we didn't touch so what we this code is
very similar to the to the other loop
it's almost exactly the same and leave
sharp wolf point this out but yeah right
is saying yeah a vectorized this lib to
this is important now so that was the
next step was to was to vectorize the
rest of this so to vectorize the entire
function and as well as a few other
there are few other optimizations as
well so let's try that out so we're
starting around 10,000 or so so another
forty to fifty percent we can eke out by
vectorizing the rest of that function
you can see if some of the gorillas have
gone off into the water you can bring
them back to life by dragging them back
to land little bit suicidal yeah they
just like the beach all right so kind of
15 x 14 x speed up that's pretty decent
we
we hope you can all do that well in your
code too so the next step okay so no
don't do that it we have a few more
things to show there are a couple of
things we didn't really talk about shark
allows you to manage the sampling
session so we've taken about four
sampling sessions here you can either
look at these in parallel right in a
multi window mode where you can put them
back and you know deal with one window
at a time the multi window mode is nice
because you can put them side by side
there's also a you know a session drawer
you can quickly switch between them and
single in the single window mode and of
course as we mentioned you can save
sessions there's also ancillary
information included so whenever you
take a session there's it records what
kind of machine it was and gives you
some space to write notes to yourself
about what what's happening on this so
this is archival you can keep it around
remember what happened there's also an
extensive user guide included so it's
just online here please read it there's
lots lots more information and features
that's covered that are covered in there
okay so one other thing we wanted to
look at we want to use the monster tool
and we want to look at some more of some
of these performance counters in depth
so this is the main monster window and
this is the spreadsheet right so I'm
there on the left-hand side we can see
the various performance counters that
are on this system and on the right is
the spreadsheet itself and the shortcut
pop up is similar to the sampling config
selection on shark same thing but if we
can we can do is we can edit these
shortcuts so if we go to the shortcut
tab and we're going to look at the ER
memory bandwidth right so this is it's
just it's taking a few of the you three
counters the memory controller counters
on this power mac g5 and it's going to
calculate the number of megabytes
transferred over the memory bus alright
so the way does this is it counts the
number of beats in each be to 16 bytes
in the data bus
so it can multiply out and figure out
for every 10 millisecond sample how many
megabytes that'd be so we have a session
let's let's open that that's all done
the best
right so if we pop open the run pop-up
see it from this session we have there
are 44 runs taken the first was just the
original scalar code the next was
threaded scalar code then the vectorized
threaded code and finally the optimized
vector threaded code so what we did is
we want to look at how this how many how
this memory bandwidth changed as we
change the code so we can use the drug
chart so we can see this so wherever we
took we took 100 samples total so 25
samples for each type of test so
originally we can see that you know
obviously the memory bandwidth is varies
over time a little bit we're around 250
megabytes per second for the original
single threaded code we threaded it
we're using closer to 400 megabytes per
second then altivec can use a lot more
of this bandwidth around the 1250 SI
megabytes per second on an average and
then finally the optimized code
seventeen hundred megabytes per second
average see you can see how as we
optimize the code we were able to take
better and better advantage of this
bandwidth massive bandwidth is available
on them at Power Mac g5 ok
so let's go back to the slides I think
right you might wonder I guess only
conde talk to you so you might wonder
how does this compare against g4 so we
started out with the regular scalar code
right and on the g4 you get about 1208
dots per second we were getting closer
to 1300 on the g4 well you say well the
g4 is running much higher frequency but
we're we're barely getting a little more
than ten percent faster performance here
so what is the bottleneck well that the
initial bottleneck was all integer
performance so with the longer latency
longer pipeline instructions you're just
not going to get the full frequency
increase in your performance increase so
when we went to threaded we started to
expose the better bandwidth on the bus
so because we have two processors and
they each have independent
point-to-point connections to the memory
controller now you can see we're over
twenty percent or so faster than the g
four and then we really break this open
when we go to vector right the g4 does
well right we get a two and a half or so
X speed up from using vector but on g5
we don't have any bandwidth limit yet so
we get a full 4x improvement from going
to vector and then with vector optimized
you can see again the g4 does pretty
well nice feet up and the g5 gets the
sixty percent speed up and that's you
know if you look at the back of that
monster chart you can see we were
getting peak bandwidth of 2.5 gigabytes
a second on the bus so we're still not
done yet but we didn't have more time to
optimize before this demo but clearly
you know there are a lot of resources
there and if you start with basic code
well you might get a decent speed up
over a g4 but if you put a little effort
into it you can get very big speed up if
you take advantage of altivec take
advantage of all the bandwidth that's
available to you right
okay so a third tool we'll talk a little
bit about a Saturn Saturn so the other
tool so the shark is there are another
profiling tool that we've talked about
that provides a statistical profile
right it's periodically interrupting the
system recording where you are and then
going on and then afterwards we say well
wherever we got the most samples that's
where the most time was spent well
Saturn is going to use is going to
instrument every function in your source
code to give you an exact profile will
show you it'll this allows you to
visualize the call tree it uses GCC to
instrument each function at entry and
exit and it records the function call
history to a trace and with this we can
get call counts so how many how many
times each function was called
performance monitor accounts that can
use those as well as the execution time
of each function right adding them so we
look at this this you know at the top we
have the familiar call stack you know
call tree view that says you know how
where we spent time in each function and
its descend anything like that and at
the bottom we have something that's
viewing the same data but in a different
way it's plotting call stack death
vertically versus time on the horizontal
axis and what you can use this for if
you see a very sharp you know narrow
spike that means that you're spending a
lot of time and calling overhead right
you're not getting you're going through
many many function calls and not getting
a lot of work done if it's not a wide
call a wide stack so ok that's a turn
and of course the ched framework you can
use the tread framework to instrument
your source code you can use this to you
know start and stop monster shark you
can also write your own performance
tools a lot of the functionality almost
all of it that's in shark and monster is
exposed in this framework so you can set
up a start and stop the PMC's collect
information about the hardware a lot of
things that would otherwise you'd have
to go through i okay who might be you
know some extra a lot of code to get at
and of course an extensive HTML
reference guide is provided
okay so here's an example of using the
framework to remotely control either
shark or monster what you do is you pick
the the profile that you're interested
in and then place either of those tools
in remote mode that means allow other
tools that are want to want to connect
and control the start and stop of the
counters to do so so first we initialize
you know acquire remote access me make
sure that another tool is actually
waiting for us to do something and this
will block if the other tools and
currently waiting start the remote perf
monitor so in this for this function you
can give it a label that's going to
appear in the tool do whatever it is
you're interested in whatever the code
of interest is stop to remote perf
monitor and of course release to be a
good citizen so that's one way to use
the the framework that's to instrument
the other way is just more directly you
can set up the counters directly and
read them directly you know initialize
acquire the sampling facility so there's
only one sort of performance counters in
the system right because they're these
are there's one set of hardware is 11
physical device so you have to acquire
the sampling facility the colonel
extension that we have manages access to
this setup the counter event clear the
counter start the counters do whatever
it is you're interested in stop the
counters and then process the result
okay so we also provide some lower-level
tools so if you've ever done as I
mentioned alt effect for programming
this you know or in any kind of really
intense tuning you'd like to know what's
happening on the processor core why you
know why is it slower than you expect or
you know what's happening so with amber
this is a command-line tool to record an
instruction trace to disk so this for
all the threads and you know given
process record that's a disk and then
you can run that that trace file through
acid this is a trace analyzer gives you
some interesting trade statistics you
can plot the memory footprint that this
trace walks through point out problem
problematic instruction sequences and
then you can also run this trace through
Tim g4 for the 7400 processor or
eventually when this is available soon
the 75 powerpc 970
simulator to know exactly what's
happening okay so at this time we'll
turn it back over to mark I think for
session wrap up thank you so to give you
a little bit of a road map of other
sessions that are going to be valuable
to you the tuning software with
performance tools session 305 this
afternoon at five o'clock in presidio
and then the mac OS 10 high performance
libraries again another set of tools
really libraries are not tools but
venues here for you to be able to eke
out performance out of the operating
system again optimization just to go
back on my introductory statements
should not always be a process of or an
afterthought optimization should begin
with when you first start writing your
code it should be part of the process of
how you want your code to be written so
you're not going back after the
application is written in and think
about well maybe I should thread my
application I should you know utilize
threading processes you know an
application like the no bleep you know
we can add threading because it's not a
lot of code but if you're talking about
a much larger application or word
processor or graphic imaging editing
application then you're looking at a
whole redesign possibly and then you
becomes more frustrating so again
optimization should be something in that
is both at the beginning of your project
as well as an afterthought of once you
finish your project how do you get more
performance out of it you know the other
thing I wanted to mention is that the g5
powerpc processor is a very unique
architecture much different than the g4
as Sanjay pointed out in his session
presentation and for that reason we want
to make sure that you have as many
resources available for you to
understand what those differences are
and how to take advantage of that all
week we have been running a g5
optimization lab on the first floor in
the California room urge you to visit
talk to the many engineers that have
made themselves available spending
countless hours
Monday Tuesday we were there until
midnight you'll be able to talk to
Sanjay Nathan and several other
engineers both from apple and IBM
throughout the week as well we'll have
follow-on kitchens available to you as
developers in the developer program at
cupertino follow following the
developers conference
[Music]
you