WWDC2003 Session 507

Transcript

Kind: captions
Language: en
good morning
l chasm the desktop hardware evangelist
and a flick appear in worldwide
developed relations this is session of
507 Mac os10 high performance libraries
the vector numerics group if you're not
familiar with them already through the
different sessions this year as well as
previous years to be doing a lot of work
and optimizing performance in the
library that we ship in our operating
system as well as working with several
of our developers in our own application
groups if you've used itunes a lot of
the performance that you see that we
achieve in that application as a result
of this group today we offer some more
and new information yesterday the group
also introduced the new library for V
image vector image library processing
today I'd like to introduce Steve Peters
who will go over these new libraries
particularly understand how the 970 the
g5 processor takes advantage of these
libraries thank you thank you Mark
welcome good morning it's a wonderful
time to be doing mathematics on the
Macintosh and I'd like to tell you a
little bit about why I think so one more
time there we go so by way of
introduction I wanted to lay out what I
think our Charter is in the numerix and
vectorization group first of all we want
your apps to achieve optimum
floating-point performance on our
platform we're in a great position to
help you with that today we also want to
make sure that your apps deliver robust
high quality numerics it's been a
tradition at Apple since the same days
back on the 68 oxo to deliver really the
best in numerical algorithms one of the
numerical science we continue to keep
the bar high and aim to set it even
higher we want your apps to readily port
to industry standard ap is in our Mac
os10 frameworks this is a leverage point
we take advantage of open source
software in
building these frameworks and we pay a
great deal of attention to optimization
in these frameworks and finally we want
your apps to enjoy easy access to the
Apple value-added features such as
altivec in our platform and where we can
transparently give you access to those
features we do and you'll see a little
later how that works out so what you'll
learn today we'd like to inform you
about floating point performance the
floating point performance work we're
develop delivering for g5 in Panther
will show you how to leverage that work
and you're at your development will walk
you through a regimen that I've been
using to squeeze out the last bit of
floating point performance in the
libraries you might find that helpful in
your applications and will recommend
some learning resources so this is where
in many presentations you'll see the
obligatory layer cake diagram and sort
of one piece cut out and put in it's the
technology framework assessment and
being a bit of a contrarian and I
decided well you know we actually touch
in more than one place on that layer
cake math performance depends on the
silicon the floating point cores it
depends on the frameworks that we
deliver to you and it relies ever more
on the performance tools that help you
optimize your apps and that that's in
the chudd toolkit it again
so let's review what's been done over
the last year since we've last night at
WWDC what do we deliver in jaguar lib mr
standard c math library lib m is
censored on the following design
principles first that we should be
standards conforming you should see the
c99 floating-point function entry points
on our platform just the way you do on
everybody else's and that's certainly
true for doubles in Jaguar our
algorithms are numerically robust in lib
m it's really the central point that our
core approximation should deliver
correctly rounded results the functions
that depend on those should be very
close on the errors in the last place we
bring you Best of Breed algorithms these
are modern algorithms nothing from
numerical recipes here folks and matched
to the hardware how do you code to this
stuff if you're writing and see couldn't
be simpler include math.h if you have
special needs and you know who you are
to touch the floating-point environment
you could include FN ph and otherwise
compile and go lib M is part of the lib
system umbrella that's automatically
linked on the CC line so you need to say
no more so from time to time we get
questions and phone calls let's say G is
it possible for me to go even faster
than what you've delivered in live in
the answer is yes but you have to make
some compromises and it's been our
position that we're not making those
compromises in our deliverable with them
but we can clue you in about what you
need to do let em emphasizes numerical
robustness if you're willing to
compromise on that you might be able to
go a little faster Lib Dem uses the
standard c99 api's which turn out too
data starve RFP use typically a call
like sign presents the math library with
a single argument and your ass to the
contract is a return single return value
that ends up just leaving bubbles all
the way through the floating point pipes
if you can provide more data through an
API it's quite possible to produce
results faster and furiously out the
other end Levin is obliged to handle all
the special cases man's infinities plus
and minus zeros as needed and live M is
obliged to preserve protect and defend
the state of the floating point flags so
there's a fair amount of detail work
that goes on in the library few willing
couple modulite stuff you can go faster
so here's a homely little example a very
respected developer came to us and said
hey your hypotenuse function is way slow
i can write a one liner that goes much
faster than what you guys do so here you
see the Pythagorean algorithm a little
kitty Pythagorean algorithm return the
square root of x squared plus y squared
and sure enough you run this function on
a triangle whose one leg is length three
the other leg leg is laying floor and
the hypotenuse is 5 well lets you know
sort of scale this thing up I don't know
is this the size of the universe's we
understand it these days in any case the
result of applying this function to
these arguments is not five times e to
the hundred and sixty it's with infinity
so there's some intermediate overflow
here right and this developer made some
compromises probably over the range of
arguments that he was interested in
using this function work just fine but
lib M takes on all comers and we're
obliged by standards to do something
different and here's just for
it's just like my clicker at home after
the kids have gotten to it there we go
so it's a purpose of this forest all
right there's a lot more going on here
than that simple one-liner those with
sharp eyes unlike mine may notice hmm
let's see there's some comparisons up
front a division halfway down finally
that square root and a rescaling and
some work on the environment so those
are the kind of details that live in and
the lengths that libram goes to to meet
standards and robustness requirements if
you can compromise on those you might be
able to go faster so that's the end of a
long aside delivered in Jaguar beclan
we've used a clip as a one-stop shopping
place for math performance we deliver
digital signal processing one and two
the real and complex FFTs the blahs
level one two and three these are based
on the open source software Atlas
automatically tune linear algebra system
it actually is a code generation system
back at Cupertino before every release I
grind through Atlas it generates code
for linear algebra matched to the
processor we get extraordinary
performance from this and for certain
the entry points it's actually SMP aware
so you can go multiprocessor
transparently right if the processors
there it goes ahead and uses it it's a
wonderful wonderful think we also
delivered la pack for solving linear
systems and eigenvalue problems again
open soft saw source software we
delivered tuned four x 4 8 x 8 16 x 16
and 32 by 32 matrix multiply matrix
matrix matrix vector vector matrix these
are for folks who know that they're
dealing with very specific sizes and
want to go really fast these are
basically completely unrolled loops and
on the single precision side hit altivec
very very hard
and so that's the next point is wherever
altivec is appropriate and available to
use beck lib tried to go to altivec how
do you code to altivec lib using c
include Veck Lebec lib dot h that's the
framework header you might want to take
a peek in there and see all the other
header files that get included he blogs
at HCL a packed-out hv big them etc from
there you can fan out and find
interfaces that you'd like to code to
and then just add framework Veck lib to
the CC line well we announced this week
that there's an umbrella framework
called accelerate that will collect all
of our high-performance math and so in
Panther you could just as well and is
probably the right migration paths
linked against framework accelerate
you'll get the same stuff
so again that same question comes up
every once in a while can I go faster
than the subroutine that i'm calling in
beclan well if you've chosen the
appropriate entry point we think the
answer is no probably not we squeeze
these things pretty hard and will
continue to do so so let's look at
Panther what's new in Panther furla been
well clearly we've tuned for 970 we've
opened up all the core algorithms recast
them to exploit the two FP use the two
LSU's paid very careful attention to way
to the way the instructions are issued
to the machine so that we achieve
maximum parallelism and we watch out for
load store issues you'll see one of
those come up a little later and at long
last we have Hardware square root this
should be the end of the your square
root is way too slow complaints that's
transparent if you're on a g5 you call
square root as a subroutine you'll get
the hardware screw root on other
platforms will dive into the polynomial
approximations there's also an
opportunity to inline square root there
compiler flags now if you're compiling
and you know you're on g5 you can inline
the square root two great great
advantage so it turns out having done
all this work we went back and said well
now how does this how's this going to go
on the the older models of G floors are
we going to have to maintain two copies
of the library turns out not at least
for this code doing these kinds of
things to tune for 970 also helped us a
little bit a few percent not a whole lot
but we're faster these days on g4 as
well your mileage may vary it's worth
experimenting before you contemplate
having a sort of multiple modules
multiple plugins go look and see if the
things you do for g5 don't also help you
on g4
so I talked a little bit in the last I'd
about careful construction of issue
groups dispatch groups on the 970 I
wanted to put up one slide and simply
remark that the issue is this we've got
a four-issue machine five if we've got a
branch need to be very careful that if
we're going to try to get to let's say
floating-point multiply a dish ooh on
simultaneously on the two CPUs we need
to make sure that each instruction is
fed into the issue queue appropriate to
the unit and that means placing the
instruction appropriately as it forms as
the machine forms dispatch groups shud
is a champ in showing us where we need
to line up our instructions a little bit
differently the timer's the sim g4 sim
g5 seem to come tools are also essential
to do that that's probably enough said
about that there's a trace later on
where we can return so how did we do
when we opened up our core algorithms
did this work and moved on to the g5
here are your favorite liban functions
with here are the cycles the number of
machine cycles in vacations of these
functions for typical arguments took on
the 7455 the g4 the high energy for
bottle here's what we've done on the 970
here's what our competition publishes
for the p4 and in general I think you'll
see we're dominating the competition
even through clock scaling we're going
to go faster than the fastest p for you
can buy the only question is around
school
route where you know its neck and neck
if you're able to in line you can get
square root the first square root out in
about 40 cycles subsequent in 35 cycles
and you can do that on to F to use if
you need to call the subroutine library
we pick up a little dynamic linking
overhead and can only go essentially on
one it's one of those data starved
situations and we only go through one
cpu on FPU in 52 cycles so you know
let's call that a draw maybe unless you
come you know in line to a time and then
it's not so much for drawing alright so
turning to the other big component of
our mathematics efforts avec lib what's
new in penton's perfected double
precision math performance right we've
all been waiting for real good double
precision engines we've now got to and
that clip takes great advantage of those
now our DSP routines have been tuned in
double precision for the 970 blahs has
been tuned and has really I think just
phenomenal performance on the 970 the
scheme we use for example for matrix
multiply derives from atlas given a very
large matrix say a thousand by thousand
elements it gets broken down into much
smaller pieces 64 x 64 in our case that
data gets moved into cash for each of
the operands a times B resulting in C
and then we cut loose with what's called
the mat mulkear nuh lon that guy which
takes the data which is lying in cash
runs it through the floating-point units
develops the output argument and repeat
in that matt mohlke colonel we're able
to ups we're able to sustain about 3.4
floating ops per clock cycle right
that's eighty-four percent of the peak
available on the machine and it measures
out just about 6.7 gigaflops
at two gigahertz when we look at the
larger problem you know reassembling the
64 x 64 block results the overhead of
pulling all the stuff in that cache we
get the GM performance on a thousand by
thousand matrix that runs from memory
from Ram a thousand by a thousand matrix
been sit and RAM after all at 2.4 flops
for clock it's about sixty percent of
peak and measures out at about 4.8
gigaflops on a single 2.0 gigahertz
processor we've also added double
precision four by four or eight by eight
special size matrices all tuned for the
970 they achieve similar if not in some
cases slightly better performance than
the deccan numbers we've completely
unrolled all those loops the matrices
are small enough they sit in the cache
all the time it just goes like crazy and
finally we've gusta la pack a bit if
you're a fan of the singular value
decomposition you'll be happy to know we
go really fast on FB geez now and
expecting Panther that la pack will be
thread-safe leave it on our mail guys
are happy about that I love it okay so
the results are in FFT s are really fast
on this box these are some slides giving
performance of you'll see one and two
the real and complex mcd's FFTs going to
the vector unit smaller numbers here in
microseconds are faster so we want to be
underneath the competition here and one
after another component have to get my
six year old out here we go complex 2d
alright it's nice to be on the g5i
alright how about our linear algebra
performance the industry standard is the
LA pack 1000 which is a solution of a
linear system thousand by a thousand
it's we use it the Atlas based
techniques those are blocked and the way
I told you before moving 64 x 64 chunks
in and out of the cash and going like
mad with the colonel and let's just
focus on the first number is the one
that probably most folks as the pointer
are familiar with so our dlp it's double
precision linpack 1000 that's the
supercomputer benchmark 2.64 gigaflops
we think that's pc industry leading
there we go
so here's a pitch following on from
yesterday's the image session there's a
new umbrella framework in Panther called
accelerate one-stop shopping for all
your math and image processing needs
minus framework accelerate gets you all
our digital signal processing stuff the
linear algebra I've described a vector
version of selected entry points in lib
Emmett's single-precision some large
number of arithmetic support and new the
image image processing library moving
forward code- framework accelerators
it's the right thing to go and
additional math stuff is going to end up
in there so it's a it's a it's a way of
the future okay so actually something go
back from it and it's it's how we
leverage right use accelerate and
leverage Apple's work right here's the
point of leverage for you folks you're
really it by selecting the right API you
get the advantage of you know our
efforts getting these things fast so
that's the segue into well what if
you've got code that you like to go
faster really doesn't fit into any of
the the work we've done how do you go
about doing that first profile you
really need to know where your code is
spending the time where to spend your
effort we actually smart business
decision tree profiling on the 970 is
very interesting this is a machine that
keeps hundreds of instructions in flight
has long and deep pipelines very often
I've looked at a tracer and had the aha
experience this wasn't where I expected
the performance to be being spent let's
arrange to tune for that remarkably
office or interesting there's a
for a rough cut you can use the command
line tool called sample but for the real
action you want to use shark that's part
of the chug toolkit it will zero in on
your hot spots and let you make really
rapid progress it's been key in our
development and so I'd like to introduce
Eric Miller will come up and talk a
little bit about the ched tools if I can
work this
so anyways I am Eric Miller with the
architecture and performance group just
a quick thing about the tug tools it was
a big session yesterday believe it was
number 506 there we go microphone can we
hear me now quick thing about shud tools
there was a session yesterday I believe
number was 506 is that correct and so
that'll be on the DVD a very lengthy
demonstration with shark that I think
you should also take advantage of that
and you shark as much as you can but
there are a couple of other Chubb tools
it's a sweet so there there are probably
nine tools to go through we we leverage
the low level performance monitor
counters inside the hardware and in the
operating system when we put the ones in
the operating system in there just for
this purpose you can find the problems
and improve your code with shark and
monster and all the best thing about
shuttles is there free we have an FTP
site I urge you when you get your
developer tools CD and there's a chud
package install that package immediately
update because we drop up an update
every several days while we're in our
beta period and once we have a gold
master it'll be probably on a semi
weekly basis there'll be fixes and
improvements so we're introducing the
chudd tools three-point oh shark was
formerly called shikari in version 2.0
and 2.5 and it as steve moment mentioned
it does do instruction level profiling
it can also tune help you tuned for that
instruction dispatch grouping that the
970 uses monsters a spreadsheet for
performance events things like cache
misses and instructions completed
instructions dispatched whether or not
how its kind of utilization you're
getting out of the various
floating-point vector units integer
units in the processor Saturn can
actually take advantage of the pmcs and
to call graph visualization with
additional information about what types
of performance events are being utilized
in a particular function in your code we
have several tracing tools and Steve
will probably talk to some of this stuff
amber is the most important tool because
it actually collects a trace from your
application of every single instrument
that's executed on the processor you
take this trace and you use that as
input to acid sim g4 and once it comes
out sim g5 you can also take advantage
of the chudd framework which allows you
to actually instrument your applications
to directly monitor performance events
or control the profiling tools from
inside your code so you can bracket
important pieces of code so i mentioned
performance counters many times already
just a couple of slides when what they
are is a set of special purpose
registers that exist in the processor
and memory controller and often the OS
we have virtual performance counters
these can be accessed by software
obviously so we've created a Judd tools
to do that for you automatically
although there are actually user
versions of some of the some of the
performance monitor counters that you
can access from user code but in general
to set them up and drive them requires a
supervisor applications that like the
colonel so we put stuff in there as I
mentioned earlier its performance events
and page faults is one of the 01 of the
operating system of eq metric measure
there's quite a number of virtual memory
counters available in the operating
system performance counters that you
might want to take advantage of so for a
very high-level look at what's going on
with your disk access and virtual memory
and visible memory access in the
operating system so Steve is alluded to
shark that's the icon in the upper right
there the nice thing about shark is you
can profile over time and with the chudd
tools the profiling can be as small as
50 microseconds per sample or as large
as a second or you can use the events
you can actually profile every so many
cycles you can take a sample ever so
many instructions completed you could
take a sample you capture everything
with shark from the kernel up through
your application so you can know exactly
where times being spent in what
frameworks time is being spent and as I
mentioned the overhead is very low it's
about 40 to 50 microseconds per sample
you can expect us to expend taking data
from the
performance counters the other nice
thing about shark is there's automated
analysis not only will it show you where
you're spending time it'll try and
explain to you why time is being spent
there and give you up opportunities to
try to alleviate that bottleneck Steve
also mentioned that you need again if we
use static analysis to construct the
theoretical dispatch groups on the 970
and this is all demonstrated quite
nicely in the 506 session which
unfortunately was yesterday you can save
and review all the sessions which of
course should always be there but it's
new in three we didn't have save and
review for our tools before except like
text output and also neatly there's
command line version so you can script
your applications if you have a large
suite of tests application test programs
or your application is a command-line
tool you can actually script or launch
your tool with you can say shark and
then the rest of the statement on the
command line would be your tools of
command command line name and arguments
and sharp will launch your app and then
instrument before and after your
application runs and give you the output
the normal output which you can review
in the in the graphic user interface so
there's a couple pictures of it on the
left you can see the the heavy tree
trace in this particular case square
roots on top and you can see some of the
types of outputs you can get there's
some source code in the right picture
and some of the shark commentary that
comes in the form of these little
exclamation points in the in the column
on toward the right of the window
I'll just briefly touch on Monster again
same thing timed intervals event counts
all the chug tools have a hotkey even
the command line tools if you're on the
console you can hit option escape and
launch sharks profiler and option escape
to toggle it off again monster uses
control escape so they can both be on
the system at the same time the Salukis
kind of neat even in a command-line tool
it'll just sit there waiting for you to
start profiling and you don't have to
have it you don't have to go into the
shell and type something you can just do
it from anywhere on your wherever your
app is you can now go ahead and start
that so the big thing about Monster is
shortcuts you could take these
performance monitor event counts and
combine them together with a simple four
function calculator notation so you can
take cache misses and cycles and compute
cache misses per cycle or cycles per
cache miss whichever way you want to do
that ratio you can also use them
controller counter the memory controller
counters and compute bandwidth by
collecting all the transactions and
multiplying by a certain value that will
represent bytes per transaction for
reads and writes and sort of thing you
can also create there are several
shortcuts that are predefined for each
CPU but you can also make your own
ratios and proportions and percentages
to print out and they come out in the in
the tabular comic columns of monster and
then you can charge that data same thing
save review sessions command line
version there's a picture so there's
some shortcuts have been highlighted
those purple columns on the left and
then you draft them by pressing the draw
a chart button these are percentages of
load store instructions with regard to
all the instructions that were collected
in the trace Saturn records your
function call history and instruments
all your code by using the GCC
instrumentation Flags very similar to
code war has some instrumentation flags
that will put a prologue and epilogue in
every function then once you have those
prologues and epilogue the data can be
collected you can see a call a typical
call tree at the top half of the screen
and then you can get the picture of the
call tree
the called depth is vertical and the
time the call wasn't was executing his
horizontal so if you have long spiky
calls you want to try to alleviate those
issues you can use call you can collect
call counts that's great but you can
also collect PMC event counts and see
those things and what kind of duration
they had so and we have these
instructions and tracing tools i
mentioned acids kind of nice as a quick
pass it's sort of sim G 475 light you
just can collect can collect stray
statistics so use amber and collect and
instruction trace that's accurate then
you run acid on it you can get these
data's pieces of data out of acid very
readily and maybe one or two screens in
the terminal whereas sim g4 and sim g5
are very cycle accurate simulators for
the respective processors and that takes
some learning which Steve will get into
to understand their output using the
chudd framework you can instrument your
short coat your source code start and
stop the other graphic user interface
tools and also directly read and write
the performance counters in your code
there's also the HTML reference guide is
is generated every time we do a build of
the framework the HTML is updated to any
new things we put in the prologues so a
quick example you can initially always
child initialize whenever useless stuff
you require remote access and you tell
shark that you want to use remote access
then you start the review start shark
and give it a label then you then your
important function executes then you
stop it and release the remote access so
another threat or another client can use
it but then shark will automatically
profile your important function in your
important function only and you'll get
the results in the GUI slightly longer
example where you actually set the
counters explicitly clear them to start
them then your important work happens
then you stop the counters and then you
take the returns arrays of thought
double-precision floating-point values
if there's six counters there will be
six entries
for each 035 in the output erase you
take those out for the Rays and then do
whatever you want to do to present them
to yourself maybe log them or shark them
this thing is a little finicky so how do
you get shud the easiest ways it's on
the developer tool CD but updates will
come directly from the web there's an
updater that will run automatically the
first time you install chud and
thereafter there's preferences you can
you can check the status of the chudd
package for new updates hourly daily
weekly or monthly we do have internal
guys but do check out early and they get
upset when they couple hours go by and
there's no new updates so the best way
to get in contact with our team is to
use the judge hype and tools hype and
feedback at group apple com and with
that I will turn it back over to Steve
excellent thank you thank you Eric I
think I'm going to go with a keyboard
yes much better to me again let me just
remark arcus covered this astonishing
capabilities it shud but don't think you
have to do anything really special the
most you need to do really is to learn
to use the hotkey if you haven't been
down to the performance lab the GTI
performance lab with your app please do
come down if you find me there I'm
likely to be a pest and hover over your
shoulder wait till your app is running
and grinding the CPU and say hey can I
start shud and it's basically start the
app hit the hotkey wait a few seconds
and then look at the sample people have
been dropping jaw at what they see and
how simple it is they tried comes up
I'll take a question the enemy
beautiful good question you don't even
need that so and we can show you why so
people are quite astonished to see how
quickly it happens and sort of almost
immediate insight they get gee there's
my hi runner right at the top of the
window now let's click into that and see
you know which instructions are slowing
me down here is it's really a great
thing so with some background on shut
behind us and I want to talk a little
bit about this regimen I go through to
really torque down and get some
performance out of floating-point
incentive code first concern often is
memory the machine has an enormous
amount of bandwidth if you can use it
effectively you need to load data early
so that it's available early to the
out-of-order execution course if the
data ain't there the the course can't
can't do the instruction so get the data
there early so in examples you'll see
that I a load polynomial coefficients
literal constants very early on in
subroutines even speculatively even if
they may not be used in a particular
branch of the code I'll often load them
early just to have them available in the
case we drop through and go so load
early load often harness the to LSU's to
drive those 2's views if you can load
the data sequentially there are hardware
initiated prefetch streams that are
really effective at getting the data
into the machine DFT and rec DST the
moto entry point are bad eggs they're
big help on g4 their execution
synchronizing on g5 it probably ought to
be avoided you're better off if you want
to prefetch data using the DCB TL ECB VL
class instructions you need to be aware
that the cache line size on the 970 is
128 128 bytes the loops that are
enclosed the DC ptl's need to be
cognizant of that so did I say that the
970 has to fdu cooler and I think so use
them at each cycle on each FPU we can do
F met instruction that counts as two
floating point ops so you can net for
float it for flops per cycle on our CPU
and that is achievable there's none of
this four out of five six out of seven
can be scheduled you can get all these
I've seen it the data has to be in
register but you can get that stuff
there are 32 floating point registers 48
additional renames the machine will
execute things out of order and take
advantage of the renames very
effectively typical latency for a
floating-point instruction that is the
time from when you started instruction
until you can use its result in the
subsequent operation it's typically six
cycles throughput is one these things
are fully pipelines you can throw me the
pipe one after the other the key
exceptions our square root and division
that should come as no surprise to
anyone since there are two ft use a
simple strategy for making sure that
you're getting both ft use fully utilize
is to think that you've got a 12 cycle
type that is start a result don't plan
to use it until 12 cycles later 12
intervening operations that often means
you need to think a little bit more
about paralyzing or software pipelining
your algorithms to get that kind of
distance between
between uses so here's a little piece on
choice of algorithm sometimes you need
to sort of just pop up a couple of
levels or think now is there some way to
recast the algorithm I have to be more
effective on floating point so the
example is it takes two hands to matrix
multiply you're trying to form the
matrix C as the product of a and B in
high school or just junior high school
nowadays you learn how to take the
product of two n by n matrices by using
two hands to form the output element CIJ
you take the I throw in one hand and a
J's column on the other hand and you
form a dot product to n fetches and n
multiplies later you have a single
element of the output CIJ turns out to
be not a really efficient use of the
floating point unit of the register set
unless you rethink the algorithm
something along these lines and this is
what Atlas actually does set of two
hands one finger used four fingers on
each hand to form the four by four
output block in matrix C you grab four
elements across the first say the first
four rows of a four elements in the
first four columns of B and form all 16
possible products pairwise products
continue down and rows and n columns
you've done and let me I'm just going to
look at this so I remembered exactly
right you're accumulating 16
simultaneous interpret the eight and
fetches 16 and operations it's actually
a factor of four reduction in memory
bandwidth you've used nearly all the
registers
possible to keep the floating-point
units bubble free and basically that's
the trick that Atlas uses an hour matt
mohlke kernel to drive the machine at in
the kernel eighty-four percent of peace
so the take-home messages think parallel
if you can but small parallel you know
four by fours it's manageable so here's
a little case study I thought we'd go
through that gets a little bit more into
the tracing tool the code is the arm of
the lib M sine function for arguments
that are smallish between PI over 4 and
minus PI over 4 and just accept some
landmarks for orientation you'll see
that there's a absolute value function
taken at the top comparison here to
decide if we're in the right arm some
manipulation of the floating point
environment some arithmetic and then
what looks like a polynomial
approximation formation of the final
result some more adjustments of the
floating point environment and out we go
so on the g4 series we had a tool called
sim g4 and if we look at this segment of
code run on a smallish argument to sign
we see this picture this is actually a
very good picture sim g4 wise at the top
ceiling isn't going again this is really
tough here we go for our landmark there
is that fab construction and we read
across to see this fat instruction
issued roughly at cycle 1200 one second
for our 628 spent two cycles getting
instruction fetch one in dispatch a
couple and execution and then retired
next line store word with update
additional time the object here is to
fall off the cliff you're retiring
instructions as fast as you can so this
looks pretty good this cause really
quite good on g4 taking just that code
unchanged bringing it to the g5 the 970
and using the sim g5 trace well first of
all it's a more complicated machine and
the letters have changed so we spend
some time and fetch five cycles and
dispatch sometime in the mapper we
finally hit execution unit 5 for that
ffs which is now on the other side of
the screen we finish that operation
what's six cycles later and we hang
around and wait for the rest of that
completion group to all finish up and
then we complete this doesn't look too
bad either you know there again the
object is to fall off the cliff well
until we get down to the bottom here
this is trouble and there's a key for
these letters and if we were to look up
the key we'd see that we had a
essentially a load store reject so
somebody is trying to load from an
address that was recently stored well
how recently turns out that store
occurred way up top this is a machine
that's putting hundreds of instructions
in flight these dependencies can stretch
out over really large length of code so
beware be forewarned and you know watch
out for these kind of things by
adjusting for this manipulation of the
floating point of our environment we can
end up doing quite a bit better you'll
notice first of all this is shorter and
the fall off the clip is much more
precipitous and we've eliminated that
nastiness down in here so that's the
kind of information you can gain from
but seems you 7425 class of tools I
think it's an adjunct shud i think shud
with shark is the first place to look
but when you you know want to eke out
the last little bit of a performance
this is the tool that I turn to so
here's the final version of that code
just for landmarks there's that fabs
again still have the compares but we've
turns out we've Mac we've manipulated
the environment by bringing it to
register rather than storing I've
adopted this style where I split across
a single line in the C code places where
I think I can gain parallel ISM so
here's loading early and often of the
polynomial coefficients here our
operations that I believe go in parallel
on the tooth do lefty use and then out
ok so quick summary of the regiment i
like to use start with judd look at
shark go back to your code pay attention
to load store issues think in terms of
do left to use even just organizing the
layout of your code can help you think
when things when you can take advantage
of things going in parallel use as many
registers as possible let the hardware
initiated prefetch streams help you get
data into the machine early and often
and when directed by sim geforce ng5
kind of tools look at dispatch group
formation just to make sure that you're
not crowding instructions that you think
ought to be going to separate FB use one
on top of the other in a single issue
queue that was the slide that came much
earlier in the talk
so time to wrap up you can review these
sessions on the DVD can contact myself
or Ollie for questions ok I'll leave
does a little skunk works operation
every once in a while a book he can tell
you about that and for more information
there are two really fine technotes now
on the web at the developer site and
cover in detail many of the things I
spoke about today also includes sort of
first time usage of the chudd tools
compiler options that can be a big help
too and then the techno 2087 is a quick
comparison to remind you of the
differences between g4 and g5 there's
some other interesting documentation at
the somebody does HTML here we'll have
to resolve this and finally I'm a big
fan of the fellow who writes for Ars
Technica describing a powerpc disease or
you know really lovely introduction to
the machine and a good place to start
you know quiet evening with your laptop