WWDC2004 Session 407
Transcript
Kind: captions
Language: en
good afternoon this is session 407 the
accelerate framework accelerate
framework was introduced last year's
Developers Conference this year will
bring a rehash a little bit about what
was introduced last year but also give
you part two if you know anything about
Ollie's team you know these are the guys
who in your calculus class were the ones
finishing the word problems or the
homework before the class was even over
they love math so I hope you guys enjoy
this session and with that I'd like to
introduce dr. Elyse house Gauri hi to
you about is the accelerate framework
and what we have done and we plan on
doing for Tiger the talk is in three
parts I'm going to give you a general
overview and some snippets of results
that we have and after that I'm going to
pass it on to my colleague Ian Oldman
and he's going to talk about the image
processing library which was introduced
last year and the the panther OS and
after that I'm gonna pass it to my other
colleague Steve and he's going to talk
more about the numerix and the linear
algebra results that we have so let's
get started
so as you know we have had this
particular configuration the accelerate
framework which is a collection of all
the computational underpinnings of the
Mac os10 in Panther we've had the vet
clip section of it for a while last year
we introduced the image vet clip section
had the signal processing the linear
algebra the matrix computations the Blas
the large number computations and the
math libraries which took hardware
vectors 128 bit vectors we added image
processing I'm happy to tell you a lot
of people are using our image processing
inside Apple and outside and we're going
to talk a little bit more about that one
of the
additions to the new operating system is
V force V force we've had a lot of calls
for people who wanted to use an array of
elements and pass it on to the
elementary functions and get the
elementary function results not just
pass it on in hardware vectors just one
scalar at a time V Force is our new
library which we will talk in depth
later on Steve will talk about that
what is delivered in Mac os10 tiger
basically the accelerate framework is
one-stop shopping for computational
performance digital signal processing we
have expanded that in Tiger we will have
about three hundred and forty new
functions in the VDS B sub framework we
have digital image image processing we
have expanded that also and with added
performance in some of the core routines
such as convolution the Blas level one
two and three if you're familiar with
that that these are the basic linear
algebra subroutines again these are the
structures of computations that people
do for la pack the entire la pack that's
single and double real and complex for
all of the routines basically this is
the exact API that people who are using
la pack are used to v force the array
LMS that we'll talk a little bit more
about that today
and that's a new in Tiger V mathlib the
counterpart of regular Lib M which on
the scalar runs on scalar this one here
runs on the vector engine I'm going to
just touch on some of the performance
improvements and performances that we
have right now in tiger in some of the
CDs the CDs that you've had first I'd
like to talk to talk about V force
performance these are the lms we have
single and double precision z' they're
highly accurate they operate on arrays
instead of elements or Hardware vectors
128-bit hardware vectors monotonicity is
observed over the entire range of
definition that's pretty important
because there are competitors which do
have functionality such as this but they
have cut
the corners and developers have to worry
about pitfalls of where to call or what
not to call or what elements to call
here you're free to call anything
basically if it's in the floating-point
domain it will work and it will not trip
you and will not give you the wrong
results I have a small little table here
showing you what the benefits of v4s are
I'm quite proud of this particular piece
of work that our group has done and
Steve will talk about this a little bit
further square root is three point five
three point one x over three times
faster than the current one exponential
is over six times and sign is 11 times
faster and square root was already
pretty pretty pretty fast on g5s but we
have it even faster on this and the
reason these things are faster is that
we are able to plug in the bubbles in
the computational structure of the
algorithms because regular elementary
functions they just don't have enough
data to go through and you end up having
a lot of empty cycles going by these
this allows it to fill it up completely
and have a stellar performance the next
thing I like to talk to you about is the
la pact performance Linpack this is a
lot of people know about these results I
just have a little bit of it here the
DLP 1000 this is the double precision
Linpack 1,000 by 1,000 matrix and we're
about 5 gigaflops for double precision
and single precision is over seven and a
half gigaflops and this is on a two and
a half gigahertz PowerPC Blas
performance the quintessential Blas
performance benchmark is really D game
and that's the double generalized
matrix-matrix multiply which is an
enhanced matrix multiply of a scalar
times a matrix times matrix plus a
scalar times a matrix and a lot of
people like to look at that to see the
prowess of the implementation and what
we have here is
comparing that with Opteron because I
get asked how we compare with the
competition here higher numbers are
better on this particular one here
double precision size 5500 so if you
multiplied a 5500 size matrices you will
have a 12.1 8 on a PowerPC and a 7 over
7 gigaflops for the Opteron now the
Opteron that we had on our hand was a
2.0 gigahertz machine we just added 20%
more to it we were unable to get a hold
of a live 2.4 gigahertz machine to run
that stuff we gave him exactly 20% up
which generally frequency goes up that
much performance doesn't go up that much
what we did
so it's twelve point eight versus eight
point five five what I have four graders
here is just for fun what would the SCM
performance be and the SCM performance
on our machine is 23 gigaflops I don't
know some of you know I've been in
competition for a while now and 20
gigaflops before required many millions
of dollars to achieve but 20 GB 23
gigaflops is just the pittance now he
can buy yourself a PowerPC at 2.5
gigahertz and get that VDS pea
performance our FFT we have we continued
having a stellar collection of FFTs for
our users to use single and double
precision real and complex 1d and 2d in
place and out of place basically and
radix 2 3 & 5 we have them hand tuned
for the vector engine and we also have
have a tan tuned for the dual scaler I
have them for I'm comparing them with a
3.2 gigahertz Zeon's this time around
we're not looking at the gigaflops were
looking at timing real microseconds
because signal processors really don't
care very much about what the throughput
a floating-point is they like to because
they're real time folks they like to
find out exactly what timing is on that
single precision 1024 complex which is
always edged into my mind is 4.5 6 micro
seconds versus 6.13 on a 3.2 gigahertz
Xeon and these are one processor only
because the floating-point the FFT in
floating-point is doesn't do that much
work to dole it out to processors single
precision 1024 real is 2.3 microseconds
versus four point two seven microseconds
so you would think this is fast enough
why would you like to make this any
faster just one example the
quintessential example that I like to
give always is iTunes itunes use as our
FFT at a tune of 1.2 million times per
hour for your music it's that real FFT
get that gets used the more we shave off
of that the faster your decoding and
encoding will go and the more we shave
off of computational things like FFTs
and imdct s the better your battery life
will be so this is pretty darn important
to make sure that always runs extremely
fast image processing library I'm very
very proud of this particular set of
work we set out and worked on this for a
year on Panther and delivered it and
it's used in a lot of applications that
we have in-house and outside I just have
a couple of little things in here we
have planar and chunky kind of a funny
word
they RGB intra interleaved formats
native support for 8-bit and
floating-point samples can be used in
real time it's multi-threaded for so
they have large images you can do better
I have a small table here for
performance and what we have is a
gigabyte image blurring I'm comparing
that to the IPP which is the Intel
integrated performance primitives that
some of you might be familiar with the
eight gigabytes image blur is five and a
half times faster the eggy gigabyte
image emboss is 2.2 times faster
delivered in Mac OS then also let's not
forget the underpinnings of again a
regular computation a Lib M we are
standards conforming AP is for I Triple
E 754 and c99 single and double knowing
Tiger is our long-lost 128 bits long
double friend which is going to make an
appearance again and we have
really stellar implementation for that
very very accurate computationally all
of these guys are numerically robust
highly accurate worried about the
environmental controls and never mess up
anything and we take any any we take a
lot of care to make sure that we conform
to all of the any existing is the
standards best-of-breed algorithms
basically coding to Alabam in c
straightforward it's some you just call
the compiler you don't have to say - LM
on that using the accelerator framework
in C it's also straightforward all you
need to do is just put in - framework
accelerate so when I would basically
what I've done here in the last few
minutes is to just give you a small
sampling of what we have in the image
processing signal processing blahs the
force and la pack and we're going to go
into some of the details of this work as
we go along now I'd like to pass this on
to my colleague Ian Olman who is going
to talk more about the image processing
all right thank you villagers introduced
last year at WWDC and shipped with
panther and since then we've gotten a
lot of feedback on it and we've taken
your suggestions to heart and so we've
got more improvements for it
now the image functionality remains much
as introduced previously with some new
features added on we still have native
support for 8-bit and floating-point
samples these can be arranged either in
a planar which is they all in one
channel per array or a chunky format
which would merely use several channels
if you're doing 8-bit work and images
then we throw in saturated clipping
usually the ends of functions that can
overflows you don't get the white goes
to black or black goes to white problem
we've put in a lot of effort rethinking
thinking the design over to make sure
you can use these things real time that
you know we don't arbitrarily call
malloc you know we give you the
opportunity to provide us with the
temporary buffers you won't block on
that that kind of things
we're also reentrant so you can call us
in a multi-threaded environment and of
course it's high-performance accelerated
for altivec we provide a variety of
image filters we have convolutions
morphology functions that allow you to
do edge detection or fill in holes that
kind of thing min max dilated Road we do
histogram operations with color
balancing alpha compositing which with
some new functionality there geometrical
transforms we do scales rotates yours
effing warps you can sort of distort the
image in lots of different ways we also
do some color space conversions and data
type conversions so just to go over what
you can do with convolution depending on
what kernel you provide the convolution
filter you can do all sorts of different
operations whereas sharpens I can do an
emboss which is essentially a first year
over the image you can do a ver a'junk
various other things we've gone over and
looked at the performance for tiger and
for d5 and for future processors and
we've done a lot of work to get the
performance up right now on your CD
you'll find that the performance for the
planar 8-bit cases substantially
improved over what it was and
as as the months go by and the near
future we're going to push that forward
on other things we've so we've done a
lot of work just to get the brute force
computation about it we've also improved
the algorithm a bit it's a lots smarter
about zeros in your convolution
currently most people they pass in a the
kernel is 90% zeroes and they just kind
of expect the library not to actually do
work for the zeros but it turns out if
you go look at all the high-performance
kind of look involves out there they do
actually do work for the zeros
but we've changed around so we don't so
in many cases now in a comparative study
between our library and the other ones
are going to see very substantial
improvement and ours over the other one
so just to give you an idea here's an
example of a somewhat blurry image of
Lisbon and you can apply a standard
sharpening kernel and it looks a little
bit sharper I don't know if it shows up
well in this display and you can see the
kernel there that we used which
accentuates the pixel in question over
its neighbors that's how you get the
effect and here's the kind of a
performance you can expect on that kind
of thing here we have a competitive
graph against Xeon it's it's a little
hard to read it's a 3.2 gigahertz Xeon
that we're working on and we're looking
at the Intel performance primitives
library and tell has already gone
through and multi-threaded all this for
you so both of these are dual processor
results
Intel is the blue bar along the bottom
we normalized its performance to 1 and
the speed of the g5 as you can see where
the dense kernel was the red line above
it so we're usually you know between 1
and 3 times faster than until for a
dense kernel and then for a sparse
kernel like emboss which is mostly zeros
then we you know up to 8 times faster
for those things we also do morphology
operations we can do kind of different
shape changing operations that kind of
thing so here would be an example where
we've got a nice picture except for oh
it looks like a power line up in the top
left corner
wouldn't it be nice we could remove that
well there's lots of ways but we'll just
use morphology for this example and so
we can apply a max filter and Max will
go around and look at all the pixels
around this pixel in question and take
the maximum value so the power line is
kind of a dark image so as we apply the
max filter it just goes away
but you notice that some of the white
highlights got bigger so we can apply a
min filter and you know kind of subtract
them back out
and so you have something that looks
like your original image bag except now
that the power line is going to be gone
so you can do these four interesting
effects in addition to just shape
changing that kind of thing
so here's performance on that we've got
a new algorithm for max which works
substantially better here you can see
the 3.2 gigahertz dual processors Eon
results again normalize the one is the
red line across the bottom and you know
as the kernel size gets larger you can
see our performance relative to Xeon
gets better and better and we're up to
four times faster for really large
filters we do alpha compositing we can
support either premultiplied images or
non premultiplied images we have
functions to pre multiply unpretty
multiply data we've now added a few new
functions for tiger you can mix non pre
multiplied into a pre multiplied layer
which allows you to do multiple stacks
go along and we added in compositing
with escape FISC a scalar fade value
which allowed to sort of fade in the
whole image without going through and
writing over the Alpha Channel so those
will be available we also have new type
conversion features this was actually
surprising at least to us number one
requested feature it seems that
everybody has their own data format that
they like to use and so I've got a lot
of conversions to get that in and out of
what the image likes to use so you now
can handle 24-bit 8-bit per channel
color also the older a RGB 155 v and RGB
565 16 bits per pixel formats we do also
the 16-bit per channel integer supports
and signed and unsigned flavors and
we've also introduced openexr compliant
16 bit floating point conversion
functions in case you need to work with
video cards that use those things also
added a few other things that allow you
to insert channels into interleaved
images or permute channels around like
so it's that you need to swap around an
a RGB image to an RGB a or something
like that so those things leave there
they'll be fully vectorize and they're
pretty much operated bandwidth-limited
rates also added color space transforms
we originally didn't put these in
because we thought we would
leave these up to color sync but now
color sync wants to use our codes so you
have them in there we have matrix
multiplications so saturated clipping
for a bit of course for to prevent
overflow we allow you to put in an
optional pre and post bias
mathematically the pre and post players
are the same but it's a little easier to
use that way so you lit put that feature
in and again like the convolution this
one only does work for nonzero element
so you can safely pass this a rather
sparse matrix and we'll just do the work
that we need to in order to do that
we're also introducing a whole set of
gamma correction functions which these
come in a variety of flavors you can get
a generic power curve we also provide a
few specialty gammas like srgb u it's
showing exactly generic power these are
available in two different formats
they're generally floating-point geared
but you can get them in either a full 24
bit or a 12 bit precision variants and
the small bit precision obviously is
appropriate for data that was a bit
integer data to begin with we also have
a few functions to do simultaneous a bit
conversion with clipping while they're
doing the gamma correction and we also
applied providing interpolated look-up
tables stuff for cases where your gamma
curve is not nicely described by a power
function so I'd like to invite Steve
Peters up to talk about the numeric
improvements for tiger
all right factory all right I'm going to
take some time this afternoon to present
the credentials of our math libraries
perhaps some of you have not used them
before and would like to know a bit
about the motivation and also spend some
time with performance Hey excellent so
you know job number one for us is
conformance to make porting your
applications building your applications
correspond to experience you've learned
on other platforms learn to the
classroom learn from reading the
standards who does it anymore and at the
base we have we're delivering platforms
based on G 3G 4G v chips all of which
have I Triple E 754 compliant
floating-point arithmetic both single
and double when we move up one level to
the elementary functions the basic math
libraries these are also compliant
compliant with the c99 standard all the
required c99 api's are present for
complex and long double as well as we
come into the Tiger world
you know I'm gonna have to use these we
build our linear algebra the blahs the
basic linear algebra subroutines from
atlas the widely respected open-source
package that is automatically tuned
linear algebra software we offer the
full panoply of api's in float double
complex complex double and similarly for
the gold standard of numerical computing
l apec you know all routines folk double
complex complex double entry points for
both C and Fortran
after conformance we're really concerned
with performance and the flagship of
performance now at Apple is the
marvelous g5 CPU the nine PowerPC 970
which offers dual floating-point cores
my recollection the first in Apple's
line and has given us really stellar
performance so on each 970 CPU we find
two floating-point cores capable of
doing double precision I Triple E single
precision I Triple E on any machine
cycle both of those units can be pressed
into action we can start a floating
point instruction down each pipe on both
pipes in a single cycle all the basic
arithmetic operations ahead multiply
subtract and divide are present we also
get Hardware square root in the PowerPC
970 that's a real boon to us and another
class of instructions that have been
present in g4 and now as well in g5
called the fused multiply add fused
multiply add takes three operands
multiplies the first two together adds
it to the third all in the course of one
instruction so this ends up being a key
important operation fundamental to
linear algebra the dot product which is
essentially multiply and accumulate
multiply accumulate multiply to
accumulate it's fundamental to the FFT
in much the same way if you're doing a
function evaluation by say polynomial
approximation you'll probably want to
use Horner's rule if you think a little
bit about the way Horner's rule works
out it's essentially a fuse multiply add
win and at the bottom line we get to
count two floating-point operations per
fuse multiply add so on a machine with
two floating-point cores we get four
flops per cycle so let's see four flops
per cycle I always have to do this in my
head
four flops per cycle two CPUs in the
dual
gee five so that's eight flops across to
see two CPUs and we clock them at two
gigahertz so we top out it 16 double
precision floating point operations 16
gigaflops worth of floating-point
operations on a two gigahertz qg5 and
now that we're using 2.5 s I have to
update my thinking it's the 20/20
gigaflops theoretical peak theoretical
peak so how do you get to this
performance how do you get to this great
double precision for performance if
you've got an existing Apple Mac OS 10
binary perhaps built for g4 just bring
it across the scheduling in the CPU is
really smart as the instruction stream
comes along and we start seeing
floating-point instructions they get
dispatched off to dual CPUs and they
will finish faster than if they were
sent to a single pipe so part of the
answer is you have to do anything and
you should see some important
performance and in existing binary apps
second if you're able to recompile your
app say it's open source application
code you've developed recompile with GCC
set the proper options that I'll point
to in a tech note later and let it
schedule instructions in even more
optimal way for the g5 and you can see
yet more gains it's also possible by
paying special attention to algorithmic
details to get even further gains for
example if you're computing a rational
function approximation you may be able
to arrange the calculation so that the
numerator is computed simultaneously
with the denominator on the two pipes at
the end you just weld them together with
the divide this level of attention we've
paid already to live my Brer II our Blas
our la Peck and the V Force library
both our g4 and g5 platforms offer the
altivec single instruction multiple data
processor this is a 4-way parallel
single precision engine doesn't do
double precision not at all Ian keeps
telling us it will never do double it's
a single precision engine with a huge
appetite for floating-point it really
just rips through floating-point
calculations all the basic operations
are present as well as a vector fused
multiply ad so now we get two flops
counted for the fused multiply head on
four operands strung across the 128 bit
vector that gives us eight flops per
cycle now let's see can I do the math in
my head for a two and a half gigahertz
g5 I think that tops out at forty
gigaflops thank you forty gigaflops tops
alright so how do you get to this
performance well sorry you got to do a
little bit of work you're gonna have to
learn a little bit about vector
programming there's an out that we've
announced this week but it helps to get
in there with your code understand where
there's inherent parallelism in your
algorithms work those over with the
Sindhi instruction set and pass them
through the compiler our advice is
always profile first before you dig in
find out where the you know the 10% of
the code is where you're spending 90% of
your time and go look at those shark is
a wonderful tool for figuring out these
case cases I hope you've seen shark or
plan to see a shark talk sometime this
week they're they're playing in a
theater near you I'm sure auto
vectorization is an option and this
slide was actually written before the
announcement was made to that GCC 3.5
we'll be offering some auto
vectorization features check those out
it may be a real boon to getting better
use of the Sindhi unit on the g-force
and g5s
there's also a third-party application
called vas that can analyze I think
Fortran codes to discover inherent
parallelism and omit the alt of proper
altivec code we've gone through at Apple
and paid this kind of attention
algorithmic attention recasting
algorithms for our V Force library our
single precision Blas our single
precision FFTs and digital single
processing algorithms and heavily in V
image well when you come to our platform
as a developer and you know kind of come
to that final step you know how do I
access these wonderful libraries link
load and go we try to make that as
straightforward as possible the library
API is generally are in will internally
dispatch for the correct platform so we
won't go off and try to execute code
that's appropriate for a g5 on a machine
that's a g3 for example generally the
rule is if the API uses a vector type
Hardware sim D vector type altivec sim d
vector type you're expected as he has a
consumer of that API to know that you're
on g4 g5 otherwise we'll take care of
that for you live and lynx by default
it's part of Lib system you don't need
to say anything about that
four-hour long double and complex I a P
is plead please add - LM X to your link
line and for V force the blahs la pack
VDS P and V image the one-stop shopping
place is framework accelerate just add -
framework accelerate to your compile and
Link lines I know that's a popular
popular flag so I'll let you copy that
down all right
well what's new for math and Tiger what
have we been working on Ali hit the
highlights of the V force library
basically we've been told people don't
want to do one square root at a time
they'd really like to do 768 at a time
and sure enough there are advantages to
be had when you can do many of these
things at once we also took a blahs
update an update to Atlas 3.6 this
helped us in a couple of places we of
course do additional Mac os10
specific tune ups to that open source
drop and our compiler technology
improved thank you
compiler team to give us some some nice
gains some nice and somewhat unexpected
gains and because of the faster
underlying blas some improved
compilation our la pack is going faster
- now all ali always likes me to lead
with the strongest graph so i can give
you a couple of performance numbers here
this these are some numbers i collected
for the two and a half gigahertz g5 dual
processor it's a set of numbers that in
the sort of linear route computational
linear algebra community you'll see
quite a bit
it measures matrix multiplied DMM and
then the three decompositions lu and the
symmetric decompositions ll transpose
cholesky and the crowd u transpose u for
matrix multiply we use various matrix
sizes ranging from 500 up to roughly I
think 9000 and sort of got our first
plateau or bit over eleven gigaflops and
then and sort of an interesting jump
around size 5000 and as we push up
beyond the 12 and into the 13 gigaflop
range and the decompositions are a
little bit less jumpy of the less of a
step function but Jenna look like
they're asymptotic hitting an asymptote
at around 10 gigaflops well what's
what's the competition up to these days
let's just look at matrix multiply
again in yellow is the dual
two-and-a-half gigahertz g5 topping out
at or near at or above 12 gigahertz on
the bottom in blue is Opteron a 2.0
gigahertz Opteron and they sort of get
to about seven gigaflops in a 2.0 model
for the purposes of comparison we know
that they've got a 2.4 gigahertz part
out there and if they were allowed to
perfectly scale they'd hit that dashed
white line and come in just a bit over 8
gigaflops we expect to see that when we
measure those machines dual 3.2
gigahertz Xeon is the Green gets up a
little bit above 10 probably touches 11
in a couple those places so
two-and-a-half gigahertz g5 seems to
dominate in the matrix multiply game
quite quite handily
the slide is a bit more busy but again
the color should be the guide here
yellow again g5 green is a Xeon and
Opteron in blue and again we've scaled
Opteron by 20% for the white - line g5
seems to dominate again this looks a
little bit out of place
I mentioned we did long double I think
Olly mentioned we did long double and
we'll also have the tight generic math
functions that's good to know so I want
to come back to this um V force business
and as Ali alluded to the elementary
functions in live M square root coasts
sine arc sine passed a single operand do
a fairly heavy amount of computation and
burp out a single result it turns out
that leaves bubbles in the modern risk
pipelines so we say these c99 api's or
data starved we're also required by I
Triple E 754 to have very careful
control over the rounding modes and
exceptions that might be generated in
the course of such a computation and
that adds a fair amount of overhead
they're instructions that will have to
synchronize the pipe to get that stuff
right and we pay it pretty good price
for that so the ideas and V force are
let's pass many operands through a
single call maybe we can get some
advantage there so if we had 768 values
and a vector X and we wanted to compute
the single precision floating point sign
of those things we could call V V sign F
pass X 768 and a place to stuff the
answers Y or we might have 117 numbers
we want the arctan of them there's a
call for that we're going to insist on
the I Triple E default rounding modes
and we're not going to set any exception
flags so this is for you know close to
the medal high performance go as fast as
you can we don't expect any big problems
and if there are any well we'll deal
with them in some other manner
i Tripoli approach so we also get some
mileage here because given multiple
operands we can pack them together into
Hardware vectors on the single-precision
side and send them through the altivec
engine this is a very good thing
similarly on the g5 we can make sure to
use will utilize the two pipes as
effectively as possible we do a lot of
software pipelining that is sort of
arranging as I say arranging to fill all
the available cycles on all the
floating-point pipes we unroll loops
like crazy
and we also have taken some algorithmic
approaches that favor calculation over
table lookup and try to avoid branches
like the plague it makes these things go
very very very fast and as we pointed
out we have gains in square root to 3x X
to sneer li7 and sign was almost 12 3 6
12 so some caveats right and was close
to the metal programming generally the
results are is accurate is live M but
they're not bitwise Danakil right don't
expect to you know call and compare for
equality on a list of arguments we
handle almost all the edge cases
according to see 99 and xg4 the special
functions the exceptions are a few
places around signs zeroes what happens
when plus or minus 0 is passed to one of
these routines we make no alignment
requirements although you will get best
performance if you can 16-bit 16 byte
align your data storage returned by
malloc is on Mac OS 10 by default 16
byte aligned this stuff is tuned for the
g5 I mean that's the performance
flagship here but the good news is it
runs quite nicely on g4 and g3 and of
course we dispatch internally to the
appropriate routine you don't need to
worry about where you're running v4 s
routines they just do the right thing
so one final change of gears here is
sort of to come back to the elementary
functions themselves where we've done a
bit of tune-up work here are a sort of
selected sample of the probably most
most used and most loved element
elementary functions in our library and
we report the number of G 5 cycles on a
random selection over a wide range of
arguments and averaged over the number
of iterations square-root taking about
35 cycles per element sine 52 and so
forth if you look at what the
competition publishes for the
performance of x87 these are essentially
hardware implementations of these
transcendental functions their square
root runs about 38 their exponential
depending on your how you want to count
runs no less than 150 cycles to do the
actually the two to the X part and
there's a bit of massaging to get e to
the X there logarithms of winner and
otherwise we get all the all the wins in
yellow now those are just sort of raw
x87 numbers when you actually package
these things into a library that take
account of rounding requirements and
error flags such as in gnu/linux the
performance falls off a bit more these
g5 numbers are already in the prescribed
I Triple E in compliance with I Triple E
so there's nothing further to say that
is Lib M that's new linux on intel on
the competitors hardware going you know
quite a bit slower so for raw elementary
function performance I think g5 wins but
I work on that stuff so there are some
notes in our technical library tech note
2086 tuning for the g5 techno 2087
quick look at the g4 and g5 if you're
familiar with programming for g4 that
will get you bumped up to g5 in a hurry
and I see some note takers finishing up
on that and some really nice
documentation in the developer reference
library for the accelerate framework and
some of its individual components via
image the DSP and a piece that Ian
mainly maintains on the velocity engine
that's sort of a wonderful general
gentle introduction to CD program is
there such a thing Bob I don't know
that's a good point okay