---
title: WWDC2004 Session 407
framework: wwdc
role: article
path: wwdc/wwdc2004-407
---

# WWDC2004 Session 407

## Transcript

Kind: captions Language: en good afternoon this is session 407 the accelerate framework accelerate framework was introduced last year's Developers Conference this year will bring a rehash a little bit about what was introduced last year but also give you part two if you know anything about Ollie's team you know these are the guys who in your calculus class were the ones finishing the word problems or the homework before the class was even over they love math so I hope you guys enjoy this session and with that I'd like to introduce dr. Elyse house Gauri hi to you about is the accelerate framework and what we have done and we plan on doing for Tiger the talk is in three parts I'm going to give you a general overview and some snippets of results that we have and after that I'm going to pass it on to my colleague Ian Oldman and he's going to talk about the image processing library which was introduced last year and the the panther OS and after that I'm gonna pass it to my other colleague Steve and he's going to talk more about the numerix and the linear algebra results that we have so let's get started so as you know we have had this particular configuration the accelerate framework which is a collection of all the computational underpinnings of the Mac os10 in Panther we've had the vet clip section of it for a while last year we introduced the image vet clip section had the signal processing the linear algebra the matrix computations the Blas the large number computations and the math libraries which took hardware vectors 128 bit vectors we added image processing I'm happy to tell you a lot of people are using our image processing inside Apple and outside and we're going to talk a little bit more about that one of the additions to the new operating system is V force V force we've had a lot of calls for people who wanted to use an array of elements and pass it on to the elementary functions and get the elementary function results not just pass it on in hardware vectors just one scalar at a time V Force is our new library which we will talk in depth later on Steve will talk about that what is delivered in Mac os10 tiger basically the accelerate framework is one-stop shopping for computational performance digital signal processing we have expanded that in Tiger we will have about three hundred and forty new functions in the VDS B sub framework we have digital image image processing we have expanded that also and with added performance in some of the core routines such as convolution the Blas level one two and three if you're familiar with that that these are the basic linear algebra subroutines again these are the structures of computations that people do for la pack the entire la pack that's single and double real and complex for all of the routines basically this is the exact API that people who are using la pack are used to v force the array LMS that we'll talk a little bit more about that today and that's a new in Tiger V mathlib the counterpart of regular Lib M which on the scalar runs on scalar this one here runs on the vector engine I'm going to just touch on some of the performance improvements and performances that we have right now in tiger in some of the CDs the CDs that you've had first I'd like to talk to talk about V force performance these are the lms we have single and double precision z' they're highly accurate they operate on arrays instead of elements or Hardware vectors 128-bit hardware vectors monotonicity is observed over the entire range of definition that's pretty important because there are competitors which do have functionality such as this but they have cut the corners and developers have to worry about pitfalls of where to call or what not to call or what elements to call here you're free to call anything basically if it's in the floating-point domain it will work and it will not trip you and will not give you the wrong results I have a small little table here showing you what the benefits of v4s are I'm quite proud of this particular piece of work that our group has done and Steve will talk about this a little bit further square root is three point five three point one x over three times faster than the current one exponential is over six times and sign is 11 times faster and square root was already pretty pretty pretty fast on g5s but we have it even faster on this and the reason these things are faster is that we are able to plug in the bubbles in the computational structure of the algorithms because regular elementary functions they just don't have enough data to go through and you end up having a lot of empty cycles going by these this allows it to fill it up completely and have a stellar performance the next thing I like to talk to you about is the la pact performance Linpack this is a lot of people know about these results I just have a little bit of it here the DLP 1000 this is the double precision Linpack 1,000 by 1,000 matrix and we're about 5 gigaflops for double precision and single precision is over seven and a half gigaflops and this is on a two and a half gigahertz PowerPC Blas performance the quintessential Blas performance benchmark is really D game and that's the double generalized matrix-matrix multiply which is an enhanced matrix multiply of a scalar times a matrix times matrix plus a scalar times a matrix and a lot of people like to look at that to see the prowess of the implementation and what we have here is comparing that with Opteron because I get asked how we compare with the competition here higher numbers are better on this particular one here double precision size 5500 so if you multiplied a 5500 size matrices you will have a 12.1 8 on a PowerPC and a 7 over 7 gigaflops for the Opteron now the Opteron that we had on our hand was a 2.0 gigahertz machine we just added 20% more to it we were unable to get a hold of a live 2.4 gigahertz machine to run that stuff we gave him exactly 20% up which generally frequency goes up that much performance doesn't go up that much what we did so it's twelve point eight versus eight point five five what I have four graders here is just for fun what would the SCM performance be and the SCM performance on our machine is 23 gigaflops I don't know some of you know I've been in competition for a while now and 20 gigaflops before required many millions of dollars to achieve but 20 GB 23 gigaflops is just the pittance now he can buy yourself a PowerPC at 2.5 gigahertz and get that VDS pea performance our FFT we have we continued having a stellar collection of FFTs for our users to use single and double precision real and complex 1d and 2d in place and out of place basically and radix 2 3 & 5 we have them hand tuned for the vector engine and we also have have a tan tuned for the dual scaler I have them for I'm comparing them with a 3.2 gigahertz Zeon's this time around we're not looking at the gigaflops were looking at timing real microseconds because signal processors really don't care very much about what the throughput a floating-point is they like to because they're real time folks they like to find out exactly what timing is on that single precision 1024 complex which is always edged into my mind is 4.5 6 micro seconds versus 6.13 on a 3.2 gigahertz Xeon and these are one processor only because the floating-point the FFT in floating-point is doesn't do that much work to dole it out to processors single precision 1024 real is 2.3 microseconds versus four point two seven microseconds so you would think this is fast enough why would you like to make this any faster just one example the quintessential example that I like to give always is iTunes itunes use as our FFT at a tune of 1.2 million times per hour for your music it's that real FFT get that gets used the more we shave off of that the faster your decoding and encoding will go and the more we shave off of computational things like FFTs and imdct s the better your battery life will be so this is pretty darn important to make sure that always runs extremely fast image processing library I'm very very proud of this particular set of work we set out and worked on this for a year on Panther and delivered it and it's used in a lot of applications that we have in-house and outside I just have a couple of little things in here we have planar and chunky kind of a funny word they RGB intra interleaved formats native support for 8-bit and floating-point samples can be used in real time it's multi-threaded for so they have large images you can do better I have a small table here for performance and what we have is a gigabyte image blurring I'm comparing that to the IPP which is the Intel integrated performance primitives that some of you might be familiar with the eight gigabytes image blur is five and a half times faster the eggy gigabyte image emboss is 2.2 times faster delivered in Mac OS then also let's not forget the underpinnings of again a regular computation a Lib M we are standards conforming AP is for I Triple E 754 and c99 single and double knowing Tiger is our long-lost 128 bits long double friend which is going to make an appearance again and we have really stellar implementation for that very very accurate computationally all of these guys are numerically robust highly accurate worried about the environmental controls and never mess up anything and we take any any we take a lot of care to make sure that we conform to all of the any existing is the standards best-of-breed algorithms basically coding to Alabam in c straightforward it's some you just call the compiler you don't have to say - LM on that using the accelerator framework in C it's also straightforward all you need to do is just put in - framework accelerate so when I would basically what I've done here in the last few minutes is to just give you a small sampling of what we have in the image processing signal processing blahs the force and la pack and we're going to go into some of the details of this work as we go along now I'd like to pass this on to my colleague Ian Olman who is going to talk more about the image processing all right thank you villagers introduced last year at WWDC and shipped with panther and since then we've gotten a lot of feedback on it and we've taken your suggestions to heart and so we've got more improvements for it now the image functionality remains much as introduced previously with some new features added on we still have native support for 8-bit and floating-point samples these can be arranged either in a planar which is they all in one channel per array or a chunky format which would merely use several channels if you're doing 8-bit work and images then we throw in saturated clipping usually the ends of functions that can overflows you don't get the white goes to black or black goes to white problem we've put in a lot of effort rethinking thinking the design over to make sure you can use these things real time that you know we don't arbitrarily call malloc you know we give you the opportunity to provide us with the temporary buffers you won't block on that that kind of things we're also reentrant so you can call us in a multi-threaded environment and of course it's high-performance accelerated for altivec we provide a variety of image filters we have convolutions morphology functions that allow you to do edge detection or fill in holes that kind of thing min max dilated Road we do histogram operations with color balancing alpha compositing which with some new functionality there geometrical transforms we do scales rotates yours effing warps you can sort of distort the image in lots of different ways we also do some color space conversions and data type conversions so just to go over what you can do with convolution depending on what kernel you provide the convolution filter you can do all sorts of different operations whereas sharpens I can do an emboss which is essentially a first year over the image you can do a ver a'junk various other things we've gone over and looked at the performance for tiger and for d5 and for future processors and we've done a lot of work to get the performance up right now on your CD you'll find that the performance for the planar 8-bit cases substantially improved over what it was and as as the months go by and the near future we're going to push that forward on other things we've so we've done a lot of work just to get the brute force computation about it we've also improved the algorithm a bit it's a lots smarter about zeros in your convolution currently most people they pass in a the kernel is 90% zeroes and they just kind of expect the library not to actually do work for the zeros but it turns out if you go look at all the high-performance kind of look involves out there they do actually do work for the zeros but we've changed around so we don't so in many cases now in a comparative study between our library and the other ones are going to see very substantial improvement and ours over the other one so just to give you an idea here's an example of a somewhat blurry image of Lisbon and you can apply a standard sharpening kernel and it looks a little bit sharper I don't know if it shows up well in this display and you can see the kernel there that we used which accentuates the pixel in question over its neighbors that's how you get the effect and here's the kind of a performance you can expect on that kind of thing here we have a competitive graph against Xeon it's it's a little hard to read it's a 3.2 gigahertz Xeon that we're working on and we're looking at the Intel performance primitives library and tell has already gone through and multi-threaded all this for you so both of these are dual processor results Intel is the blue bar along the bottom we normalized its performance to 1 and the speed of the g5 as you can see where the dense kernel was the red line above it so we're usually you know between 1 and 3 times faster than until for a dense kernel and then for a sparse kernel like emboss which is mostly zeros then we you know up to 8 times faster for those things we also do morphology operations we can do kind of different shape changing operations that kind of thing so here would be an example where we've got a nice picture except for oh it looks like a power line up in the top left corner wouldn't it be nice we could remove that well there's lots of ways but we'll just use morphology for this example and so we can apply a max filter and Max will go around and look at all the pixels around this pixel in question and take the maximum value so the power line is kind of a dark image so as we apply the max filter it just goes away but you notice that some of the white highlights got bigger so we can apply a min filter and you know kind of subtract them back out and so you have something that looks like your original image bag except now that the power line is going to be gone so you can do these four interesting effects in addition to just shape changing that kind of thing so here's performance on that we've got a new algorithm for max which works substantially better here you can see the 3.2 gigahertz dual processors Eon results again normalize the one is the red line across the bottom and you know as the kernel size gets larger you can see our performance relative to Xeon gets better and better and we're up to four times faster for really large filters we do alpha compositing we can support either premultiplied images or non premultiplied images we have functions to pre multiply unpretty multiply data we've now added a few new functions for tiger you can mix non pre multiplied into a pre multiplied layer which allows you to do multiple stacks go along and we added in compositing with escape FISC a scalar fade value which allowed to sort of fade in the whole image without going through and writing over the Alpha Channel so those will be available we also have new type conversion features this was actually surprising at least to us number one requested feature it seems that everybody has their own data format that they like to use and so I've got a lot of conversions to get that in and out of what the image likes to use so you now can handle 24-bit 8-bit per channel color also the older a RGB 155 v and RGB 565 16 bits per pixel formats we do also the 16-bit per channel integer supports and signed and unsigned flavors and we've also introduced openexr compliant 16 bit floating point conversion functions in case you need to work with video cards that use those things also added a few other things that allow you to insert channels into interleaved images or permute channels around like so it's that you need to swap around an a RGB image to an RGB a or something like that so those things leave there they'll be fully vectorize and they're pretty much operated bandwidth-limited rates also added color space transforms we originally didn't put these in because we thought we would leave these up to color sync but now color sync wants to use our codes so you have them in there we have matrix multiplications so saturated clipping for a bit of course for to prevent overflow we allow you to put in an optional pre and post bias mathematically the pre and post players are the same but it's a little easier to use that way so you lit put that feature in and again like the convolution this one only does work for nonzero element so you can safely pass this a rather sparse matrix and we'll just do the work that we need to in order to do that we're also introducing a whole set of gamma correction functions which these come in a variety of flavors you can get a generic power curve we also provide a few specialty gammas like srgb u it's showing exactly generic power these are available in two different formats they're generally floating-point geared but you can get them in either a full 24 bit or a 12 bit precision variants and the small bit precision obviously is appropriate for data that was a bit integer data to begin with we also have a few functions to do simultaneous a bit conversion with clipping while they're doing the gamma correction and we also applied providing interpolated look-up tables stuff for cases where your gamma curve is not nicely described by a power function so I'd like to invite Steve Peters up to talk about the numeric improvements for tiger all right factory all right I'm going to take some time this afternoon to present the credentials of our math libraries perhaps some of you have not used them before and would like to know a bit about the motivation and also spend some time with performance Hey excellent so you know job number one for us is conformance to make porting your applications building your applications correspond to experience you've learned on other platforms learn to the classroom learn from reading the standards who does it anymore and at the base we have we're delivering platforms based on G 3G 4G v chips all of which have I Triple E 754 compliant floating-point arithmetic both single and double when we move up one level to the elementary functions the basic math libraries these are also compliant compliant with the c99 standard all the required c99 api's are present for complex and long double as well as we come into the Tiger world you know I'm gonna have to use these we build our linear algebra the blahs the basic linear algebra subroutines from atlas the widely respected open-source package that is automatically tuned linear algebra software we offer the full panoply of api's in float double complex complex double and similarly for the gold standard of numerical computing l apec you know all routines folk double complex complex double entry points for both C and Fortran after conformance we're really concerned with performance and the flagship of performance now at Apple is the marvelous g5 CPU the nine PowerPC 970 which offers dual floating-point cores my recollection the first in Apple's line and has given us really stellar performance so on each 970 CPU we find two floating-point cores capable of doing double precision I Triple E single precision I Triple E on any machine cycle both of those units can be pressed into action we can start a floating point instruction down each pipe on both pipes in a single cycle all the basic arithmetic operations ahead multiply subtract and divide are present we also get Hardware square root in the PowerPC 970 that's a real boon to us and another class of instructions that have been present in g4 and now as well in g5 called the fused multiply add fused multiply add takes three operands multiplies the first two together adds it to the third all in the course of one instruction so this ends up being a key important operation fundamental to linear algebra the dot product which is essentially multiply and accumulate multiply accumulate multiply to accumulate it's fundamental to the FFT in much the same way if you're doing a function evaluation by say polynomial approximation you'll probably want to use Horner's rule if you think a little bit about the way Horner's rule works out it's essentially a fuse multiply add win and at the bottom line we get to count two floating-point operations per fuse multiply add so on a machine with two floating-point cores we get four flops per cycle so let's see four flops per cycle I always have to do this in my head four flops per cycle two CPUs in the dual gee five so that's eight flops across to see two CPUs and we clock them at two gigahertz so we top out it 16 double precision floating point operations 16 gigaflops worth of floating-point operations on a two gigahertz qg5 and now that we're using 2.5 s I have to update my thinking it's the 20/20 gigaflops theoretical peak theoretical peak so how do you get to this performance how do you get to this great double precision for performance if you've got an existing Apple Mac OS 10 binary perhaps built for g4 just bring it across the scheduling in the CPU is really smart as the instruction stream comes along and we start seeing floating-point instructions they get dispatched off to dual CPUs and they will finish faster than if they were sent to a single pipe so part of the answer is you have to do anything and you should see some important performance and in existing binary apps second if you're able to recompile your app say it's open source application code you've developed recompile with GCC set the proper options that I'll point to in a tech note later and let it schedule instructions in even more optimal way for the g5 and you can see yet more gains it's also possible by paying special attention to algorithmic details to get even further gains for example if you're computing a rational function approximation you may be able to arrange the calculation so that the numerator is computed simultaneously with the denominator on the two pipes at the end you just weld them together with the divide this level of attention we've paid already to live my Brer II our Blas our la Peck and the V Force library both our g4 and g5 platforms offer the altivec single instruction multiple data processor this is a 4-way parallel single precision engine doesn't do double precision not at all Ian keeps telling us it will never do double it's a single precision engine with a huge appetite for floating-point it really just rips through floating-point calculations all the basic operations are present as well as a vector fused multiply ad so now we get two flops counted for the fused multiply head on four operands strung across the 128 bit vector that gives us eight flops per cycle now let's see can I do the math in my head for a two and a half gigahertz g5 I think that tops out at forty gigaflops thank you forty gigaflops tops alright so how do you get to this performance well sorry you got to do a little bit of work you're gonna have to learn a little bit about vector programming there's an out that we've announced this week but it helps to get in there with your code understand where there's inherent parallelism in your algorithms work those over with the Sindhi instruction set and pass them through the compiler our advice is always profile first before you dig in find out where the you know the 10% of the code is where you're spending 90% of your time and go look at those shark is a wonderful tool for figuring out these case cases I hope you've seen shark or plan to see a shark talk sometime this week they're they're playing in a theater near you I'm sure auto vectorization is an option and this slide was actually written before the announcement was made to that GCC 3.5 we'll be offering some auto vectorization features check those out it may be a real boon to getting better use of the Sindhi unit on the g-force and g5s there's also a third-party application called vas that can analyze I think Fortran codes to discover inherent parallelism and omit the alt of proper altivec code we've gone through at Apple and paid this kind of attention algorithmic attention recasting algorithms for our V Force library our single precision Blas our single precision FFTs and digital single processing algorithms and heavily in V image well when you come to our platform as a developer and you know kind of come to that final step you know how do I access these wonderful libraries link load and go we try to make that as straightforward as possible the library API is generally are in will internally dispatch for the correct platform so we won't go off and try to execute code that's appropriate for a g5 on a machine that's a g3 for example generally the rule is if the API uses a vector type Hardware sim D vector type altivec sim d vector type you're expected as he has a consumer of that API to know that you're on g4 g5 otherwise we'll take care of that for you live and lynx by default it's part of Lib system you don't need to say anything about that four-hour long double and complex I a P is plead please add - LM X to your link line and for V force the blahs la pack VDS P and V image the one-stop shopping place is framework accelerate just add - framework accelerate to your compile and Link lines I know that's a popular popular flag so I'll let you copy that down all right well what's new for math and Tiger what have we been working on Ali hit the highlights of the V force library basically we've been told people don't want to do one square root at a time they'd really like to do 768 at a time and sure enough there are advantages to be had when you can do many of these things at once we also took a blahs update an update to Atlas 3.6 this helped us in a couple of places we of course do additional Mac os10 specific tune ups to that open source drop and our compiler technology improved thank you compiler team to give us some some nice gains some nice and somewhat unexpected gains and because of the faster underlying blas some improved compilation our la pack is going faster - now all ali always likes me to lead with the strongest graph so i can give you a couple of performance numbers here this these are some numbers i collected for the two and a half gigahertz g5 dual processor it's a set of numbers that in the sort of linear route computational linear algebra community you'll see quite a bit it measures matrix multiplied DMM and then the three decompositions lu and the symmetric decompositions ll transpose cholesky and the crowd u transpose u for matrix multiply we use various matrix sizes ranging from 500 up to roughly I think 9000 and sort of got our first plateau or bit over eleven gigaflops and then and sort of an interesting jump around size 5000 and as we push up beyond the 12 and into the 13 gigaflop range and the decompositions are a little bit less jumpy of the less of a step function but Jenna look like they're asymptotic hitting an asymptote at around 10 gigaflops well what's what's the competition up to these days let's just look at matrix multiply again in yellow is the dual two-and-a-half gigahertz g5 topping out at or near at or above 12 gigahertz on the bottom in blue is Opteron a 2.0 gigahertz Opteron and they sort of get to about seven gigaflops in a 2.0 model for the purposes of comparison we know that they've got a 2.4 gigahertz part out there and if they were allowed to perfectly scale they'd hit that dashed white line and come in just a bit over 8 gigaflops we expect to see that when we measure those machines dual 3.2 gigahertz Xeon is the Green gets up a little bit above 10 probably touches 11 in a couple those places so two-and-a-half gigahertz g5 seems to dominate in the matrix multiply game quite quite handily the slide is a bit more busy but again the color should be the guide here yellow again g5 green is a Xeon and Opteron in blue and again we've scaled Opteron by 20% for the white - line g5 seems to dominate again this looks a little bit out of place I mentioned we did long double I think Olly mentioned we did long double and we'll also have the tight generic math functions that's good to know so I want to come back to this um V force business and as Ali alluded to the elementary functions in live M square root coasts sine arc sine passed a single operand do a fairly heavy amount of computation and burp out a single result it turns out that leaves bubbles in the modern risk pipelines so we say these c99 api's or data starved we're also required by I Triple E 754 to have very careful control over the rounding modes and exceptions that might be generated in the course of such a computation and that adds a fair amount of overhead they're instructions that will have to synchronize the pipe to get that stuff right and we pay it pretty good price for that so the ideas and V force are let's pass many operands through a single call maybe we can get some advantage there so if we had 768 values and a vector X and we wanted to compute the single precision floating point sign of those things we could call V V sign F pass X 768 and a place to stuff the answers Y or we might have 117 numbers we want the arctan of them there's a call for that we're going to insist on the I Triple E default rounding modes and we're not going to set any exception flags so this is for you know close to the medal high performance go as fast as you can we don't expect any big problems and if there are any well we'll deal with them in some other manner i Tripoli approach so we also get some mileage here because given multiple operands we can pack them together into Hardware vectors on the single-precision side and send them through the altivec engine this is a very good thing similarly on the g5 we can make sure to use will utilize the two pipes as effectively as possible we do a lot of software pipelining that is sort of arranging as I say arranging to fill all the available cycles on all the floating-point pipes we unroll loops like crazy and we also have taken some algorithmic approaches that favor calculation over table lookup and try to avoid branches like the plague it makes these things go very very very fast and as we pointed out we have gains in square root to 3x X to sneer li7 and sign was almost 12 3 6 12 so some caveats right and was close to the metal programming generally the results are is accurate is live M but they're not bitwise Danakil right don't expect to you know call and compare for equality on a list of arguments we handle almost all the edge cases according to see 99 and xg4 the special functions the exceptions are a few places around signs zeroes what happens when plus or minus 0 is passed to one of these routines we make no alignment requirements although you will get best performance if you can 16-bit 16 byte align your data storage returned by malloc is on Mac OS 10 by default 16 byte aligned this stuff is tuned for the g5 I mean that's the performance flagship here but the good news is it runs quite nicely on g4 and g3 and of course we dispatch internally to the appropriate routine you don't need to worry about where you're running v4 s routines they just do the right thing so one final change of gears here is sort of to come back to the elementary functions themselves where we've done a bit of tune-up work here are a sort of selected sample of the probably most most used and most loved element elementary functions in our library and we report the number of G 5 cycles on a random selection over a wide range of arguments and averaged over the number of iterations square-root taking about 35 cycles per element sine 52 and so forth if you look at what the competition publishes for the performance of x87 these are essentially hardware implementations of these transcendental functions their square root runs about 38 their exponential depending on your how you want to count runs no less than 150 cycles to do the actually the two to the X part and there's a bit of massaging to get e to the X there logarithms of winner and otherwise we get all the all the wins in yellow now those are just sort of raw x87 numbers when you actually package these things into a library that take account of rounding requirements and error flags such as in gnu/linux the performance falls off a bit more these g5 numbers are already in the prescribed I Triple E in compliance with I Triple E so there's nothing further to say that is Lib M that's new linux on intel on the competitors hardware going you know quite a bit slower so for raw elementary function performance I think g5 wins but I work on that stuff so there are some notes in our technical library tech note 2086 tuning for the g5 techno 2087 quick look at the g4 and g5 if you're familiar with programming for g4 that will get you bumped up to g5 in a hurry and I see some note takers finishing up on that and some really nice documentation in the developer reference library for the accelerate framework and some of its individual components via image the DSP and a piece that Ian mainly maintains on the velocity engine that's sort of a wonderful general gentle introduction to CD program is there such a thing Bob I don't know that's a good point okay
