Transcript
>> Good morning, everyone.
My name is Steve Canon.
I'm a Senior Engineer in the Vector Numerics Group.
And today, I'm going to be taking to you
about the Accelerate framework for iPhone OS.
I'm going to start off with a little
bit of overview of the ARM architecture,
and then we're going to go a very high-level
overview of what's in the Accelerate framework,
and we're going to have a few examples
of how to use it effectively.
What I really want you to take away from this
is first, what's in the Accelerate framework.
It's new to iOS 4.
So if you haven't developed for the Mac
before, you're probably not familiar with it.
I want you to learn where to look for references
and documentation on the API's that are in there,
and I also want to give you some
pointers on using Accelerate effectively.
So let's start with a little bit about the ARM architecture.
Apple has shipped two versions of the ARM architecture.
The first one we shipped was ARMv6.
This was used in the original iPhone, the iPhone 3G,
then the first and second generation, iPod touches.
ARMv6 has a general purpose integer unit with 16
registers, and it also has hardware floating point support,
that's called VFP, and it has 32 single precision
registers and 16 double precision registers.
So it's got hardware support for
both single and double precision.
Now, in the newer products, we're
using an architecture called ARMv7.
This is used in the iPhone 3GS.
It's used in the third generation iPod touch, it's
used in the iPad, and it's also used in the iPhone 4.
ARMv7, like ARMv6, has an integer unit with 16 registers.
It's-- unlike ARMv6, it's capable of
executing two instructions simultaneously.
It also has legacy support for the VFP instructions,
it's the hardware floating point model on ARMv6.
But it also has new SIMD unit which is called NEON.
SIMD stands for Single Instruction Multiple Dispatch.
NEON unit has 16 128-bit registers.
And all single precision floating point on the
ARMv7 architecture executes on the NEON unit.
NEON does not have double precision support.
So double precision happens on the Legacy VFP unit.
A NEON register, as I said, it was a 128 bits wide.
But the operations that the NEON instructions
perform don't treat it as 128-bit block.
Because it's SIMD registers, what they do they
treat it as though it's several smaller units,
and perform the same operation on each of those units.
So their instructions did operate as though
it were sixteen 8-bit fields, as you see here,
there's instructions that operate as though it were eight
16-bit integer fields, there's instructions that treat it
as four 32-bit integers, as four 32-bit floating point
numbers, that's a single precision floating point number,
or there's even a few instructions
that treat as two 64-bit integers.
Not a lot, but there's a few.
So this is an example of a NEON instruction.
This is the vadd.i16 instruction, it's a 16-bit integer add,
and so what this instruction does is
it takes these two 128-bit registers,
treats them as though they contain eight 16-bit integers,
and it simultaneously adds all eight pairs
of integers to give you eight results.
Another example, a floating point example
here, this is a vmul.f32 instruction,
and this takes the two 128-bit registers, treats
them as though they contain four 32-bit floats each,
and it simultaneously multiplies them
to give you four floating point results.
So that's sort of a couple of examples from NEON.
There are several different types
of load and store instructions.
NEON actually has a really rich set of loads and stores.
It has a good aligned load support, good aligned store--
unaligned store support, it has some
strided loads, it's got all kinds of stuff.
It has single precision floating point arithmetic,
all the basic operations, conversations,
it also has a faster reciprocal and
square root estimate instruction.
As I said before, it does not have
double precision arithmetic.
And it also has a fairly complete set of integer operations.
It's got 8, 16, 32 bit, signed and unsigned, integer
operations, and it's got all your favorite things.
Multiply-accumulate, multiply-add,
subtract, reverse subtract, wide multiplies.
And then it also has a few more specialized
things that are useful in fixed point arithmetic,
and sort of digital signal processing applications.
NEON, however, does not have double precision.
So on ARMv7, all double precision happens on the VFP
unit, and that is much slower than the NEON unit.
Another thing about NEON is that although
NEON instructions typically consume more power
than do either scalar instructions
or general purpose instructions,
they tend to consume less energy
overall if your code is written well.
The reason for that is that the energy consumed is
actually the amount of power consumed over time.
So whereas a general-- if you have your code using
general purpose instructions, you graph the power consumed
versus time for it, you typically see, see your
graph like this, were initially, the processors idle,
it ramps up to do some work, it consumes power
at a relatively steady state for the period
of the time it takes to do work, and then it ramps back down.
Now, if you use NEON effectively, if you take good
advantage of it, you write good code, what happens is,
you consume much more power, or sometimes, only a
little bit of more power, but for much less time.
And so, by using NEON efficiently,
you're able to use less energy overall
which makes your battery last longer,
which is something we all like.
So that's a little overview of NEON. And now,
we're going to dive into the Accelerate framework.
The Accelerate framework was introduced
by the iOS 4 launch event
as having 2000 API's for hardware
accelerated math functions.
Now, what kind of math functions are we talking about?
Accelerate framework in Mac OS X is an umbrella
framework that consists of many libraries.
Not all of those libraries are
going to be available in iOS 4.
We're bringing three of them to iOS 4 right now.
Those are the vDSP, LAPACK, and BLAS libraries.
Let's start with vDSP.
It stands for the Digital Signal Processing Library,
and it provides a lot of basic operations
and more complicated operations.
We're going to start with one of the absolute
simplest functions in the vDSP library just
to give you a feel for it, the dot product.
You may be familiar with dot product already.
If you're not, I'll remind you.
The dot product is the sum of pair of the products
of corresponding elements from two vectors.
So you have two vectors, your inputs, I've called them
A and B here, and it gives you a scaler as an output.
And the way it gets back is by multiplying the first element
of one vector by the first element of the other vector,
second element of one vector by the second element
of the other, third element by the third element.
Take all those products, sum them together,
that's the dot product of two vectors.
So suppose we need to use this in our program.
The dot product is a really basic operation, gets used
all the time, it's used heavily in audio processing,
it's used heavily in lighting calculations
for games, gets used all over the place.
So say, you're writing a program,
and you need to compute it.
You might just write a simple for loop to compute it.
This is a simple piece of C99 code to compute a dot
product, we just iterate over the length of the two vectors,
we take the products, and we add them together.
What could be easier?
This is fine, but I think we can do better.
So we can use Accelerate to compute the dot product.
Here, I have included the Accelerate headers
so that I get the definitions we need,
and I've called this function vDSP_dotpr, stands for dot
product, and it takes pointers to the two arrays, A and B,
they're these two arguments one,
which we're not really going to worry
about right now, but we'll talk about them later.
It takes a pointer for the return value, the dot
product, and it takes the length as an argument.
Now, I haven't really saved a lot of code by doing this.
My code is about as long as it was before.
It's not a lot simpler.
Maybe it's less likely that I introduced a bug because
I'm using a library function instead of writing it myself.
But why should you do this?
The first thing that most people would
probably thing of is performance.
So let's take a look at the execution
time on the ARMv7 architecture.
What I've got here is the execution time
for a length 1024 Dot Product on ARMv7.
This is something that would typically
be used on audio processing applications.
And the vDSP_dotpr function is about eight times faster
than that simple for loop that we started out with.
So that's great.
What else did we get?
This is my favorite.
We used a lot less energy.
This is again, that same 1024 element dot product on
the ARMv7 architecture, and the vDSP_dotpr is consuming
about one quarter-- actually, a little less than
one quarter, of the energy than that for loop would.
So that means that hypothetically, if your
application was doing nothing but Dot Products,
you could do more than four times
as much work on the same battery.
So that's fantastic.
Now, what about on older processors?
What about the ARMv6 architecture?
As I mentioned, it has a very different
floating point unit than the ARMv7 does.
And code that performs well on one
may not perform well on the other.
Here's the graph where we're looking
at the performance on ARMv6.
And that same code that we have, just calling the Accelerate
framework dotpr function, gives us great performance
on the ARMv6 architecture, as well as on the ARMv7.
So you don't need to write a lot
of architecture specific code.
Now, there may be some graphics programmers out there
in the audience who are saying, "This guy is crazy.
Why would you want to do a 1024 element dot product?"
I want to do a three element dot product.
So I'll tell you right up front that that's pretty much
the worst use case possible for the Accelerate framework,
because you're only going to do three things worth of work,
and you are going to pay for a
function call, all these overhead.
The good new is that you still
get pretty descent performance,
more than twice as fast as that simple for loop would be.
Now, I would say, you shouldn't use it in this case.
But if you do, the performance would be good.
Let's take a look at what makes the dotpr function faster.
This is that simple for loop that we
started out with, and this is disassembly
of the compiled code that resulted from it.
Now, I'm going to highlight the instructions that
are doing, sort of the critical work in the loop,
the actual floating point operations
that we're interested in.
There they are.
One thing you'll notice right away is that the compiler
didn't do a great job of structuring this loop.
So it's using a lot of overhead for the computation.
Now, let's take a look at what it is doing.
These first two instructions, these flds instructions,
these load the floating point value from each array,
and then we multiple those two
values together with the fmuls--
the vmul.f32 instruction, and then we add dot product
to a running sum with the vadd.f32 instruction.
I mentioned the overhead.
There's another thing that makes this loop slow, which
is that the multiply comes immediately after the loads,
but it can't execute until the data
that being loaded is in register.
So in the best case scenario, where those loads are
coming from the L1 cache, it's still going to take
about three cycles before that data is loaded into
register, and the vmul instruction can execute.
So your processor is doing nothing at all for three cycles.
Now three cycles isn't a lot, but if this is a loop that
gets called a lot, your processor is doing nothing at all
for three cycles every iteration of the loop.
That's wasting energy.
The same thing happens on the add instruction.
Again, we're waiting for the result of the
multiple before we can add it to the accumulator.
So for a couple of cycles, the
processor is doing no useful work.
It's just spending energy.
Let's take a look at the vDSP_dotpr
function and see what we did to do better.
This is sort of the critical interloop
of the vDSP_dotpr function.
Not all cases fall through this.
If you have a very short vector,
you won't end up in this loop.
If your vectors are misaligned, you won't end up here.
But in general, most of the time, this is the
core loop that's doing the bulk of the work.
Again, I'm going to highlight the
instructions that are doing useful work here.
First thing to notice is that we have a lot less overhead.
Whereas, that simple for loop we had was computing
one term of the dot product every iteration
that had six instructions worth of overhead.
Here, we have only two instructions of
overhead, but we're computing 16 terms
of the dot product every pass to this loop.
Now, how are you doing 16 at once?
This loop actually has four separate chains
of computation interleaved with each other.
I'm going to highlight one of them.
This is going to compute four terms of the
dot product using the NEON vector instructions
that we were talking about before.
So first, we load four floats simultaneously from A, we
load four floats from B, we multiply those four pairs
of floating point numbers together to
get four pairwise products, and then--
this actually happens on the next
iteration through the loop--
we add those four products to the four
running sums that we're maintaining.
Now, you could do that yourself.
You can write that code-- it's not terribly hard.
It takes some work.
But you use Accelerate, it's really easy.
You get all the performance benefits of that, but
you just have to write that one function column.
I think that's great.
What else is in the vDSP library?
Besides dot products, we have lots
of other basic operations on arrays.
You can add, subtract, multiply arrays of floating
point data, there are conversions between types,
there's an accumulation function like the dot product,
there's also things that sum arrays,
there are lots of other stuff.
You'll have to look through the
manual to learn about all of them.
There's also more complicated functions
that are not so easy to write yourself.
There's convolution and correlation functions, and
there's also a nice set of fast Fourier transforms.
There's both complex to complex, and
real to complex Fourier transforms,
and there's support for any power-of-two
size and some non-power-of-two sizes.
The data type supported by vDSP support both single
and double precision, have real and complex data,
and there's also support for strided data access.
This is those two ones that we saw
on the vDSP_dotpr signature before.
What strided data access is, is it lets you operate not on
every element of array, but on every nth element of array.
So if I wanted to operate on the 0th
element, the 2nd element, the 4th element,
the 6th element, I would use a stride of 2.
If I wanted to operate on every third
element, I would have a stride of 3.
So sometimes, you need to do this because your data
comes in in a format where it's laid out that way.
Strided data access is there as a utility so
that you can still use the Accelerate functions,
but you don't have to first deinterleave yourself.
Alright. So now, let's take a look at the FFT.
Here, I've got the basic setup for calling an FFT.
The first thing is, I'm including the Accelerate header,
and I'm also including standard lib
because I'm going to use malloc.
The FFTs in vDSP take the length as
actually, the logarithm base 2 of the length.
So I'm going to perform a 1024 element FFT.
The log base 2 of the 1024 is 10,
so I'm just constructing a variable
to hold the length here, and also the log of the length.
The FFTs take their input and output
in these DSP split complex objects.
They have a pointer to the real part
and a pointer to the imaginary part.
By laying them out this way, we can make things a
little bit more efficient in the body of the FFT,
and I'm just allocating space for a 1024 real parts
in the input, a 1024 imaginary parts in the input,
1024 real parts in the output, and
in the imaginary part of the output.
This last line is creating this
opaque structure called an fftsetup.
fftsetup contains weights and other data that's
needed by the FFT routines to do their computation.
You have to create a setup before you can do an FFT,
but you really only want to create a setup once.
Creating setup is expensive and it's slow.
And so, what you want to do is make one
setup and then use that to perform many FFTs.
You can reuse it over and over again for lots of FFTs.
Now, I have that setup.
Suppose I had put some useful data into my input array,
something that I want to take a Fourier transform of,
I'm going to call this function, vDSP_fft_zop.
That stands for a complex to complex, that's
Z for complex, and OP is "out of place".
So it's not going to overwrite its input.
It's going to store the result of
the FFT in the output structure.
So I pass it the setup that I created, I pass it
the input that 1 is the stride for the inputs,
I pass it the output again, with stride 1, I
give it its length is a log 2, and this flag,
FFT_FOWARD, that tells it to perform a forward FFT.
That's all you have to do.
Now, suppose I then want to do the inverse
transform and get back to where I started.
Here, I'm using the fft_zip function instead of zop.
Zip is a complex to complex in place FFT, so it
overrides its inputs with the result of the transform.
So what this is going to do, it takes that same
setup, the call looks almost the same as zop,
except it only has one DSP split complex structure, so
it's going to take the output of the forward transform
that we did and override it with the inverse transform of
the forward transform, and we give it that FFT INVERSE flag
to tell it that it's doing an inverse Fourier transform.
Now, you might think that the inverse transform of
forward transform for a discrete FFT is the original data.
And it almost is, except that you have to rescale.
Everything gets scaled by a factor of N when
you do a forward followed by inverse TFT.
Here, I'm using another vDSP function, the
vDSP_vsmul function, to undo that scaling.
Create a scale factor of 1 over N, and the vsmul
function stands for Vector Scalar Multiply.
And so, what it's going to do is it's going to take
both that-- the first line, takes that real pointer,
and scales all of the real results by 1 over N.
The second one takes the imaginary pointer,
scales all the imaginary results by 1 over N.
So now, up to floating point rounding, we've
gotten back the data that we started with.
Usually, we would do some useful work with the data at
some point here, rather than just going forward and back.
But I leave that up to you to come up with cool
things you can do in your app with the FFT.
That's not my department.
This last line, the vDSP destroy
setup, just cleans up that data
in that opaque structure that was allocated by vDSP setup.
So basically, free for a vDSP setup.
So that's an example of the FFTs.
What about performance?
Now, I can't really compare the
FFTs to a simple for loop in C.
They're a lot more complicated.
And even if I did compare them to a simple C
program, it would be so much slower, the C program,
that it wouldn't even really show up on the graph.
So instead, I'm going to compare to FFTW.
FFTW stands for the Fastest Fourier Transform in the West.
It's pretty much the best commercial portable FFT library.
You can build it for power PC, you can build it for Intel,
you can build it on ARM, you can build it on all kinds
of platforms, and it gives you really quite
good performance on all of those platforms.
If you want to learn more about it, you can
go to fftw.org, you can download it there.
It's a really nice library.
So how do we compare?
Here, I'm graphing the execution time
for 1024 element single-precision FFT.
So this is something that typically would be used in
audio processing, but also on a lot of other things.
A lot of audio processing both on the Mac and
the phone uses exactly this case very frequently.
And smaller bars are better here.
We're looking at the actual time it takes to execute.
vDSP_fft-zop is five times as fast as FFTW.
FFTW is an excellent library, but we're better.
[ Applause ]
I'm not holding my breath for them to rename it though.
Maybe the Faster Fourier Transform in the East.
So that's the FFTs.
Let's talk a little bit about LAPACK which
stands for the Linear Algebra Package.
Now, LAPACK is a very venerable library.
Its roots go way back in time.
Its roots in fact, predate the C language even.
It's been around the long time in one form or another.
It provides high-level linear algebra operations.
It can solve linear systems.
If you have a system of linear equations, four
equations and four unknowns, LAPACK can solve it.
It can also solve it even if you have six equations and four
unknowns, or two equations and eight unknowns or, whatever.
It's got all kinds of solves.
It provides matrix factorizations.
It lets you compute eigenvalues and eigenvectors.
So this provides a lot of abstract linear algebra
operations that you'd like to be able to do on matrices.
LAPACK supports a bunch of different data types.
It supports both single and double
precision, it supports real and complex data,
it has lots of support for special matrix types.
You can do operations on symmetric
matrices, triangular matrices,
banded matrices, Hermitian matrices, lots of stuff.
There's one gotcha though.
Data in the LAPACK library, because its roots go
way back in time, is laid out in column-major order.
Anytime you're building matrices that
are are going to interact with LAPACK,
you're going to need to lay them out this way.
And what does that mean?
Those of us who speak western languages are
used to reading left to right, top to bottom.
So we might read this matrix as
1 minus 1, 1 minus 1, 1, 2, 4, 8.
That's not the order that's used by LAPACK.
Column-major order means that first, the
first column is stored contiguously in memory.
Then, the next column, the next column, and the next column.
So each column is contiguous.
Whereas, if you're reading across a
row, the axis is going to be strided.
Let's look at the example of using LAPACK
to solve the system of linear equations.
This is one of the most basic things
that LAPACK has support for.
It also has support for lots of,
much more complicated stuff.
So here, I've got a matrix A and a vector B,
and I want to solve the equation Ax equals B.
So I'm looking for a vector x that satisfies this.
First thing, I'm going to set up my matrix A.
I've got two nested for loops here, I'm iterating
first over the rows, and then over the columns.
And here, I'm setting up the lower triangle.
Those are all the minus one elements of that matrix.
Then I'm going to set up the diagonal elements, these
are all one, and then set up the right most columns,
all ones as well, and then we have these
inner upper triangle elements that are zeros.
So I've set up my matrix here.
You notice that the matrix access
that I have here, I'm using A, j, i.
Normally-- and if you took a linear algebra class, you're
probably starting A, i, j when you're describing the ith row
and the jth column of matrix, they're transposed
explicitly so that we get that column-major ordering.
This is one easy way to get the column-major access.
You can't always use this trick.
Sometimes, you need to actually
explicitly write out your axises.
But when you can use this, it works nicely.
So now, let's look at actually solving that system.
We've got our right hand side vector here B, that's the
vector that we're trying to-- we're trying to solve for.
And then we make a call to the sgesv function.
sgesv stands for Single Precision, General Solve.
So our data here is floats, that's single precision.
General is what we use for just a normal rectangular matrix.
It's not anything fancy.
It's not symmetric, it's not banded.
If you're just working with the matrix and you don't know
anything special about it, you're always going to be looking
at these routines that have the GE prefix.
Now, what we passed to the SGESV routine is the size
of the system and the number of right hand sides.
That's the number of vectors B that we're solving for.
That's nrhs.
We give it the matrix A.
The next n parameter here is the
leading dimension of the matrix.
Most of the time, this is just going to
be the dimension of your matrix for you.
If you're working with the matrix that you built yourself,
it's almost always going to be
just the dimension of the matrix.
In special circumstances, it may be something else, I leave
you to look at the LAPACK documentation to learn about that.
We're also going to give it our
vector B that we're solving for.
We give it this info parameter.
That is a value that it uses as a flag, to
indicate whether it was able to solve the system.
Not all systems have solutions.
So if it can't solve it, info will
return with a nonzero value.
There's also this one value, ipiv, that's
a temporary workspace that's needed by it.
On return, it contains some of the
factorization information on the matrix.
So that's all we do, and then I'm
going to print out the result.
You know that the result has overwritten
the right hand side B that we started with.
This is something that happens throughout LAPACK.
Almost all the LAPACK routines overwrite
some of their inputs to hold the results.
This means that if you need to keep the
inputs around, you're usual going to need
to make a copy of them when you're working with LAPACK.
That's something important to be aware of.
So I have compiled it here, I'm going to run it.
You note that I'm linking against the Accelerate framework.
You have to link it against the Accelerate
framework in order to take advantage of it.
Most of the time, you're going to be doing this in Xcode,
so you just need to go to the Add Frameworks menu item,
select the Accelerate framework and bring it in.
Here, I'm compiling on command lines, so
I used this "-framework Accelerate" flag.
So I'm going to build this and run
it, and it solves our linear system.
This happens to be a very simple system.
It has a very simple solution.
But you can use it for all kinds of more complicated stuff.
So that's a little overview of LAPACK.
Let's talk about BLAS.
BLAS stands for the Basic Linear Algebra Subroutines.
It's sort of the low-level underpinnings beneath LAPACK.
But you can also call those routines yourself.
As I mentioned, it's low level.
It provides low-level linear algebra operations.
LAPACK gave you high-level abstract operation on
matrices, solve this system, factor this matrix.
BLAS is low-level stuff that's really the
basic arithmetic operations on matrices.
It gives you vector-vector operations like dot
products, scalar products, and vector sums.
It gives you matrix-vector operations, like
if you want to multiply a vector by a matrix,
or if you want to take the outer product
of two vectors to update a matrix.
It gives you that.
And it also gives you operations on
two matrices like the matrix multiply.
BLAS, like LAPACK, supports lots of different data types,
supports both single and double precision, supports real
and complex data, and it has support
for multiple data layouts.
It's got-- unlike LAPACK, it supports
both row and column-major order.
This is good news because it means
that if you're not working with LAPACK,
if you're only going to use the BLAS routines, you can lay
out your data in row-major order, which is the natural order
for most C programmers, and just use it easily that way.
If you're going to work with LAPACK as well though,
you're probably going to want to lay your data
out in column-major order still so you
don't have to explicitly transpose it
when you're passing data to the LAPACK routines.
Like LAPACK, it has support for dense matrices, banded
matrices, triangle matrices, lots of different matrix types.
Another nice thing about the BLAS routines is that they can
operate on your matrix data as though it were transposed,
or as though they were a conjugate transpose.
So you don't need to explicitly transpose the data
yourself often times when you're working with some.
This is really useful, and if you're using the
BLAS, I encourage you to take advantage of this
because transposition is reasonably expensive.
And in general, we can do it more efficiently if we do
it as part of whatever other operations you're doing.
This is a really, really basic example of the BLAS.
This is matrix multiply, and it's multiplying two 2 by
2 matrices, A and B together here, to get a matrix C.
This is a 2 by 2 example.
You can do a 1000 by 1000, whatever you want.
It's pretty easy to use, it's got a clean C interface,
it's a little nicer to work with than LAPACK.
Here, I'm passing it the CblasRowMajor flag that tells
it-- that my matrices are laid out in a row-major order.
The CblasNoTrans flag is indicating that I don't want
it to do a transposition as part of the multiplication.
If I wanted to implicitly transpose one of the
matrices, I would pass it a CblasTrans flag.
The sea of 2's in the call are
the dimensions of the matrices.
The first three are the dimensions of A, B, and C.
The floating point 1 value, that's a
scale factor to apply to the product.
So I'm saying multiply A times B, scale that by 1,
the 0 before the C is a scale factor to apply to C.
So here, I-- well, it's actually going
to use A times B plus a scaled C,
I'm scaling C by zero, so the result
I get is just A times B.
It's basic matrix multiply.
That's a very basic example of what's available in the BLAS.
Let's talk a little bit about how
to use Accelerate effectively.
First off, why should you use Accelerate?
We saw some things already that Accelerate gives
you great performance, assuming you use it properly.
Another thing that's really nice about
it is it gives you functionality.
Dot product is really easy to write.
You can write your own dot product
in, you know, a couple of seconds.
But if you want to compute the eigenvalues of a matrix,
or you want to compute a 1024 element
FFT, those are not easy routines to write.
You know, if you write them a lot, maybe
you can write them off the top of your head.
But most people are going to need to go look on
Wikipedia or dig out an old reference book from college,
something like that, find out how to implement
it, have to spend a couple of days debugging it.
You really don't want to do that.
You want to spend time writing a great app.
So take advantage of Accelerate, all these routines
are written for you already, you can spend your time
on adding features, doing other cool stuff
that users of your app will appreciate.
It's also easy to use.
This goes back to what I was just talking about.
You don't need to know a lot about
what's going on behind the scenes.
If you do know a lot, that's great.
It will make it even easier to-- for
you to find what you're looking for.
But if you don't know a lot about these abstracts
mathematical operations that you're performing,
you can still take advantage of Accelerate
and get good performance, good energy usage,
and you're also going to get architecture independence.
Normally, if you want to get great performance and great
energy usage, you have to write architecture specific code.
This goes back to what I was talking
about with how the ARMv6 architecture,
the ARMv7 architecture are very different.
If you want to support the iPhone 3G, which a lot of people
have, then-- and you still want to get great performance,
you're going to need to have two separate
code paths if you write this yourself.
If you use Accelerate, we already
have those two separate code paths.
We're going to give you good performance on both platforms,
and you're going to get-- it's
going to work in the simulator too.
You don't have to write a separate set of fall back
code for your ARM assembly to use in the simulator.
Those are some reasons to use Accelerate.
There are some design trade offs that we make in
working on Accelerate that you should be aware
of for your own useful library to make
sure you can take good advantage of it.
We always try to make the common cases fast.
It's because they're the common cases, they're
where most of the work is going to be done.
We want those to be as fast as possible.
So we spent a lot of effort making the common cases fast.
But because we want to provide general functionality,
we want to make sure you can use the
Accelerate framework for all your stuff.
You don't need to worry about writing an FFT.
Some-- we also support lots of
cases that aren't the common case.
Now, they're usually be fast.
We make them fast too.
But the uncommon cases are usually the uncommon
cases because there's something weird about them.
And the fact that there's something
weird about them often means
that they just can't be as fast as the common case can.
Strided access will never be as fast as
unstrided access on a vector processor.
It's just the way it is.
So when you're writing your application,
when you're laying out your algorithms,
you want to use the simple common case as often as you can.
If you have to use the other case
that supports there, it'll work,
but try to structure things so
that you're using the common case.
Continuing with that, you should be aware of
the size of the buffers that you're using.
If you're making lots and lots of calls with tiny
buffers, then you're paying a lot of call overhead.
There are registers that need to
be safe and restored on every call.
That takes time.
Calling a function takes time.
You want to spend as little time as possible doing that.
You want your processor spending as
much time as possible doing real work.
So this means that you don't want to make
a thousand calls with length 3 vectors.
It's much better to make three calls
with vectors length a thousand.
So really try to structure algorithms so that you can use--
so you're not making tons and tons of
tiny, tiny calls into the framework.
At the same time, you don't want your
data, your buffers to be too large.
If they're too large, then it's much
harder for the algorithms to benefit
from the cache that's available on the processor.
Usually, you call one vDSP function, you'll probably
going to call another vDSP function with the output,
or your own code that takes the output and does something.
You'd really like all of the output
to be in the cache on the processor
so that it's available for whatever comes next.
It doesn't need to fetch it from memory.
So it's good to aim for a working set.
What I mean by a working set is serve the amount of data
that's live at any point and time in your algorithm.
What you're really computing with at any moment.
You want to aim for, you know, about 32
kilobytes, about the size of L1 cache.
That's a good working set to aim for.
What this means is that, you know, if you're working with
single precision data or integers, vectors, you know,
length a thousand to four thousand, that kind
of range, that's a really nice range to aim for.
That's where you're going to get the best performance.
I keep coming back to this, but I'm going to say it again.
Use contiguous arrays if you can.
We have the strided access.
If you can avoid it, don't use it.
So how do you not use it?
A lot of times, you can't do-- you can't change
a data structure that you're getting as input.
Some other guy wrote it, you can't
go make him change all of his code,
it's coming from the internet,
you can't change that, who knows.
You're getting data in, it's strided,
you have to deal with it.
However, the fact that the input to one of these routines
is strided doesn't mean the output has to be strided.
So what you want to try to do is on the very first call to a
vDSP routine or a LAPACK or BLAS routine, let the input come
and stride it, but produce a contiguous stride 1 output.
That way, all the subsequent operations
that you do can be faster.
If you have to produce a strided output at the
end after-- at the end of your computation,
keep everything stride 1 write up until that last step,
and then use a strided output for the final result.
Another thing to be aware of is just the architecture.
Assuming that you're supporting the ARMv6 products, make
sure to build your binary path for both ARMv6 and ARMv7.
This will let your compiled code get the
best performance on both architectures,
and you should also prefer single precision floating point,
floats, to double precision, doubles, whenever you can.
Float is a little bit faster than double on ARMv6,
mostly because you can fit more data in cache.
But on ARMv7, on the current architecture,
float is enormously faster than double.
The reason for that is that float executes on
the NEON unit, which is much faster than VFP,
double executes on the Legacy of VFP unit.
Also, because it executes on the NEON
unit, float can benefit from vectorization.
When you call the Accelerate routines, we can take
advantage of vectorization on single precision data.
There's not really many opportunities,
not really any at all,
for vectorization on the ARMv6 architecture
with either float or double data.
And the VFP unit, which is what's used for double on
ARMv7, also doesn't really allow you to vectorize.
So if you can use single precision, you're going to get the
best performance that way on ARMv7, both in your own code
and when you call the Accelerate framework.
You can't always use it.
Sometimes, your algorithm is so numerically
sensitive that you just stuck with double.
That's how it is.
But a lot of times, you can use it.
So check, see if you can use float.
If you can, do, and you'll get better performance.
If you want more information about Accelerate, you can--
make-- talk to our technology evangelist, Paul Danbold,
and you can also post on the Apple
Developer Forums if you have questions.
We have some documentation available on the vDSP library.
LAPACK and BLAS are industry standard libraries.
Probably the best documentation for them is the
freely available documentation at netlib.org.
There's also an excellent book on
LAPACK, the LAPACK User's Guide.
There's links to purchase a copy if you want one.
They are on netlib.org.
And the headers for the system
are also a great place to look.
That's where all these functions are defined.
You can start poking around in there, see what
looks interesting, what you might like to use.