WWDC2013 Session 713

Transcript

X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
[ Silence ]
>> Good afternoon.
Welcome to the Accelerate
Framework Session.
My name's Jeff Belcher.
I'm an engineer in the
Vector and Numerics Group.
Today I want to start off
with a pretty common scenario.
Imagine you've got a great
idea for an application,
and that application
has a computationally
intensive component.
You look around and you
find an open source solution
to the problem, you bring
it into your application,
you test it, and you find
the graph's too slow,
or maybe it's a battery drain.
At this point you're forced to
spend the next several hours
or maybe days, profiling
and optimizing that code
to get the performance to
where you need it to be.
We don't think that's right.
The goal of the Accelerate
Framework is
to solve this problem.
The Accelerate Framework is
a collection of functions
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of commonly used computationally
intensive operations.
The Accelerate Framework is
designed to be high performance
and deliver great
energy savings for all
of these APIs that
are available.
When you adopt the Accelerate
Framework you're going
to get great performance and
amazing energy characteristics
from the smallest
iPhone all the way
up through the biggest Mac Pro
without changing a single
line of code on your end.
Let's dive into the details
of the Accelerate Framework
and see how it can help you
make a really great app.
So what is the Accelerate
Framework?
When you think Accelerate
Framework there's a few things
that I want you to remember.
First, easy access to
a lot of functionality.
There's more than
2,000 APIs available
on the Accelerate Framework.
Throughout the rest of
the talk we'll break this
down into four easy-to-remember
categories
and show you what
exactly is available.
Think accurate.
We spent a lot of time testing
so that you don't have to.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The big one is fast
with low energy usage.
You guys really pushed
the limits
of the hardware available today
with your great applications.
When you use the Accelerate
Framework you're going
to get great performance,
and that's going to come
with amazing energy
characteristics.
The best part for you is it
works great on both OS X and iOS
and it's optimized for all
generations of hardware,
so when new hardware
comes out you're not going
to have to revisit your code.
So I mentioned that there's
a lot of functionality
and the Accelerate Framework
is geared toward commonly used
computationally intensive
operations,
but what exactly is available?
We break it down into
these four categories.
First we've got image
processing, with vImage,
we've got digital signal
processing and VVSP,
transcendental math functions
and vForce and vMathLive,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and finally, linear
algebra in LAPACK and BLAS.
At the end of this talk
there's a few points
that I want you to
come away with.
The first of these is how the
Accelerate Framework can help
you create a really
great application.
I'm going to show
you some examples
of real world performance
and energy savings
that you can expect
when you utilize the
Accelerate Framework.
I want you to have an
idea of areas of your code
that are likely to benefit
from the Accelerate Framework,
and finally, how to use
the Accelerate Framework.
So this is going to
range from linking
against the Accelerate Framework
up through some tips and tricks
that can really allow
you to get the most
out of the Accelerate Framework.
I want to move now to why the
Accelerate Framework is fast.
Understanding why the Accelerate
Framework is fast can help
in understanding when and why
to use the Accelerate Framework.
One of the big reasons the
Accelerate Framework is fast is
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
we utilize SIMD instructions.
This is Single Instruction
Multiple Data.
For those of you
unfamiliar, if we're trying
to for example add 2 arrays
together, there are instructions
on current hardware
that allow us
to add multiple elements
simultaneously.
For those of you more
familiar with SIMD operations,
on Intel this means
we're taking advantage
of SSE, AVX, and now AVX2.
On ARM we're taking
advantage of NEON.
Utilizing SIMD instructions
in certain situations can
have significant energy
and performance savings.
We also spend a lot of time
matching the microarchitecture
for the complete
Apple hardware lineup.
This includes optimizations
like instruction selection
and instruction scheduling,
as well as software
pipelining and loop unrolling.
So I bring these up because
it requires a certain amount
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of data before optimizations
like loop unrolling
become beneficial,
so it helps to understand
that this is sometimes
happening behind the scenes
in the Accelerated Framework.
The last reason the
Accelerated Framework is fast is
because it's multithreaded
using GCD.
When it's appropriate we're
going to take advantage
of all the cores available.
So I wanted to talk
about why it's fast
so that you have an
understanding of where some
of the tips for successful use
of the Accelerate
Framework come from.
The first tip is
preparation of your data.
When you prepare your
data there's a few things
that I want you to remember.
The first is if you can
make your data contiguous.
This means that if
you're creating an array,
you want to make that array
such that the elements
are contiguous.
If you're allocating or
have control over the layout
of that buffer and memory, if
you can align the beginning
of that buffer to
16-byte boundary,
that's going to be ideal.
With the Accelerate
Framework we always strive
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to deliver the greatest
performance,
but if you can meet
these recommendations,
in certain situations we
can algorithmically exploit
that to give you
slightly more performance.
The next tip is to
understand the problem size.
Any function call has a
cost associated with it.
The Accelerate Framework
is not immune to this.
On the previous slide
we also saw
that in certain situations
optimizations
like loop unrolling are used.
What this means for you is
that when you're
using really small --
when you're using the
Accelerate Framework
with really small datasets,
it may not deliver
the best performance.
There's not a problem size
that I can say don't use
the Accelerate Framework
for something that's
small; it's going to depend
on the operation
you're performing.
For example, if you're scaling a
vector it might be on the order
of 100 elements; whereas
if you have a more complicated
operation for example,
Matrix Multiply, it could
be as small as 8 elements.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The best thing you can
do here is to experiment.
The Accelerate Framework
is always going
to deliver the great
functionality,
just for these smaller
problem sizes it may not be the
best performance.
The last tip for successful
use is to do setup once
and destroy once at the end.
There's a handful of operations
in the Accelerate Framework
that require a setup structure.
Creating this setup
structure can be costly
and time-consuming.
These setup structures
are designed
to be used multiple times, so if
you find yourself in a situation
where you need to do these
setups, create the setup,
do all of the computation that
you want to do with that setup,
and then destroy
once at the end.
Throughout the rest of the talk
we'll see some examples of this
and it will become more clear.
Now I want to move on to using
the Accelerate Framework.
For those of you brand new
to the Accelerate Framework,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
including it is just like
including any other framework.
Here we have a typical Xcode
project, and we're just going
to navigate to the build phases.
In the build phases we're
going to find the link
with the library
section and we're going
to find the Plus button.
This brings up the list
of available frameworks.
The Accelerate Framework's
right at the top,
we'll just select
it and click Add.
And then we can be sure that the
Accelerate Framework is included
in our project because
it's going to show
up in this link the
Library section.
The only other step to using
the Accelerate Framework is
to include the headers.
This is accelerate/accelerate.h.
That's all it takes
to use the Accelerate Framework.
Linking from the Command
line is just as easy.
In your link step simply
include -framework accelerate.
So now I want to dive into the
details of what's available
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in the Accelerate Framework.
I mentioned there's
over 2,000 APIs
and we've got these four
categories so we'll start
to step through these now.
And we'll begin with
image processing.
For image processing
we have vImage,
our vectorized image
processing library.
There's a lot of
functionality in vImage,
and rather than just list it
I put together a short video
to show you some of the
features that are available.
We've got alpha blending
and alpha compositing,
dilation, erosion.
You can create Sobel filters
for form edge detection,
various types of CONVLs
to perform blur, deblur,
or multi-kernel CONVLs,
MaxFilters, MinFilters,
color transformations,
warps and Shears.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So this is just some of
what you'll find in vImage.
We also have some great
additions and improvements
in both iOS 7 and OS X.
First we have improved
conversion support.
Conversions are operations
like converting between planar
and chunky data or changing
between a pixel component type,
so an 8-bit image format
to a 16-bit image format
or a floating point image
format, just to name a few.
We also introduced vImage
buffer creation utilities,
so in the tips I talked
about how important it is
to create a buffer,
getting the alignment right
and getting everything
contiguous, so to take some
of the guesswork out
of that for vImage,
we introduced the utilities
where you can just specify
the size of the image,
and this function will create
the appropriately sized buffer
to deliver the maximum
performance.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We also introduced
resampling of 16-bit images,
so all the operations like Warp
and Shear that were available
for 8-bit and floating point
image formats are now available
for 16-bit image
formats as well.
The last addition is streamlined
core graphics interoperability.
This is a big one, and I
want to dive into the details
of this with an example.
So we got the question a lot.
How do I use vImage
with my CGImage ref?
To solve this problem
we introduced two new
utility functions.
To go from CGImage
ref to vImage buffer,
we introduced a utility function
vImage buffer and with CGImage
and for the reverse direction,
we introduced vImage
create CGImage from buffer.
Let's take a look at
an example of this,
and see just how
easy it is to use.
So here we're going
to look at how to go
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
from a CGImage ref
to a vImage buffer.
As always, we're going to begin
by including the
Accelerate Framework header
and then we're going to
create an openImage ref.
I'm not going to go through
the details of this here.
There's a lot of documentation
and examples of this,
but assume after this line that
we have our CGImage ref open.
The first step that we're going
to do then is specify
the image format.
This image format describes
the format of the vImage buffer
that we want to create.
We've introduced the
vImage/CGImage format structure.
You'll find several elements
in here; for example,
bits per component,
bits per pixel,
information about the color
and bitmap info to name a few.
This descriptor is
describing an ARGB 8-bit image.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We see that the first entry
in this structure is
bits per component of 8,
so each component in the
picture is going to be 8 bits.
The bits per pixel is 32,
so there's going
to be 4 components.
Color space, we pass null.
When we pass null this
means that we're going
to get a default
RBG color space,
so we have 3 color components.
And then in the bitmap info,
we have kCGImage alpha first.
This means we have a
single alpha component
and it's the first component.
So this describes our
8-bit ARGB image format.
With this format we're going to
call vImage buffer with CGImage.
The first argument is the
input buffer that we want
to create from our CGImage ref.
The second argument
is the reference
to that format description
that we just created.
The third argument is
unused in this case.
This is information
about background color.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
In certain conversions when
alpha channels are involved,
it may be necessary
to provide information
about a background color.
The next argument is NImage --
this is our CGImage ref
that we want to convert
to the vImage buffer, and
finally any additional flags.
In this case we don't have any
so we pass kV image no flags.
Upon successful return
of this function,
we've allocated a
new vImage buffer.
It contains the image format,
the image and the format
that we've described,
and we're free
to at this point
release the CGImage ref.
The reverse is just as easy,
going from a vImage
buffer to a CGImage ref.
So we've done our
image processing,
and we have our vImage
buffer out buffer.
We haven't changed the
format so we're going
to use our same format specifier
that we created before.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
To create the CGImage
refer we're going
to call vImageCreate
CGImage ref from buffer.
The first argument is going
to be the output vImage buffer
that we just finished
processing,
that same format type, because
we haven't changed the format.
The next two arguments are user
callback and user functions,
user callback functions
and user data.
For this particular
conversion we don't need
that so we're just
going to pass null.
And then we pass flag,
any additional flags.
Again, in this case
there are none,
so we pass k at vImage,
no flags.
And then finally a
reference to a vImage error
to capture the error state.
Upon successful return
of this function,
we're going to return
the CGImage ref,
out image in this case.
This is going to be a freshly
allocated CGImage ref containing
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
the image information,
and we are free
to release the vImagem buffer.
All of this is built
around a really powerful API
that we're introducing now
called vImage Convert Any
to Any.
What vImage Convert Any
to Any does it it converts
between the image format
specifiers that we just saw,
so you'll create two of these
format types, one for the source
and one for the destination
type,
and you'll create a converter.
Once you've created this
converter, you can then convert
as many images as you
want from that source type
to that destination type.
So this is one of those cases
where you want to create
that converter once and use
it as many times as you can.
The vImage Convert Any
to Any is really fast,
and I want to show
you an example of hits
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
with a real world application.
I want to show you that with
software jpeg encode performance
running on the iPhone 5.
What I have here is a graph.
On the y-axis I've got
megapixels per second,
so this is the rate at
which we can perform
that software jpeg encode.
On the x-axis I have
various image format types.
For the sake of this example,
think of this software
jpeg encode
as happening in two steps.
Step one is to convert from
our input image format type,
so those that we
see on the x-axis;
two the image format type
that the encode step consumes,
and the second step is to
perform the actual encode.
What we're interested here
is step one, so converting
from the input image format type
to the format type
consumed by the encode.
Let's take a look at the
performance the original way.
We see a few things here.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
First we see a lot
of variability.
For example, if you start
from an 8-bit RGBA image,
your encode performance is
going to be almost twice as fast
as if you start from a
floating point RGBA image.
The reason that this
is happening is
because step one is so variable.
So what we wanted to do
is change just step one.
We replace step one now with
vImage Convert Any to Any,
and let's look at
the performance.
We see everything
gets a lot faster now.
We also see that the
performance is quite consistent.
So our 8-bit RGBA image is
now only a few percent faster
than our floating
point RGBA image.
The reason that this happens is
because we reduced the amount
of time that we spent
in step one,
converting from the input image
format to the other format,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to a very small percent
of the overall operation.
This type of result is what you
can expect in your applications.
This is a real world
application.
vImage is delivering
great performance
and consistent results.
I want to stay on the
topic of conversion
for a little bit longer.
I want to talk about an example
of scaling a premultiplied
image.
A lot of people will have an
image format and they'll have it
in a vImage buffer and
they'll want to scale it.
They'll look through
vImagem and see
that the only way you can scale
an image is a non-premultiplied
image format.
So the way that you need to do
this is three steps in vImage.
I'm not going to go into the
details of each of these steps,
but in step one, we're going
to unpremultiply the data.
In step two, we're going
to perform the scale.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And then in step
three we're going
to premultiply the
results of that output.
A lot of people see this as
three times the amount of work,
and they get afraid
and they go off
and they implement
their own scale.
I want to show you how much time
we spent in each of these steps.
What I have here is the
percentage of time in each
of those three same
steps as we saw them.
At the top we see
unpremultiply, a little over 1%,
at the bottom we see the
premultiply, a little of 1/2%.
The vast majority of time is
spent in the actual operation.
What I want you to take away
from this is don't take away
the conversions, they're fast.
If your image isn't in the right
format, use the conversions.
It's going to be worthwhile
getting into the image.
Now I want to talk about
some performance of vImage
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
as compared to some of the
other options, and I want to do
that by comparing to OpenCV.
OpenCV is a third party open
source computer vision library.
It has an image processing
module.
That image processing
module has a lot
of the same functionality
that vImage has.
There's a couple points
that I want to compare.
The first is execution time.
Everybody wants their
applications to run fast.
The second is energy consumed.
We're increasingly reliant on
our batteries so it's important
that we get that
performance while being aware
of the energy consumption.
To begin we'll look at the
execution time and we'll do
that by looking at the
speedup of vImage over OpenCV.
So on this graph I've
got numbers where numbers
above 1 means vImage is going
to be that many times faster
than OpenCV, and for numbers
below 1 it means OpenCV is going
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to be faster.
I've got a handful of operations
here, and we see that vImage is
between 1.6 and over 20
times faster than OpenCV,
so these are some really
great performance results.
But as I mentioned, it's not
just all about performance.
We're concerned also with energy
consumption and battery life.
I want to explain this
relationship between performance
and energy consumption and
battery life a little bit,
and there's a few points.
First, fast code tends to
decrease energy consumption,
therefore, fast code tends
to increase battery life.
Let's look at why
this tends to happen.
What I have here is a typical
energy consumption profile.
So we're measuring the
instantaneous power.
Energy is the area
underneath that power curve.
So on the x-axis I've got time.
In the beginning, on the y-axis
I've got our instantaneous
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
power measurement.
In the beginning we're
running at some idle state
and using a very
small amount of power.
At time t0 our application
begins
and we increase the amount of
power that we're consuming.
The application runs
through time t1
and we return back
to some idle state.
The amount of battery that we're
using, the energy consumption,
is the area underneath
this curve.
Let's look at how an
optimized routine compares
to an unoptimized routine.
So here in blue I've got
an optimized routine --
much faster.
In certain situations it's
going to take more power to make
that routine run faster, but
the important part here is
that the energy consumption
is the area underneath,
and we can seek that the
optimized routine is using
significantly less energy.
So now let's look at that
same vImage OpenCV comparison
for the energy numbers.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So I've got the vImage energy
savings over OpenCV here.
So again, numbers above
1 means vImage is using
that much times less
energy than OpenCV,
and for numbers below 1 it means
OpenCV is using less energy.
This ranges from .75 up through
almost 7 times less energy.
So we're delivering
really great performance,
and we're also delivering
really great energy savings.
This is what you can
expect in your applications.
We love to get feedback about
use of the Accelerate Framework
and we found this tweet I wanted
to share with you: "Using vImage
from the Accelerate Framework
to dynamically prerender
my spreads,
it's the only way
to make it fast."
Now I want to move on
to the next big category
of operations available on
the Accelerate Framework
and that is digital
signal processing.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
You'll find digital
signal processing in vDSP.
This is our Vectorized Digital
Signal Processing library.
In vDSP you'll find basic
operation on arrays, additions,
subtractions, multiplies,
conversions, accumulations.
You'll also find discrete
Fourier transforms,
discrete cosine transforms,
as well as convolutions
and correlations.
In both iOS 7 and OS 10.9,
we've introduced some great
new features and functionality.
The first of these is a
multi-channel IIR filter.
This is an infinite
impulse response filter.
So whereas before if you
needed to perform an IIR filter
on multiple channels, maybe you
have a surround sound system
that you want to
filter, you'd have to do
that with individual
calls into an IIR filter.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now with this new
multi-channel you can do
that with a single function
call, and we've been able
to give you some great
performance and energy savings
by doing that operation
in a single function.
We've also improved
power of 2 support
for the discrete
Fourier transform
and the discrete
cosine transform.
I want to talk about
this with an example.
So before we essentially
had two entry points
for the same operation based
on the number of points
that you wanted to evaluate.
So if you had a power of 2,
you would call into the FFT.
If you had a non-power of 2
you would call into the DFT.
Starting in iOS 10.9 and iOS 7,
the DFT supports
certain powers of 2.
When the DFT supports the
number of points that you want
to compute, we recommend
that you use the DFT.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So this brings up another
question: How can I be sure
that my number of
points is supported?
If you can't find it in the
documentation for some reason,
you can always programmatically
check.
The DFT is one of the routines
that requires a setup structure,
and that setup structure
is designed to return 0
if the number of
points isn't supported.
You can always be sure
that you're using
the correct routine.
Let's look at an
example of the DFT.
Again, we'll start by including
the Accelerate Framework,
then we're going to create
and prepare our data.
In this case we've got 4
buffers, 2 input buffers,
one for the real numbers and
one for the imaginary numbers,
2 output buffers --
again, one for the real
and one for the imaginary.
We want to align
these if possible.
Then we're going to perform a
DFT setup, and we're going to do
that with vDSP zop create setup.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Takes a few arguments.
The first argument
is information
about any pervious setups
that may have occurred.
We don't have one in this case
so we'll pass zero or null.
The next is the number of points
that we want to compute, 1024,
and then information that
describes the DFT that we want
to perform, in this
case the forward DFT.
Once we've created a setup,
we're going to execute our DFT.
We do that with vDSP
DFT execute,
takes that setup structure
that we just created
and the 4 buffers that
we had set up before.
Again, we want to do this
as many times as we can
with that same setup structure.
We can use it over
and over again.
Once we've done all the
computation one time at the end,
then we want to clean up our
setup with vDSP DFT Destroy.
So I want to do another
comparison now vDSP versus FFTW.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
FFTW is called Fastest
Fourier Transform in the West.
This is another third party
freely available library,
supports one
and multidimensional
transformations,
both real and complex data.
It's parallel.
It's a good freely
available library.
It's a fair comparison.
I'm going to show
again the vDSP speedup
over FFTW on the iPhone 5.
So again, numbers above 1
means vDSP is going to be
that many times faster than FFTW
and numbers below
1 FFTW is going
to be faster than the vDSP.
Across the x-axis I have
several number of points
that we're going to execute.
Let's take a look at the
performance that we get.
We see that vDSP is between
1.8 and about 2.5 times faster
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
than FFTW for all of these
number of points that we looked
at -- some really great
performance results.
It's one thing to look
at benchmarks, though.
It's another thing to
look at the performance
that you can expect
from a real application.
So imagine you need to code an
audio signal using AAC enhanced
low delay.
This is a process
that's done in face time.
The DFT is one of many of
the DFT routines in use,
but it's the only one that
we're looking at here.
And we're going to look at this
by looking at the percentage
of time that we spend
in the DFT.
So what I've got here is the
percentage of time for the DFT
at 54% and at 47% is everything
else in the operation.
This is when we're
linking against FFTW.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The only thing we change
is we link against vDSP
so that we get the
DFT out of vDSP.
And let's look at
how this changes.
When the DFT is replaced
with the DFT out of VDSP,
the time spent goes to 30%.
This translates to significant
performance and energy savings.
This is what you can
expect in your applications.
A little bit more details
about what VDSP supports.
It supports single and
double precision, both real
and complex values,
as well as strided
and non-strided data accesses.
So again, we love
to get feedback.
Another tweet about using vDSP.
Want to do FFT on iOS?
Use the Accelerate Framework.
Highly recommended.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Thank you.
So now I want to move on to
transcendental math functions.
And for that, I'm going
to turn it over to Luke.
>> Luke: Hello, everyone.
My name's Luke Chang.
I'm here to talk
about math functions.
In our group, we support
math for every data level.
For scaled data, we have
libem, takes a scalar input,
returns a scalar output.
If you're writing vector
code, we have the method.
It takes a SIMD vector S input
and then return a
SIMD vector S output.
And you want to handle a lot
of data, will have vForce.
It takes Arias input and
then returns Arias output.
We're going to talk
about them one by one.
First, libem.
It's a standard C math
library, it has a collection
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of [inaudible] like
exponents, logarithm,
trigonometry, power functions.
You're probably very familiar
with it, so I'm going to talk
about what we added
this year for libem.
What we added is an
extension to the C11 standard,
so we prefixed the function
name with double underscores.
They are available on both
iOS 7 and Mac OS 10.9.
They are power of 10 function,
trigonometry in terms of pi,
and sine and cosine pairs.
First, power of 10, why
do we add power of 10?
It's a very common operation
in decimal calculation,
so if you're writing audio apps,
you need quite a lot of it.
Without a specific power of 10
function you have 2 options --
one, to use Pow and use
constant 10 as base.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
However, this is inefficient,
because Pow is designed
to handle generic inputs.
if you know your base is a
constant, there are a lot
of optimization that we can
do to make it go faster.
The other way is to use X.
You can prescale your input
by log(10) to do power of 10.
But it has its own problem.
It's not accurate.
There's routing error
in the multiplication.
For example, if you
want to calculate 10(5),
using this method, you will
not exactly get 100,000.
There's a small error
at the end.
That's why we added
X(10) so you can do power
of 10 faster and more accurate.
Next is trigonometry
function in terms of pi.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Basically it's the same
regular trigonometry function
with your input scale by pi.
It is faster because we can do
automatic reductions faster.
It's much easier to reduce
the argument by multiple of 2
than multiple of 2 pi.
It's also more accurate when
you're dealing with degrees.
For example, if you want to
calculate cosine of 90 degrees,
90 degrees [inaudible]
into 1/2 pi.
With the regular trigonometry
function you will have
to say cos pi x 0.5, and
you will not get 0 back;
you will get a very
small number,
because pi is not so accurate.
So if you use cos pi 0.5,
you will get exactly 0 back.
There's no error
sine/cosine pairs.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
A lot of times when
you can't really sine,
you'll need cosine
for the same value.
For example, if you want to do
a polar 2 [inaudible] conversion
you will need cosine for the
x-axis and sine for the y-axis.
Because we do it simultaneously,
there is only one
argument reduction.
You will have to do the argument
reduction twice to save time.
And what's even better is
that compiler recognize
we have sine cos,
so you will optimize your
code into calling sine cos,
without even knowing it.
Of course, if you want to call
sine cos yourself, you can.
We also added C11
support for CMPLX.
This macro is used to
define a complex number.
Without this, you're more likely
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to do the real part
+ imaginary part x I.
But in that expression, there's
addition and a multiplication
in it, so sometimes you will
not get what you expect --
like this example:
0.0 + infinity x I.
Using CMPLX allows you
to specify the real part
and the imaginary part of
the complex number directly,
so you don't have to worry
about multiplication.
We also have CMPLXF and CMPLXL
for float and load level.
So that's the new
addition to libem.
Vmathlib is a SIMD
vector math library.
It is designed to take
a SIMD vector as input
and then return a SIMD vector.
Similar to libem,
it has a collection
of [inaudible] functions.
We prefix the function then
with a single V, so we have VX,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Vlog, Vsine, et cetera.
You want to use V method
when you're writing
your own vector code.
Accelerate Framework provide a
wide range of functionalities,
but sometimes you have
your own special algorithm
that you write, and
you want to be fast,
so you write in vector code.
What if you need
the, for example?
You could use libem and then
use a for loop to iterate
through each of your
element in the SIMD vector.
But obviously you're not
going to take full advantage
of the vector unit, so we
can replace it with Vmathlib.
Instead of including Math.H,
you include accelerator header,
accelerator.h. Instead of the
for loop you make one
function call to VsineF.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
You will take your SIMD vector
and then return the
result SIMD vector.
But you can go on
with your vector code.
The code looks simpler,
cleaner, and it's also faster.
So it's VMathlib.
You use it when you write
your own vector code.
Next, vForce.
vForce is designed to
handle a lot of data,
called the vectorized
math library.
It works on arrays, so it
prefix the function then
with double Vs, VVX,
VVlog, VVSine, et cetera.
Let's say you want to write
a signal generator app
and you want to generate
a sine wave, for example.
You can do it with Libem,
again, write a for loop,
go through each element
in your buffer --
you could do better
by using vForce.
Here's how.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Instead of using a for loop,
you make one function
call to VV Sine F.
You're passing the upper
buffer, inner buffer,
and the pointer to the length.
The generator sine will be
ready in the upper buffer right
after this function call.
Again, the code looks
simpler, cleaner,
and most importantly, is faster.
Let's look at the performance
measured on the iPhone 5.
As you can see, vForce
is more than twice faster
than using a for loop.
Within the same amount of
time it can generate more
than twice the restful
than the for loop.
This is not it.
It also has great
energy performance.
It use lot less energy
than using a for loop.
It use about only
60% of the energy
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
when you use vForce
compared to a for loop.
So your app will last longer,
you will not drain the battery,
and we did not cherry
pick just VVSineF
to show you the performance.
There is performance
improvement across the board.
The graph doesn't even
fit into the screen.
For the Trunk F, vForce is
more than 5 times faster
than using a for loop.
For all other functions they
are at least twice faster
than using a for loop.
A few words about vForce.
vForce supports single
and double precision
floating point numbers.
It handle Edge cases currently,
so if you have infinities
or nins in your input, you
don't have to worry about them.
vForce will handle the
Edge cases correctly.
vForce require minimal
data alignment.
We only require native
data alignment
for a single precision floating
number that's 4 bytes aligned,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
double precision floating point
number is 8 bytes aligned.
Supports in place
operation, so you don't have
to create a temporary buffer.
That minimize the
memory movement.
We get this question a lot.
Like Jeff mentioned before,
how much data is enough,
so using vForce or any other
server function is beneficial?
Well, for vForce, I can give
a rule of thumb; that is,
if you have more than 16
elements in your array,
consider using vForce.
Of course, the actual crossover
point may vary for each function
in vForce, but if you
have more than 16,
you're probably good to go.
So that's vForce.
I'm going to hand the
presentation back to Jeff.
He'll talk about linear algebra,
my favorite section
of the presentation.
[Applause]
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
>> Jeff: Thanks, Luke.
So for linear algebra we've got
the industry standard LAPACK
and BLAS libraries.
LAPACK is linear
algebra package,
and BLAS is basic linear
algebra subprograms.
Let's begin with LAPACK.
In LAPACK you'll find high level
linear algebra functionality.
This includes things
like solving systems
of linear equations, performing
matrix factorizations,
as well as computing eigen
values and eigen vectors.
One of the great ways to tell
how you're doing with LAPACK
and BLAS is to look at
the LINPACK benchmark.
So as I mentioned these
are industry standard.
They've been around a
long time, and people came
up with LINPACK benchmark
to see how they're doing.
LINPACK benchmark is essentially
answering the question,
how fast can you solve a
system of linear equations?
There's a couple variations
of the LINPACK benchmark.
The one that we're going to
look at here is using a matrix
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of 1,000 x 1,000 elements.
Let's look at the performance.
So this is the LINPACK
performance of Brand A.
Two years ago we
did this comparison
and we compared Brand A.
We looked around at all
the published benchmarks
that we could find, and
they were at 40 megaflops.
In 2 years, there's
been a lot of time,
improvements have been
made, and that performance
for Brand A has come
up to 788 megaflops,
just under a gigaflop
-- pretty good.
Let's look at the performance
of the LINPACK benchmark using
the Accelerate Framework.
1200 megaflops --
this is 1.2 gigaflops.
This is pretty good.
There's just one thing.
We've had 2 years, too.
This is the performance
running on the iPhone 4S.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Let's look at the performance
of the Accelerate Framework
running on the iPhone 5.
It's quite a bit better.
Thank you.
Well, LINPACK benchmark using
the Accelerate Framework
on the iPhone 5 is
at 3,400 megaflops.
That's 3.4 gigaflops.
This is a phone that
fits in your pocket
and runs on a battery.
This is really impressive.
As I said, the LINPACK
benchmark's been
around for awhile, and so
we wanted to do a comparison
to an older machine for fun.
And so we're going
to compare the iPad
with the Retina display
to a Power Mac G5.
For those of you that have
been around for awhile,
you might remember some of the
bake-offs with the Power Mac G5,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
so we're having a
triumphant return.
This is a 10-year old machine,
and if any of you remember
this machine, it's returning
with all fans blazing.
I think there's 7 case
fans, when you turn it
on you know it's in the room.
When you run LINPACK benchmark,
sounds like you're driving
down the highway with
your head out the window.
Let's look at the performance.
LINPACK benchmark on Power
Mac G5 is 3,643 megaflops.
Let's see how the iPad compares.
Just edges it out at
3,686 megaflops --
pretty impressive
for a little tablet.
Thank you.
Let's look at an example
of how to use a LAPACK.
As always, we'll begin
by including the
Accelerate Framework header,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and then we're going to
create and prepare our data,
so we'll create 2 major Cs, A
and B, which describe our system
that we want to solve.
In this case, we're going to
use a system solve that's going
to perform pivoting, so we need
a vector to contain information
about the pivots that
we're going to perform,
and then we're going to
perform this all with DGESV.
There's a couple things
I want to point out.
So as I mentioned, the
LAPAC is industry standard,
it's been around for awhile.
It's originally written
in FORTRAN and maintained
in FORTRAN, so the entry
points look like this.
It's going to be DGSB
followed by an underbar.
It also means that all the
values are going to be passed
by reference, must
something to be aware of.
It's pretty easy to get
tripped up with this.
But to perform the system solve,
we simply pass in the size
of the matrix in N, the
number of right-hand sides
which is the number of systems
that we're going to solve,
the matrix, the leading
dimension of the matrix,
and then the pivot
vector that we created,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and that right-hand sides B.
Info will capture any errors
that happen in this operation.
It's pretty easy
to solve a system
with linear equations
with a LAPACK.
Next is BLAS.
So a LAPACK is the higher level
linear algebra operations.
It's built heavily on BLAS,
the lower level linear
algebra operations.
All of BLAS is available through
the Accelerate Framework.
It's typically broken down
into three categories:
vector operations -- this is
DOT product, scalar product,
vector sums, matrix
vector operations,
matrix vector product,
outer product,
and matrix/matrix operations,
like matrix multiply.
Let's look at an example
of how to use BLAS
in the Accelerate Framework.
We'll begin by including the
Accelerate Framework header.
As always we'll create
and prepare our data,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
so we'll align these
buffers if we can.
In this case we have 2
operands matrices A and B,
and the result matrix C.
And then we're going to
call into C BLAS DGEM.
BLAS supports both
row and call major,
so the first argument is
going to be to specify
if we're a row or call major.
The next 2 arguments specify if
we want to perform a transpose
on the 2 operand matrices.
It's important with BLAS
and a LAPACK to understand
that these transposes
don't actually happen;
the operation is
organized as such
that they are implied
as transposes.
And then the last
several parameters
for this argument are
information about the size
of the matrix, the
matrices themselves,
their leading dimensions,
and any scalar values
which will scale the
operands or a result matrix.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Just to cover some of the data
types and details supported
by both BLAS and LAPACK,
they both support single
and double precision values,
both real and complex,
and multiple data
formats for your matrices,
so dense matrices, band in
matrices, triangular matrices.
As we saw before, transposes as
well as conjugate transposes --
and again, these
disappear in the operation.
They aren't explicit transposes.
And then finally,
BLAS supports both row
and column major while LAPACK
only supports column major.
Another tweet I wanted
to share with you,
playing with the Accelerate
Framework today, having BLAST.
So in summary, there's
a lot of functionality
in the Accelerate Framework.
You'll find image
processing in vImage,
digital signal processing
in vDSP,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
transcendental math functions
in vForce and vMathLib
and linear algebra,
LAPACK and BLAS.
When you think Accelerate
Framework, think easy access
to all this functionality,
over 2,000 APIs.
Accurate, we tested so
that you don't have to.
You're going to get
great performance
with low energy usage.
It's going to work great on OS X
and iOS, and it's going to work
on the complete Apple
hardware lineup,
everything that's available now
and everything that's to come.
Just a recap of the
tips to be successful
with the Accelerate Framework.
When you're preparing your data,
if you can make the
buffers contiguous
and you can align the
beginning of those buffers
to a 16-byte boundary, we can
in some cases get you
slightly more performance.
Again, Accelerate
Framework is always going
to give you the best
performance possible
when you can't meet
these recommendations.
Understand the problem size.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
For small problem sets,
the Accelerate Framework
might not be able
to deliver the best performance.
It's always going to deliver
the functionality, though.
Finally, do set up
and destroy once.
If you find yourself
creating a setup structure,
use that setup structure
as many times as possible.
The Accelerate Framework is
for you guys, and so I want
to leave you with this.
If you need a feature,
please request it.
The best way to do that
is by filing a bug.
And one more tweet:
"The discrete cosine transform
was my feature request
that made it into the
Accelerate Framework.
I feel so special."
So we do listen.
Please request.
And then lastly, thanks, Apple,
for making the Accelerate
Framework.
Thank you, guys, for
making it a success.
[Applause]
Just a little more information
here, if you guys need to get
in touch with us,
contact Paul or George.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
There's some documentation
available online, and as always,
check the Apple developer
forums.
That's all we got,
thank you, guys.
[Silence]