WWDC2014 Session 703

Transcript

X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Good morning everyone.
My name is Geoff.
I'm an Engineer in the
Vector and Numerics Group
where we maintain the
Accelerate Framework.
The Accelerate Framework
is a collection of routines
which deliver a huge
range of functionality.
All this functionality is
going to be extremely fast
and be very energy efficient.
Today I want to introduce some
new features and functionality
to the Accelerated
Framework which are designed
to really simplify the way
that you access this
high-performance functionality.
So what are you going to find
in the Accelerated Framework?
We break this into
four broad categories.
The first is image processing.
Here you're going
to find conversions
between various pixel formats,
warp, shears, convolution,
etc. We've got digital signal
processing, FFTs, DFTs, biquads,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
etc. We've got digital signal
processing, FFTs, DFTs, biquads,
various vector operations.
Vector math functionality, so a
lot of things that you're going
to find in math.h for
example, operating on vectors,
so sign of a vector,
co-sign of a vector,
etc. And then finally,
linear algebra.
Solving systems with linear
equations, eigen values, matrix,
matrix operations, a lot of
functionality in here as well.
The Accelerate Framework
brings a lot more
than just functionality
to the table.
First, it's extremely
high performance.
When we say this,
there's two main metrics
that we pay a lot
of attention to.
The first is speed.
It's going to be extremely fast.
There's two key tools that
we use to achieve this.
The first is short vector units.
So on Intel we're taking
advantage of SSE and AVX
and on ARM we're taking
advantage of NEON.
Also in some situations we're
utilizing multiple cores.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Also in some situations we're
utilizing multiple cores.
We're going to do
this automatically.
So we're really going
to take advantage of all
of the processing
that's available for you.
The other metric that we
spent a lot of time looking
at is energy efficiency.
So we're increasingly relying
on our portable devices.
It's important that we
keep an eye on this.
Generally, when we improve
speed and performance,
energy efficiency
improves as well.
So when you adopt the Accelerate
Framework, you're going
to be fast and energy efficient.
The Accelerate Framework is
available on both OS X and iOS.
And it's optimized for all
generations of hardware.
So when you adopt the
Accelerated Framework,
you're going to write once.
You're going to get a code
that runs extremely fast
and is energy efficient no
matter where it ends up running.
So it's really convenient
for you.
Today I want to talk about the
new features and functionalities
to make it easier to get
to this high performance.
We've got some great
new features in vImage
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We've got some great
new features in vImage
which really round out what
you can do with vImage.
And then I want to
spend the rest
of the time introducing
two new pieces of work.
The first is designed to
really simplify the way
that you access high
performance LinearAlgebra.
We're calling this
LinearAlgebra.
It's a part of the
Accelerated framework.
The other piece is
not actually a part
of the Accelerated framework.
It's a collection of vector
programming primitives.
It's found in simd.h. And
for those of you that want
to roll your own
high-performance vector
implementations, there's
going to be some great,
great tools in here
to help you do that.
So now let's jump
right into vImage.
This is our high-performance
image processing library.
It's got a huge range
of functionality.
I want to show you
some of the things
that you can do with
a short video.
You can perform alpha
blending, dilatation, erosion.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
You can perform alpha
blending, dilatation, erosion.
You can create sobel
filters to do edge detection.
Convolutions for
blur and de-blur.
You can create multikernal
convolves.
There's min and max filters.
Various color transformations.
And warps and shears.
This is just some of what
you can do with the vImage.
Really you can do almost any
of your image processing needs
with the tools that are
available in vImage.
I want to move now
into some work
that we introduced last year.
And this is about getting
your image into a format
that the vImage can consume.
Specifically, if you're
coming from a CGImageRef.
So until last year this
was a difficult task.
If you didn't know exactly
what the pixel format
of your CGImageRef was
for whatever reason,
it could be difficult to
get it into the 8 bit ARGB
or whatever format
that you saw in vImage
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or whatever format
that you saw in vImage
that you wanted to work with.
So last year we introduced
a single routine
that allows this to happen.
I'm just going to move
through this at a high level
to make you aware of it.
For further details,
please see last year's talk.
But all you do now is
you create a structure
which describes the pixel format
that you're trying to get to.
And then you're going to
make a single function call,
vImage buffer, and
it was CGImage.
This takes an uninitialized
vImage buffer.
It takes the structure
describing the format
and the CGImage.
At the end of this,
it's going to return
in a fully initialized
vImage buffer,
and you can do whatever
you need to do.
The round trip is just as
easy, single function call.
So now we've performed all the
operations on the vImage buffer.
I stayed in the same
format so we can use
that same structure
describing the pixel format.
And this is going to
return a CGImageRef.
So some really great
inoperability with CGImageRef.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So some really great
inoperability with CGImageRef.
It's really easy to
get your image data in
and out of vImage this way.
Last year we also introduced
some high level entry points
to some really amazing
conversion support.
And this through
vImageConvert AnyToAny.
It does exactly what it
sounds like it's going to do.
It allows you to convert
between nearly any pixel format
and any other pixel format.
Again, just at a high level
for further details,
see last year's talk.
But the way that it works is
you're going to create two
of these structures
describing the pixel formats.
One for the source format,
one for the destination type.
Then you create a converter.
And then with that converter
you can create image.
You can convert between
the two image formats.
You can convert as many as you
want with a single converter.
So, this allows you to convert
between nearly any pixel format.
To the power user, this
means you can get almost any
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
To the power user, this
means you can get almost any
of your image formats
into a format
that the image can consume
very easily, very efficiently,
and it's going to
run extremely fast.
You guys had some really
great things to say
about these two new features.
One Twitter user said,
"functions that convert
vImage objects
to CGImage objects
and back," thumbs up.
Another Twitter user said,
"vImageConvert AnyToAny
is magical.
Threaded and vectorized
conversion
between nearly any
two pixel formats."
We really appreciate
the feedback.
We're very happy that
you guys are using this
and find it useful.
Please keep the feedback coming.
So with that I want to introduce
video support to vImage.
This is new in both
iOS 8.0 and OS X 10.10.
And I'm going to start with
the high level functionality
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
from a CVPixelBufferRef.
So this is a single video frame.
And we're introducing
the same interoperability
and the same ease of use that
we saw with core graphics.
So now if you want to get your
CVPixelBufferRef into a format
that the image can operate on,
it's a single function call.
You're going to use
that same structure
which describes the format
which you're trying to get to.
And then you're going
to call vImageBuffer
and it with CVPixelBuffer.
It takes an uninitialized
vImageBuffer.
It takes the structure
describing the format,
and the CVPixelBuffer.
There's some additional
arguments for the power user
which we'll see a little
bit more about in a second.
At the end of this you've got
a freshly initialized vImage
buffer, and you can perform
any operation you want.
The round trip is just as easy.
So vImageBuffer back
to CVPixelBufferRef is vImage
copied to CVPixelBuffer.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to CVPixelBufferRef is vImage
copied to CVPixelBuffer.
Takes the vImageBuffer that
you just finished working with,
the buffer format
describing the pixel format.
And then the CVPixelBuffer that
you're trying to copy back to.
So there's some great
interoperability now
with core graphics as well.
To support the high level
functionality that we just saw,
there's a lot going
on behind the scenes.
All of this is exposed
to you as well.
So, the lower level interfaces.
There's forty-one new video
conversions which are supported.
You can, through some of the
other arguments that we saw,
do things like manage
the chroma siting,
work with transfer functions,
and conversion matrices.
There's a lot that
you can do with this.
Another one that is really
neat if you've worked
with video formats
before is RGB colorspaces.
So there's some subtleties
and some,
it's just a little bit
tricky and complicated
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
it's just a little bit
tricky and complicated
to get an RGB colorspace.
And vImage makes this really
simple and easy to do.
vImageConvert AnyToAny is
extended to support all
of the video formats now.
And there's two great
new convenience routines
which allow you to create
converters to convert back
and forth between core
graphics and core video.
So now with video support
in vImage we've got
great interoperability
with both core graphics, core
video, really fast conversions
for both image and
video pixel formats.
And really fast operations
once you're in vImage.
I want to show you some
typical performance.
So what I have here is
performance from VideoToolbox.
This is available in
CoreMedia, and what I've got
on this graph is showing the
speed and megapixels per second
on the Y axis to convert
from VRGA 8 bit pixel format
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
on the Y axis to convert
from VRGA 8 bit pixel format
to the pixel format
shown on the X axis.
The gray bar is OS X 10.9.
This is before VideoToolbox
had adopted vImage.
And then the blue
bar is OS X 10.10
after VideoToolbox
adopted the image.
We see a few things here.
First we see some really great
performance improvements.
So vImage conversions are
going to be really fast.
In some cases we're up
to five times faster.
The other thing that we see
all the way at the right,
the v210 image format wasn't
even supported before.
The image supports a
wide range of formats,
and it made it really
easy for them
to produce new features
once they adopted the image
video support.
So this is what you can
expect out of vImage.
Great performance.
Simple, easy to use,
good interoperability
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Simple, easy to use,
good interoperability
with core graphics
and core video.
Now I want to move
on to LinearAlgebra.
This is a new sub-framework
in the Accelerate Framework.
It is designed to
be simple access
to high-performance
LinearAlgebra.
I want to begin with
a motivating example.
How do you solve a system
of linear equations?
Let's look at how you do this
with a LAPACK, also available
in the Accelerate Framework.
And this is saying if we've
got a system of equations
in sub-matrix A on
the right-hand side
and sub-matrix B,
how do we find AX=B.
So with a LAPACK, it's going
to look something like this.
It's not terribly
straight forward.
The naming convention in
a LAPACK uses short names,
so you're going to
have to figure
out that sgesv means solve
system of linear equations.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
out that sgesv means solve
system of linear equations.
Once you're there, the
argument names are not going
to be much better.
You're passing by
reference here.
All of the argument
types are CLPK integer.
So there's going to be
a lot of explicit casts.
Additionally, there's going to
be a lot of memory management
that you need to do
explicitly, workspaces,
or in this case a pivot vector
that you need to create.
So there's a lot to just
finding the right routine
and then using it correctly.
We think it should be
much simpler than this.
Let's look at how
you solve the system
of linear equations
with LinearAlgebra.
It's going to be really simple.
It's simply going
to be la-solve.
All of the details are
going to be managed
for you behind the scenes.
So with that let's dive into
what exactly you're going to get
out of the LinearAlgebra
sub-framework.
So it's new in both
iOS 8.0 and OS 10.10.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So it's new in both
iOS 8.0 and OS 10.10.
It is designed to be simple
with good performance.
It has got single and
double precision support.
So it's not going to be mixed,
much like BLAS and LAPACK.
It is, got support
for Objective-C,
so the object is going to be a
native of Objective-C Object.
What are you going to
find in LinearAlgebra?
There's a huge range
of functionality.
We've got element-wise
operations, add, subtract.
Matrix products.
This could be inner product,
outer product, matrix, matrix.
Transposes.
There's support for
norms and normalizations.
Support for solving systems
with linear equations.
And then two pieces
which are unique
to the LinearAlgebra
sub-framework,
and those are slice and splat.
And we'll see about those
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And we'll see about those
in further detail
a little bit later.
Well let's begin with a
new LinearAlgebra object.
The LinearAlgebra object is a
reference counted opaque object.
As I said it's an Objective-C
Object in Objective-C.
It still works in C though.
It manages a lot
of things for you.
So in that initial
LAPACK example we saw
that for each argument you're
tracking a pointer, the row
and column dimensions,
leading dimension or a stride.
There's a lot of things
for each argument.
It means you have
a lot of arguments.
There's a lot going on.
Here the object is going to
keep track of the data buffer.
It's going to keep
track of the dimensions
of each of these objects.
Errors and warnings
are attached directly
to the object making
it really convenient.
And then finally scalar type.
So with BLAS and LAPACK you've
got all the APIs duplicated,
one for single and
one for double.
We can collapse all that down
to half the number of APIs.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We can collapse all that down
to half the number of APIs.
Memory management for these
LinearAlgebra objects.
Again these are reference
counted objects.
There's a lot of documentation
about reference counted objects.
There's nothing new here.
Just very briefly from C,
you're going to use la-
release and la-retain.
You do not ever free these.
From Objective-C,
they take the standard
release/retain messages.
And then finally,
Objective-C with ARC,
which is what we recommend.
Just lets you write
exactly what you want
with no explicit
memory management.
From here on out, all the
examples that I show are going
to be Objective-C using
ARC, so there's going
to be no memory management.
So how do you get
your data into one
of these LinearAlgebra objects?
In this example, we're
going to allocate a buffer.
It's going to be some number of
rows by some number of columns,
and we know the row stride
and number of elements.
We're going to fill that
as a row major matrix.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Then to get that matrix into
the LinearAlgebra domain,
we're going to just
call LA matrix
from float or double buffer.
It takes the pointer,
the dimensions
of the matrix, the row stride.
And then hints which we'll see
on the next slide a little
bit more details about those.
And then attributes, which
are attached to objects.
These attributes allow you
to do things enable
additional debug logging.
In this particular case,
the data is copied out of A
so the users retained
all rights to A.
In this case they
need to free it.
So hints, when you're
passing data to LinearAlgebra,
there's some information
that can be beneficial
to the framework to deliver
the maximum performance.
So hints are designed to allow
for this, to allow for you
to give us details and
insights about the buffer
so that we can use the right
routines behind the scenes.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
so that we can use the right
routines behind the scenes.
So for example, if you know that
you're a diagonal or a triangle
or matrix, we can leverage that.
These are hints, so if you pass
the wrong hint it's not going
to give you a wrong result.
It may just add additional
overhead.
If you don't know,
just use LA no hint.
The next piece I want to talk
about is Lazy Evaluation.
I want to do that with a fairly
large example for a slide.
So it's not important that you
understand exactly what's going
on in all of this code.
I just want to walk
through it at a high level
so that you can understand
what's going
on behind the scenes.
LinearAlgebra uses
an evaluation graph.
When you create an object,
evaluation is not
necessarily going to occur.
It's going to be added
into this evaluation graph.
So at the start of
this function,
we've got two evaluation graphs
with a single note
in each of them.
And as we step through
this code we're going
to create additional objects.
So in this case we
create a transpose.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So in this case we
create a transpose.
We add that to our
evaluation graph.
Then we take the sum of
the odd elements of x
and the even elements of x.
Again, we just add that
to the evaluation graph.
And we continue.
This time the product of At
and x2, all scaled by 3.2.
All of this is just added
to this evaluation graph.
At no point has any
evaluation occurred
or any temporary data
structures been allocated.
So no computation is going to
occur until you trigger it.
This allows us to
not perform a lot
of frivolous memory
allocations and computations.
And right now we don't
trigger a computation
until you've explicitly
ask for data back.
This is going to happen
with la-matrix to float
or double-buffer,
or la-vector-to-float
or double-buffer.
So again, creating these objects
is going to be lightweight.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So again, creating these objects
is going to be lightweight.
We're going to do a lot
of work behind the scenes
to make this run extremely fast.
And we're only going
to compute the data
that you request at the end.
I want to show you some
performance results
for the routine that
we just saw.
Before I do that, I want
to introduce Netlib BLAS.
This is an open source
implementation of BLAS.
I said if you weren't aware
that BLAS was available
in the Accelerated Framework,
this is probably
the implementation
that you would find
yourself using.
So now let's look at the
performance of that routine
that we were looking at before.
On the X axis we've got
various matrix sizes.
On the Y axis we've
got gigaflops,
so higher is going to be better.
Here's the performance of
the LinearAlgebra Framework.
We can see it's pretty good.
Let's compare it to
the accelerated BLAS,
an extremely high
performance benchmark here.
What we see here is
LinearAlgebra is getting most
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
What we see here is
LinearAlgebra is getting most
of the performance that
the Accelerated Framework
can deliver.
Much simpler to get all of the
performance from LinearAlgebra.
There is a discrepancy
on the small end.
There are fixed costs
associated with these objects
which are magnified
for smaller matrices.
But overall, you're getting
most of the performance
with a really simple clean API.
I just want to put this
performance comparison
into perspective.
What if you had used
that open-sourced NetLib
implementation of BLAS?
Your performance
would look like this.
So you can see, you're
getting a lot
of the possible performance
from LinearAlgebra.
Next I want to talk
about error handling.
So what I've got here
is just a sequence
of operations with
LinearAlgebra.
After each operation we're
checking the error status.
We don't recommend
you doing it this way.
What we recommend you doing
is checking the error once
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
What we recommend you doing
is checking the error once
at the end.
So errors are going to be
attached to and propagated
through these evaluation graphs.
So if we have an error
in the first statement,
that error is going to be
attached to the object AB.
Sum is going to see that
there is an error there
and just propagate it through.
Additionally with Lazy
Evaluation, there's a class
of errors that may
not be triggered
until computation time.
So it's always best to check
the status as late as possible.
In this case we're
trying to write back
to the buffer before we
even check the status.
The way that we recommend
you checking the status is
if the status is
zero or LA-SUCCESS,
then everything went well.
In this case, you've
got data in your buffer.
If it's greater than zero, there
was some warning, you're going
to have data there but you
may not have full accuracy.
And then finally less than
zero some hard error occurred.
In this case there's going
to be no data in that buffer.
This might be something
like a dimension mismatch
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
This might be something
like a dimension mismatch
or something we just
can't recover from.
So this sort of begs the
question, how do we debug this
if we've got all this late
error checking, Lazy Evaluation?
The best way to do this
is to enable debug logging
with LA-ATTRIBUTE-ENABLE-
LOGGING.
When you do this and you
encounter an error or warning,
you're going to get a message
like this to standard error.
This is going to help you
determine what the error was
and where it occurred,
which really helps you
to quickly narrow down where
the problem is coming from.
I want to talk a little bit
about the details of the solve.
So if you're familiar with
LinearAlgebra, if you've worked
with LAPACK before, you know
there's a lot of options here.
So I just want to talk
about what our solve
is doing at this point.
So if A is square and
non-singular matrix,
it's going to compute
the solution to Ax = b.
If A is square and
it's singular,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
If A is square and
it's singular,
it's going to produce an error.
So right now, it's pretty
straightforward to do,
and this is what you're
going to get out of it.
The next piece which is unique
to LinearAlgebra is slicing.
So slicing is light weight
access to partial objects.
I say lightweight
access, so there's going
to be no buffer allocation
and no copy.
Things that you can do with
slices are for example,
taking the odd elements
of a vector.
We shouldn't have to
allocate a temporary buffer
and copy those odd
elements out into
that buffer if we don't need to.
And when I say that there's
no allocation and no copy,
don't confuse this
with Lazy Evaluation,
this is added evaluation time.
We're going to do
everything that we can just
to access that data in place.
There's three pieces
of information
that you need to create a slice.
That is offset, stride
and dimension.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
That is offset, stride
and dimension.
And let's look at an example.
Let's say we wanted to slice
and get some of the elements
out of an existing
vector already.
The first argument is
going to be the offset.
This is a zero based offset.
So if you start with the 8th
element it's going to be 7.
The stride is the direction
and number of elements
that we're going to move.
In this case it's
negative 2, so we're going
to move back two elements.
And then finally,
the dimension is 3.
So we're going to have this
view of a three element vector,
which is really elements
out of some larger vector.
Again, no copy, no allocation
here, just a lightweight access
of elements in some
larger object.
One of the ways that
you might use this is
to create a tiling engine.
Let's just look at
a simple example.
You want to sum two
matrices together.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
One of the ways you can do this
is with this simple nested loop.
And you would put your
slices inside the loop.
And you're slicing
the two operands.
A and B in this case here.
And you're creating
a partial result C.
Just using that C and then
getting the next partial sum.
So you can do it this way.
And it's going to work.
But we can actually do a
lot of this work for you.
So instead what we
recommend doing is hoisting
that sum out of the loop.
With a Lazy Evaluation,
nothing is going to happen here.
And instead to just put
the slice on the result.
So our picture has
changed a little bit.
It looks like something
different is happening here.
But behind the scenes, you're
actually getting what you saw
on the previous slide.
So you're getting
exactly what you want.
We're doing all the work
for you behind the scenes.
So it's really easy to
work with these slices.
And the rule of thumb
is to put them as close
to the result as possible.
The next piece is a splat.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The next piece is a splat.
A splat is way to work
with scalar values
with vectors and matrices.
So let's say you want to add 2
to every element of a vector.
The way that you're going
to do this is you're going
to call la-sum with
your vector object.
And then you're going to
splat the scalar value 2.
So it's really easy to
do certain operations now
with scalars on matrices
and vectors.
So that's a high level
summary of LinearAlgebra.
It's got a really
simple, easy-to-use API.
It's got some great modern
language and runtime features.
And it's going to deliver
really good performance.
With that I want to turn it over
to Steve to talk about LINPACK.
>> Thanks Geoff.
So I'm Steve Canon.
I'm a Senior Engineer in the
Vector and Numerics Group.
I work with Geoff.
And I'm going to talk about
our other new feature shortly,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And I'm going to talk about
our other new feature shortly,
but before I do that, I thought
we'd have a little bit of fun
and talk about LINPACK
real quickly.
So LINPACK is a benchmark
that originated
in the high-performance
computing community.
And what it really measures
is how fast are you able
to solve a system
of linear equations?
Now this might seem like kind
of an arbitrary benchmark.
But it turns out that
lots of computations
that we do every day boil down
to solving linear problems.
So this is really an important
thing to be able to do quickly.
Now when you talk about LINPACK,
it's important to keep in mind
that LINPACK is measuring
the speed
of both hardware and software.
You can't have great performance
on LINPACK without good hardware
and without good software
that takes advantage
of the hardware that you have.
The past few years, we've
shown you a shootout
between Accelerate
running on iOS devices
and what we like
to call Brand A.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and what we like
to call Brand A.
Last year we showed
you a chart that looked
like this comparing Accelerate
running on the iPhone 5
against the best LINPACK
score that we were able
to find anywhere for
any Brand A device.
So this is performance
in gigaflops.
This is double precision.
See that Accelerate on
the iPhone 5 gives you
about 3 1/2 gigaflops on LINPACK
which is a really
impressive number.
It's great.
Now the past few years we
showed you a chart like this,
and then the next year Brand
A hardware would have improved
enough to make the
comparison more interesting.
And then we could
blow you away again
with how much faster
Accelerate was.
But since last year, Brand
A hardware hasn't changed
that much.
And so the great software
primitives that we give you
in Accelerate, well this
is still on the iPhone 5,
and you can see, it's not that
interesting of a comparison.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So this year we thought
we'd do something different.
We're going to find
some new competition.
Instead of comparing current
iOS hardware against Brand A,
we're going to compare
Accelerate running on the iPhone
against Accelerate running
on some other device.
What should we pick?
We chose to look at
the 2010 MacBook Air.
Now this was a sweet laptop.
I had one of these.
It's fantastic.
This was like the first
one that we shipped
with the current hardware
design on the outside.
It's a really nice machine.
It's just a few years old.
You can see it's more than twice
as fast as the iPhone 5 was.
So, how do you think
the iPhone 5s stacks up?
Well, you should have some clue.
I probably wouldn't be showing
you the graph if it wasn't
at least going to be close.
But on the other hand, this
is a pretty sweet laptop
from just a few years ago.
And we're going to compare it
against the phone that fits
in your pocket like so.
I don't know.
Who thinks that the
iPhone 5s is faster?
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Who thinks that the
iPhone 5s is faster?
Who thinks that the
MacBook Air is faster?
Ok. So let's see what happens.
The iPhone 5s would give
you 10.4 gigaflops double
precision LINPACK.
And we have other
iOS devices too.
On the iPad Air, we give you
14.6 double precision gigaflops.
And you don't need
to be an expert
in high-performance computing
in memory hierarchies,
in vectorization, in
multithreading to get this.
You just use the simple
primitives that we give you
for matrix operations, and you
get this kind of performance.
So I think this is really cool.
With that, I'm going to move
on to our last new feature
for the day, which
is called SIMD.
Now SIMD traditionally is a name
used to talk about hardware,
and it stands for single
instruction multiple data.
And that's not exactly what
we're talking about here.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And that's not exactly what
we're talking about here.
This is a new library
that we're introducing
in iOS 8.0 and OS X Yosemite.
And it has three
primary purposes.
The first one is to support 2D,
3D and 4D vector
math and geometry.
The second purpose for SIMD is
to provide a lot of the features
of Metal in C, C++ and
Objective-C running on the CPU.
So it's going to make it
easier to prototype code.
Maybe you want to run the CPU
before you deal with GPU stuff.
Maybe you want to move code
between the CPU and the GPU.
Makes it a little bit
easier to do that.
And finally, SIMD library
provides an abstraction
over the actual hardware SIMD
and the types and intrinsics
that you often use to program
against it to make it easier
to write your own vector
code when you need to.
So I think the most
interesting thing
about this is the vector math
and geometry, and I'm going
to dive right into that.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
There are already a
couple of vector math
and geometry libraries
on the platform.
There's all the features in
Accelerate, which can do just
about anything you want.
There's DLKit, SpriteKit,
SceneKit,
the physics library
that goes with them.
So if we're going to
do a whole new one,
we had better get
some things right.
So, my wish list of what a
library like this should look
like is kind of like this.
First off, we should have
inline implementations
of everything we possibly can.
Because when you're doing,
you know, a 40 dot product
or something, there's
not a lot of arithmetic.
It's just four multiplies
and three adds.
So you have to actually make
an external function call
to a jump.
That's not what you want to do.
You're only going to do
seven arithmetic operations.
And because of this,
essentially everything
in SIMD is header inlines.
So it just gets inserted
into your code.
We give you a really
nice performance.
Next, we should have
concise functions
that don't have a lot
of extra parameters.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that don't have a lot
of extra parameters.
If you want to do a dot product,
3D dot product using
BLAS, it looks like this.
You've got all these
extra parameters.
We don't think you
should need to write this.
If you're going to use it using
GLK, which is a great library;
I love GLK, but the
compiler should know that x
and y are three-dimensional
vectors.
You shouldn't need to tell it
that in every function you call.
With SIMD, you just write
this, vector dot(x, y).
Functions overloaded to support
all the different vector types
that we have.
It just works.
It inserts the correct
implementation into your code.
You get great performance.
If you're writing C++, then
we have even shorter names
under the SIMD namespace.
And these look just like Metal.
So you can take Metal code,
add the using namespace SIMD,
and a lot of it will just
work using SIMD headers.
This is really convenient when
you're writing your own code.
The last feature that
I think is important is
that arithmetic should
use operators.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that arithmetic should
use operators.
So if you want to average two
vectors, rather than needing
to write this, you
should just be able
to write 0.5 times ( x + y).
Now you have the
average of two vectors.
This is a lot easier to write.
It's a lot easier to read.
It makes your code more natural.
Alright, so let's dive
into what's actually available
here and what we're doing.
First, the basic types.
We have a lot of vector
types available in SIMD.
But the ones that you're
going to use most often
when you're doing vector math
and geometry are the 2, 3,
and 4 dimensional float factors,
which are just vector float2,
vector float3, and
vector float4.
If you're writing C++ code,
again we have the
names that match Metal.
They're in the SIMD namespace.
You can just say
float2, float3, float4.
And these are based on a clang
feature called extended vectors.
And that gives us a lot
of functionality for free
that made writing this
library really pleasant.
So first off, arithmetic on
vectors pretty much just works.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So first off, arithmetic on
vectors pretty much just works.
You can use all your
favorite arithmetic operators
on vectors and on scalars.
Everything is nice.
It makes your code easy to read.
And I'm going to show
you another example
of that right now.
So, a pretty basic function
for a graphics library
is a vector reflect.
So we take a vector x, and
we take a unit vector n.
That unit vector
determines the plane.
And we're going to reflect
x through that plane.
This is a really common
operation in graphics.
And there's a simple
mathematical expression
that gives the result.
Now, before we might
have had to have a lot
of verbose function calls
to compute this expression.
But with SIMD, it's
really simple.
We just write x minus
twice the dot product of x
in the normal vector
times the normal vector.
This is just as simple
as the mathematics is.
It makes your code, again,
really easy to write,
really easy to read.
I think it's much nicer.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
I think it's much nicer.
There are bunch of other
features that we get
with these vectors
without needing
to call any functions
or do anything.
We get access to vector elements
and subvectors really easily.
Array subscripting just works.
If you want to pull out
the second element vector,
you just subscript just like
you would if it were an array.
Named subvectors just work.
So if you have a vector of 4
floats, you can get the low half
of it, the first two elements
by just using the name
of the vector dot low.
The high half is just dot high.
You can get the even elements.
You can get the odd elements.
And I should point out
that these subvectors
and elements, they're L values.
So you can assign to them as
well as reading from them.
And this is real useful
when you're writing
your own vector code,
especially if you're doing
perspective coordinates
or something like that.
A lot of times you need
to just set some value
in the fourth coordinate
for example.
This is really nice.
If you go totally hog wild with
this, it will make it harder
for the compiler to
generate great code for you.
But used sparingly, this is
really a powerful feature.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We have some other,
that's about what you get
for free with the types.
Now we also give you
lots of functions
that give you the
operations that you want.
We have three headers that
have tons of stuff that comes
up all the time for
math and geometry.
Math, common and geometry.
In C and Objective-C, those
functions look like this.
Notice the math functions look
just like the math functions
that you use for scalars.
They're overloaded, so now they
work for floats, for doubles,
for vectors of floats, for all
our other floating point vector
types, just works.
You want the square
root of a vector?
Just call square root.
Everything is there.
The common functions
you may be familiar
with if you've written
shader code before
or if you've done a lot
of graphics programming.
These are operations
that are really useful
when you're dealing with
coordinates or colors.
If you haven't done
a lot of that before,
they may be new to you.
But don't worry about that.
They're easy to understand and
there's a lot of documentation
for them in the headers.
And then there's the
geometry functions as well.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And then there's the
geometry functions as well.
Now in C++ and Metal, again we
have shorter names available
in C++.
These are under the
SIMD namespace.
And these exactly match
the Metal functionality.
So again, this makes it really
easy to move code between C,
C++ and Objective-C and
Metal when you need to.
Now I want to call out that some
of these functions
come in two variants.
There's a precise version.
And there's a fast version.
Now precise is the default
because if you don't know
which one you need,
it's better to be safe
and give you the most
accurate one we have.
But, there is also
a fast version.
If you compile with ffast-math,
then you get the
fast ones by default.
The fast ones just may
not be totally accurate
to the last bit.
We give you about half the
bits in a floating point number
with the fast variance.
Now even if you compile
the ffast-math,
you can still call the precise
ones individually when you need
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
you can still call the precise
ones individually when you need
to by just introducing
precise into the name.
And similarly vice-versa.
If you don't have
ffast-math specified,
you can always call
the fast variant.
And in C++ we do
this with namespaces.
There's a sub-namespace
called fast
and a sub-namespace
called precise that you use
so that you can just override
the defaults really easily.
Now last, when we talk about
vector math and geometry,
wouldn't really be complete
if we didn't have matrices.
So we have a set
of matrix types,
which are matrix floatNxM.
This could be 2, 3, or 4, and
they don't need to be square.
You can have a 4 x 2
matrix or a 2 x 3 matrix.
I want to point out that N
is the number of columns.
M is the number of rows.
If you're a mathematician this
may be a little strange to you.
But 2 x 3 matrix has two columns
and three rows instead
of vice-versa.
But if you come from
a graphics background,
this is very natural.
This follows the precedent
that Metal and open CL and DX
and GLSL and all of these
libraries have always used.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and GLSL and all of these
libraries have always used.
So that's why we do it.
There are lots of operations
available on matrices as well.
You don't get the operators
for free in C and Objective-C.
Sorry. So you do have to
make some function calls.
But we have a nice set of
functions to create matrices.
We have a nice set of
functions to operate on matrices
and matrices and vectors.
This is just sort of
the broad overview.
We have some other
stuff as well.
In C++ you get operator
overloading.
So you can add and
subtract, multiply by scalars,
multiply matrices and vectors.
We have some nice
constructors that make it easier
to create these objects.
It's really nice to work with.
Really easy to write
your vector code.
So that's, that's sort of the
vector math and geometry story.
And now I want to
talk a little bit
about writing your own SIMD
code using the library.
So we also have lots
of other types.
I mentioned this
at the beginning.
Vector float is just
a few of them.
We also have vectors of doubles.
Vectors of signed and
unsigned integers.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Vectors of signed and
unsigned integers.
We've got 8 bit, 16 bit,
32 bit and 64 bit integers.
We support longer vector
types, 8, 16 and 32 elements.
This is really useful to write
just a little bit of code
and have the compiler
effectively unroll your loops
for you.
We also have unaligned
vector support.
All of the normal vector
types are aligned by default,
which is great when
you're doing geometry
because you're not
usually getting the data
from somewhere else.
You're, you know, we
just want to align it.
We want to give you the
best performance you can.
However, when you're
writing your own vector code,
usually you're operating
on data buffers
that came in from somewhere.
And those buffers
may not be aligned.
So we also provide unaligned
types for you to work with.
And I'll show you an example
of that a little bit later.
Now, just like the floating
point vectors I showed you,
you get lots of operators
for free.
You get the normal
arithmetic operators.
These just work.
You also get the
bitwise operators.
Those just work on vectors.
They work with vectors
and scalars
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
They work with vectors
and scalars
so you can shift every
element right by 3,
by just writing vector,
shift right, 3.
We also have a big set of
conversion functions for you.
These let you convert from
one vector type to another.
I want to point out
that you should use the
conversion functions.
Don't cast vectors, because it
almost surely doesn't do what
you want it to do.
When you cast vectors,
it reinterprets the data
in the vector as the other type.
This means that you can't
even cast say a vector
of 416 bit integers into a
vector of 432 bit integers
because they have
different sizes.
So rather than casting them,
call the conversion functions,
which will convert one vector
type to another vector type
for you, give you
the right behavior.
You also get comparisons.
So comparisons just
work on vectors.
It's a little bit strange though
because I can't really
talk meaningfully
about one vector being less
than another vector, right.
That doesn't make
sense geometrically.
So comparisons produce
a vector of results.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So comparisons produce
a vector of results.
Where each lane of the result
vector is minus 1, that's all 1s
if the comparison is
true in that lane.
And it's zeros if the comparison
is false in that lane.
I'll show you an example.
Here's a vector of 4 floats.
Compare it against
another vector of 4 floats,
we'll see if x is less than y.
So, in the first lane,
zero is not less than zero,
the comparison is false.
We'll get a result of zero.
Now 1 is less than 3.14159,
so the result is all 1s.
So 2 is not less than minus
infinity, 3 is less than 42.
Now I just went through
this, but it's going to turn
out this doesn't matter a
lot to you most of the time
because almost always when you
do a comparison, you're going
to consume the result of
that comparison with one
of three operations; vector any,
vector all, and vector
bitselect.
Vector any is true if
the comparison is true
in any lane of the vector.
Vector all is true if it's true
in every lane of the vector.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Vector all is true if it's true
in every lane of the vector.
And bitselect lets you
select between the elements
of two vectors based on some
result of the comparison.
So most of the time, these
give you the functionalities
that you really want
from comparisons anyway.
You don't need to worry about
the nitty-gritty details
of what the type
of the result is.
So now I'm going to show
you an example of using this
to write your own vector code.
I'm going to choose an
example that's something
that we normally don't really
think about vectorizing.
It's not that hard to
vectorize, but it's something
that you know, is
outside the realm of sort
of floating point computations
that we normally think of.
We're going to look
at string copy.
So here's a simple
scalar implementation
of string copy that's
sort of right out of KNR.
And all we do is we iterate
through the bytes of the source
and we copy them
to the destination.
And when we reach a byte
that's zero, we stop copying.
That's it.
Complete implementation
right there.
Now, as I said, this isn't
too hard to vectorize.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, as I said, this isn't
too hard to vectorize.
Here's sort of a typical
SSE intrinsic implementation
string copy.
I haven't pulled out all
the crazy stops here,
but this is sort of a
reasonable implementation.
And this is fine.
It wasn't too hard to write.
It's a little bit ugly.
I find it kind of
a pain to read.
The big problem with this is
that this works for 32 bit
and 65 bit Intel,
but we might want
to now run our code unarmed.
We have to write, either write
a whole new implementation,
or just fall back
on the scalar code.
So we want to give you the tools
to write fast implementations
that you can run on
all of our platforms.
Here's what a SIMD
implementation
of string copy looks like.
First off, it's a
little bit shorter
than the SSE intrinsic version.
And I think it's a
little bit cleaner.
I'm going to walk
you through it.
The first part here we
just go byte by byte
until the source has
16 byte alignment.
That's going to enable
us to use lined goods
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
That's going to enable
us to use lined goods
from the source from
that point on.
I'm not going to get too
much into the nitty-gritty
of the details of why
it's important to do this.
But when you're dealing with
implicit length data objects
like strings, you do really need
to align your source buffers.
Having aligned our source
buffer, now I'm just going
to cast the source and
destination pointers
into pointers to vectors.
And you notice I used two
different vector types here.
Remember I aligned
the source buffer.
So it's a vector char16.
That's a line that
has 16 byte alignment.
The destination vector is
not necessarily aligned.
There's no guarantee that
by aligning the source
that the destination is aligned.
So instead, I'm going to
use this packed char16 type
which is an unaligned vector
type for the destination.
So now that I've set up
my types, the actual meat
of the copy is really
just these two lines.
All we do is load the vector
from the source,
compare it to zero.
If any lane is equal to zero,
then we stop the copy, right.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
If any lane is equal to zero,
then we stop the copy, right.
So if not, any lane of the
vector is zero, we're done.
We continue.
But as soon as a lane
is zero, we're done.
So, then in the copy, we just
copy that vector from the source
to the destination and
advance both the pointers
to the next vector.
Really simple.
And then finally, if we did
find a zero, if we found the end
of the string in the
next 16 bytes, well,
let's just copy it
byte by byte from there
until we reach the end.
This is a really
simple implementation.
It's not the best implementation
that's possible to write,
but it was really
easy and it's going
to give us a nice
performance string.
So let's look at that.
We're going to look at
the performance measured
in bytes per nanosecond for
a variety of string lengths.
Now this bytes per nanoseconds
is how fast we're copying
so that the more data we copy,
the better off we're doing.
Higher is better on this graph.
We start with that
scalar code we had.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We start with that
scalar code we had.
And you can see we get up to
about half a byte per
nanosecond, which is,
that's still 500
megabytes a second.
We're moving a lot of data.
But we're going to
do a lot better.
Let's look at that
SIMD implantation
that you recall is just
a few lines of code.
It's almost ten times faster.
And, as I said, it's
possible to do better
if you really pull
out the stops.
Here's the performance that we
get from libc on the platform,
which is also a vectorized
implementation,
and it does some
really clever things
about etching an alignment
to get more performance.
But you notice, we're getting
most of the performance of libc.
We got nearly a 10X win.
We're within 80 percent
of the performance
of libc for long vectors.
And we got that with
just a few lines of code
that were really easy to write.
The libc implementation
is an assembly.
We wrote four lines
of C basically
to get the performance
we see here.
And that's really what
our message for today is,
that Accelerate has always given
you really fast vector code.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that Accelerate has always given
you really fast vector code.
And what we're doing now is
try to make it even simpler
for you to get at that.
To make it so that more
developers can easily take
advantage of the performance
that the hardware offers.
Now I want to note
that LinearAlgebra
and SIMD are both
brand new libraries.
They do have some rough edges.
I'm sure you'll try to do things
that we haven't thought of.
But that also means that you
can tell us what use cases are
really important to you, and
we'll take that into account.
You can have an enormous
impact on the way
that these libraries develop.
If you want more information
about Accelerate or SIMD,
we have two great contacts,
Paul Danbold and George Warner.
There is a bunch of
documentation available
for vImage and vDSP online.
I would also recommend looking
at the headers in Accelerate
if you need documentation.
VImage and vDSP, LinearAlgebra
and SIMD all have lots and lots
of documentation in the headers.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of documentation in the headers.
It's a fantastic resource
if you want to know more
about how things work.
The developer forums
are a pretty good place
to ask for help with things.
If you're going to
file any comments
in the developer forums,
the place to file
them is under Core OS.
That's where you're most likely
to get the attention
that you want.
And the Bug Reporter is also
a great way to report issues,
make feature requests.
You don't only need to
use this if there's a bug
in the conventional sense.
You can say hey, it would
be great if, you know,
I could do this thing that's
a little bit different
from what you're doing.
Or make an entirely
new feature request.
Or say I tried to this, and the
performance wasn't quite as good
as I thought it should be.
Those are absolutely
bugs, and those are things
that we want to look at.
So file bugs early and often.
We love to get them, and we love
to get feature requests
from you guys.
A ton of the stuff
that we've done
in the past few years
has been motivated
by feature requests we got
from external developers.
There are some related sessions
that are worth checking out.
If you're here, you're almost
certainly going to be interested
in the Metal sessions
that are tomorrow morning.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in the Metal sessions
that are tomorrow morning.
Those are a great
thing to check out.
Thanks a lot for coming by guys.