WWDC2017 Session 711

Transcript

>> Thank you.
And welcome to the Accelerate
Session.
My name is Eric Bainville.
I'm with the CoreOS Vector and
Numerics Group.
Our group is maintaining the
Accelerate framework.
And this is our agenda for
today.
So first I will introduce
Accelerate, what's inside, how
to use it, and through a few
examples I'll show you why you
want to use that.
And then we focus on new
improvements and additions we
have this year, first with
Compression, lossless data
compression library, then BNNS,
Basic Neural Network
Subroutines.
And after that my colleague
Steve will come on stage and
tell us what's new in simd.
And finally, our very own
Jonathan Hogg will come on stage
and tell us -- introduce the
sparse matrices package.
It will be the first time
commercialized shipping with
sparse matrices solver package.
But let's start with Accelerate.
Accelerate is a low-level system
framework dedicated to high
performance primitive on the
CPU.
Actually, it's everywhere when
you have a significant workload
to run on the CPU.
Inside Accelerate there's a lot
of libraries, like vImage for
image processing, the Swiss
knife of image processing.
We also have vDSP inside for
DFT's and FFT's signal
processing, vForce for vector
functions.
And then we have a whole lot of
linear algebra libraries for
dense and sparse matrices.
So that's BLAS and LAPACK for
dense vectors and matrices.
And Sparse BLAS and Sparse
Solvers for sparse vectors and
matrices.
Last year we also introduced
BNNS, Basic Neural Network
Subroutines, which is also
[inaudible] for neural networks.
This is used by, for example,
Core ML and also Vision and NLP
frameworks.
Slightly outside Accelerate but
still maintained by us, we have
simd, which is a set of headers
and function allowing you to
directly talk to the CPU vector
units.
And finally, Compression, a
lossless data compression
library.
How do you use Accelerate?
Well, so you import Accelerate
or you include the Accelerate
header and then link with
Accelerate frameworks on your
own set.
Now, I will show you a few
examples about why you want to
use Accelerate.
Let's start with that one.
So let's say you have this
vector X of floating point
values and you want to scale
them.
So you have this scale.
And so you multiply each value
and store it in the Y vector.
So it's a very simple loop.
That's perfectly fine.
But actually, there is a
function for you in Accelerate
doing the same thing, it's
called vsmul.
So that's one single line
replacing your code.
And the good thing is that we
optimize it for you on all the
supported hardware.
And of course, we maintain it
for you so you don't have your
loop to maintain again.
And, of course, it's faster, as
Accelerate says.
So that's a reference speed and
energy consumption for the loop.
And this is what you get with
Accelerate.
Six times faster.
[ Applause ]
And six times less energy
consumed, which is very
important.
Let me give you another one.
So this time we still have this
array X, and we want to clip the
values between the low and high
bound and store that in Y.
Again, you could go with this
code.
That's perfectly fine.
That's good.
But we have a function for you
in vDSP; it's called vclip.
Does the same thing.
And, again, we maintain it for
you.
We optimize it for you.
And, of course, it's faster.
That's a reference for the loop.
And this is what you get with
Accelerate, eight times faster
and also eight times less energy
consumed.
Another one, matrices.
So let's say you have two
matrices A and B, and you want
to compute the product of A and
B and add the result into the C
matrix.
It doesn't sound simple.
But actually, it's very simple.
That's just three lines of code.
You can see that.
And we have, of course, a
function for you in Accelerate
doing that; it's called sgemm
for the Cblas package.
And really, really, you never
want to write code computing
metrics vector products or
matrix metrics or anything
related to metrics.
You don't want to write that
code ever.
You want to call BLAS, LAPACK,
or anything in Accelerate.
Why? Well, first because we
maintain that for you, and it
will be optimized and
multithreaded on all the
architectures we support.
And this time -- this is a
reference for the loop -- this
is what you get with Accelerate.
I'm not sure you can see it.
Yeah. That's 100 times faster
and 26 times more energy
efficient.
That's your battery here.
Okay. One more example, this
time in vImage.
So let's say you have a 32-bit
per pixel image with four
components per pixel.
That's alpha, red, green, and
blue.
And you want to apply a 4 by 4
transformation matrix to every
pixel in the image.
But you could write the code.
You could write it here, that
would be too long.
But really, you don't want to
write this code.
We have a function for you in
vImage, MatrixMultiply -- that's
one of the most used functions
in vImage -- doing exactly that
and optimized to as fast as it
can on all the architectures we
support.
Last one, this is convolution
layer.
That's the working horse of the
neural networks, the convolution
neural networks.
So this layer takes input stacks
-- that's the right thing on the
left -- that's a stack of
images.
And it will put output image
stack, the blue thing.
And each pixel in the output is
the result of a treaty
convolution on all the entire
input stack.
And we do that for every of the
three that I mentioned for
output image.
So at the end, that's a
six-dimensional loop.
And you really don't want to
write this loop.
And even when the dimensions are
small, you're multiplying all of
them together.
So that's millions or even
billions of floating points
operation here.
Of course, we have a function
for you this time in BNNS doing
that.
And when you run a Core ML
model, you will spend, like, 80%
of the time in this function.
All right.
So that was a few examples.
I could continue almost forever
because we have more than 2,800
API's in Accelerate.
So usually that would be a
function for you inside.
And every time you use an
Accelerate function, the
benefits are that's less code
for you to write; we maintain it
for you; of course, it will be
faster and more energy
efficient; and it will be
optimal -- as close to optimal
as possible on all the
architectures we support,
including the new ones.
So when we release new hardware,
you will get your code running
as fast as it can from day one.
Okay. So that was it for
Accelerate.
Now let's focus on Compression.
Compression is a lossless data
compression library.
It's a very simple API with a
few select compressors inside.
So they are represented on this
little graph here.
On the x-axis you see the
relative compression ratio
compared to ZLIB.
ZLIB is in the center.
And on the y-axis is the
compression speed.
It's not that exponential.
So inside the compression
library we have a selection of
compressors, as I said.
We offer LZMA for better
compression, optimized versions
of LZ4 for fast compression.
Of course we have ZLIB with the
optimized ZLIB decoder, and our
very own LZFSE, which compresses
slightly more than ZLIB but much
faster.
And last year we open sourced
LZFSE; it's on GitHub.
Okay. The API now.
That, too, API's in the
compression.
The first one is a buffer API.
So that's when you have the
entire data to compress.
And so you have a buffer with
the data to compress, you just
call one function, provide the
output buffer, and you will get
the output in one single code.
That's good for encode and
decode.
And if the data is huge or you
get it in small pieces, you want
to use a stream API.
In that case you will create a
stream object and send data
inside and get smaller data
outside.
And you will call that
repeatedly until the entire
stream is processed.
And this, what we have, what's
new, we added a compression tool
command line.
So you can compress with
Compression from the command
line.
Okay. That's for Compression.
Now let's switch to BNNS, Basic
Neural Network Subroutines.
As I said, this is the energy
running of the CPU and
supporting all the neural
network and the machine learning
libraries.
So you use BNNS almost anytime.
When you type on the keyboard or
when you run face recognition,
all these applications use BNNS.
And BNNS provides lower-level
functions for this.
So that's convolutional layers,
pooling layers.
We also have fully connected
layers.
And these, we added separate
activation layers doing just the
activation function.
And these layers also can do
efficient conversion, data type
conversions.
Speaking of which, data types.
So when you train a machine
learning model, what you get is
megabytes or hundreds of
megabytes of data, usually
32-bit floating point data for
your model.
That's the convolution weights,
etc. It turns out you can
convert these guys into smaller
types, like 16-bit floating
point or even 8-bit integer
signed or unsigned and still get
the same precision for your
model.
But, of course, when you convert
32-bit floating point to 8-bit,
your model is four times
smaller.
And that's something you ship
with your app.
So you want to consider that.
This year we optimized BNNS to
support these types.
So this is what we support now,
optimized in the convolutional
layers.
The green stuff is new.
See, we added fp16 storage for
input and weights and also int8.
And this is what we support for
the fully connected layers.
So we're still accumulating to
32 bits, but we can take 16-bit
or even 8-bit inputs and
weights.
Now, for the activation
functions, this is what we had
last year.
And this year we added a few
more, including the most
requested, Softmax.
So now we have an optimized
Softmax in BNNS.
And if you set the activation
function to identity, then you
can change the input and output
types to different combinations.
This is what we support now.
And you will get optimized type
conversion from BNNS.
Last but not least, performance.
So we worked a lot with the Core
ML team and the Vision and NLP
teams to optimize what they
really use a lot.
So that includes convolutions
with padding and Stride 1 and 2
convolutions also and smaller
kernels.
Because the new neural networks
have a lot of layers with
smaller convolutions -- that's 3
by 3 and 1 by 1.
So we optimized these cases.
And especially for the 3 by 3
case, we have Winograd
convolutions, which can be up to
four times faster than the
reference implementation.
And that's it for BNNS.
So let me invite Steve on stage,
and he will tell us everything
about simd.
Thank you.
[ Applause ]
>> Thanks very much, Eric.
Thank you, everyone.
As Eric said, my name's Steve.
And today I'm going to talk to
you a little bit about the simd
module.
I don't think I'll get to cover
everything, but we'll cover
some.
So the simd module lives outside
of Accelerate.
It's a small collection of
headers and user include.
And it's a module you import in
Swift.
And there's sort of three main
use cases that are going to
drive you to use simd.
The first is if you're doing 2
by 2, 3 by 3, 4 by 4 vector and
matrix arithmetic, the sort of
thing that comes up all the time
when doing graphics or geometry
operations.
It also provides a bigger set of
vector times, both integer
vectors and floating point
vectors on lengths up to 64
bytes for doing sort of general
vector programming.
It lets you target all the
architectures we support on all
the platforms we support pretty
easily.
And you can write one piece of
code that gets you good vector
code gen for all of those
architectures.
The last reason you'll use simd
is that it's a great set of
types and operations for
interoperating between all of
the various things that do 3 by
3, 4 by 4 operations on the
platform.
So things like SceneKit,
SpriteKit, ARKit, Vision -- all
those different things you have
lots of matrices and vectors
flying around.
And the simd types are a great
set of things to use with that.
I should say that SpriteKit in
particular -- or SceneKit in
particular added a bunch of new
stuff this year.
So check out their session.
There's some nice stuff for
working with simd there.
I'm going to show you a few
examples of what you can do.
So let's say you want to
multiply a three-dimensional
matrix by a vector.
You can do that using BLAS,
which Eric talked about earlier,
looks like this.
And this is fine, but BLAS takes
all the parameters just as raw
pointers.
So we have to tell it these are
the dimensions of the matrix,
these are the dimensions of the
vector, this is how the memory's
laid out.
There's a lot of other
information we have to pass,
which both makes the code a
little harder to write and it
makes it harder to read.
In you're not fluent with BLAS
already, this last line here,
it's not really obvious that
that's doing a matrix vector
product.
So we'd like to have something
simpler than that.
We could also write this with
GLKit.
GLKit makes it considerably
nicer.
We have dedicated types for a
three-dimensional matrix and for
three-dimensional vector; we
call this
GLKMatrix3MultiplyVector3
function.
It's pretty clear that's a
multiplication.
But we can make this even nicer
using simd.
This is what it looks like in
simd.
Okay? Totally explicit that it's
diagonal matrix.
And multiply a matrix by a
vector is just using the
multiplication operator.
It's really nice, it's really
simple, and it's also really
fast.
All of simd for the most part is
implemented as header inlines.
So this is just going to give me
three vector multiplications on
my CPU.
It's really fast, there's no
call overhead, there's no
parameter checking, there's no
nothing.
I get just nice simple code gen
from this.
So that's a Swift example.
The next example that I'm going
to show you will be in C.
Here I'm going to show you an
example of how you can use simd
to write vector code.
So let's say that we want to
compute a logistic curve with a
given midpoint and maximum
slope.
This is a function that comes up
all the time in sort of
computational mathematics,
especially machine learning.
So this is a useful function to
be able to optimize.
We care a lot about this.
And what I've put in the comment
in the body of the function here
is sort of a typical scalar
implementation, just
mathematically what this looks
like.
But we want to compute this on
16 floating point values
simultaneously because we can
get better efficiency that way.
So that's what this simd float16
type in the function is --
that's just a vector of 16
floats.
And we're going to see if we can
implement the body of this
function.
So vector code is complicated.
I just break this apart into
three pieces so we can write
each one of them individually.
Let's start with this first
section, the linear section.
So we're just subtracting a
scalar from a vector.
We're going to subtract it from
every lane of the vector, and
then we're going to multiply by
a scalar.
What does that look like in
simd?
We just subtract and we
multiply.
It's really, really simple.
This works in C, it works in
Swift, it works in C++, it works
in Objective-C.
It's really nice.
It looks a lot like shader
programming if you've done any
of that.
And this last piece down here
where we take a reciprocal,
again, the same thing, we just
write code that looks like the
math.
It looks just like the scaler
code, it looks just like the
math we're doing.
What about this middle section?
That's a little bit more
complicated.
What we want to do here is for
every element in the vector, we
want to compute the exponential
function of that and put that in
the corresponding element of the
result factor.
We can do that with a for loop.
And this is fine, this works.
But we have a nice new feature
for you this year, which is that
basically all of the math
functions just work on vectors
of floats and doubles.
So any length of vector, float,
and double.
You can call XPath, you can call
sine, you can call cosine,
whatever.
The math functions, they just
work on them.
It's a really nice convenience
feature when you're writing this
kind of code.
And this is going to target the
NEON extensions when I write for
ARM.
And when I compile for Intel,
it's going to target AVX and
SSE.
So I'm going to get fast code on
all the platforms.
This is really nice.
We have one other big new
feature this year that a lot of
people have asked for, which is
quaternions.
I'm going to give you a real
quick introduction to them.
Just that quaternions extend the
complex numbers in the same way
the complex numbers extend the
reals.
So complex number, you might
remember from school, has a real
part and an imaginary part.
Quaternions have a real part and
they have three imaginary parts;
sometimes you call that the
vector part.
My mom always said that if
having one square root of minus
one is good, having three square
roots of minus one must be
great.
She was a smart lady.
You should listen to her.
There's a lot of fascinating
mathematical structure about the
quaternions -- we don't really
care about that.
We're interested in one thing.
Quaternions have a notion of
length that's just like the
complex numbers.
You sum the squares of the
components and you take the
square root, that's the length
of a quaternion.
Quaternions of length one, we
call those unit quaternions, and
they have this really nice
property, which is that you can
use them to represent rotations
in three-dimensional space.
That's all we care about.
Forget about all the other math
that I just mentioned.
So I'll show you a quick code
example of that.
Say we have a vector that's a
vector in the XY plane.
And let's build a quaternion.
This is a quaternion that
represents a rotation around the
y-axis.
Quaternions act on vectors --
that's the simd act function.
It's not multiplication.
When you multiply a vector by a
matrix -- when you rotate a
vector by a matrix, you just
multiply it.
With quaternions it's called an
action.
And we don't care too much about
the details of that, but that's
why this is the act function.
So this is a nice way to
represent rotations.
There's a lot of other ways to
represent rotations.
Why do you want to use this one
and when do you want to use this
one?
You know, you might use matrices
instead, you might use Euler
angles or yaw/pitch/roll, you
can use axis and angle
representation.
There's lots of choices.
Quaternions are an especially
good choice for a certain class
of operations that I'm going to
tell you about.
The first nice thing about
quaternions is they take less
memory than matrices do.
This is nice and it's great, but
that's not really why we want to
use them.
The better thing about
quaternions is that while they
are not especially fast to do a
rotation with, you can see here
they're about a third the speed
of using matrices to actually
compute a rotation.
When you want to do a sequence
of operations, when you want to
combine rotations or you want to
interpolate rotations, you want
to do anything like that,
they're the most natural setting
to do those operations in.
So when we want to multiply two
rotations together, you can see
quaternions are about 30% faster
than vectors and matrices.
But they also let us do things
that are hard to do with
matrices.
So let's say we want to
interpolate between two
different rotated coordinate
frames.
That's a little subtle with
matrices, but it's really
natural with quaternions.
And the reason it's natural has
to do with this sphere that I've
drawn over on the side of the
screen here.
You might say, "Well, why are
you drawing rotations on a
sphere?"
There's a good reason for that.
Quaternions you can think of as
points on the four-dimensional
projective sphere.
And that sounds complicated and
mathematical, and it is, but
it's the natural space of
rotations for three-dimensional
space.
So when I do operations on the
surface of this sphere, that
corresponds exactly to your
natural intuition for what
should happen with rotations.
So for instance, if we want to
interpolate between them, we
just interpolate along a great
circle on the sphere -- that's
this simd slerp function that
stands for spherical linear
interpolation.
And that's really easy to use
and it does exactly what you
want.
If we have a whole sequence of
points that we want to
interpolate between, we could
call slerp repeatedly to
interpolate between them, but
that will have a noticeable jump
a little bit in the rotation
when you get to corners.
Instead we can use the simd
spline function to get a
completely smooth interpolation
using a whole sequence of
rotated coordinate frames.
There's a ton of other
operations like this in the simd
headers that you can use to do
these types of operations.
If you want to work with
rotations, the API is really
pretty full.
I encourage you to check it out.
And as I said, this was one of
the most requested features from
developers.
So I encourage you to file bugs
to say, "Hey, can you guys add
this other thing as well?"
We're really responsive to that.
We love to get feature requests
from our users.
With that, I'm going to turn you
over to Jonathan Hogg, who's
going to tell you all about
really, really big matrices.
[ Applause ]
>> Hi. So you've heard from
Steve about simd, which is
really great for very, very
small matrices.
And I'm going to tell you in a
little while about very big
matrices.
But first I'm going to tell you
about BLAS and LAPACK.
These are libraries for handling
dense matrices.
So you can get up to 30,000 or
40,000 rows of columns here on a
MacBook.
What do we have?
So BLAS stands for Basic Linear
Algebra Subroutines, and these
do basic operations on matrices
and vectors.
We have BLAS 1, which does
vector vector operations.
And then we move through BLAS 2
for matrix vector, to BLAS 3 for
matrix matrix operations.
And you've already seen from
Eric that we can be 100 times
faster on the matrix matrix
multiply than your simple loop.
If you want to do things more
complicated than this, then we
have LAPACK.
These do your matrix
factorizations, your linear
solves, your find Eigenvalues,
Eigenvectors, SVD's -- pretty
much everything you want to do.
That's all I'm going to tell you
about dense matrices because we
want to actually talk about
sparse matrices.
What is a sparse matrix?
So James Wilkinson was one of
the founding fathers of
computational linear algebra.
This was his definition, "Sparse
matrix is any one where
exploiting zeros is useful to
us."
Let's see what sparse matrix
actually looks like.
So here are two sparse matrices.
One's actually the Cholesky
factorization of the other.
And each pixel here represents
multiple non-zeros.
Where they are white, all of the
entries behind that pixel are
zero.
Where there's blue, that means
at least one non-zero's present.
So you can see these matrices
are mostly empty.
In fact, if you were to store
this as a dense matrix, it's
about 30,000 by 30,000.
So it takes us 6.5 gigabytes.
If we store it as a sparse
matrix, we require 260 times
less storage -- only 26
megabytes.
If we wanted to multiply this
matrix by a vector, we'd require
almost 200 times fewer floating
point operations.
But if we want to do something
more complicated like factorize
this matrix, things get better,
at least in the floating point.
We require 2,000 times fewer
floating point operations to
factorize this matrix.
And the dense matrix is still
the same size, the factor's
filled in a bit.
It's slightly less sparse than
it was, so we're only 30 times
better on the storage.
Now, to drive that point home,
we've set up a little race.
We decided we'd run the sparse
solver on a Watch and we put it
up against our best dense matrix
solver from LAPACK on a Macbook
Air.
And this is what happened.
Now, this is running at five
times real time.
And you can see that the Watch
is finished in only 16 seconds.
You got to remember while
watching this that the floating
point throughput on the MacBook
Air is about 50 times that
available on the Watch.
[ Laughter ]
Now, there's actually two phases
to this factorization in the
sparse world.
First we find where the
positions of the non-zero
entry's going to be -- that's
the symbolic phase.
Then we have a numeric phase
which calculates the actual
values.
These times are just for the
numeric phase, but the symbolic
phase only takes about two
seconds on the Watch.
So rather than 16 seconds, you
could say it takes about 18
seconds.
But if you're doing more than
one factorization on the same
pattern, you can skip that
symbolic phase on the second and
third factorization.
Even if you include that, we're
still more than 10 times faster
than the Macbook due to dense
computation.
Hopefully I've now convinced you
that it's worth using sparse
matrices.
So let's tell you how to
actually define one.
So here is a very, very small
sparse matrix.
You'll notice it's missing
entries -- those are zero.
We are going to store this using
a standard format called
compressed sparse column, which
uses these three arrays.
And we'll start off with the
rowIndices rate.
So let's put the row numbers
onto our matrix.
And we'll just copy those up
into the rowIndices array.
And let's do something similar
for the values.
You can see we got a one-to-one
correspondence here.
That first entry is in row zero
and has value two.
So let's just put on the
positions of those entries in
that rowIndices and values
array.
And the trick to compressed
sparse column is that all these
values have to occur in order of
increasing column.
All entries in column zero
actually occur before those from
column one.
And then we get to the trick to
this format, which is we're only
going to store the position of
the first entry in each column
and one additional piece of
information, the total number of
entries in the matrix.
That means that we know how long
that last column is.
If you've already using a sparse
solver, you should probably have
your data in either this format
or a coordinate format and we
provide a converter.
To use it in Accelerate, we need
to wrap it in some metadata,
just telling it how many rows,
how many columns.
And we're going to say this one
is an ordinary sparse matrix.
I've got an example of an
unordinary sparse matrix in a
couple of slides.
Now I've got my sparse matrix,
what can I do with it?
So you can do pretty much
anything you'd expect.
You can multiply a dense vector
or a dense matrix by it; you can
add two sparse matrices or
sparse vectors together; you can
permute the rows or columns; or
you can find various useful
matrix norms.
All that functionality is
provided by the Sparse BLAS,
which we introduced a few years
ago.
So what's new this time?
The ability to solve sparse
systems, that is, given a matrix
equation A times X equals B
where we know the matrix A and
to the right-hand side B, find
that vector of unknowns X.
So we got two approaches to this
for you.
The first is matrix
factorization.
This is exactly what happens in
LAPACK.
It's simple, it's accurate, it's
easy to use.
But mathematicians being
mathematicians, they came up
with a more complicated way of
doing things.
That's iterative methods.
And I'll tell you a bit more
about those later.
So now a matrix factorization.
For those of you who haven't
come across this before, we want
to take our green matrix on the
left here and factorize it into
the products of two triangular
matrices on the right.
That's because we know how to
solve a system with a triangular
matrix very well.
If we're not square, we have to
do things slightly differently;
we have to pick a rectangular
and orthogonal factor here.
And this is your QR
factorization if you've heard of
that before.
So let's see how to actually do
this.
Here is a sparse matrix
equation.
And let's define that matrix
just to begin with.
So this is going to be very
similar to what I just showed
you, except this matrix is
special.
It's symmetric.
That means that the lower
triangle is just the mirror
reflection of the other
triangle.
So we can take advantage of that
and let's only store those lower
triangle entries.
Wrap it in that metadata and
this time we're going to specify
that this is not ordinary, this
is a symmetric matrix.
And we're going to tell it that
we're passing the lower
triangle.
We could pass the upper triangle
if we wanted, we've chosen the
lower triangle here.
So we got our matrix.
Next let's look at that
right-hand side.
So this is a dense vector.
Let's just have a simple array.
Wrap it in a little bit of
metadata, telling us how long it
is, and that's how easy this is.
Which gets us to the interesting
part, how do we actually find
that vector X?
So let's define some storage to
put the answer in.
This is exactly the same as flat
dense vector B we just saw,
except we don't have to supply
any values.
Then we're going to call
SparseFactor.
And I know this matrix is
positive definite; therefore, I
can tell it use a Cholesky
factorization.
I've got a flowchart for you in
a couple of slides which tells
you how to pick a factorization
to use, but here we're using
Cholesky.
That gets us L times L transpose
factorization, which we then
feed into SparseSolve, pass that
right-hand side and the storage
we specify it.
And we get our answer.
We can put that back into the
equation.
And we can see this is correct.
So what if A is not square?
Well, this is where we have to
use that QR factorization
[inaudible] I mentioned before.
But we got two different cases
here.
It's not entirely simple.
We can be overdetermined since
we have more rows than columns.
Unless you've picked a very
special system here, that
probably means there's no
exactly correct answer.
In fact, you're in this sort of
situation.
Put a straight line through
these four points.
Clearly that's impossible, but
if you remember back to your
school days, you probably did
some least squares fitting.
We pick a line which minimizes
sum of the square of the arrows.
It's exactly what we do in this
case.
Remember we want to solve X
equals B.
So the arrow is X minus B.
Let's take the two normals out,
which is effectively the sum of
the square of the arrows in this
example, and minimize that.
If we're underdetermined, that
is, we have more columns than
rows, we're in a slightly
different situation.
It's equivalent to saying put a
line through this point.
Obviously, there's an infinite
family of lines which goes
through that point.
So how do we pick one to return
to you?
We give you the solution with
minimum norm.
Let's look at that on a code
slide.
Here it's very, very similar to
that Cholesky factorization we
saw before.
In fact, the only difference is
that we say to use a QR
factorization rather than
Cholesky.
And this will automatically pick
whether to do the least squares
or the minimum norm depending on
the dimensions of your matrix.
And I told you that we had a
flowchart for you on how to
decide which factorization to
use.
So the first question you have
to ask is: Is your matrix
symmetric?
If it isn't, you have to use the
QR factorization.
But if it is, we have another
question for you: Is your matrix
positive definite?
Now, if you don't know the
answer or if you're sure it
isn't, you can do a symmetric
indefinite factorization, LDL
transpose.
But if you know that extra bit
of information that you've got
positive definite matrix, you
can use the Cholesky
factorization L times L
transposes.
And that's all we have to tell
you on matrix factorizations.
Now, I said there was this other
technique, iterative methods.
So what is an iterative method?
Well, we pick a starting point,
our best guess at the solution
before we start.
And this can be zero if you
don't have any idea what the
actual answer is going to look
like.
And we want to get within some
small radius of our actual
solution.
And the way we do that is we
iterate through a series of
points which converge to that
solution.
Now, there's a couple of caveats
with using these.
Typically they're only going to
be faster than that matrix
factorization approach if you've
got a really, really, really big
sparse matrix.
And further, to actually get to
that performance, you need to
know a bit mathematically about
your problem.
You need something called a
preconditioner, which is a very
approximate solution.
And if you check the literature
for your field, you'll probably
find quite a number have been
derived by mathematicians.
What does this actually look
like to use?
So here's that matrix equation
we had earlier.
This time I'm going to use
iterative method to solve it.
In fact, I'm going to use
conjugate gradients.
So this is positive definite.
So we just specify to use the
conjugate gradient methods.
And you'll notice there's some
brackets ever this.
That's actually because this is
a factory function which
produces the methods and you can
specify method-specific
parameters in those brackets.
The other thing I'm going to do
is I'm going to use a diagonal
precondition.
This matrix is diagonally
dominant.
That means that the entries down
the diagonal are very large
compared to those off the
diagonal; therefore, I know this
diagonal preconditioner will
work very well.
And indeed, if we look at the
output of the algorithm, you can
see that this arrow AX minus B
is decreasing its iteration and
we get to machine precision in
four iterations.
That's because it's 4 by 4
matrix.
Mathematically, we should
converge in at most N iterations
where N is the size of matrix.
So this is behaving as expected.
But if you've got a much larger
matrix, you probably don't want
to go that many iterations,
which is why you get an
approximate solution.
And you can see you the get same
answer as before.
Now, let's say we want to solve
the least squares problem
instead.
We offer a least square solver,
which is iterative.
We don't offer an
underdetermined system solver,
however.
In that case you can just pick a
[inaudible] of zeros and call
them in solver square system
instead.
And to use this, we use the
method LSMR and a slightly
different preconditioner which
is, again, problem-specific.
And you can see that we get
there three iterations this time
because this is a 4 by 3 matrix.
But there are some very cool
things about this particular way
of doing things.
The first is that I don't
actually need my matrix
explicitly.
As long as I have a function
which performs the mathematical
operation A times X or A
transpose times X, that is, you
have an operator, I can
substitute a block of code in
place of this actual matrix
argument.
The second is you're not
restricted to using our
preconditioners.
You can write your own and just
provide a function pointer in
this argument.
Now, you're probably saying,
"How do I know which iterative
method to use?"
I've got another one of those
flowcharts for you.
This time our first question is
not whether you are symmetric
but whether you are square.
If you're not square, you're
going to have to do a least
square solve.
However, if you are, we go
straight to that question are
you positive definite?
If you're not, we have GMRES --
that will handle pretty much any
square matrix.
But if you know that extra bit
of information, you can, of
course, use the famous conjugate
gradient method.
Now, that's everything I have to
tell you today about sparse
matrices.
So we've got one thing we want
to point out, and that is that
you can now use Accelerate on
the Watch.
We have provided you that SDK.
Now, the framework has always
been there.
So it's even better.
Using today's SDK you can back
deploy to previous Watch OS's.
So that means that you get
everything that we've told you
about today on the Watch.
So let's just summarize what
that is.
By using Accelerate your code
will run faster.
It will be more energy
efficient.
It will run across all our
devices.
And at the end of the day you
have less code to maintain.
You get everything here we've
told you about today -- that
sparse solver library, the new
compression tool, changes to the
BNNS, improvements to simd, and
many more, and increased
performance across the
framework.
So if you want some more
information, including some
extensive sample code we've
developed for the sparse solver,
it's all available here.
And you may be interested in
reviewing some of these
sessions, which have already
been or going to the Metal
session this afternoon.
Thank you for your time.
[ Applause ]