WWDC2017 Session 608

Transcript

>> Good afternoon everyone,
welcome to our talk on using
Metal 2 for Compute.
My name is Anna Tikhonova.
I'm an engineer on the GPU
Software Team, so let's begin.
The Metal 2 echo system is so
much more than the Metal API and
the language.
We also have the GPU Tools and
we have the MetalKit and Metal
Performance Shaders frameworks.
You might know Metal as this
great technology for developing
high-end games and graphics.
But it can also be used for
Compute processing.
In fact, the Compute side of
Metal is so powerful and
flexible that the Metal
Performance Shaders framework is
built completely on top of
Compute.
And in this session, we'll talk
about what's new in the Metal
Performance Shaders framework.
We introduced the Metal
Performers Shaders framework, or
MPS in 2015.
And the videos of our past
sessions are available on our
developer website.
MPS uses the compute power of
the GPU to bring GPU accelerated
primitives.
For image processing, linear
algebra and machine learning.
The framework is optimized for
iOS and we're happy to announce
that this year we're also
bringing MPS to the Mac.
[ Applause ]
Thank you.
The entire feature set is
available in both iOS and macOS.
So let's begin with a quick
update on our image processing
support.
So here's a list of all of the
primitives for image processing
that we had available in iOS 10.
So there's Convolution, Gaussian
Blur, Lanczos Resampling, just
to name a few.
They're all now available in
macOS.
And this year we're bringing you
four new image processing
primitives.
The Image Keypoints primitive
can be used -- is often used in
computer vision algorithms such
as image stabilization and
Bilinear Rescale, Image
Statistics, and Element-wise
Arithmetic Operators, are
commonly used to pre-process
images.
For example, in machine
learning.
And the arithmetic filters also
support broadcasting operations.
Which, for example, allow you to
add a 2D image or the 1D image.
So that's it for our very quick
update on image processing.
And now let's talk about the new
Linear Algebra operations.
Without support, Matrix
Multiplication, Matrix Vector
Multiplication, and Triangular
Matrix Factorization and Linear
Solvers.
To support Linear Algebra
operations, we now have multiple
new data representations.
First, we have the MPSVector
object which interprets the data
in a metal buffer as a
one-dimensional array.
And we have an MPSMatrix object
which interprets the data in a
metal buffer as a rectangular
array.
And MPS matrices are in role
major order.
And you can think of both
MPSVectors and MPSMatrices as
wrappers around user data
buffers.
And we also support a temporary
variance of MPSMatrix.
MPS images -- temporary images
and MPSTemporaryMatrices are
allocated from a Metal heap
associated with a command
buffer.
And they are called temporary
because their lifespan is
limited to the lifetime of the
command buffer.
And we recommend you use
temporary images and matrices
for most of your intermediate
storage.
Both MPSVector and MPSMatrix
support a number of input types.
We support single-precision and
half-precision input types and a
floating-point input types.
And 16-bits and 8-bit signed
integer input types.
And now let's take a look at how
we can create an MPSVector of
size N.
So if you don't already have a
Metal buffer, you need to create
one.
And then you need to create a
descriptor for your vector.
And note here that you specify
the length of the vector.
That's because the vector can be
made from a portion of the
original Metal buffer.
And other related offsets can be
set in a kernel that will use
this vector.
And then the last step is
creating a vector from the
buffer with a descriptor.
And now let's take a look at how
you can create an MPSMatrix with
M rows and N columns.
So it's very similar to the way
you would create MPSVector, but
there's just a few things we
want to mention.
We provide a convenient API that
you can use to find the
recommended bytes per row value
for sizing your Metal buffers.
And if you choose to use the
API, this is how you would
create a metal buffer with this
recommended value.
And using this API is completely
optional, but recommended for
better performance.
And then the rest is simple.
You create a descriptor for your
matrix, and then you create a
matrix with a descriptor.
And now that we talked about the
data presentations, now let's
talk about the primitives.
So for Matrix-Matrix and
Matrix-Vector multiplication our
API is modeled after the
standard BLAS GEMM and GEMV
interfaces.
And for triangular matrix
vectorization and linear
solvers, our API is modeled
after standard LAPACK
decomposition and solve
interfaces.
So if you're familiar with those
interfaces our API will look
very familiar to you as well.
And now let's take a look at a
very simple code example.
So we'll be doing matrix
multiplication and computing
just C = A times B.
So first we need to create our
matrices A, B and C.
But I know you know how to do
this, I showed you in a previous
slide, so let's move on.
Now we want to run matrix
multiplication on the GPU.
So first we do our usual Metal
setup to getting a device, a
command queue, and a command
buffer.
And then we need to create our
matrix multiplication kernel.
And note here that you specify
the size of the result.
That's because this kernel can
operate on subregions of
matrices.
And then we encode this kernel
to the GPU and tell it to start
doing the work.
And we already have sample code
from Matrix multiplication
available on our developer
website, and the sample code for
triangular matrix vectorization
and solving a system of linear
equations is coming very soon.
So that's it for our -- for the
linear algebra operations.
Let's now move on to the next
topic, which is Accelerating
Machine Learning Primitives on
the GPU.
There are a number of
machine-learning related talks
at WWDC this year.
And we are a part of the
machine-learning community.
And this slide shows the overall
architecture.
So as an application developer,
you can add machine learning
functionality to your
applications by using high-level
domain-specific frameworks such
as division framework and the
natural language processing
framework, which rely on the
Core ML framework.
And the Core ML framework is
powered by the accelerates
framework BNNS primitives on the
CPU.
And by the machine learning --
and by the Metal Performance
Shaders framework on the GPU.
But if you're writing an
application that uses Metal,
then you can use the MPS
framework directly and I will
show you how in this session.
So let's start with what are we
talking about here?
What is deep learning?
What is Machine learning?
So imagine that this is you.
And when you see an image, you
know immediately what's depicted
on it.
It's a panda.
But now think about all of the
images on your iPhone.
Or all of those pictures in your
family albums.
Or all of the images on the
internet.
No human can possibly -- can
classify these many images.
But deep-learning algorithms is
designed specifically to do
that.
They can be used for sifting
through large amounts of data
and answering questions such as
what is in this image?
Deep learning algorithms have
two phases.
Training and inference.
So let's talk about training
first.
And let's actually use an
example.
Let's train a system to classify
images.
So the training system to
classify images, for example, if
you want to have it recognize
animals.
Like, to have it recognize cats,
you need to feed this system a
large number of labeled images
of cats and then rabbits, and
all the other animals that you
want your system to be able to
recognize.
And this training step is a
one-time computationally
expensive and labor-intensive
step.
And it's usually done offline.
But the results of this training
phase is trained parameters
which are required for the next
phase, the inference phase.
This is when your system is
presented with a new image that
it has never seen before and it
needs to classify, this is a
cap.
We provide view acceleration for
the second phase; the inference
phase.
Specifically, last year we
talked about the building blocks
for building convolutional
neural networks on the GPU for
inference.
So before we move onto any of
the new features for machine
learning that we brought you
this year, we are going to
review some of the core
information that was covered in
our last year's presentation.
Such as, what are convolutional
neural networks?
And once we do that then we can
talk about the new primitives
that we've added for
convolutional neural networks
this year, and then we'll
introduce the new, easy-to-use
neural network graph API.
And our last topic will be
recurrent neural networks.
So let's go into our recap.
So what are convolutional neural
networks?
Convolutional neural networks
are biologically inspired and
designed to resemble the visual
cortex.
So let's think about how our
brain processes visual inputs.
The first hierarchy of neurons
that receive information in the
visual cortex is sensitive to
specific edges and blobs of
color.
While the brain region's further
down the visual pipeline respond
to more complex structures such
as faces of our friends or kinds
of animals like cats.
So in a similar way, CNNs are
organized into a hierarchy of
layers where high-level features
are derived from low-level
features.
So the first few layers in your
network respond to low-level
features like edges and blobs of
color.
While subsequent layers respond
to progressively more complex
features such as faces.
And I keep saying features.
So think of a feature as a
filter that filters your input
data; that particular feature.
And here's a list of all of the
CNN primitives that we had
available in iOS 10.
And in this recap I will be just
talking about the core
convolution layer.
The core building block of a
CNN.
And the rest of these primitives
are covered in great detail in
our presentation -- in our
documentation.
So Pooling, Fully-Connected and
SoftMax.
You can find information on
those.
So let's talk about the core
building block.
So the function of this core
convolution layer is to
recognize features in the input
data and it's called a
convolution layer because it
performs a convolution on its
input.
So let's recall how regular
convolution works.
You have your inputs, your
outputs and the filter.
And to convole a filter with the
input data you need to multiply
each value in your filter with
the value in the input data and
combine that information to
compute a single output value.
And you do the same for the rest
of the output pixels.
And now the convolution layer is
a generalization of regular
convolution.
It allows you to have multiple
filters.
So you have as many filters as
you have output channels -- or
16 in this case.
And these are the filters which
are going to be filtering input
data for particular features.
Now imagine that you're working
with RGB data.
So you actually have three
channels in your input.
And just because how CNNs work,
this means you need three sets
of 16 filters.
One set for each input channel.
And then these filters are
applied to the input data
separately.
And then the final step combines
all of this information to
compute a single output pixel.
So that's it for our recap of
the convolution layer.
Now let's talk about the new
primitives we've added for
convolutional neural networks.
So as you can see, we've added
quite a few.
And I'll be talking about the
ones highlighted in yellow, but
the rest of them like L2Norm
Pooling, Resampling,
Up-sampling, they will all be
covered in our documentation.
So let's talk about updates to
our core convolution layer.
We used to support only single
precision floating-point weight
types.
And now to help you reduce the
memory footprint and to prove
the performance of your
networks.
We also support half-precision
floating points, 8-bit integer,
and binary weight types.
We used to support only standard
convolution and now we also
support binary and XNOR
convolution.
Dilated convolution.
Sub-pixel convolution and
convolution transpose
operations.
And many of these are
orthogonal, so you can even have
dilated sub-pixel convolution if
you want.
So let's go through them
one-by-one.
Binary and XNOR convolution
perform the same exact operation
as regular convolution but they
do so with improved performance
and great space savings.
So in regular convolution, you
may have floating point inputs
and floating point weights.
What binary convolution allow
you to do is to use your
full-sized input with binary
weights.
And for XNOR convolution the
first thing that happens is that
your input is first converted to
binary so that both your inputs
and the weights are binary.
In regular convolution, the
input has to be multiplied with
the weights.
And for XNOR convolution the
separation becomes a simple XNOR
operation.
And now let's talk about dilated
convolution.
So we already know how regular
convolution works.
You need to apply a filter to
the input data to compute a
single output value.
But say you're working on an
algorithm that requires global
integration of a wider context
of your input data.
So instead of a 3 by 3 kernel,
you may be using a 5 by 5 kernel
to look out further.
But that's a lot more
computationally expensive.
What you can do instead is use
dilated convolutions which
allows you to -- which allows
you to use dilation factors to
introduce gaps into your
convolution kernel so that
you're still using just a 3 by 3
kernel, but you can look out
further.
And now let's talk about
subluxal convolution and
convolution transpose primitive;
very commonly used for image
upscaling.
And let's think about how
upscaling usually works.
So you have your input data and
you want to upscale it by a
factor of 2.
So you won't -- you have some
missing pixels to compute.
And usually upscaling is the
fixed operation with a constant
filter.
So for example how would a box
filter help you to upscale this
image?
So the box filter, which is take
the known pixels and copy the
known data into the missing
location to get you upscaled
results.
For sub-pixel convolution, your
filters are not constant.
Your filters are learned from
the data.
They are your trained parameters
that you get from the training
step where the system was
trained to do this task; to do
image upscaling.
So for 2x upscaling you get 4
filters.
For 4x upscaling you get 16
filters and so on.
So for our 2x upscaling we get
our 4 filters and we apply them
to the input data.
And then the output of that
operation is reshuffled to get
your final full-resolution
image.
And now let's talk about how the
convolution transpose primitive
can be used to upscale images.
So we have our inputs and we
still have to compute our
missing data.
So the way that this primitive
computes the missing data is
that it applies a kind of
convolution pass to this
intermediate result with gaps to
compute each output pixel.
So that's how you get your
upscaled output.
And now we're going to show you
how you can use these new
convolution primitives to
implement a real-world network.
So we took this colorization
network that takes black and
white images as input and
produces colorized images.
And this particular network uses
the dilated convolution
primitive to integrate wider
global context quicker.
And it uses the convolution
transpose primitive to upscale
the results of the network.
And now let's look at this
colorization network in action.
So in this demo we have a
collection of black and white
images like this image of a
lion.
And as soon as I tap on this
image, the colorization network
will run right here live on the
device, and we'll see a
colorized image.
And let's try another example
for this beautiful snowy
mountain.
And now we see it in color.
And this beautiful lovely image
of a dad and a daughter playing
guitar.
And now you can see them playing
guitar in color.
And I really like this one, the
brown bear walking in the
forest.
So I think this network does
just a really wonderful job.
Okay. So that's it for the live
demo.
[ Applause ]
Thank you so much.
So we've added all of these new
convolution CNN primitives, but
that's not all.
We also went back and improved
the performance of some of the
core CNN kernels that were
available to you in iOS 10.
So this chart will show the
performance of the Inception-v3
network, which is a commonly
used network for image
recognition.
So it shows the performance of
this network in iOS 11.
And as you can see, we're
bringing you at least 20 percent
performance improvement across
different iOS hardware.
And now let's talk about the new
neural network graph API.
the neural networks are commonly
described using a graph
abstraction like this
visualization of the
Inception-v3 network.
And we're now allowing to do
just this using the new graph
API.
So let's zoom in on one of these
inception modules.
You have filter nodes which
describe the operations that you
can perform on your data.
Such as convolution, pooling,
etcetera.
And you have image nodes which
describe how the data flows
between these different
operations.
So why did we add this new graph
API?
Well because it's easy to use.
You get this compact
representation of your entire
network and you can save it to
disk and restore it, and that
works across platforms.
You only need to initialize the
graph once and then you can
reuse it for multiple input
images.
And you can execute the entire
graph on the GPU with a single
call.
There are no intermediate images
for you to manage, you just need
to take care of your input and
output.
Internally we use Metal heaps to
make sure that the memory
footprint of all your
intermediate images is as small
as possible.
For example, for the
Inception-v3 network this means
5x memory savings and 10x viewer
allocations, which I think is
pretty impressive.
So as I said, the graph does all
the groundwork for you.
It takes care of creating
intermediate images.
It takes care of sizing them.
It also -- it even sizes your
outputs.
It takes care of the padding
policies.
It takes care of censoring.
So in short, it's a lot less
code for you to write and a lot
fewer bugs for you to write as
well.
And when I say less code, I mean
a lot less code.
So last year we released this
Metal recognition sample that
uses the Inception-v3 network
for image recognition.
And we took that sample and
converted it to use the new
graph API and found that we had
to write four times less code.
And that's pretty much the same
number of lines as Python code
you would have to write in the
open-source sensor flow
framework to implement the same
network.
And we just want to mention that
we will be releasing this
updated sample code -- updated
example as sample code.
And now having all this
information about your entire
network allows us to deliver the
best performance across
different views.
We make it easy for your to
parallelize between the CPU and
the GPU.
so as the graph is executing --
as the GPU is executing the
graph of one input image, the
CPU can already prepare to
execute the graph for a
different input image.
We can also fuse graph nodes
together like the convolution
and neuron nodes.
And we can execute graph nodes
concurrently.
So if we look at this inception
module again, you can see that
there are multiple rows of these
nodes that can be executed
completely independently of each
other.
And of course the output of
these independent executions
need to be concatenated via
concatenation nodes.
And the graph is smart enough to
optimize those away as well.
And now let's take a look at how
you can use the new graph API.
So this is the code for creating
a convolution node using the
graph API.
So it takes an image as source
and it also has weights.
So let's talk about weights for
a minute.
Neural networks keep growing
larger and larger in size.
And if you have many convolution
nodes in your networks, that
means that the overall size of
the weights for your entire
network could be quite
considerable.
And to help with that we've
added a convolution data source
protocol that you can implement
and it provides just in time
loading and purging of weights
data.
So the idea is that the weights
for your entire network do not
have to be loaded in memory all
at the same time.
They also do not have to be
loaded in advance.
To help minimize the memory
footprint, when we initialize
the graph and we process a
particular convolution layer,
we'll load the weights for that
convolution layer and then we
purge them before we move on to
the next convolution layer.
What you have to do is to
implement this initialization
method which just knows where
the data is but it doesn't
actually load it.
And then when the graph calls
the load function that alerts
you that the weights need to be
loaded.
And then when the purge function
is called by the graph then you
can release the weights.
And now let's build a graph.
So here we're implementing this
makeGraph function.
And on the left you can see all
the nodes that make up our
network that we need to build.
So then we create the nodes.
So we create the convolution
node.
The pooling node.
And then the rest of the nodes.
So we have the nodes.
How do we connect them into a
graph?
So we just take the result image
of one node and pass it as a
source image to the next node.
And then we have our graph.
And now let's run it on the GPU.
So first we do our usual Metal
setup.
We initialize the graph.
We take care of our input data
and then we encode the graph to
the GPU.
And the data in the output image
will be -- the output image will
be populated with data when the
command buffer completes.
And then we have an option to
wait for the GPU to finish.
But we don't want you to do
that.
When this happens the CPU is
waiting for the GPU to finish
before it can start encoding the
next run of the graph.
And this introduces bubbles into
your pipeline, which can
adversely affect performance.
So what we want you to do
instead is to use the new
asynchronous executeAsync API.
So with this API your Metal
setup is even smaller.
So you just need to get the
Metal device.
Then you still need to
initialize your graph.
Prepare the input data and then
you executeAsync call.
It returns immediately and then
the output image will be ready
when this code inside the
closure executes.
But in the meantime, you don't
have to wait for the GPU to
finish, you can already proceed
with a coding and new GPU task.
And this way the CPU and the GPU
are executing concurrently.
There are no bubbles in your
pipeline and they're both
utilized to full capacity.
Okay. And now I will do a live
demo that demonstrates the
performance difference between
the synchronous and asynchronous
APIs.
And this demo will be using the
Inception-v3 network for image
recognition.
All right.
So I will be starting with
synchronous API and here we're
detecting a water bottle.
And we're getting about 50
milliseconds per second per
image on average.
And now I will switch to the
asynchronous API.
And now we're getting about 36
milliseconds per image on
average.
So that's pretty good
performance improvement.
All right.
So that's it for the live demo.
[ Applause ]
Thank you.
Okay. Now that we've talked
about the new neural network
graph API and I showed you how
easy it is to use and what great
performance you can achieve with
it, let's now switch gears and
talk about recurrent neural
networks.
So what are recurrent neural
networks?
So one disadvantage of CNNs is
their inability to remember
anything that happened in the
past.
They can take one image as input
and generate a single output
such as the set of probabilities
of what is depicted in the
image.
RNNs on the other hand have
memory.
And they're good at operating on
sequences.
So they can take one input such
as a set of probabilities of
what is depicted in the image
and generate a sequence of
outputs.
So a sequence of words that make
up a caption for this image.
They can also take a sequence of
inputs such as a sentence in
English and generate a sequence
of outputs such as the same
sentence translated to a
different language like Russian
or Finnish.
And we support a number of
different of variants of RNNs.
The single gate RNN, the long
short-term memory RNN or LSTM,
and multiple variants of LSTMs.
The GRU and the MGU.
So let's talk about the simplest
kind of RNN, the single gate
RNN.
the single gate RNN has a
recurrent unit which enables the
previous output over RNN to
affect the output of the
subsequent iterations of the
same RNN.
But the single gate RNNs are not
powerful enough to carry on
important information for many
iterations.
Because the current output of an
RNN -- of the single gate RNN is
also its current state.
There's nothing else there.
The solution to this is the long
short-term memory RNN or LSTM.
It's built from single gate RNNs
and it has an internal memory
cell.
And a certain combination of
gates control how the
information flows inside LSTM.
And what is stored and not
stored in the memory cell.
So let's take a look at the
architecture of LSTM in more
detail.
As I said, the most important
entity inside LSTM is the memory
cell which is updated in every
duration of LSTM.
So you can think of each
iteration of LSTM is this
transition between the old and
new memory.
And now let's talk about the
gates.
So first there is a forget gate
which decides what to keep and
what not to keep from old
memory.
And then there are the inputs
and the cell gates and their
combined contribution determines
what from the current input will
affect the new memory.
And then the combination of all
of these three gates is combined
to update the memory cell.
And finally, there is the output
gate which determines what from
the previous inputs the -- the
previous output, the current
inputs and the new memory will
affect the output of LSTM.
So now that you know what LSTM
is made up of, let's take a look
at how you can create one using
our framework.
So first you create a descriptor
for the LSTM.
And then you need to initialize
the gates.
So what controls the gates?
So what controls the gates with
-- what controls how they
operate is the trained
parameters.
The ones that come from the
training step where you train a
system to do a particular task.
And there are multiple gates for
you to initialize as you can
see, but we're only showing two
initializations just to be
brief.
And as you can see, we're also
using a data source provider.
The same one I showed you before
to initialize the weights.
And the next step is to create
our LSTM layer and now we want
to run it on the GPU.
So we need to create our arrays
that will hold the input and
output for the sequence of the
LSTM executions.
And then we encode the sequence
to the GPU.
And here we're showing you the
-- a matrix-based RNN, but we
just want to mention that we
also support RNNs that operate
on MPS images via convolutions.
And now let's take a look at an
actual example.
So we'll use image captioning as
an example of using LSTM.
So as you recall, I told you
that deep learning algorithms
have two phases.
The training phase and the
inference phase.
So to train a system to caption
images you need to feed it a
large number of images with
human-generated captions.
So what does this system have?
Like what is it made out of?
So this system has a CNN and a
RNN working together to generate
captions.
The CNN is used to figure out
what's depicted in the image and
then the RNN is used to generate
the actual caption.
And the output of that process
is the trained parameters which
are required for the next step,
the inference step.
So in the inference phase, the
trained parameters control both
the operation of the CNN layers
and the operation of the RNN
gates.
And then for each image it's
processed by both the CNN and
the RNN to generate a caption.
So we already know a good
network for figuring out what's
depicted in the image.
It's the Inception-v3 network,
so we'll use that.
And we just talked about LSTMs,
so let's use that to generate
our caption.
And the caption generation phase
-- the caption generation
process also has two phases.
So first we have the LSTM
initialization phase.
So we run our Inception-v3
network and we actually run all
of the layers except the very
last SoftMax layer.
And the output of that is a
feature vector which has
information about what is
depicted in the image.
And then we take that feature
vector and convert it to a
compact representation that's
required by LSTM.
And then run that through LSTM
to initialize it.
And then once we have our
initialized LSTM, then we're
ready for the next phase.
Our actual caption generation
phase.
And we start this process by
passing in a special sentence
start ID token to our LSTM.
And the output of that
operations is a sequence of
words which are, you know, the
words that are connected to what
is depicted in the image.
And then we pass those words to
a SoftMax layer which computes
probabilities for these words.
And we pick the three best ones.
And these three best words are
also our one-word partial
captions for a particular image.
So we take those words and pass
them to the next situation of
LSTM which function is to now
come up with three best two-word
captions for our image and so
on.
We execute for N iterations
until we reach a stopping
condition, which is when we
either reach the maximum number
of words that we want to be in
our caption, or when the
probabilities evolving the newly
generating captions drop to 0.
So I know this is still pretty
abstract.
So let's look at the output of
LSTM -- of an actual output of
LSTM for several iterations for
a particular image.
So in this image we have, you
know, our surfer riding a wave.
And we want to compute the top
three captions for this image.
And in the first iteration of
LSTM we generate three best
words.
So -- which are our best
one-word captions for this
image.
So "man", "a", and "the".
And the word "a" has the highest
probability.
So we take these three words and
we pass them to the next
iteration of LSTM.
And in this iteration, for every
one of these three starter
words, LSTM generates three new
words that have the highest
probability of following each
one of these starter words.
Right? So we have three new
words to follow the word "man".
Three new words to follow the
word "a", and three new words to
follow the word "the".
And now as you can see, each one
of these two-word captions also
have a probability.
And because the word "a" had
such a high probability in the
first iteration, then the
captions that start with the
word "a" in the second iteration
also end up having the highest
probability.
Why? Because the probability of
a two-word caption is just a
product of the probabilities of
the words that make up that
caption.
So that's how we get these three
best ones.
And then we take them and we
move on to the next iteration.
And in the next iteration we
just add one more word to our
captions so that we have
three-word captions.
And then we compute the
probabilities of those captions
and pick the best three.
And we move on to the next
iteration where we just end up
adding one more word to our
caption.
So we have four-word captions.
And then we compute the
probabilities of all these
captions and pick the best
three.
And so on -- I think you get the
idea.
Let's just skip to the end.
So in the end, we get our three
top captions for this particular
image.
And the best one is a man riding
a wave on top of a surfboard,
which I think it's pretty close.
So -- [applause] and let's now
do a demo.
[ Applause ]
So now we'll do a demo of this
-- of the captioning network.
So we have a collection of
images here, and as soon as I
tap on an image then the CNN
will run to determine what is
depicted in the image.
And then the RNN will run to
generate the actual caption.
So let's try it out.
>> A man riding a wave on top of
a surfboard.
>> So we already know this.
Now let's try another image.
>> An old truck is parked in the
field.
>> So the network actually knows
that it's an old truck and that
it's parked and not moving,
which I think is pretty
impressive.
Now let's try one more.
>> A black and white dog laying
in the grass.
>> So the network knows that
it's a black and white dog and
that it's laying in the grass,
not running.
Not walking.
Not sitting.
Laying in the grass.
So pretty cool.
[ Applause ]
Thank you.
And on this note, let's go to
the summary.
So in this session we talked
about all of the new primitives
that we added to the MPS
framework this year.
We've expanded our support for
image processing primitives and
for convolutional neural
networks.
And we've added support for
linear algebra and recurrent
neural networks.
the framework is optimized for
iOS, as I told you, and now
these primitives are also all
available on the Mac.
We also talked about the new
neural network graph API and we
showed you how easy it is to
use, to build and execute your
networks on the GPU.
And that it makes it possible
for us to deliver the best
performance for your networks
across the different GPUs.
And we would love to see you use
all of this new functionality to
create a really great apps and
tell us about it.
So please check out the related
Metal 2 sessions and the
sessions on the core ML and
Accelerate and Vision
Frameworks.
And for more information about
this session and for links of
sample code, please check out
this link on our developer
website and thank you so much
for coming, and have a great
WWDC.
[ Applause ]