WWDC2018 Session 609

Transcript

[ Music ]
[ Applause ]
>> Good afternoon, everyone.
Welcome to our talk on Metal for
Accelerating Machine Learning.
My name is Anna Tikhonova, and
I'm an Engineer on the GPU
Software Team.
The Metal Performance Shaders
Framework is built on top of
Metal.
And it provides GPU accelerated
primitives that are optimized
for both iOS and macOS.
We provide primitives for image
processing, linear algebra, and
machine learning.
We talked extensively about
inference in our past WWDC
sessions, so I just want to
highlight them here.
And this year, we've also added
support for training on both iOS
and macOS.
[ Applause ]
Thank you.
We've also added support for
accelerated ray tracing on our
platform, and we had an entire
session on this topic earlier
this week.
So, it was titled Metal for Ray
Tracing Acceleration.
And the video for the session
will be available online
shortly.
In this session, I will talk
primarily about machine
learning, specifically training.
So, I mentioned training and
inference.
Deep learning algorithms consist
of these two phases.
The first phase is the training
phase.
So, let's use an example where
we want to train a model, to
categorize images into classes
like cats, dogs, giraffes,
etcetera.
So, in order to train a model to
recognize cats, we need to feed
it a large number of labeled
images of cats.
And same for the rabbits and all
of the other animals you want
your model to be able to
recognize.
Training is a computationally
expensive and time-consuming,
iterative process.
The result of training are
trained parameters.
The trained parameters are
required for the next phase, the
inference phase.
This is when your model is
presented with a new image, that
is never seen before, and it
needs to classify it based on
the trained parameters.
This is a cat.
We now provide GPU acceleration
for both the training and
inference phases.
But before I talk about
training, let me tell you about
some enhancements to CNN
Inference we've added this year.
We now have support for FP16
accumulation for the convolution
and convolution transpose
primitives.
This new feature is available on
devices with an Apple A11 Bionic
GPU.
We find that using FP16
accumulation for inference
workloads is more than
sufficient in terms of precision
for many commonly used neural
networks.
FP16 accumulation offers better
precision and significant power
benefits.
So, please take advantage of it
in your inference workloads.
And this is an example of how
you can enable FP16 accumulation
for a convolution primitive.
You just need to set the
accumulatorPrecisionOption
property.
And now, let's talk in depth
about the main topic of this
session, training neural
networks.
And let's start with training
convolutional neural networks.
So, here we have a simple,
handwritten, digit recognition
network that takes an image of a
handwritten image as input, and
assigns it to one of 10 classes,
from 0 to 9.
In this example, we are
correctly classifying this image
as an image of a digit 7.
For inference, we initialize our
network with trained parameters.
So, in this example, the trained
parameters add the weights for
the convolution, and fully
connected primitives.
The goal of a training algorithm
is to compute these trained
parameters, so the network can
use them to map inputs to
correct outputs during
inference.
When we start the training
process, we don't have any
weights.
We need to compute them.
So, the first step is to
initialize the weights with
small, random numbers.
And now we are ready to train
this network.
So, let's take a look at all the
steps involved in a training
process.
Training is an iterative
process, and each iteration of
training can be divided into
four steps.
The first step is the forward
pass.
This is when we take an input
and pass it to our network to
produce an output.
It's very similar to inference.
And next, we need to compute
loss.
So, loss intuitively measures
the difference between the
network's output and the ground
truth.
The objective of a training
algorithm is to minimize loss.
Our next step is the gradient
pass.
This is when we propagate this
difference when the network's
output and the ground truth,
back to the network to update
weights.
The idea is that as training
continues, our network is
becoming better trained, so it's
better able to map inputs to
correct outputs, which in turn
helps to minimize loss.
So, this is an overview and now
let's take a look at each one of
these steps in more detail.
The forward pass involves
propagation forward to the
network, to compute an output.
As you can see, during this
first situation of training, our
network is not doing so well.
It's outputting a result that's
clearly wrong.
So, why is it doing so badly?
Well, this is expected.
We just initialized our weights
with some random numbers.
We haven't trained the network
to do any better yet.
So, now, we need some weight to
quantify how well or how badly
our network is currently doing.
So, we can use this information
to improve our weights, so that
hopefully after more iterations
of training, the network can
produce a better result.
But in order to know how well
we're doing, we need to know
what the right answers are.
So, the ground truth, which I
will call labels from now on, is
an input to the network along
with the input image.
So, in this case, it's a vector
of 10 values, where we have 1
for the correct class, Class 7,
and zeros for all the other
classes.
The output of the network is our
10 probabilities.
One per class.
So, as you can see in this first
situation of training, the
network is producing a very low
result for the correct Class 7,
and it's assigning the highest
probability to a Class 9, which
is why the network is returning
9 as the answer.
So, now we take all of this
information and we pass it to
our loss primitive.
And as I mentioned previously,
loss measures the difference
between the network's output and
the ground truth.
And the objective of a training
algorithm is to minimize loss.
And now, we also need the second
half of the graph.
The second half of the graph
contains gradient primitives for
each responding forward
primitive.
The gradient primitives compute
gradients that are needed to
update weights.
So, the loss primitive computes
the first gradient, which is a
derivative of a chosen loss
function with respect to its
inputs.
And then we take this gradient
and back propagate it, backward
through the network, backward
through the first gradient
primitive in the backward
direction.
In this case, it's the SoftMax
gradient primitive.
And we use the Chain Rule to do
that.
So, the Chain Rule allows us to
back propagate gradients,
backwards through the network.
And we're computing these
gradients so that we can update
weights.
So, we're computing very small
deltas to apply to the weights,
in each iteration of training.
And then we can use these
updated weights in the next
iteration of training, to
ideally produce a lower loss
value, which is what we're
trying to minimize.
In practice, any situation of
training, we're not going to
operate on a single image.
We're going to operate on a
group or a batch of images.
For example, a batch of size 32
or 64.
And we need a corresponding
batch of labels, for loss
computation.
So, in this case, we have a
batch of labels, was 1 for the
correct class and zeroes
everywhere else.
And in each situation of
training, we're going to use a
different batch of images and a
responding batch of labels.
So, let's now run through
several iterations of training.
For the first batch of images,
we're doing a forward pass,
computing loss, and doing a
gradient pass.
And updating weights.
So, what happens with the second
batch?
Exactly the same process.
The forward pass, then we
compute loss, to the gradient
pass, and update weights.
And as we go through iterations
of training, we want the loss
for our network to decrease and
the accuracy of the network to
increase.
And we continue training until
the loss falls below a
particular threshold and the
accuracy of our network reaches
a desired level.
And then we know that the
network is fully trained and now
we can use the computed trained
parameters for inference.
And now, let's take a look at
the steps necessary to train a
neural network using the Metal
Performance Shaders Framework.
Neural networks are often
described using graph
abstraction.
So, in MPS, we enable you to
describe your networks as a
graph.
So, the first step is to create
a training graph.
Then we need to prepare our
inputs.
We need to specify weights.
And then we execute the graph.
So, we run the forward paths,
compute loss, do the gradient
pass, and update weights.
And training is an iterative
process.
It can take many iterations to
train a network.
So, we'll also need to know when
we can stop training.
So, let's now discuss each one
of these topics in more detail.
Let's start with creating a
training graph.
So, as I said, in MPS, we enable
you to describe your networks as
a graph using a neural network
graph API.
So, here we again have a
visualization of our
handwritten, digit recognition
network.
But in this visualization, you
can also see the image notes.
They're the small white notes.
The image notes are for your
data.
For your input, your outputs,
and all of the intermediate
results.
They describe how data moves
between different operations.
And then, operations on the
data, like convolution and
pooling, are described with your
filter nodes.
We support all of the nodes
necessary to create commonly
used neural networks.
And now, let's take a look at
how easy it is to use the neural
network graph API.
So, here's an example of how you
can create an MPSImageNode using
the neural network graph API.
And this is how you would create
a convolution node using the
graph API.
And now, for every forward node,
we support a corresponding
gradient node for training.
It takes a single line of code
to create a gradient node from
the forward node.
So, here is an example of how
you can create a gradient
convolution node from the
convolution node.
And now, let's build an entire
graph.
So, here we have a very small
network.
We have a convolution node
followed by a pooling node,
followed by another convolution
node.
So, how do we connect these
nodes into a graph?
So, that's easy.
We take the result node of --
the result image of one node and
pass it as a source image to the
subsequent node.
And here we have an entire
connected graph.
And now, let's build a training
graph.
So first, we need to add a loss
node to the graph.
And now, let's add some gradient
nodes.
So, as I said, it takes a single
line of code to create a
gradient node from its
corresponding forward node.
And then we connect them as
previously, and now you have a
complete training graph.
So, as you can see from this
example, the graph API is very
simple to use.
The graph does a lot for you
automatically.
It manages all the intermediate
results, and even the output
image.
It minimizes the memory
footprint of your networks, by
aliasing memory for all your
intermediate images, using Metal
heaps.
It can also fuse graph nodes.
For example, it can fuse batch
normalization and neural nodes.
And it can optimize away nodes.
For example, it optimizes the
way you can cut nation nodes.
The graph also automatically
handles padding and manages
state objects for you, which we
will discuss later in this
session.
So, please take advantage of the
graph API.
So, now that we know how to
create a training graph, let's
now take a look at the inputs we
need to pass to the graph.
And for this, let's take a look
at the encode call we will use
to encode the graph to the GPU.
So, as I already mentioned,
we're not going to send in one
image at a time for training.
We're going to operate on groups
or batches of images.
So, one of the inputs to the
graph, is a batch of images.
And as you recall, for every
batch of images, we also need a
corresponding batch of labels
for loss computation.
The labels for loss computation
are passed to the graph as
states.
So, the code call also takes a
batch of states as input.
And now, let's talk about these
batches and states.
What are they?
Let's start with batches.
So, batches are just arrays of
images or states.
We've added them this year
specifically to support
training.
There are two new MPS types for
you to use: MPSImageBatch and
MPSStateBatch.
So, here's an example of how you
can create a single image, using
our API.
So, here we're creating one from
an existing Metal texture.
And this is an example of how
you can create a batch of
images, using our API and append
a new image to the batch, so you
can pass it to the graph.
And now, what are state objects?
An MPS state is an opaque data
container.
In training, it's frequently
used to capture a state of a
forward node, when it's called.
And so, it can later be used by
the gradient node.
So, the graph manages all of the
state objects.
So, as a developer, you
generally don't need to worry
about states.
But it's nice to know how they
work.
So, let's use a specific
example.
So, let's go back to our
handwritten digit recognition
network.
And take a look specifically at
the drop-out and drop-out
gradient nodes.
The forward drop-out node drops,
or it zeroes out, values in its
input, with a certain
probability.
And then, the dropout state
object captures information
about the forward drop-out
operation, so it can later be
used by the drop-out gradient
node because it used to zero out
values in its input gradient at
the exact same locations as was
zeroed out by the forward
operation.
So, as I said, you don't
generally need to worry about
states, because the graph
manages them.
But because the labels for loss
computation are passed as states
to the graph, and because they
require user input.
So, that's your ground truth or
correct results.
You need to create a batch of
labels for loss computation and
pass this batch as input to the
graph.
So, this is an example of how
you would create a single label
for loss computation.
You first need to create a loss
data descriptor which describes
how the label's data is laid out
in memory.
And then you need to create an
MPSCNNLossLabel object, with
this descriptor.
And then you create a batch of
these for training, and when the
GPU's done running the graph,
your batch of labels will
contain the per image loss
values for the batch.
And you can inspect these
values, or you can compute a
single value across the batch
and inspect that value.
So, now that we have a training
graph and we talked about how to
provide inputs to our graph,
let's talk about how to provide
weights to the graph nodes that
require weights.
The only way to provide weights
to convolution fully connected,
batch normalization and instance
normalization nodes, is through
data source provider protocols.
This is an example of how to
create a convolution node, with
a data source provider.
You need to implement a class
that conforms to the protocol.
We call it MyWeights in this
example.
Data source providers are very
useful in many ways.
For example, if you have many
convolution nodes in your
network, the overall size of the
weights for the network can be
quite considerable.
And we do not want the weights
for all of your convolution
nodes to be in memory all at the
same time.
We want to keep the memory
footprints of your networks as
low as possible.
And data source providers come
into play here because they
provide just in time loading and
purging of weights data.
So, we load the weights for one
convolution kernel, when we
process it.
And then we purge them before
moving on the next convolution.
So, here's an implementation of
MyWeights.
You need to provide an
initialization method that is
responsible for pulling in
memory and making it ready.
And then the graph will call the
load function.
And then when the purge method
is called, you can release the
weights.
Data source providers are also
essential for training, and we
will discuss this later in this
session.
So, now that we have a training
graph and we've prepared our
inputs and specified weights,
we're ready to execute the graph
on the GPU.
To change the [inaudible] graph
on the GPU, we first need to do
the usual Metal setup.
We need to initialize our
training graph.
So, we have prepared our inputs.
And now, let's train a network
on the GPU.
Training is an iterative
process.
So, we want to set up a training
loop.
And we usually want to execute
our graph over a number of
EPOCHS.
The number of EPOCHS is the
total number -- is the number of
times we want to iterate over
our entire data set.
And we want there to be multiple
iterations in each EPOCH.
So, the number of iterations is
the total number of images in
your data set divided by batch
size, like 32 or 64.
And now, let's take a look at
each training iteration.
In each training iteration, we
encode a batch of images for
training.
But we don't want the CPU to
wait for the GPU to finish
running one run of the graph,
with one batch of images before
the CPU can start encoding
commands to the command buffer
for the next run of the graph.
We want the CPU and the GPU to
work concurrently.
And for this, we're going to use
double buffering.
So, in this setup, we're going
to create a counting semaphore
with an initial value of 2.
It's because we want only two
encodes to be in flight at the
same time.
And then when we enter the
training iteration function,
we're going to call weight on
the semaphore.
That's decrementing it.
So, if the value of the count
has already been decremented to
zero, we wait.
Otherwise, we continue.
And then we encode our graph,
and the encode call returns
immediately.
And a user specified callback is
called, when the GPU is done
running the graph.
So, now we know.
The GPU is done running the
graph, and the CPU can continue
encoding more work to the GPU,
work that was previously waiting
on the semaphore.
So, why are we using double
buffering?
Why not encode more runs of the
graph, to the GPU concurrently?
Well, it takes a lot less time
to encode commands to the
command buffer, than to run the
graph.
So, we don't want to encode too
many runs of the graph
concurrently to minimize memory
footprint.
Okay, we've talked about
executing the graph.
When we execute the graph, we do
the forward pass, we compute
loss, we do the gradient pass,
and the graph will also update
weights.
So now, let's talk about weight
updates.
As I mentioned, data source
providers are essential for
training.
All of your weight updates need
to happen through an optional
update method on a data source
provider.
The graph will call the update
method automatically.
So, what does the weight update
step actually involve?
Let's take a look.
So, recall that we're computing
gradients during the gradient
pass that we can apply small
deltas to the weights, in each
situation of training.
How these deltas are applied to
the weights, is described by an
optimizer.
It's just a function that takes
the old weights, the computed
gradients as input, and produces
updated weights as outputs.
You will use an optimizer in the
update method of your data
source provider.
And we support a number of
different variants of the weight
update step on the GPU,
including Adam, Stochastic
Gradient Descent, and RMSProp.
And you can even define your own
custom update weight step if you
prefer.
So now, let's take a look at how
to use an optimizer in MPS.
So, recall that your data source
provider has an init method.
This is where you want to create
your optimizer because you only
want to create it once.
And now, let's take a look at
the implementation of our update
method.
The update method receives the
source state and gradient state
as inputs.
So, the source state contains
the old weights, the gradient
state contains the computed
gradients, and now we can encode
our optimizer with this data,
and the last step is to return
the source state, which now has
the update weights.
So, pretty simple.
And now we have just one more
step to discuss.
So, as I said, training is an
iterative process.
It can take many iterations to
train a network.
And you will need to know when
to stop training.
So, let's now discuss how can
you make this decision in the
context of your training loop?
So, here we again have our
training loop, where we're
training a neural network for a
number of EPOCHS.
To check whether you can stop
training, you need to have a
test set of images.
So, a test set of images
contains images that are not
used for training.
They're only used to evaluate
the accuracy of your network.
So, after each EPOCH, you can
optionally wait for the graph to
-- for the GPU to stop running
the graph, and then you can use
the current trained parameters
to initialize an inference
network.
And then you can run this
inference network on your test
set, and you can optionally stop
training when the accuracy of
your network on this test set,
reaches a particular level.
So, now that we've discussed all
of the steps that are necessary
to train a neural network in
MPS, it's time for a demo.
So, as was already mentioned in
the platform State of the Union,
the Metal Performance Shaders
Framework powers Core ML, Create
ML, and Turi Create.
To Turi Create is an easy to
use, flexible, high-performance
tool set for creating Core ML
models for tasks such as image
classification, object
detection, recommendations, and
more.
For more information on Turi
Create, we want to refer you to
the A Guide to Turi Create
Session Video.
We've prepared a demo where we
will be using an -- we'll be
training an object detection
network in Turi Create powered
by MPS.
As was mentioned in the platform
State of the Union, this is nine
times faster than without MPS.
An object detection network
draws bounding boxes around
recognized objects.
So, in this demo, I will be
using a MacBook Pro, with a
connected external GPU.
I will be running Turi Create on
the MacBook Pro and I will use
an external GPU to train the
network with MPS.
This is a great example of how
you can use an external GPU to
enhance the computational power
of a MacBook Pro.
The external GPU we're using is
an AMD Vega GPU.
So, in this demo setup, I've
already imported Turi Create,
and preloaded the object
detection network, and a
training data set.
So, now let's train this network
for 10 iterations.
And now, the entire object
detection network, all the
primitives, the optimizer, the
weight update step, everything
is running on the external GPU.
Okay, so we're already done with
10 iterations of training, it
would take more than 10
iterations to train this
network, and we're not going to
do this on stage.
But what I'm going to do right
now, is to load a network that
we pretrained in advance, run it
on a test set of images and
visualize some of the results.
So, let's take a look.
Okay, so here we have a banana,
that's correctly classified as a
banana.
And we have a bounding box, and
now we have a perfect breakfast
of a cup of coffee and a
croissant, and very,
mean-looking egg.
Okay, so that's it for the Turi
Create demo.
[ Applause ]
Thank you very much.
And now, let's switch gears and
talk about training recurrent
neural networks.
But first, let's do a recap of
what are the recurrent neural
networks?
One of the disadvantages of
convolutional neural networks is
their inability to remember
anything that happened
previously.
They can take one input, such as
an image, and generate a single
output, such as a set of
probabilities of what is
depicted in the image.
RNNs on the other hand, have
memory.
And they're good at operating on
sequences of inputs and outputs.
For example, they can take one
set of probabilities, so what is
depicted in an image, which is
an output of a CNN, and generate
a sequence of outputs, which is
a sequence of words that make up
a caption for this image.
They can also take a sequence of
inputs, such as a sequence of
words that make up a sentence
and generate a sequence of
outputs which is a same sentence
but translated to a different
language.
For example, to ration or
finish.
With support, a number of
different variance of RNNs.
The most commonly used one is
the Long Short-Term Memory RNN,
or LSTM for short.
In our last year's WWDC Session,
we talked extensively about the
gates inside LSTM and walked
through a LSTM inference
example.
So, please refer to that session
for more information on LSTM
inference.
This year, we've added support
for training, for all of these
variants of RNNs.
And in this session, I'm going
to talk about training LSTMs.
So, let's use a specific
example.
So, here we have an activity
classifier network which takes
motion sensory data as input.
For example, reading some
sensors like an accelerometer or
a gyroscope.
And then the network uses this
data to identify a physical
activity performed by the user.
So, for example, we want to know
if a user is cycling, skiing, or
walking.
As you can see, this network is
set up in an interesting way.
So, it contains a series of CNN
primitives, followed by LSTM
primitive, followed by more CNN
primitives.
So, why is it set up this way?
Let's take a look.
So, even though our input is
sensor data, it's represented by
a batch of 1D images with six
feature channels.
So, one feature channel for
access in the accelerometer and
gyroscope readings.
And each 1D image has 2,000
pixels.
And you can think of them as
samples in time because the
activity we're trying to
identify, occurs over time.
And then we pass these images
through a 1D convolution
primitive which compresses these
2,000 samples, to just 20
samples.
But it expends a number of
feature channels, because -- so,
we're not losing any features in
the data.
And then, this new
representation of the data, is
passed to LSTM primitive as a
sequence of lengths 20.
And we ran LSTM for 20
iterations.
So, our LSTM is operating on a
sequence of lengths 20 instead
of 2,000, so it's operating on a
higher-level feature
representation of the data.
And then we have additional CNN
primitives that we find
high-level features in the data.
And the last primitive in this
network is the SoftMax primitive
which generates probabilities
for the different activity
classes, which is the output of
the network.
And now, let's take a look at
how to train this network.
So, we again need a loss
primitive, which takes the
output of the network and the
labels as input.
And then we need the second half
of the graph.
So, in the second half of the
graph, we again have gradient
primitives for the corresponding
forward primitives, including
the LSTM primitive.
And now, for training, we do the
forward pass through the
network, then we compute loss,
and we do the gradient pass to
compute gradients that will be
used to update weights.
So, this is a very similar setup
that we have for a CNN training.
And the last step is of course
to update the weights and as you
know, the LSTM also has weights,
so they need to be updated as
well.
And now, let's take a look at
how to train this network in
MPS.
But first, let's take a look at
how we can create LSTM layer for
training using our framework.
So, first, you need to create
LSTM layer descriptor.
And we initialize the descriptor
with initial training parameters
using data source providers.
So, these initial training
parameters are, use smaller
random number or some checkpoint
values.
The descriptor setup for
training, is exactly the same as
it is for inference.
And we discussed the layer
descriptor setup in our last
year WWDC session in a lot more
detail.
So, I want to refer you to the
session for more information on
LSTM layer descriptor setup.
Once you have the descriptor,
the next step is to create LSTM
training layer with this
descriptor.
MPS will populate training
weights using the data sources
specified in the descriptor.
And we also need to have some
matrices to hold the computed
gradients.
You will use the
createWeightGradientMatrices API
on the training layer to create
these matrices.
And then, the training weights
will be used in a forward and
gradient passes and will be
passed to an optimizer along
with the computed gradients,
job, to date, weights.
And now we need to prepare some
inputs and outputs for training
our LSTM.
So, here's an example of how you
can create the matrices to hold
the input and output sequences
for both the forward and
gradient passes.
You will need 20 matrices for
each one of those.
And here is how you would
initialize these matrices with
data.
And now, we are ready to train
our activity classifier network
in MPS.
So, in this code example, I will
be highlighting only the LSTM
filter in the interest of time.
So, in the forward pass, we ran
a sequence of 20 matrices
forward through the LSTM
training layer.
And then in the backward pass,
we ran a sequence of 20 matrices
though the LSTM layer to compute
gradients.
And now, you have the training
weights, and you have the
computed gradients, and you can
pass them to an optimizer to
update weights.
So, there's just one more thing
I'd like to mention.
[Inaudible] neural networks
operate on images and LSTMs
operate on matrices.
And we'll provide convenience
kernels in the framework to make
it easy to convert between
images and matrices.
So, in order to copy an image to
a matrix, you need to use the
MPI's Image Copy to Matrix
Kernel.
So, this is how you can create
one, and this is how you can
encode one on a batch of images.
Here, each row in a destination
matrix, will contain one source
image.
And to copy from a matrix to an
image, you need to use the MPS
Matrix Copy to Image Kernel.
This is how you can create one
and this is how you encode one
to the GPU.
So, we just showed you how to
train CNNs and RNNs using MPS.
We also showed you a demo of
Turi Create which is now powered
by MPS.
And now it's time for one more
demo.
We have been working with Google
to add support to the Metal
Performance Shaders Framework to
TensorFlow to accelerate machine
learning on macOS, and we would
love to show you a demo of that
in action.
Specifically, we want to show
you a demo of training the
InceptionV3 Object
Classification Network, using
TensorFlow, powered by MPS.
So, for this demo, I will again
be using a MacBook Pro with an
attached external GPU.
So, I will be running TensorFlow
on this MacBook Pro, and I will
use an external GPU to train a
network using MPS.
So, in this demo setup, I've
already imported TensorFlow and
preloaded the InceptionV3
Network and a training data set.
So, now, let's train this
network for 30 iterations.
So, you can see how fast this is
going.
Again, the entire network, all
of the primitives, the
optimizer, and the weight update
step, everything is running on
the external GPU.
And we're already done.
And as you can see, the training
rate is approximately 100 images
per second.
So, as was stated in the
platform State of the Union,
training the InceptionV3
Network, in TensorFlow powered
by MPS, is up to 20 times faster
than without MPS.
So, this is it for the
TensorFlow demo.
Thank you very much.
And now, let's summarize this
session.
This year, we've added a FP16
accumulation for the convolution
and convolution transpose
primitives to improve the
performance of CNN inference.
We've also added GPU accelerate
primitives for training neural
networks.
These primitives are optimized
for both iOS and macOS.
We've also added the neural
network graph API for training.
It makes it very easy to train
neural networks on the GPU and
enables us to provide the best
performance across different
GPUs.
For more information on this
session and links to related
resources, please go to our
developer website.
We have the Metal for Machine
Learning Lab tomorrow at 9 a.m.
So, we would love to talk to
you.
So, please come talk to us.
And thank you for coming and
have a great WWDC.