Transcript
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
[ Applause ]
>> PHILIP BENNETT: Good morning,
and welcome to Metal Performance
Optimization Techniques.
I'm Phil Bennett of the GPU
Software Performance Group,
and I will be joined shortly by
our special guest Serhat Tekin
from the GPU Software
Developer Technologies Group
and he will be giving a demo
of a great new tool you can use
to profile your Metal apps.
I'm sure you're going
to love it.
So, Metal at the WWDC,
the story so far.
In What's New in Metal Part 1,
we covered great new features
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
In What's New in Metal Part 1,
we covered great new features
that have been added to Metal
as of iOS 9 and OS X El Capitan.
In What's New in Metal Part 2,
we introduced two new
frameworks, MetalKit
and Metal performance shaders.
These make developing
Metal apps even easier.
In this our final session,
we will be reviewing what tools
are available for debugging
and profiling your Metal
apps and we're going
to explore some best practices
from getting optimal performance
from your Metal apps.
So let's take a look
at the tools.
Now, if you have been doing any
Metal app development in iOS,
you are likely to be
familiar with Xcode
and its suite of Metal tools.
Now, we are going
to take a quick look
at the frame debugger.
So what we have here is a
capture of a single frame
from a Metal app,
and on the left,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
from a Metal app,
and on the left,
we have the frame navigator
which shows all of the states
and Draw calls present
in the frame.
These are grouped by render
encoder, command buffer,
and if you have been
using debug labels,
they will be grouped
by debug groups also.
Next we have the render
attachment viewer,
which shows all of the
color attachments associated
with the current render pass
in addition to any depth
and stencil attachments, and it
shows this wire frame highlight
of the current Draw call,
which makes navigating
your frame very convenient.
Next we have the
resource inspector
where you can inspect all of
the resources used by your app,
from buffers to textures
and render attachments.
You can view all
different formats,
you can individual
bitmap levels, cube maps,
TD arrays, it's fully featured.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
TD arrays, it's fully featured.
And then we have the state
inspector, which allows you
to inspect properties of all of
the Metal objects in your app.
Moving on, we have
the GPU report,
which gives you a frames
per second measurement
of the current frame and gives
you timings for CPU and GPU.
In addition, it also shows
the most expensive render
and compute encoders in your
frame so you can help narrow
down which shaders and
which Draw calls are the
most expensive.
And finally, we have the
shader profiler and editor.
And this is a wonderful
tool for both debugging
and profiling your shaders as it
allows you to tweak your shaders
and recompile them on the
fly, thus saving you having
to recompile your app.
It's really useful.
And as you probably
are aware now,
all of these great
tools are now available
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
all of these great
tools are now available
for debugging your Metal
apps on OS X El Capitan.
So Instruments is a
great companion to Xcode
as it allows you to profile
your app's performance
across the entire system,
and now we are enabling you
to profile Metal performance
in a similar manner with this,
the Metal System
Trace Instruments.
It's a brand-new tool for iOS 9.
It allows you to
profile your Metal apps
across your CPU and GPU.
Let's take a look here.
We can start by profiling Metal
API usage in the application,
down to the driver,
right onto the GPU
where we can see the
individual processing phases,
verse X fragments, and
optionally computes,
and then onto the
actual display hardware.
Now, here to give
us a demonstration
of this great new tool,
please welcome Serhat
Tekin to the stage.
[ Applause ]
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
[ Applause ]
>> SERHAT TEKIN: Thank you,
Philip, and hello, everyone.
I have something really
cool to show you today,
and it's brand new,
it's our latest addition
to our Metal development
tools, Metal System Trace.
Metal System Trace is
a performance analysis
and tracing tool for your
Metal iOS apps and is available
as part of Instruments.
It lets you get a system-wide
overview of your application
over time also giving you an
in-depth look at the graphics
down to the microsecond level.
It's important that
I should stress this.
This is available for the first
time ever on our platform.
This is all thanks
to Xcode 7 and iOS 9.
So without further ado, let's
go ahead and give it a shot.
So I'm going to launch
Instruments,
and we are at the
template chooser.
You can notice that we have
a new template icon here,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
You can notice that we have
a new template icon here,
Metal icon for Metal
System Trace.
I will go ahead and choose that.
Those of you familiar
with Instruments will realize
I just created a new document
with four instruments
in it, as you can see
on the left-hand side
of the timeline here.
I will give you a quick tour of
these instruments and the data
that they present
on the timeline.
So let's go ahead and select
my Metal app on the iPad
as my target app
and start recording.
All right.
Now, Metal System
Trace is set to record
in one instrument
called Windowed Mode.
It's essentially capturing
the trace into a ring buffer.
This lets you record
indefinitely.
And the important point here
is that when you see a problem
that you want to investigate,
you can stop recording.
At that point, Instruments
will gather all
of the trace data collected,
process it for a while,
and they will end up with a
timeline that looks like this.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and they will end up with a
timeline that looks like this.
So there is quite a lot of stuff
going on here, so I will zoom
in to get a better look.
I can do that by holding
down the Option key
and selecting an area of
interest in the timeline
that I want to zoom in.
I can navigate the timeline
using the tracker gestures,
two fingers swipe to
scroll and pinch to zoom.
And you can see that
I get more detail
on the timeline as
I zoom further in.
So what are we looking at here?
Essentially what we have
here is an in-depth look
of your Metal application's
graphics workload over time
across all of the layers
of the graphics stack.
The different colors that we go
through in the timeline
represent different workloads
for individual frames.
And the tracks themselves
are fairly intuitive.
Each box you see here represents
an item's trace relative start
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Each box you see here represents
an item's trace relative start
time, end time, and
how long it took.
Starting from the top
and working our way down,
we have your application's
usage of the Metal framework.
Next, we have the graphics
driver processing your command
buffers, and if you have any
shader compilation activity
midframe, it also
shows up in the track.
This is followed by
the GPU hardware track,
which shows your Render
and Compute commands
executing on the GPU.
And finally we have the
display surfaces track.
Essentially, this is your frame
getting displayed on the device.
All right.
So another thing you can
see here is these labels.
Now, note that these two labels
here, shadow buffer and G-buffer
and lighting, are labels I
assigned myself to my encoders
in my Metal code using the
encoder's Label property.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in my Metal code using the
encoder's Label property.
These labels propagate their
way down the pipeline along
with the workload they
are associated with,
which makes it very easy
to track your scenes
rendering passes here
in Metal System Trace.
I highly recommend
taking advantage of this.
And if anything is too
small to fit its label,
you can always go hover over
the ruler and see a tool tip
that displays both the label and
the duration at the same time.
The order of the tracks
here basically map
to the same order your Metal
commands would work their way
down the graphics pipeline.
So let us go ahead and
follow this command buffer
down the pipe.
So at the top track I can
see my application's use
of Metal command
buffers and encoders,
specifically what I see
here is the creation time
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
specifically what I see
here is the creation time
and submission time for
both my command buffers
and rendering compute encoders.
At the top I have
my command buffer,
and at the bottom I have my
relevant encoders created
by this command buffer
directly nested underneath.
Now, note this arrow here
at the submission time
of the command buffer
going to the next track.
Dependencies between
different levels
of the pipeline are
represented by these arrows
in Metal System Trace.
So, for instance, when this
command buffer is submitted,
its next stop is going to be
the graphics display driver,
if I can zoom in there
and get a better look.
Look at how much
we are taking here.
It's really, really
fast, and they are still
on the CPU side barely
consuming anything.
Similarly, I can go and follow
the arrows once the encoders are
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Similarly, I can go and follow
the arrows once the encoders are
done processing.
The encoders are going to get
submitted to the GPU track.
Following the arrows
the same way,
I can see my encoders
getting processed on my GPU.
This GPU track is separated
into three different lanes,
one for vertex processing,
one for fragment,
and one for compute.
So, for instance, here I can see
my shadow buffer rendering code
for my shadow buffer pass going
through its vertex
processing phase and moving
on to the fragment phase,
which happens to overlap
with my G-buffer and
lighting phase as well.
Something that is desirable.
A quick note here is that the
vertex fragment also compute
processing costs have more than
just the shader processing time.
For instance, we
are running on iOS,
and it's a tile-based
deferred architecture,
so the vertex processing
cost is going
to include the tiling
cost as well.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to include the tiling
cost as well.
It's something to keep in mind.
Finally, once my frame is done
rendering, the surface is going
to end up on the
display, which is shown
in the track at the bottom.
Essentially, it's showing me
what time my frame was swapped
onto the display and how
long it stayed there.
Underneath that, we
have the resync track,
which shows us the
resync intervals separated
by these spikes that correspond
to individual resync events.
Finally, at the bottom,
we have our detail view.
The detail view is similar
to what you would see
in other instruments.
It offers contextual
detail based
on the instrument use selected.
For instance right now,
I have the Metal application
instrument selected,
so I can go ahead and expand
this to see all of my frames
and all of the command
buffers and encoders along
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and all of the command
buffers and encoders along
with the hierarchy involved.
This track is useful if you want
to see, say, precise timing.
If I go to the encoder list,
precise creation
submission timings
or what process something
originated from.
It's very useful.
Cool! So this timeline look
at the graphics pipeline is
an incredibly powerful tool.
It's available for the first
time with iOS 9 and Metal.
So how do you use this to
help you solve your problems?
Or how does a problem app look?
Let me go ahead and
open a different trace
to show you that.
In a couple of minutes, Philip
will go into a lot more detail
than I will about
Metal performance
and how you can use this
tool for that purpose.
But I'm going to give
you a quick overview
of the tool's workflow and
a quick couple of tips.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of the tool's workflow and
a quick couple of tips.
First and foremost, you
need to be concerned
about your CPU and
GPU parallelism.
You can see that this
trace that I opened,
labeled Problem Run
appropriately,
is already sparser than
the last trace we took.
This is because we have
a number of sync points
where the CPU is actually
waiting on the GPU.
You need to make sure
you eliminate these.
Also, another useful thing
to look for is the pattern
that you see on the timeline.
These frames are all part of the
same scene, so they are going
to have really high
temporal locality.
Any divergence you
see might point
at a problem you
should investigate.
Another important thing is
the display surfaces track.
So ideally, if your frame rate
target is 60 frames per second,
these surfaces should
be staying on display
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
for a single VSync interval.
So we should be seeing
surfaces getting swapped
at every VSync interval.
This particular frame, for
instance, stayed on for three,
so we are running at 20 fps.
Another thing that pretty useful
is the shader compilation track
directly shows you if the
shader compiler is kicking
in at any time during
your trace.
One thing that you want
to particularly avoid
is submitting work
to the shader compiler
midframe because it's going
to waste CPU cycles you
can use on other things.
Phil will explain this in a
couple more minutes in detail.
Finally, you should aim to
profile early and often.
A workflow like this will
help you figure out problems
as they occur and make
it easier to fix them.
And Xcode helps you with that by
offering a profile launch option
for your build products.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
It's going to automatically
build a release version
of your app, installed
on the device,
and start an instruments run
with a template of your choice.
All right.
So you have our first look
at Metal System Trace.
Available for all of your
Metal-capable iOS devices
out there.
Please give it a try.
We are looking forward to
your feedback and suggestions.
Now, I will leave the
stage back to Phil,
who will demonstrate a couple
of key Metal performance issues
and how you can use our
tools to identify these.
Thank you.
[ Applause ]
>> PHILIP BENNETT:
Thank you, Serhat,
that was very informative.
Now, we are going to cover the
aforementioned Metal performance
best practices, and we
are going to use the tools
to see how we can diagnose
and hopefully follow
these best practices.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and hopefully follow
these best practices.
So let me introduce our sample
app, or rather a system trace
of our sample app, and
immediately we can see
that there are several
performance issues.
To begin with, there
is no parallelism
between the CPU and the GPU.
These are incredibly
powerful devices,
and the only way you are going
to obtain the maximum
performance is
by having them run
independently,
whereas here they seem to
be waiting on the other.
So we can see there
is a massive stall
between processing
frames on the CPU.
There is a whopping
22 milliseconds.
We shouldn't have any stalls.
What's going on there?
And if we look at the actual
active period of the CPU,
it exceeds our frame deadline.
We were hoping for
60 frames per second.
So we had to get everything
done within 16 milliseconds.
And we have blown past that.
And things don't look much
better on the GPU side, either.
There is a lengthy stall in
proportion to what is on the CPU
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
There is a lengthy stall in
proportion to what is on the CPU
because the CPU has been
spending all its time doing
nothing of note and
hasn't been able to queue
up work for the next frame.
Furthermore, the active GPU
period overshoots the frame
deadline, and we are shooting
for 60 frames per second,
but it looks like we
are only getting 20.
So what can we do about this?
Well, let's go back to basics.
Let's first examine one
of the key principles
of Metal design and performance.
And that's creating
your expensive objects
in state upfront.
Now, in a legacy app, typically
what would happen would be
during content loading, the app
would compile all of its shaders
from source, and that could be
dozens or even hundreds of them,
and this is a rather
time-consuming operation.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, this is only half of the
shared accompilation story
because the shaders
themselves need to be compiled
into a GPU pipeline
state in combination
with the various state used.
So what some apps
might attempt to do is
to do something known
as prewarming.
Now, normally the device
compilation would occur
when the shaders and states
were first used in a Draw call.
That's bad news.
Imagine you have a racing game
and suddenly you turn a corner
and it draws in a
lot of new objects
and the frame rate drops.
That's really bad.
So what prewarming does is you
issue a load of W Draw calls
with various combinations of
graphic states and shaders
in the hope that the driver
will compile the relevant GPU
pipeline state.
So when the time comes
to actually draw using this
combination state and shaders,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to actually draw using this
combination state and shaders,
everything is ready to go and
you don't get a frame rate drop.
Now, in the actual
rendering loop,
there would typically be
your setting of states,
and if you actually
get around to any,
maybe you will do some
Draw calls as well.
So the Metal approach is to
move the expensive stuff ahead
of time.
Shaders can be compiled
from source offline.
That's already saving
a chunk of work.
We move state's definition
ahead of time.
You define your state.
The GPU pipeline
state is compiled
into these state objects.
So when you come to actually do
the Draw calls, there is none
of that device compilation
nonsense, so there is no need
for a shade of warming anymore.
It's a thing of the past.
That leaves the rendering
loop free for Draw calls.
Loads of Draw calls.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So fundamentally,
Metal facilitates
upfront state definition
by decoupling expensive state
validation and compilation
from the Draw commands, thus
allowing you to pull this
out of the rendering loop
and keep the rendering loop
for actual Draw calls.
Now, the expensive-to-create
state is encapsulated
in these immutable state
objects, and the intention is
that you will create these
once and reuse them many times.
Now, getting back
to our sample app,
here we see there is some shader
compilation going on midframe,
and we are wasting about
a millisecond here.
That's no good at all.
And if we look at the
Xcode's frame debugger,
look at all of this
happening in a single frame.
Look at all of these objects.
We don't want any of this.
All that you should be
seeing is this, the creation
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
All that you should be
seeing is this, the creation
of the command buffer for
the frame and the acquisition
of the drawable and its texture.
All of the rest is
completely superfluous.
So let's cover these
expensive objects
and when you should create them.
And we are going to begin
with shader libraries.
These are your library
of compiled shaders.
Now, what you really want to do
is compile all of them offline.
You can use Xcode,
any Metal source files
in your project will
automatically be compiled
into the default library.
Now, your app may have its
own custom content pipeline,
and you might not necessarily
want to use this approach.
So for that, we provide
command-line tools,
which you can integrate
into your pipeline.
If you absolutely cannot
avoid compiling your shaders
from source in runtime, the
best you can do is create
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
from source in runtime, the
best you can do is create
them asynchronously.
So you create the library,
and in the meantime, your app,
or rather, the calling threads,
can get on with doing
something else,
and once the shader
library has been created,
your app will be
asynchronously notified.
Now, one of the first
objects you will be creating
in your app will be the
device and command queue.
And these represent the GPU
you will be using and its queue
of ordered command buffers.
Now, as we said, you want
to create these during
app initialization
and because they are
expensive to create,
you want to reuse them
throughout the lifetime
of your app.
And, of course, you want
to create one per GPU used.
Now, next is the
interesting stuff, the render
and compute pipeline state,
which encapsulates all
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and compute pipeline state,
which encapsulates all
of the programmable
GPU pipeline states,
so it takes all the descriptors,
your vertex formatter scripts,
render buffer formats,
and compiles it
down to the actual
raw pipeline state.
Now, as this is an
expensive operation,
you should be creating
these pipeline objects
when you load your
content, and you should aim
to reuse them as
often as you can.
Now, as with the libraries,
you can also create these
asynchronously using
these methods.
So once created, your
app will be notified
by a completion handler.
One point to mention is that
unless you actually need it,
you shouldn't obtain
the reflection data
as this is an expensive
operation.
So next we have the depth
stencil and sampler states.
These are the fixed-function
GPU pipeline states,
and you should be creating these
when you load your content along
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and you should be creating these
when you load your content along
with the other pipeline states.
Now, you may end up with many,
many pieces of depth stencil
and sampler states, but you
needn't worry about this
because some Metal
implementations will internally
hash the states and
create loads of duplicates
so don't worry about that.
Now, next we have the actual
data consumed by the GPU.
You have got your
textures and your buffers.
And you should, once
again, be creating these
when you load your content, and
reuse them as often as possible,
because there is an overhead
associated with both allocating
and deallocating
these resources.
And even dynamic resources,
you might not be able
to fully initialize them
ahead of time, but you should
at least create the
underlying storage.
And we are going to be
covering more on that very soon.
So to briefly recap.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So to briefly recap.
So the most expensive states
obviously should be created
ahead of time, so these
are the shader libraries
that you aim to build offline.
The device and the command
queue, which are created
when you initialize
your app, the render
and compute pipeline states,
created when you
load your content,
as are the fixed
function pipeline state,
the depth stencil
and sampler states,
and then finally the
textures and buffers
that are used by your app.
So we went ahead and we
applied this best practice
to our example app, which you
may remember looked like this.
We had some shader compilation
occurring midframe every frame,
and now we have got none.
So already we have saved about
a millisecond of CPU time.
This is a good start,
but we will see
if we can do better soon.
So in summary, create your
expensive state and objects
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So in summary, create your
expensive state and objects
up front and aim to reuse them.
Expecially compile your shader
source offline, and you want
to keep the rendering loop
for what it's intended for.
It's for Draw calls.
Get rid of all of
the object creation.
Now, what about the resources
you can't entirely create
up front?
We are talking about
these dynamic resources,
so what do we do about them?
How can we efficiently
create and manage them?
Now, by dynamic resources,
we are talking
about resources which, once
created, may be modified many,
many times by the CPU.
And a good example of this
is buffer shader constants,
and also any dynamic vertex and
index buffers you might have
for things like particles
generated on the CPU,
in addition to dynamic textures,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in addition to dynamic textures,
perhaps your app has some
textures which it modifies
in the CPU between frames.
So ideally given the choice,
you would put these resources
somewhere which is efficient
for both the CPU and
the GPU to access.
And you do this with the
shared storage mode option
when you create your resource.
And this creates
resources in memory shared
by both the CPU and the GPU.
Now, this is actually the
default storage mode on iOS,
iOS devices being unified
memory architecture,
so the same memory is shared
between the CPU and GPU.
Now, the thing about these
shared resources is the CPU has
completely unsynchronized
access to them.
It can modify the data as freely
as it wants through a pointer.
And in fact, it's quite easy
for the CPU to stomp all
over the data which
is in use by the GPU,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
over the data which
is in use by the GPU,
which tends to be
pretty catastrophic.
So we want to avoid that.
But how can we achieve this?
Well, the brute force approach
would be to have a single buffer
for the resource, where we
have, say, a buffer of constants
which are updated on the CPU
and consumed later by the GPU.
Now, if the CPU wants to
modify any of the data
in the constants
buffer, it has to wait
until the GPU is
finished with it.
And the only way it can
know that is if it waits
for the command buffer in which
the resource is referenced
to finish processing on the GPU.
And for that, in this case
we use Wait Until Completed.
So we wait around, rather
the CPU waits around,
until the GPU is
finished processing
and then it can go ahead
and modify the buffer,
which is consumed by the
GPU in the next frame.
Now, this is really bad because
not only is the CPU stored
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, this is really bad because
not only is the CPU stored
but the GPU is stored as well
because the CPU hasn't had time
to queue up work
for the next frame.
This is what is happening
in the example app.
The CPU is waiting around for
the GPU to finish on each frame.
You are introducing a massive
store period, and, yes,
there is no parallelism
between the CPU and the GPU.
So we need a better
approach clearly,
and you might be tempted to just
create new buffers every frame
as you need them.
But as we learned in
the previous section,
that's not a particularly
good idea
because there is an
overhead associated
with creating each buffer.
And if you have many buffers,
large buffers, this will add up,
so you really don't
want to be doing this.
What you should do instead
is employ a buffer scheme.
Here we have a triple
buffering scheme,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Here we have a triple
buffering scheme,
where we have three buffers,
which are updated on the CPU
and then consumed by the GPU.
Why three?
Typically we suggest
that you limit the number
of command buffers in flight
to three, and effectively,
you have one buffer
per command buffer.
And by employing a
semaphore to prevent the CPU
from getting too far ahead
of the GPU, we can ensure
that it's safe to update
the buffers on the CPU
when the GPU wraps
around, when it goes back
to reading the first buffer.
Rather than bore you with
a lot of sample code,
I will point you straight at a
great example we already have.
That is the Metal
Uniform Streaming example,
which shows you exactly
how to do this.
So I recommend you check it out
afterward if you are interested.
Getting back to our example app,
you may remember we had these
very performance-crippling
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
you may remember we had these
very performance-crippling
weights between each
frame on the CPU.
Now, after employing a buffering
scheme to update dynamic data,
we managed to greatly reduce
the gap between processing
on both the CPU and the GPU.
We still have some sort
of synchronization issue,
but we are going to look
into that very shortly.
So we are making good
progress already.
And in summary, you
want to buffer
up your dynamic shared resources
because it's the most
efficient way of updating these
between frames, and you enforce
safety via use of the buffers
and flights that I mentioned.
Now, I'm going to
talk about something
or rather the one thing you
don't actually want to do
up front, and that relates
to when you acquire your
app's drawable service.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, the drawable surface is
your app's window on the world,
it's what your app renders
its visible content into,
which is either displayed
directly on the display
or it may be part of a
composition pipeline.
Now, you retrieve the
drawables from the Metal layer
of Core Animation, but there
is only a limited number
of these drawables because
they are actually quite big,
and we don't want to keep loads
of them around nor do we want
to be allocating them
whenever we need them.
So these drawables are
maintained very limited,
and predrawables
are relinquished
at display intervals once
they have been displayed
in the hardware.
And each stage of the display
pipeline may actually be holding
onto a drawable at any point
from your app, to a GPU,
to Core Animation if you
have any compositing,
to the actual display hardware.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to the actual display hardware.
Now, your app grabs a
drawable surface typically
by calling the next
drawable method.
If you are using MetalKit,
this will be performed
when you call Current
Render Pass Descriptor.
Now, the method will only return
once a drawable is available,
and if there happens to be a
drawable available at the time,
it will return immediately.
Great, you can go on and
continue with the frame.
However, if there are
none available your app,
or rather the calling
for it, will be blocked
until at least the next display
interval waiting for a drawable.
This can be a long time.
It's 60 frames per second.
We are talking 16 milliseconds.
So that's very bad news.
So is this what our
example app was doing?
Is this the explanation for
these huge gaps in execution?
Well, let's see what Xcode says.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So we go to the frame
navigator, and we take a look
at the frame navigator here.
And Xcode seems to
have a problem
with our shadow buffer encoder.
See a little warning there.
So if we take a closer look,
we see that indeed we are
actually calling the next
drawable method earlier
than we should do.
The next code offers
some very sage advice
that we should only call it when
we actually need the drawable.
So how does this fit in
with our example app?
Well, we have several passes
here in our example app,
and we were acquiring the
drawable right at the start
of each frame before
the shadow pass.
This is far too early, because
right up until the last pass,
we are drawing everything
off screen,
and we don't need a drawable
right up until we come
to render the UI pass.
So the best place to acquire the
next drawable is naturally right
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So the best place to acquire the
next drawable is naturally right
before the UI pass.
So we went ahead and we made
the change, we moved our call
to next drawable
later, and let's see
if that solved our problem.
Well, as you can
already see, yes, it did!
We removed our second
synchronization point,
and now we don't have any
stalls between processing
on the frame processing
on the CPU.
That's a massive improvement.
So the advice is very simple:
only acquire the drawable
when you actually need it.
This is before the render pass
in which it's actually used.
This will ensure that
you hide any long latency
that would occur if there
weren't any drawables available.
So your app can continue
to do useful work,
and by the time it
actually needs a drawable,
one is likely to be available.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So at this point we are
doing pretty well so far.
But there is still
room for improvement.
So why don't we look
at the efficiency
of the GPU side rather than
diving to a very low level, say,
trying to optimize our shaders
or change texture formats,
whatever, why don't we see
if there is any general
advice we can apply.
As it so happens, there is.
That relates to how we use
Render Command Encoders.
Now, a Render Command
Encoder is what is used
to generate Draw commands
for a single rendering pass.
And a single rendering pass
operates on a fixed set
of color attachments, and
depth and stencil attachments.
Once you begin the pass, you
cannot change these attachments.
However, you can change
the actions acting on them,
such as the depth stencil
state, color masking
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
such as the depth stencil
state, color masking
and blending, for instance.
And this is valuable
to remember.
Now, the way in which we use
our render encoders particularly
important on the iOS device
GPUs due to the interesting way
in which they are architected.
They are tile-based
deferred renderers.
So each Render Command Encoder
results in two GPU passes.
First you have the vertex
phase, which transforms all
of the geometry in your encoder,
and then performs clipping,
coloring, and then bins
all of the geometry
into screen space tiles.
This is followed by the fragment
phase, which processes all
of the objects tile
by tile to determine
which objects are visible,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
which objects are visible,
and then only the visible
pixels are actually processed.
And all of the fragment
processing occurs
in these fast on-chip
tile buffers.
Now, typically at the end
of a render you only need
to store out the color buffer.
You would just discard
the depth buffer.
And even sometimes you may have,
say, multiple color attachments,
but you only need to
store one of them.
By not storing the
tile data in each pass,
you are saving quite
a bit of bandwidth.
You are avoiding writing
out entire frame buffers.
This is important for
performance, as is not having
to load in data each tile.
So what can Xcode tell us?
Can it give us -- or rather,
I mentioned that each
encoder corresponds
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
I mentioned that each
encoder corresponds
to a vertex pass
and a fragment pass.
And this applies
even for MT encoders,
and this is quite important.
Here we have actually
two G-buffer encoders,
and the first one doesn't
seem to be drawing anything.
I guess that just slipped
in there by mistake,
but this actually has quite an
impact on performance if we look
at the system trace of the app.
Just that empty encoder consumed
2.8 milliseconds on the GPU,
and presumably it was
just writing a clear color
out to however many
attachments we had, three color
and two depth and stencil.
And our total GPU
processing time
for this particular
frame is 22 milliseconds.
Now, if we remove
the MT encoder,
which is done very easily
because it shouldn't be there
in the first place, we go down
to 19, so that's a very nice win
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in the first place, we go down
to 19, so that's a very nice win
for doing very little at all.
So watch out for
these MT encoders.
If you are not going
to do any drawing
in a pass, don't start encoding.
So let's look a bit deeper now.
Let's have a look at the render
passes in our example app
and see what we have got.
So we have got a shadow pass,
which renders into
a depth buffer.
We have a G-buffer
pass, which renders
into three color attachments and
a depth and stencil attachment,
and then we have these
three lighting passes,
which use the render attachment
data from the G-buffer pass,
either sampling through the
texture units or loading
to the frame buffer content.
And when the lighting
passes use this data,
and they perform
lighting and outputs
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to a single accumulation target
which is used several
times over.
And finally you have
a user interface pass
onto which user interface
elements are drawn
and presented to the screen.
So is this the most
efficient setup of encoders?
Once again we summon
Xcode's frame debugger to see
if it has anything to say.
And once again, yes it does.
It has taken issue with
our sunlight encoder.
So let's take a closer look.
We are inefficiently using
our command encoders.
And Xcode is kind
enough to tell us
which ones we could
actually combine.
So let's go ahead and
merge a couple of passes.
Rather than merge just two,
we can actually merge three,
which all operate on the
same color attachment.
So let's go ahead and do that.
So we have six passes
here, and now we are going
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So we have six passes
here, and now we are going
to merge them down to four.
So what impact did that have
on performance, GPU side?
Let's go back to the
GPU, the system trace.
Here we can see we have
gone from 21 milliseconds,
six passes, down to 18 by
not having to write out all
of that load and store all
of that attachment data.
So that's quite a nice win.
But could we go any further?
Let's return to our app.
So we have four passes,
and is it actually possible
to combine both the G-buffer and
the lighing pass to avoid having
to store out five attachments
and keep everything on chip?
Well, it in fact is.
We can do that with clever
use of programmable blending.
So I'm not going to go
into too much detail there,
but what we did was we combined
these two encoders down to one.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
but what we did was we combined
these two encoders down to one.
So now we are left with
three render encoders
and we are having to
load and store far,
far less attachment data,
and that's a massive win
in terms of bandwidth.
So let's see what
impact that had.
Actually not a lot.
That was very unexpected.
We have only chopped
off about a millisecond.
That's not great.
I was hoping for more than that.
So once again, can
Xcode save us?
We turn to Xcode's
frame debugger.
And we take a closer look at
the load and store bandwidth
for the G-buffer encoder.
Now, it turns out that we
are actually still loading
and storing quite a lot
of data, and the reason
for that is quite simple.
It looks like here we have
mistakenly set our loads
and store actions for each
attachment incorrectly.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and store actions for each
attachment incorrectly.
We only wanted to be storing
the first color attachment,
and we want to discard the
remaining color attachments
in addition to the depth
and stencil attachments,
and we certainly don't
want to be loading them in.
So if we make the very simple
change, we change our load
and store actions to
something more appropriate,
we have reduced our load
bandwidth down to zero
and we have massively
reduced the amounts
of attachment data
we're storing.
So now, what impact
did that have?
So before, with our
three passes,
we are taking 17
milliseconds on the GPU.
Now, we are down to 14.
That's more like it.
So to summarize, don't
waste your render encoders.
Try to do as much useful
work as possible in them,
and definitely do
not start encoding
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and definitely do
not start encoding
if you are not going
to draw anything.
And if you can, and with the
help of Xcode, merge encoders
which are rendering to
the same attachments.
This will get you big wins.
Now, we are doing pretty
well on the GPU side now.
In fact, we are actually
within our frame budget.
But is there anything we
can do on the CPU side?
If you remember, I think we were
actually still slightly beyond
our frame budget.
What about multithreading?
How could multithreading
help us?
What does Metal allow us to
do in terms of multithreading?
Fortunately for us, Metal was
designed with multithreading
in mind and has a very efficient
threadsafe and scalable means
of multithreading
your rendering.
It allows you to encode multiple
command buffers simultaneously
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
It allows you to encode multiple
command buffers simultaneously
on different threads, and your
app has control over the order
in which these are executed.
Let's take a look at
a possible scenario
where we might attempt
some multithreading.
But before that, I
would like to stress
that before you even
go ahead and try
to multithread your rendering,
you should actively
pursue the best possible
single-threaded performance.
So make sure there is
nothing terribly inefficient
in there before you start
trying to multithread things.
Okay. So we have an example here
where we have two render passes,
and we are actually taking so
long to encode these two passes
on the CPU that we are actually
missing our frame deadline.
So how can we improve this?
Well, we can go ahead and
we can encode the two passes
in parallel.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And not only have we managed to
reduce the CPU time per frame,
the side effect is that the
first render pass can be
submitted to the GPU quicker.
So how would this look in
terms of Metal objects?
How does it come together?
Where we start with our Metal
device in the command queue
as usual, and now for
this example we are going
to have three threads.
And for each thread, you
need a command buffer.
Now, for the two threads, each
has a Render Command Encoder
which is operating
on separate passes,
and on our third thread we
might have multiple encoders
executing serially.
So it goes to show
the approaches
to multithreading can
be quite flexible,
and once they have all
finished their encoding,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and once they have all
finished their encoding,
the command buffers are
submitted to the command queue.
So how would you set this up?
It's quite simple.
You create one command buffer
per thread and you go ahead
and initialize render
passes as usual,
and now the important
point here is the order
in which the command buffers
will be submitted to the GPU.
Chances are this is
important to you.
So you enforce it by
calling the Enqueue method
on the command buffers,
and that reserves a place
in the command queue so when
the buffers are eventually
committed, they will be
executed in the order
that they were enqueued.
This is an important
point to remember.
Because then we create the
render encoders for each thread,
and we go ahead and encode our
draws on the separate threads
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and we go ahead and encode our
draws on the separate threads
and then commit the
command buffers.
It's really very simple to do.
Now, what about another scenario
which could potentially
benefit from multithreading?
So here again we
have two passes,
but one of them is significantly
longer than the other.
Could we split that up somehow?
Yes, we can.
Here, we will break it up
into two separate passes.
We have three threads here.
One is working on the
first render pass,
and we have two dedicated to
working on chunks of the second.
And, again, here by employing
multithreading we are
within our frame deadline,
and we have got a bit of time
to spare on the CPU as well
for doing whatever
else we fancy doing.
It need not necessarily
be more Metal work.
So how would we, or rather
what would this look like?
So once again, we have the
device and the command queue.
And for this example, we are
going to be using three threads.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And for this example, we are
going to be using three threads.
But here we only want
one command buffer.
Next, we have the special form
of the Render Command Encoder,
the Parallel Render
Command Encoder.
Now, this allows you to split
work for a single encoder
over multiple threads, and this
is particularly important to use
on iOS because it ensures
that the threaded
workloads are later combined
into a single pass on the GPU.
So there is no loading and
storing between passes.
This is very important that
you use this if you are going
to split up a single pass
across multiple threads.
So from the Parallel
Render Command Encoder,
we create our three
subordinate command encoders,
and each will encode to
the command buffer now,
because we are multithreading
they may finish encoding
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
because we are multithreading
they may finish encoding
at indeterminate times,
not necessarily any
particular order.
Then the command buffer
submitted to the queue.
Now, it's entirely feasible
that you could even have
parallel Parallel Render
Command Encoders.
The multithreading possibilities
are not quite endless,
but very flexible.
Or you could have
like we saw earlier,
you could have a fourth thread
which is executing
encoder serially.
So how do we set this up?
Well, we begin by creating one
command buffer per Parallel
Render Command Encoder.
So no matter how many
threads you are using,
you only want one
command buffer.
We then proceed to initialize
the render pass as usual,
and then we create our
actual parallel encoder.
Now, here is the important bit.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, here is the important bit.
When we create our
subordinate encoders,
the order in which they are
created determines the order
in which they will be
submitted to the GPU.
This is something to bear
in mind when you split
up your workload for encoding
over multiple threads.
Then we go ahead and we encode
our draws and separate threads,
and then finish encoding for
each subordinate encoder.
Now, the second important
point is all
of the subordinate encoders must
have finished encoding before we
end encoding on the
parallel encoder.
And how you implement
this is up to you.
Then finally, the command buffer
is committed to the queue.
So we went ahead and we
decided to multithread our app.
Look what turned up.
So previously, we had
serial encoding or passes.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So previously, we had
serial encoding or passes.
This was taking 25
milliseconds of CPU time.
Now, we pursued an approach
where we encode the shadow pass
on one thread, and the G-buffer
pass and UI pass on another,
and now we are down
to 15 milliseconds.
That's quite a nifty
improvement,
and we have got a bit of time
left over on the CPU as well.
So as far as multithreading
goes,
if you find that you are still
CPU bound and you have done all
of the investigations you can,
and determining you haven't
got anything silly going
on in your app, and that
you could actually benefit
from multithreading, you
can encode render passes
simultaneously on
multiple threads.
But should you decide to
split up a single pass
across multiple threads,
you want to use the Parallel
Render Command Encoder to do so.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
you want to use the Parallel
Render Command Encoder to do so.
Now, what did we
learn in this session?
Well, we introduced the
Metal System Trace tool,
and it was great.
It offers new insight into
your app's Metal performance.
And you want to use this
in conjunction with Xcode
to profile early and often.
And as we have seen,
you should also try
to follow the best
practices set out,
so you want to create the
expensive state up front
and reuse it as often
as possible.
We want to buffer
dynamic resources
so we can efficiently
modify them between frames
without causing stalls.
We want to make sure we
are acquiring our drawable
at the correct point in time.
Usually at the last
possible moment.
We want to make sure we are
efficiently using our Render
Command Encoders.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Command Encoders.
We don't have any
empty encoders,
and we have coalesced any
encoders which are writing
to the same attachment
down to one.
And then if we find we are
still CPU bound as we were
in this case, we might consider
the approaches Metal offers
for multithreading
our rendering.
So how did we do?
Well, now look at our app!
We don't have any runtime
shader compilation.
Furthermore, our GPU workload
is within the frame deadline.
It's great.
As is the CPU workload.
And there are no gaps between
processing of frames on the CPU.
And we even got quite fancy and
decided to do multithreading.
We have a lot of time left
over there to do other things.
And we managed to
meet our target,
which in this case was
60 frames per second.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
which in this case was
60 frames per second.
So well done us!
So now, the talk is
over, and if you would
like any more information
on anything mentioned
in this session, you can
visit our developer portal,
you can also sign up for
the developer forums,
and should you have any detailed
questions or general inquiries,
you can direct them to Allan
Schaffer, who is our Graphics
and Games Technologies
Evangelist.
So thank you very much
for attending this talk.
And we hope you found
it interesting,
and enjoy the rest of WWDC!
Thank you very much!
[ Applause ]