Transcript
[ Applause ]
>> Welcome.
We introduced the host of new
technologies with Metal 2 to
allow you to make better,
faster, and more efficient
applications.
My name is Michal and together
with my colleague Richard we'll
explore three main themes today.
With Metal 2 we are continuing
our direction of moving the
expensive things to happen less
frequently and making sure that
the frequent things are really,
really cheap.
Over the years we introduced
precompiled shaders, render
state objects, Metal Heap last
year all to make sure that you
can move the costly operations
outside of your main application
loop.
We gave you 10 times more draw
calls by switching from open GL
to Metal.
And this year we are introducing
our new binding API that gives
you some more.
And so we will talk about it a
bit further.
We are also putting GPU more in
a driving seat with GPU driven
pipelines.
And you will be able to create
new, novel algorithms, new
rendering techniques, and whole
unique experiences utilizing
Metal 2 on modern GPUs.
Well, speaking of the
experiences, we have a lot of
new features in Metal and we
have three other sessions that I
would love you to attend.
VR is coming to Mac this year
and with the new iMacs we are
giving you really powerful GPUs.
The external GPU is coming to
MacBook Pro to give you the same
power.
And this all enables your users
and your content creators to
experience VR in ways not
possible before.
Tomorrow's session will show you
how to use our display -- direct
display technology -- to get
your content to HMD quick and
with low latency.
You'll learn about the new Metal
API editions for VR and our new
Tools editions.
Machine learning is quickly
becoming a key feature of our
devices in many, many
applications.
And with Metal 2 you can use
Metal performance shaders to
utilize the power of the GPU for
machine learning on both test up
and mobile devices.
And you're probably staring at
that picture behind me and
thinking, "How's that done?"
Well, we have a session for you
on Thursday where you will learn
about this and about the machine
learning primitives -- the image
processing primitives -- we have
in our Metal performance
shaders.
Lastly, our tools have seen the
biggest advancement yet with
Metal 2.
You'll be able to debug your
applications quicker.
You can drill down to problems
easier and we are exposing, for
example GPU performance
counters, to make sure that you
can find your hotspots and your
application fast pass quicker.
So I hope I got you excited
about the few days ahead and
let's get back to the present
with the content of today's
session.
So we'll start with argument
buffers, probably our biggest
core framework addition this
year.
argument buffers provide an
efficient new way of configuring
which buffers, textures, and
samplers your application can
use freeing up considerable
amount of CPU resources and
actually enabling completely new
schedules for the GPU at the
same time.
Then we'll talk about Raster
Order Groups, a new fragment
shader synchronization primitive
that allows you to precisely
control the order in which
fragment shaders access common
memory, enabling you new use
cases for example of
programmable blending on MacOS
or voxelization [phonetic] order
independent transparency.
And then we'll switch to the
topic of display and we talk
about the new ProMotion Displays
on iPads and how to best drive
them using Metal.
And we'll also give you a recap
of our best practices of getting
your content from your render
targets to the glass as quickly
as possible and with the least
amount of latency.
And finally we'll finish with a
survey of all the other Metal
features that we added to align
iOS and macOS platforms into one
big, common Metal ecosystem.
So the argument buffers.
Let's look at what they are and
how they work.
And I will need an example for
that so let's think of a simple
material that those who actually
wrote any sort of 3D render
program would know.
In your material you have a
bunch of numerical constants, a
bunch of textures -- probably
more than two now a days --
assembler.
And this is what you need to
send to the GPU to be able to
render your primitive.
Now the texture objects are
interesting because they contain
both texture properties such as
width, height, pixel format
perhaps, and then a pointer to a
blob of memory which contains
all the pretty pixels.
Well, unfortunately we are not
really interested in those
pixels in this presentation.
So off it goes and we'll only be
talking about boring texture
states.
So with traditional argument
model we allow you to put all
the constants into a Metal
buffer and we created this
indirection so that it's easy
for you to use and also it gives
GPU the unfiltered, direct
access to all the data.
However, when it comes to things
like textures or samplers you
still need to go through quite
about of an API and in your
rendering loop you'll set the
buffer, set all the textures,
samplers, and only after that
you can finally draw.
And even though Metal is really
optimized this is quite a few
API codes and if you multiply it
with the number of objects you
need to render, every frame, and
the fact that you need to do all
this work every frame, it
actually at some point limits
the amount of objects that you
can put on the screen.
With argument buffers we decided
that we would like to extend
this very convenient indirection
that we have for constants to
everything.
So you can actually put texture
state, samplers, pointers to
another buffer into an argument
buffer and this really
simplifies your rendering
pipeline because -- well,
suddenly the only thing you need
to do is set the buffer and
draw.
And you probably figured out
that with this few API calls you
can put more objects on the
screen, and as you'll see later,
you can do actually even better
with argument buffers.
So we've done a bunch of
benchmarks and run argument
buffers on our devices.
And this is for example what you
get on iPhone 7.
While with traditional model,
quite unsurprisingly, the cost
of your draw call scales, with
the amount of resources you use
in a draw call, with argument
buffers the cost stays pretty
low and almost flat.
So this already shows that for
example with a very simple
shader, with just two resources,
with texture and a buffer, or
two textures, you're getting
seven times the performance
improvement.
With eight textures or eight
resources, however you want to
mix it up, you are getting 18
times performance improvement on
iPhone 7 and it goes even better
with 16 resources, obviously.
So I already talked about the
performance.
I hinted toward a new use -- new
use schedules.
And we'll talk about this in a
minute.
And the last point -- the last
benefit of argument buffers I
would like to bring up is the
ease of use.
And it comes from the fact that
argument buffers are ultimately
an extension of buffers.
So you can, for example go ahead
and prepare them ahead of the
time, let's say when your game
is loading, and then don't have
to worry about it anymore during
your rendering loop, further
improving your performance.
Or you can mix them with a
traditional binding model, for
example even within a single
draw call, which means that your
adoption can be as simple as
using our new tools to figure
out what is your most expensive
loop in our application and
optimizing that and then maybe
return to the rest in a year
when you have time.
And lastly, the argument buffers
are supported across all Metal
devices.
So once you take this adoption
step and you get all the
performance you can keep using
it on all Metal devices.
The ease of use actually
translates really well to the
shaders.
And since we will be looking at
the shaders quite a bit during
this section -- this is an
example of the material I gave
you in the beginning.
And as you can see, the textures
in the sampler are part of the
structure, and that's the main
thing to take away from this is
that your argument buffer is
just a structure in a shader and
you can use all the language
that you have at your disposal
to make embedded structures, to
organize your data, or use erase
or pointers.
It just really works.
So let's now look at the three
main new features of argument
buffers, the first one being
dynamic indexing.
And great example of it is crowd
rendering.
If you played some of the recent
Open World games you've seen
that games try to render large
crowds full of unique, varying
characters in order to make
these beautiful, immersive
worlds.
Well, actually that's quite a
costly thing to do if you need
to create so many draw calls.
With argument buffers we already
said that we could put all the
properties required for let's
say a character into a single
argument buffer, bind it, and
save all that performance on the
CPU, but actually we can do
better.
We can for example create an
array of argument buffers where
each element represents single
character.
And then it suddenly becomes
very, very simple because what
you need to do is set this big
buffer, this one API call, issue
single instance draw call, let's
say with 1,000 instances because
I would like 1,000 characters on
screen.
That's second API call.
And after that it's all on the
GPU.
In a vertex shader you use
instance ID to pick the right
element from the array, get the
character, put it somewhere
where it needs to be in the
world, give it the right pose,
if it's for example mid-walk
cycle, and then in the fragment
shader again you use the
instance ID and pick the right
materials, the right hair color
to finalize the look.
So we are suddenly getting from
tens, hundreds, maybe thousands
of draw calls to a single one.
And it's faster on the CPU.
It's faster on the GPU.
And this is how simple it looks
in a shader.
Pretty much your argument buffer
becomes an array of structures.
You pick the right element using
instance ID referenced within
and you can, for example take
the pointer and pass it to your
helper methods or whatever you
need to do to process data.
The second great feature of
argument buffers is the ability
of the GPU to set resources.
And we actually created an
example for this.
We created a particle simulation
running completely on the GPU.
And I'll tell you how we done
that and we'll see the -- we'll
see it in action later.
So we created an array of
argument buffers where each
element is single particle --
and I guess you already spotted
a trend here.
Our simulation kernel then
treats and simulates one
particle per thread, but we want
to actually go further and we
want it to be able to create the
particles in the kernel as well,
on the GPU.
So in order to do that, and to
give it the right materials, we
also have argument buffer with
all the different materials that
we would like our particles to
have.
And our simulation kernel then,
every time you do an action in
our little demo, the simulation
kernel looks into the
environment and sees what's the
correct, most appropriate
material.
And let's say if you are in the
forest, we pick moss as the
right, appropriate material for
a rock and copy it to the
particle itself.
If you're on the rocks we pick
the rock material.
On the hill we pick grass.
So this way everything stays on
the GPU and it actually looks in
the shader just as simple as I
describe it.
If you want to modify data on
your GPU you bind it as a device
buffer and start assigning
values as you are used to, but
also this time around you can
copy textures or copy the whole
structure and it's really this
simple.
And the last great feature I
would like to mention is ability
of argument buffers to reference
another argument buffer.
So this way you can actually go
ahead and create a reusable and
complex object hierarchy just as
you are used to from C++ Swift,
Objective-C.
Let's say in the example of our
renderer, if you have a ton of
objects, but you probably have
very few materials, so what you
can do is reference the material
from each object and save some
memory or you can build your
scene graph as a binary tree
where actually you point to the
objects and the tree nodes as
you need them, as you would be
used to from the CPU.
And you can share this data with
the CPU as well.
So these are the main features.
And let's look at the support
matrix.
We have two tiers.
The tier one is supported across
all Metal devices and you get
the CPU performance
improvements.
You get the new schedule
language.
But because of the limitations
of the GPUs this tier does not
-- is not able to utilize the
GPU driven use cases that I
mentioned earlier.
With tier two however you are
getting all of this -- so you
get all the new use cases -- and
we are also really increasing
the amount of resources you can
access.
Your shaders can access half a
million textures and buffers to
-- for you to do this -- do
these new algorithms.
While tier one is supported on
all Metal devices, tier two is
something you need to query for.
But don't worry, the support is
really wide.
All the Macs with discrete
GPUs are tier two.
All the new MacBook Pros, the
latest MacBook, the last tier
MacBooks Pros are tier two.
So you can go ahead and have
fun.
Now let's look at the demo I
promised you.
We will be showing three videos
with three different features.
The real time rendered terrain,
with material that changes
dynamically, we place some
vegetation by the GPU on the
terrain to make it interesting,
and we have all these nice
particles that I mentioned
before.
So, as you see, we are painting
high on the terrain.
We can changing sculpting the
terrain and the material
actually follows.
And this is a great thing about
argument buffers because they
allowed us to create a one big
argument buffer with all the
possible materials as layers in
there and when we are rendering
the terrain in a pixel shader we
are looking at things like
terrain height, slope, the
amount of sun that reaches
certain pixels, and based on
these properties and some others
we do decide what are the best
and most appropriate materials
for that given pixel.
And this is all happening in
real time, whereas previously we
would have to go ahead and split
the terrain in small pieces
offline, analyze which pieces
need which textures in order to
make it as optimal as possible,
and only then render it.
So we are going from a
pre-processing step, which is
heavy and prevents real time
modification, to something that
is real time, without -- sorry
-- without preprocessing and
completely dynamic.
And we added vegetation on it
and as you see the vegetation is
also context sensitive.
You see the palm trees on the
sand.
You see the little tiny apple
trees on the hills.
And while the vegetation itself
is fairly traditional instance
rendering, the power of the
argument buffers here is that it
allows us to share the same
terrain material with all the
same properties and the same
terrain analysis function
between two completely separate
pieces of code.
While terrain rendering uses all
this data to render pixels, the
computer that places the
geometry, the vegetation,
actually analyzes the same
materials to figure out what is
the best type of tree to place
in the given spot.
And this is very easy because
every time we make a change
nothing actually changes in our
code because we just add new
layers or change our analysis
function, whereas previously we
would have to maybe juggle 70
textures between two completely
separate code basis in order to
make them run in sync.
Lastly, we have the particles.
I hope you can see that they
nicely get the material of the
terrain there.
Now what I did not mention is
that this all is rendered with
again a single draw call.
We are rendering 16,000
particles here with single draw
call, with absolutely no
involvement on the CPU.
And not only do particles have
unique materials, they actually
have unique shapes because
argument buffers allow --
actually allow you to change
your vertex buffer per draw
call.
This is something where if you
try to do that without argument
buffers, we had to create a
complicated control hand over
between GPU that simulates and
the CPU that tries to come up
with the best set of draw calls
to represent all this variety.
So with argument buffers this
became just very, very simple.
Okay, so enough pretty pictures.
And let's wrap my portion of the
session with a look at some APIs
and some best practices.
As I mentioned before, argument
buffers are an extension of
Metal buffers and that means all
of our API related to buffers
just works.
You can go ahead and take
argument buffer, copy it
somewhere else; you can blitz it
between CPU and GPU.
And while argument buffers look
like structures on the GPU for
shaders, on the CPU you will use
MTLArgumentEncoder objects to
fill up the content.
This abstraction allows Metal to
create the most optimal memory
representation for any given
argument buffer on that specific
GPU that you are actually
running.
So you get the best performance.
It also frees you, as the
developer, from all these
details and worries about, for
example how each GPU represents
what the texture is.
Where does it live in memory?
All of this changes from
platform to platform and we hide
it between a simple interface so
that you can write very simple
and effective applications.
So I hope you're not worried
about the encoder that I
mentioned.
It's really, really simple to
use.
For example, if you want to
create an argument encoder for
this argument buffer all you
need to do is get your Metal
function that uses the argument
buffer and ask the Metal
function for the encoder and
that's about this.
This is all you needed to do.
You get an object and you start
using a familiar set texture or
filling constant API that is
very, very similar to how you've
been using Metal with command
encoder.
So this also plays into what I
said about ease of use and
transition.
There are multiple other ways of
creating the encoder.
You can go more explicit with
the descriptor, but that's
something you should look into
in documentation if you need
such thing.
We advise you to actually go and
get argument encoders from the
shaders.
Now with all those interactions,
GPU being able to step in and
modify the argument buffers or
you know dynamic indexing and
half a million textures, all
that in a mix, it's not really
possible for Metal to figure out
what -- for example what
textures or buffers do actually
intend to use in your rendering,
but luckily you as a developer
have pretty good idea about
that.
So we ask you with argument
buffers to be quite explicit
about it.
If you are using Heaps, and
absolutely you should use Heaps
to get the best performance out
of your platform and the best
way of organizing your data, the
only thing you need to do is
tell Metal that you intend to
use a Heap , or multiple Heaps,
it's up to you.
And -- this is -- this makes
sure that the textures are
available for you in the
rendering loop.
If you want to do something more
specific, let's say you would
like to write to a render target
from inside a shader, or you
would like to read from a dev
buffer, you use a more specific
API and tell Metal that you
intend to change resource and --
with a specific way.
And again, it's as simple as
this.
You don't need to do anything
else.
So let's start out with a couple
of best practices.
I think if you know Metal they
are very, very similar to what
we are telling you about using
Metal buffers.
The best way to organize your
data is by usage pattern.
And you probably have a ton of
properties that do not change
per frame.
So put them into an argument
buffer and share it with all the
objects so you will save memory
this way.
The same -- on the same -- on
the other hand you will probably
have a lot of properties that
actually do change for every
object and you need to manage
them every frame.
And for these I think the best
way is to put those into
separate argument buffers so
that you can double buffer it or
whatever is your management
scheme and you don't need to do
all the other copies to keep all
the data in there.
And then you will likely have a
ton of argument buffers that
just don't change at all.
Let's say the materials, or
maybe some other properties, and
for these just create them at
the initialization of your
application and keep using them.
Similar to Metal buffers, think
about your data locality and how
you actually use your argument
buffers.
If, for example you have three
textures that are accessed in a
shader, one after another, then
the best thing you can do is
actually put those textures
close to each other in argument
buffers so that you maximize the
use of GPU caches.
And as I mentioned at the
beginning, traditional argument
model is not going anywhere and
you should take advantage of it
and mix it with the argument
buffers whenever it's more
convenient.
So let's say if you need to
change a single texture for
every object, for example a cube
reflection, it probably would be
an overhead to create argument
buffer just for that and upload
it every frame.
So just use the traditional
model for this.
That's it about argument
buffers.
I really hope you will adopt our
new API and get some creative
use cases out of it.
And please welcome Richard, who
will talk about the Raster Order
Groups.
[ Applause ]
>> Thank you.
Hello. So thank you Michal.
So I'm going to take you through
the rest of the day's
presentation, starting with
Raster Order Groups.
So this is a new feature that
gives you control over the GPU's
thread scheduling to run
fragment shooter threads, in
order.
This allows overlapping fragment
shooter threads to communicate
through memory, where before it
wasn't always really possible to
do in most cases.
So this opens up a whole new set
of graphics algorithms that were
not practically achievable with
just write only access to your
frame buffers or onward access
memory to device memory.
For example, one of our key --
one of the key applications for
this is Order-independent
transparency.
We've been -- already talked a
lot today about how to reduce
the CP usage of your Metal
application and this feature
lets you build or an algorithm
to include blending back to
front without having to pay the
CPU cost of triangle level
sorting.
There's also been lots of
investigations into advanced
techniques such as dual layer
G-buffers, which can
substantially improve post
processing results, or using the
GPU rasterizer to sort of
voxelize triangle meshes.
For both of these onward
accesses to memory has been a
really large barrier to
efficient implementations.
But probably the simplest and
most common application for this
feature is just implementing
custom blend equations.
iOS hardware could always do
this pretty natively, but this
is not something that desktop
hardware has traditionally been
able to do.
So I'm going to use custom
blending as an example
application to introduce this
feature.
Okay, so pretty typical case of
triangle blending; one triangle
over another.
Pretty much all modern GPU APIs
guarantee that blending happens
in draw call order.
It provides this nice,
convenient illusion of serial
execution.
But of course what's really
going on behind the scenes is
GPU hardware's highly parallel.
It's going to be running
multiple threads concurrently.
And only this fixed-function
blend step at the end is going
to be delayed until everything
gets put back in order again.
There's this implicit wait that
happens before that blend step.
Things change however if the
ordering -- if we need to put
things in order not at the end
of our fragment shooter, but
right in the middle because in
this case triangle one wants to
write something to memory that
triangle two's threads want to
read from.
If we want triangle two to be
able to build upon and consume
triangle one's data we need to
get that ordering back.
And so that's pretty much what
Raster Order Groups provides.
So I'm going to jump over to a
shader code example.
So if I want to implement custom
blending, an initial attempt
that does not work is going to
be to replace my classic
graphics frame buffer with a
read to write texture and
perform all of my rendering and
blending directly to this
texture.
But of course if the threads
that I'm blending over have yet
to execute, or concurrently
executing, this is -- this whole
remodify/write sequence is going
to create a race condition.
So how do we use Raster Order
Groups to fix this?
It's really, really easy.
All I have to do is add a new
attribute to the memory that has
conflicting accesses.
At this point the compiler and
the hardware are going to
cooperate to be able to
implicitly take the entire range
of [inaudible] shader that
accesses that memory from the
very first to the very last
access and turn it into a
critical section behind the
scenes.
You can also apply this
attribute to normal device
memory pointers, not just
textures.
So with that we get the thread
schedule that we want.
Thread one will proceed and
write to memory and thread two
is going to stop and wait until
thread one's write's complete
giving us basically race free
access to this memory.
Oh, there's one other really
important topic and that's
talking about which threads are
synchronizing with each other.
So of course GPU hardware's
going to be running not just
two, but tens of thousands of
threads at the same time and in
fact it's probably executing
every single thread from both of
these triangles simultaneously.
So of all of these thousands --
tens of thousands -- of threads,
which one synchronizes with each
other?
So I've highlighted one pixel
here because that's the answer
to this question.
You -- this feature only
synchronizes against other
threads that your current
fragment shooter thread overlaps
with, those other threads that
are targeting the same frame
buffer xy location, targeting
the same multi-sample location,
targeting the same render target
index.
If I wanted -- and it
specifically does not provide
any guarantees at all against --
that you can safely access
memory that are written by any
neighboring pixels.
If you do need to have these
kind of area -- or region of
influence -- kind of algorithms
then you will need to go back to
using full memory barriers
between draw call -- or full API
barriers between draw calls or
render passes.
But this comes at a much higher
performance cost and it does not
work in the case where you have
triangle overlap within a single
draw call.
But for these common algorithms
that you do have only need
overlap only synchronization,
Raster Order Groups can get the
job done at a substantially
lower performance cost.
So this is a pretty actually
easy one and that's really all
I've got to say about it.
Raster Order Groups lets you
efficiently wait for overlapping
and only overlapping threads to
finish their access to memory,
which enables a collection of GP
algorithms that were previously
just too inefficient to use
practically in GPU hardware.
This middle of shader thread
summarization is a feature of
the latest GPU hardware, so it
is something you do need to
check for at run time.
In particular it's supported on
the newest AMD Vega GPUs
announced this week as well as
the past couple years' worth of
Intel GPUs.
And that brings us on to our
second feature and that is the
new iPad Pro's ProMotion
Display.
So ProMotion, this is a
particularly great feature for
graphics and game developers and
so I really want to show you
what you can do with it.
This is the first of a sequence
of timeline diagrams I'm going
to show you, showing us when the
GPU starts and finishes
producing a frame, and then when
that same frame finally gets
onto the glass for the user to
see.
The first and most obvious thing
that ProMotion does is we can
now render at 120 frames per
second.
This feels absolutely fantastic
for anything that has really
high speed animations, for
anything that's latency critical
such as tracking user touch or
pencil input.
And it does have some catches.
You of course only get half as
much CPU and GPU time available
per frame so you really have to
pay a lot of attention to
optimization and it does
increase overall system power
consumption.
But if you've got the right
content, where this matters, it
gets a really payoff for the
user experience.
But ProMotion goes a lot farther
than 120 frames per second
rendering.
It also provides much more
flexibility regarding when to
swap the next image onto the
glass.
We're not limited to just 120 or
30 or 60 frames per second.
ProMotion behaves much more
gracefully as your application's
performance moves up and down
compared to a fixed frame rate
display.
For example, here I have a
timeline diagram of a title
that, you know just -- is just
doing too much GPU work to
target 60 frames per second.
You know they're producing
frames every about 21
milliseconds or about 48 frames
per second.
The GPU is perfectly happy to do
that, but on the display side we
can only refresh once every 16
milliseconds and so we end up
with this beating pattern.
There's this stuttering that the
user feels where some frames are
on the glass a lot longer than
others.
And it's not nice at all.
And so pretty much universally
what applications do in this
case is they all have to
artificially constrain the frame
rate all the way down to 30
frames per second.
They're basically trading away
their peak frame rate in order
to get some level of
consistency.
ProMotion does much better here.
So if I just take the same
application, move it to a
ProMotion display, it does this
to our timeline.
We now have a refresh point
every four milliseconds rather
than every 16.
Our timeline gets pulled in,
even with the GPU doing exactly
the same work as before.
The display can now present at
an entirely consistent 48 frames
per second.
The user is now getting both the
best possible frame rate and
perfect consistency from frame
to frame.
This tradeoff that we had to
make is completely gone.
Furthermore -- so a second
example is that this time in
application that wanted to make
60 frames per second, but one
frame just ran a bit long and we
missed our deadline.
On a fixed frame rate display we
end up on the display side with
a pattern that looks very
similar to what we saw before.
ProMotion can fix this too.
So frame one's time on the
glass, rather than it being
extended by 16 milliseconds, is
now only extended by four.
The degree of stutter that the
user experiences is tremendously
reduced and then frame two and
three, their latency gets pulled
right back into where they were
before.
The system recovers right back
onto the timeline right away,
latency is improved, and your
application can proceed on.
We've just gotten right back to
where we wanted to be.
So put it all together, it just
makes animation just feel that
much more robust and solid no
matter what's going on.
So how do you actually go about
taking advantage of this?
For normal UIKit animation, such
as scrolling through lists or
views, iOS will do this entirely
for you out of the box.
It will render it 120 frames per
second when appropriate.
It will use the flexible display
times when appropriate.
Metal applications though tend
to be much more aware of their
timing and so for those we've
made this an opt in feature.
Opting in is done really easily
just by adding a new entry to
your application bundles
info.plist.
Once you do this the timing
behavior of our three Metal
presentation API changes a
little bit.
And so I'm going to walk you
through those three APIs and how
they change now.
So the first of our Metal
presentation APIs is just
present.
It's -- it says present
immediately; schedule my image
to be put on the glass at the
very next available refresh
point after the GPU finishes.
On fixed frame rate hardware
that's 16 milliseconds and on
iPad Pro that's now four
milliseconds.
This is the easiest API to use
because it takes no runs.
So it's the API that most of the
people in this room are already
using.
It's also the API that gives you
the lowest latency access to the
display.
It works identically on both our
fixed frame rate and ProMotion
hardware, but once you opt in it
starts working with much, much
better granularity.
The second of our Metal
presentation APIs is present
with minimum duration.
So this one says, whenever this
image lands on the glass, keep
it there for a certain fixed
amount of time.
So if my image lands on the
glass here, it's going to stay
for 33 milliseconds.
And if my start time shifts so
does the end time.
This is the API you'd use if you
want perfect consistency in
frame rate from frame to frame.
This is particularly useful in
30 frames per seconds -- on 60
rate per seconds displays,
although it's also sometimes
useful on ProMotion as well.
But our third presentation
varying is the most interesting
by far.
It's present at a specific time
and it does exactly what it
sounds like.
If the GPU's done well before
the designated time, the display
will wait.
If the GPU runs over your
deadline the display will pick
it up at the very next available
point afterwards.
This is the key API to use if
you want to build fully custom
animation and timing loops.
This API to present and time,
combined with ProMotion display
basically lets you leave behind
the concept of a fixed frame
rate entirely and render your
content exactly for the time the
user is going to see it.
If you want to keep your Metal
view perfectly in synch with
something else happening on the
system, such as audio, or if you
want to basically provide the
appearance of zero latency at
all and be able to forward
project your animation for
exactly when the user's going to
see your content this is what
lets you do that.
Now of course the trick is
implementing that project next
display time.
That's your function.
To make that work you do need
some feedback from the system to
help you determine what your
actual performance is.
And so we've added that as well.
So a Metal drawable object is a
transient object that tracks the
lifetime of one image you've
rendered all the way through the
display system.
It can now be queried for the
specific time that frame lands
on the glass and you can also
get a call back when that
happens.
So now you can know when your
image is landing on the glass,
when they're being removed, and
you have the key signal to know
when you are or are not making
the designated timing that you
intended and are giving you the
signal to adjust for future
frames.
So that's the story of ProMotion
and what you need to do to make
use of it on the future -- on
these new iPad Pros.
It's incredibly easy to get more
consistent and higher frame
rates with almost no code
changing at all in most
applications.
From there it gives you a menu
of options to decide what
display time model is going to
best benefit your particular
app.
A really, really fast paced
Twitch arcade game or something
tracking touch or pencil input
probably wants to go for 120
frames per second.
A really high end rendering
title might want to stick with
30 or 60 frames per second or
somewhere in between and just
enjoy the consistency benefits.
And applications that want to
really take control of their
timing loop have entirely new
capabilities here as well.
But regardless of what your app
actually is, ProMotion gives you
this powerful new tool to
support its specific animation
needs.
So that's ProMotion.
So moving on, I have a different
display topic to talk about and
that is a feature we're calling
Direct 2 Display.
So the story of what happens
after your GPU finishes
rendering your content and the
display is actually a little bit
more complicated.
And then your image can take two
paths to the display; GPU
composition and direct to
display.
The first of those is a -- your
typical user interface scenario
where I've got a collection of
views or layers or windows and
the like and at this point the
system is going to take all of
these and composite them
together.
It's going to scale any content
to fit the display.
It's going to perform
color/space conversion.
It's going to perform -- apply
any core image filters or
blending and it's going to
produce the one, final combined
image that the user sees.
This is really, really critical
abstraction for full-featured
user interfaces.
But it's also all done on the
GPU and it takes some time and
memory there.
And if we're basically building,
you know a full-screen
application, you know it's a
little bit overkill for that.
And so that's where direct
display mode comes in.
If none of these operations are
actually required, we can point
the display hardware directly at
the memory you just rendered to
and so without any middleman at
all.
So how do you enable this?
It turns out there is no single
turn it on API for direct to
display.
This mode is really an omission
of anything that requires the
GPU compositer to intervene.
When the compositer takes a look
at the set-up of your scene and
says there's nothing it needs to
do here it will just step out of
the way.
So how can you set up your scene
to get the compositer to step
out of the way?
So this is pretty
straightforward, an intuitive
feel of, does my content need
any kind of nontrurial
[phonetic] processing is a
pretty good intuitive start.
But more specifically you do
want your layer to be opaque.
I don't want to be blending over
anything.
We don't want to apply anything
that requires that core
animation or the window server
modify our pixels.
We don't want to put on rounded
corners in our view or masking
or filters or the like.
We do want to be full-screen.
If your content does not
actually match the aspect ratio
of the display it is okay to put
a full-screen, opaque, black
background layer to sort of give
a black bar kind of effect.
But in the end we want to
basically obscure everything.
We do want to pick render
resolutions that match the
native panel.
So this is actually a little bit
tricky because all of our --
both on macOS and iOS we ship
hardware that has a virtual
desktop modes or resolution
modes that are larger than the
actual physical panel.
And the last thing we want to do
is spend time rendering too many
pixels only to have to spend
time on the GPU to scale it all
back down again.
And finally, you want to pick a
color, space, and pixel format
that the display hardware is
happy to read from directly.
And so this one, there's any
infinite number of combinations
here so I want to help out by
giving you a little bit of a
white list of some particularly
common and efficient
combinations.
So right on the top is our good
old friend; SRGB8888.
This is pretty much the
universal pixel format that most
applications use and all
hardware is happy to read.
And so for most people that's
all they need.
But we've been shipping wide
color gamut P3 displays on both
our macOS and iOS hardware and
if your application does want to
start making use of this ability
to represent more colors, you
need to pay a bit more
attention.
In both the -- the concepts are
the same between iOS and macOS,
although the details differ a
little bit.
In both cases we do want to
render to attend the pixel
format, but note that if you
render P3 content onto a P3
display that's fine, but if you
render P3 content onto an SRGB
display the system -- the GP
compositer might have to get
involved to crush the color
space back down to fit the
display.
And so this is -- P3 is not
something you want to do
universally, all the time.
you do want to take a look at
the current display and make
this a conditional thing.
So finally, for completeness I'm
also going to list RGBA float
16, which is sort of the
universal, wide gamut, high
dynamic range pixel format.
Although, in -- I do -- it's
also necessary for MacOS's
extended data range feature.
Although it is worth noting that
it does require GPU compositing
in all cases.
So I mentioned, you do want to
be a little bit conditional if
you write an application that's
wide color aware.
Fortunately, both UIKit and
AppKit provide really convenient
APIs to check that.
So the last step is, how do you
know if you're actually on the
directed display path?
So this is a screen shot of our
Metal system trace tool and
instruments.
And Metal system trace is pretty
much a developer tool that will
give you a live timeline of the
CPU and the GPU in the display.
Pretty much a real-world version
of the diagrams I've been
showing you in this
presentation.
So in this case, I want to
highlight my three frames that
I've rendered.
The color-time intervals are my
own application's rendering.
And the gray time intervals are
some other processes in the GPU.
I can get more details down at
the bottom of the window or I
can see it's coming from
backboard D, our iOS composition
process.
So this is the case where my
application is going down the
GPU compositing path.
Going back and revisiting some
of our best practices can remove
that from the picture and now I
can rerun my Metal system trace
and see that I have a timeline
where, you know I've got the GPU
completely and entirely to
myself.
So that's it for direct to
display.
Our system compositors can make
a lot of magic happen behind the
scenes to make full-featured
user interfaces possible, but
that can come at a performance
cost because they use the GPU to
do it.
By being a little bit aware of
what you're asking the
compositer to do, or more
importantly by not asking what
you're not asking the compositer
to do, it can get out of the way
without using the GPU, returning
some of that time to you.
Direct to display is supported
on iOS and Tos and always has
been and its support is new to
macOS High Sierra for Metal
applications.
So with that I want to touch on
our last topic of the day and
that's everything else.
There's a lot more that we've
added to the core frameworks and
sheeting language for Metal 2.
And so I'm not going to dive
deep into any of these things,
but I do want to give you a
survey.
So right off the bat we've added
some new APIs to be able to
query how much GPU memory's
being allocated for each buffer,
for each texture, for each Heap.
This actually takes into account
things that just generally
happen behind the scenes, like
alignment and various padding.
So this can give you a more
accurate view of how much GPU
memory you're actually using.
We also have a roll-up query on
the Metal device, which is the
entire GPU memory usage for your
entire process.
And this is particularly notable
because that also counts all of
the memory that the driver needs
to allocate that's not otherwise
visible to you; things like
memory to put shader code in or
command buffers or anything
else.
So this can give you where
you're at relative -- you know
everything all in compared to
your memory usage target.
We have a couple compute
oriented additions.
The first of those is that we've
added a set of shading language
functions to help -- to allow
you to transfer data directly
between threads in a SIMD group.
If you're not familiar; GPU
hardware typically gains an
individual vertex fragment and
compute shader thread into SIMD
groups and executes them
together for greater efficiency.
This are also called wayfrencer
[phonetic] warps.
Within a group these threads do
have some ability to directly
communicate without having to
load and store through memory.
They can read values directly
out of one thread's register and
write them to another thread's
register.
And that's what these new
standard library functions
allow.
So in this case broadcast means
I can read a data directly --
read a field directly out of
thread zero's registers and
write it directly into the
registers of 16 other threads
that happen to be part of this
group.
Our second compute addition is
to give you more flexibility in
how big your thread groups are.
So for example if I have a pixel
bird here that I want to run
some pretty classic image
processing kernel over, but then
I've written my compute kernel
such that I'm using four by four
thread groups everywhere.
Well, this leads to some
problems because I've got -- if
my image is not a nice multiple
of my thread group size I've got
a bunch of stray threads on the
side.
I mean this means that I've got
to dive into those and say when
I actually write my code.
I have to be defensive.
Am I out of bounds?
I have to handle it in some
special way.
It's doable but annoying.
It also means that we're just
wasting GPU cycles.
So non-uniform thread group
sizes, unless you declare what
dimensions you want to run your
kernel over, without being
multiple thread group sizes.
So the hard working, smaller
thread groups along the edges of
my grid, in order to say -- in
order to just shave off that
unnecessary work it both
improves GPU performance and
just makes your kernels easier
to write.
We've added support for a view
port arrays.
You can now configure up to 16
simultaneous view ports and your
vertex shader can select, per
triangle, which view port that
triangle gets presented into.
I'm not going to go further into
this because it will be
discussed in detail tomorrow in
the VR with Metal 2 session.
It is particularly valuable for
efficiently rendering to the
left and right eyes.
We've added the ability to
choose where in each pixel your
multi-sample locations are
supported.
This lets you do a few
interesting things including
maybe toggling your sample
positions every other frame and
giving you some new -- you know
valuable input into some
temporal anti-aliasing
algorithms.
In the vein of trying to keep --
of working to bring our
platforms up to date to have
them have the same feature set
wherever possible, we've brought
resource Heaps, shipped last
year in iOS 10 to macOS High
Sierra this year.
So I'm going to actually do a
little bit of a refresher on
this because good use of your
Heaps is really important to
getting the most out of argument
buffers.
So Heaps are of course where I
can allocate a big slab of
memory up front rather than
going to the kernel to -- I want
memory for texture a, and I want
memory for texture b and so
forth.
I can go to the kernel and get
memory right up front and of
course put textures -- you know
add and remove textures and
buffers to -- without having to
go back to the system.
This has a few advantages.
It means that I can bind
everything in that Heap much
more efficiently.
There's much less software
overhead.
It means that we can oftentimes
pack that memory a little bit
closer together.
We can save some padding and
alignment, save you a little bit
of memory.
It means when we delete memory
we don't give memory back to the
system.
That could be good or bad.
It means when we allocate new
memory -- when we allocate a new
texture it means we don't have
to go back to the system and get
new memory.
It also means that you can
choose to alias these textures
with each other.
If I have -- you typically
render targets or intermediate
render targets between different
passes in my render graph.
It means that if I have two
different intermediates that
just don't have to exist at the
same point in time I can alias
them over each other and I can
save tons of memory like this.
So that's it for a quick survey
of Heaps.
We've added linear textures from
iOS to macOS.
Linear textures allows you to
create a texture directly from a
Metal buffer without any copies
at all.
We've extended our function
constant feature a little bit.
A quick refresher, function
constants allow you to
specialize by codes.
When you've done all your front
end compilation offline you can
then tweak and customize your
uber shader bi-code a little bit
before actual generating final
machine code.
If you have a classic uber
shader this can save you the
cost of having to re-run the
compiler front end for every
single permutation.
So we've made this a bit more
flexible and added a few more
cases where you can use these
specialized arguments.
We've added some extra vertex
array formats.
We had some missing one and two
component vertex formats.
And we've also added BGRA vertex
formats.
We've brought iOS surface
texture support from macOS to
iOS.
And we've also brought dual
sourced blending to iOS as well,
also particularly useful in some
deferred shading scenarios.
So that's -- brings us to the
end of introducing Metal 2.
My colleague, Michal, started
with giving you a little bit of
an overview of the overall scope
of Metal 2.
From VR to external GPUs, to
machine learning, and to new
developer tools and performance
analysis.
Of that, the pieces that we
really covered today are our
next big push toward reducing
CPU overhead using argument
buffers.
Argument buffers also unlock the
ability for the GPU to start
taking a little bit of its own
destiny when it comes to
configuring shader arguments,
which is one less reason to take
back to the CPU.
Raster Order Groups let us start
using the rasterizer for things
beyond basic in order blending.
We can now start taking
advantage of the latest hardware
capabilities to do, you know,
vox slice triangle meshes or set
transparency blending either in
order or independent.
They're both -- it makes them
both possible.
For the new iPad Pros, ProMotion
gives you very fine grained
control over exactly how your
animations are presented to the
user, giving you the ability to
get both peak frame rates and
the lowest possible latency.
Direct to display provides you a
path to reclaim a little bit of
GPU performance from the system
by being aware of what our
compositors do on your behalf.
So you'll be able to find the
video and the slides for this
session on the WWDC2017 website.
We have three other sessions on
Metal 2 this year.
In particular, tomorrow
afternoon we're going to have a
session dedicated to VR and
Metal 2.
This is going to go deep into
what your application needs to
do and a conceptual overview of
how to do VR rendering, dive
into specifically how to do VR
with the combination of Metal 2
and the Steam VR toolkit.
It's also going to go into using
Metal with external GPU
hardware.
On Thursday we have a
doubleheader starting with Metal
2 optimization and debugging.
This is going to go into what's
new in our developer and
performance tools and all the
new workflows that enables to
help you build the best
applications possible.
And it's going to be followed up
right after that with using
Metal 2 for compute.
And that's going to really have
a big focus this year on using
the GPU for machine learning
applications.
We've added a whole lot this
year and we want to show you
everything we've done.
I want to point you to a couple
of last year's WWDC sessions.
The first, What's New in Metal
Part One is where we did a deep
dive on resource Heaps.
And instead if you're looking to
get the best performance out of
argument buffers, argument
buffers and Heaps were built to
go together and so I highly
encourage you to go check out
the video and really -- and, you
know basically plan your
application around both of those
together.
They cover that in a lot more
detail than we did here today.
Second, if the conversation
about direct to display and wide
gamut and wide color interested
you we have a whole session that
really goes deep into the
concepts and the specifics
behind that, we also talked
about last year.
With that I think we'll wrap it
up.
I thank you for all attending
and I hope you enjoy the
remainder of your week.
So thank you.
[ Applause ]