WWDC2018 Session 611

Transcript

[ Music ]
[ Applause ]
>> Good morning everyone.
My name is Karol Gasinski, and I
am a member of GPU Software
Architecture Team at Apple.
We will start this session with
a brief summary of what is new
in [inaudible], in terms of VR
adoption.
Then, we will take a deep dive
into a new Metal 2 features,
designs specifically for the VR
this year.
And finally, we will end this
session with advanced techniques
for developing VR applications.
Recently, we introduced new iMac
and iMac Pro that have great
GPUs on board; iMac is now
equipped with [inaudible] based
GPUs and have up to 80 gigabytes
of video memory on board.
While iMac Pro are equipped with
even more advanced and even
bigger based GPUs, with up to 16
gigabytes of video memory.
That's a lot of power that is
now in your hands.
But we are not limiting our
service to iMac on their own.
With recent announcement of
extended GPU support, you can
now turn any Mac into powerful
workstation that gives you more
than 10 [inaudible] of
processing power.
And that's not all.
Today we are introducing plug
and play support for HTC Vive
head mounted display.
It has two panels with 1,440 by
1,600 and 650 pixels per inch.
That's 78% increase in the
resolution, and 57% increase in
pixel density compared to Vive.
And with support for better
panels comes support for its new
dual-camera front-facing system,
so developers will now be able
to use those cameras to
experiment with pass-through
video on Mac.
And together with Vive Pro
support comes improved tracking
system.
So, you might be wondering how
you can start developing VR
application on Mac OS.
Both HTC Vive and Vive Pro work
in conjunction with Valve
SteamVR runtime, that provides a
number of services including VR
compositor.
Valve is also making open VR
framework that is available on
Mac OS, so that you can do the
map that works with SteamVR.
We've been working very closely
with both Valve and HTC to make
sure that Vive Pro is supported
in SteamVR runtime on Mac OS.
So, now let's see how new Metal
features that we're introducing
can develop a [inaudible] Mac OS
Mojave can be used to
fiber-optimize your VR
application.
As a quick refresher, let's
review current interaction
between application and VR
compositor.
Application will start by
rendering image for left and
right eye into 30 multi-sample
textures.
Then it will resolve those
images into iOS surface back
textures that can be further
passed to VR compositor.
VR compositor will perform final
processing step that will
include [inaudible] distortion
correction, chromatic
aberration, and order
operations.
We can call it in short warp.
Once the final image is
produced, it can be sent to the
headset for presentment.
It is a lot of work here that is
happening twice, so let's see if
you can do something about that.
See, now with VR application it
wants to benefit from
multi-sample [inaudible], it
needed to use the dedicated
textures per i, or single shared
one, for both.
But none of those layouts is
perfect.
The dedicated textures require
separate draw calls and passes,
as we just saw.
While straight textures enable
rendering of both eyes in single
rendered and results pass, they
are problematic when it comes to
post-processing the effects.
[Inaudible] textures have all
the benefits of both dedicated
and shared layouts, but
currently they couldn't be used
with MSAA.
This was forcing app developers
to use different rendering
part-time layouts, based on the
fact if they wanted to use MSAA
or not.
Or use different tricks to work
around it.
So let's see how we can optimize
this rendering.
Today, we introduce new texture
types to the multi-sample
already textured.
This texture type has all the
benefits of previously mentioned
types without any of the
drawbacks.
Thanks to that it is now
possible to separate from each
other rendering space, which
simplifies the post-processing
effects, views count, so that
application can fall back easily
to monoscope rendering and
control over anti-aliasing mode.
As a result, application can now
have single rendering files that
can be easily adopted to any
situation, and most important,
that can be rendered with single
draw and render pass in each
case.
So here we see code snippet for
creation of mentioned 2D multi
sample [inaudible] texture.
We set up sample count to 4, as
it's an optimal tradeoff between
quality and performance, and at
the same time, we set up our
other length to 2 as we want to
store each image for each I in
separate slice.
So let's see how our pipeline
will change.
We can now replace those 2D
multi-sample textures with
single 2D multi-sample
[inaudible] one.
So now application can render
both I in single render pass and
if it's using instancing, it can
even do that in single draw
code.
So that already looks great, but
we still need to resolve those
2D multi-sample array texture
slices into separate iOS
[inaudible] faces before we pass
them to compositor.
So let's focus on our way,
application shares textures with
compositor.
So now, for sharing textures, we
use IOSurfaces.
They are sharing textures
between different process spaces
and different GPUs, that we've
got [inaudible] comes a price.
IOSurfaces can be only used to
share simple 2D textures, so if
you have a multi-sampled one,
storing [inaudible] or having
[inaudible], they couldn't be
shared.
That's why today we introduce
shareable Metal textures that
allow your applications to share
any type of Metal texture
between process spaces, as long
as these textures stay in scope
of single GPU.
This file features [inaudible]
advanced view of these cases.
For example, sharing depth of
your scene with VR compositor.
But, of course, it's not limited
just to that.
Now, let's look how those
textures can be created.
Because shareable textures
allows us now to pass complex
textures between processes, we
will create 2D array texture
that we will pass to VR
compositor.
As you can see, to do that, we
use new methods, new shared
texture with this creator.
And while doing that, you need
to remember to use private
storage mode, as this texture
can be only accessed by the GPU
on which it was created.
Now, we see a code snippet
showing us how our VR
application would send IOSurface
to VR compositor in the past.
We will now go through this code
snippet, and see what changes
needs to be applied to switch
from using IOSurfaces to shared
Metal textures.
So we don't need those two
IOSurfaces anymore, and those
two textures that were backed by
them can now be replaced with
single shareable Metal texture
that is over 2D array type.
We will then assign this texture
to both texture descriptors from
open VRSDK, and change its type
from IOSurface to Metal.
After doing these few changes,
we can submit image for the left
and right I to the compositor.
Compositor will now know that
we've passed shared Metal
texture with advanced layout,
instead of IOSurface, and if we
check, if its type is 2D array
or 2D multi-sampling array.
If it is, then compositor will
automatically assume that image
for the left i is stored in
slice 0, and image for right i
is stored in slice 1.
So your application doesn't need
to do anything more about that.
And of course, sharing Metal
textures between application and
compositor is not the only use
case for shareable Metal
textures.
So here we have simple example
of how you can pass Metal
texture between any two
processes.
So we start exactly in the same
way.
We create our shareable Metal
texture, but now from this
texture, we create special
shared texture handle that can
be passed between process spaces
using cross-process
communication connection.
Once this handle is passed to
other process, it can be used to
recreate texture object.
But while doing that, you need
to remember to recreate your
texture object on exactly the
same device as it was originally
created in other process space,
as this texture cannot leave
scope of GPU.
So now let's get back to our
pipeline and see what will
change.
Application can now replace
those separate IOSurfaces with
one 2D array texture, storing
the image for both i's.
This allows further optimization
as original 2D multi-sample
array texture can be now
resolved in one pass as well to
just create it shareable through
the array texture.
But that's not everything.
Let's look at the compositor.
Once we have simplified
rendering parts on application
site, there is nothing
preventing compositor from
benefiting from those new
features as well.
So compositor can now use those
incoming 2D array textures and
perform work for both i's in
single render pass as well.
And as you can see, we've just
simplified the whole pipeline.
So let's do recap of what we've
just learned.
We've just described two new
Metal features.
Shareable Metal textures, and 2D
multi-sample array texture type.
And the way they can be used to
further optimize your rendering
pipeline.
Both features will be soon
supported in upcoming SteamVR
runtime updates.
So now, let's focus on
techniques that will allow your
application to maximize its CPU
and GPU utilization.
We will divide this section into
two subsections-- Advanced frame
pacing and a reducing free rate.
We will start with frame pacing.
And in this section, we will
analyze application frame pacing
and how it can be optimized for
VR.
So let's start with simple,
single-threaded application that
is executing everything in
serial monitoring.
Such application will start its
frame by calling WaitGet pauses,
to receive pauses, and
synchronize its execution to the
frame rate of the headset.
Both Vive and Vive Pro has
refresh rate of 90 frames per
second, which means the
application has only 11.1
milliseconds to process the
whole frame.
For comparison, blink of an eye
takes about 300 milliseconds.
So in this time, the application
should render 50 frames.
So once our application receives
pauses from WaitGet pauses, it
can start simulation of your
trial [inaudible].
When this simulation is
complete, and state of all
objects is known, application
can continue with encoding
command buffer that will be then
sent to GPU for execution.
Once GPU is done, an image for
both i's is rendered, it can be
sent to VR compositor for final
post-processing, as we talked
about a few slides before.
After that, frames scanned out
from memory to [inaudible] in
the headset.
This transfer takes additional
frame as all pixels need to be
updated before image can be
presented.
Once all pixels are updated,
[inaudible] and user can see a
frame.
So as you can see from the
moment the application receives
pauses, to the moment image is
really projected, it takes about
25 milliseconds.
That is why application receives
pauses that are already
predicted into the future, to
the moment when photons will be
emitted, so that the rendered
image is matching the user
pause.
And this cascade of events
overlapping with previous and
next frame is creating our frame
basing diagram.
As you can see, in case of the
single-threaded application, GPU
is idle most of the time.
So let's see if we can do
anything about that.
We are now switching to
multi-threaded application,
which separates simulation of
its visual environment from
encoding operations to the GPU.
Encoding of those operations
will now happen on separate
rendering threads.
Because we've separated
simulation from encoding,
simulation for our frame can
happen in parallel to previous
frame encoding of GPU
operations.
This means that encoding is now
shifted area in time, and starts
immediately after we receive
predicted pauses.
This means that your application
will now have more time to
encode the GPU [inaudible] and
GPU will have more time to
process it.
So, as a result, your
application can have better
visualize.
But there is one trick.
Because simulation is now
happening one frame in advance,
it requires separate set of
predicted pauses.
This set is predicted 56
milliseconds into the future so
that it will match the set
predicted for rendering thread
and both will match the moment
when photons are emitted.
This diagram already looks good
from CPU side, as we can see
application is nicely
distributing its work
[inaudible] CPU course, but
let's focus on GPU.
As you can see, now our example
application is encoding all
these GPU [inaudible] for the
whole frame into a single common
buffer, so unless this common
buffer is complete, GPU is
waiting idle.
But it's important to notice
that encoding of GPU operations
on a CPU takes much less time
than processing of these
operations on the GPU.
So we can benefit from this
fact, and split our encoding
operation into a few common
buffers while a few common
buffer will be encoded very
fast, with just few operations,
and submitted to GPU as fast as
possible.
This way, now our encoding is
processing in parallel to GPU
already processing our frame,
and as you can see, we've just
extended the time when GPU is
doing its work, and as a result,
further increase amount of work
that you can submit in a frame.
Now, let's get back to our
diagram, and see how it all
looks together.
So as you can see, now both CPU
and GPU are fully utilized.
So [inaudible] application is
already very good example of
your application, but there are
still few things we can do.
If you will notice, rendering
thread is still waiting with
encoding of any type of GPU work
before it will receive predicted
pauses.
But not all [inaudible] in the
frame requires those pauses.
So let's analyze in more detail
to pick our frame workloads.
Here, you can see a list of
workloads that may be executed
in each frame.
Part of them happen in screen
space or require general
knowledge about pause for which
frame is rendered.
We call such workloads
pause-dependent ones.
At the same time, there are
workloads that are generic and
can be executed without
knowledge about pauses
immediately.
We call those workloads pause
independent ones.
So currently, our application
was waiting for pauses to encode
any type of work to GPU.
But if we split those workloads
in half, we can encode pause
independent workloads
immediately and then wait for
pauses to continue with encoding
pause-dependent ones.
In this slide, we've already
separated pause independent
workloads from pause dependent
ones.
Pause independent workloads is
now encoded in [inaudible]
common buffer, and is marked
with a little bit darker shade
than pause-dependent workload
following it.
Because pause-independent
workload can be encoded
immediately, we will do exactly
that.
We will encode it as soon as the
previous frame workload is
encoded.
This gives CPU more time to
encode the GPU work, and what is
even more important, it ensures
us that this GPU work is already
waiting for being executed on
GPU so there will be exactly no
idle time on GPU.
As soon as previous frame is
finished, GPU can start with the
next one.
The last subsection is a
multi-GPU workload distribution.
We can scale our workload across
multiple GPUs.
Current Mac Book Pro has two GPU
on board, and while they have
different performance
characteristics, there is
nothing preventing us from using
them.
Similarly, if each GPU is
connected, application can use
it for rendering to the headset
while using Mac's primary GPU to
offload some work.
So we've just separated
pause-independent work and moved
it to a secondary GPU.
We could do that because it was
already encoded much earlier in
our frame, and now this
pause-independent workload is
executing in parallel to
pause-dependent workload of
previous frame.
As a result, we further
increased the amount of GPU time
that you had for your frame.
But, by splitting this work into
multiple GPUs, we now get to the
point where we need a way to
synchronize those workloads with
each other.
So today we introduce new
synchronization parameters to
deal exactly with such
situation.
MTL Events can now be used to
synchronize GPU work in scope of
single GPU across different
Metal cues and MTL Shared Events
extends this functionality by
allowing it to synchronize
workloads across different GPUs
and even across different
processes.
So here we will go through the
simple code example.
We have our Mac, with attached
eGPU through Thunderbolt 3
connection.
This eGPU will be our primary
GPU driving the headset, so we
can use GPU that is already in
our Mac as secondary supporting
GPU.
And we will use shared event to
synchronize workloads of both
GPUs.
Event initial value is zero, so
it's important to start
synchronization counter from 1.
That's because when we would
wait on just initialized event,
its counter of zero will cause
it to return immediately, so
there would be no
synchronization.
So our rendering thread now
starts encoding work for our
supporting GPU immediately.
It will encode pause-independent
work that will happen on our
supporting GPU course, and once
this work is complete, its
results will be stored in locker
memory.
That's why we follow with
encoding brief operation that
will transfer those results to
system memory that is visible by
both GPUs.
And once this transfer is
complete, our supporting GPU can
safely signal our shared event.
This signal will tell eGPU that
now it's safe to take those
results.
So our rendering thread
committed this [inaudible]
common buffer, and supporting
GPU is already processing its
work.
At the same time, we can start
encoding command buffer for a
primary GPU that is driving the
headset.
In this command buffer, we will
start by waiting for our shared
event to be sure that the data
is in system memory, and once
it's there, and the shared event
is signaled, we can perform a
brief operation that will
transfer this data through
Thunderbolt 3 connection, back
to our [inaudible] GPU and once
this transfer is complete, it's
safe to perform pause-dependent
work, so a second command buffer
will signal lockout event to let
pause-dependent work know that
it can start executing.
After encoding and submitting
those two command buffers,
rendering thread can continue as
usual, with waiting for pauses,
and later encoding
pause-dependent work.
So now we have a mechanism to
synchronize different workloads
between different GPUs.
But as you can see, our
secondary GPU is still a little
bit idle.
That's because in this example
we decided to push through it,
pause dependent workloads that
have dependency with pause
dependent ones.
Excuse me.
But of course there are types of
workloads that have no
dependencies, and they can
happen at lower frequencies, the
frequency of the headset.
One example of such workloads
can be, for example, simulation
of physically based accurate
[inaudible] or anything else
that requires a lot of time to
be updated.
Such workload can happen in the
background, completely
asynchronously from rendering
frames, and each time it's
ready, its results will be sent
to primary GPU.
It's marked here with gray color
to indicate that it's not
related to any particular frame.
So, of course there are
different GPUs with different
performance characteristics, and
they will have different
bandwidth connections.
And your application will have
different workloads in a single
frame with different relations
between them.
So you will need to design a way
to distribute this workload on
your own, but saying all that,
it's important to start thinking
about this GPU workload
distribution, as multi-GPU
configuration are becoming
common on Apple platforms.
So let's summarize everything
that we've learned in this
section.
We showed multi-thread
application to take full benefit
of all CPU codes.
And split your command buffers,
to ensure that GPU is not idle.
When doing that, if possible,
try to separate
pause-independent from
pause-dependent workloads, to be
able to encode this work as soon
as possible, and even further,
splitting workloads by frequency
of update so if your application
will execute on multi-GPU
configuration, you can easily
distribute it across those GPUs.
And while doing that, ensure
that you drive each GPU with
separate rendering threads to
ensure that they all execute
asynchronously.
Now, you switch to reducing fill
rate.
Vive Pro introduces new
challenges for VR application
developers.
To better understand scale of
the problem, we will compare
different medium fill rates.
So, for example, application
rendering in default scaling
rate to Vive headset, produces
436 megapixels per second.
And most advanced [inaudible]
against that [inaudible] HD TVs
have fill rate of 475 megapixels
per second.
Those numbers are already so big
that game developers use
different tweaks to reduce this
fill rate.
Now, let's see how Vive Pro
compares to those numbers.
Vive Pro has a normal fill rate
of 775 megapixels per second,
and if you add to that four
times multi-sampling [inaudible]
or bigger scaling rate, this
number will grow even more.
That is why reducing fill rate
is so important.
There are multiple techniques
there and new techniques are
made every day.
So I encourage you to try them
all, but today we will focus
only on a few still as they are
the simplest to implement and
bring nice performance gains.
So we will start with clipping
invisible pixels.
Here, you can see image rendered
for left eye.
But due to the nature of the
lens work, about 20% of those
pixels are lost after compositor
performs its distortion
correction.
So on the right, you can see
image that will be displayed on
a panel in a headset before it
goes through the lens.
So, the simplest way to reduce
our fill rate is to prevent our
application from rendering those
pixels that won't be visible
anyway, and you can do that
easily by using SteamVR Stencil
Mask.
So we've just saved 20% of our
fill rate by applying this
simple mask, and reduce our Vive
Pro fill rate to 620 megapixels.
Now, we will analyze implication
of this lens distortion
correction in more detail.
We will divide our field of view
into nine sections.
Central section has field of
view of 80 degrees horizontally
by 80 degrees vertically, and we
have surrounding sections on the
edges and corners.
We've color tinted them to
better visualize the
contribution to final image.
So as you can see, corners are
almost completely invisible and
edges have matched less
contribution to the image than
in the original one.
In fact, if you see this image
in the headset, you wouldn't be
able to look directly at the red
sections.
The only way to see them would
be with your peripheral vision.
So this gives us great hint.
We can render those edge and
corner sections and a reduced
fill rate, as they are mostly
invisible anyway.
We render the central section as
we did before.
But then we will render vertical
edges with half of the width and
horizontal sections with half of
the height.
And finally, we will render
corner edges at one-fourth of
the resolution.
Once our expensive rendering
pass is complete, we will
perform cheap upscaling pass
that will stretch those regions
back to the resolution at which
they need to be submitted to
compositor.
So you are wondering how much
we've gained by doing that.
In case of 80 by 80 degree
central region, we reduced our
fill rate all the way down to
491 megapixels per second.
But you remember that we just
talked about clipping invisible
pixels, so let's combine those
two techniques together.
By clipping pixels combined with
multi-resolution shading, you
can reduce your fill rate even
further to 456 megapixels per
second, and that is not a random
number.
In fact, that's a default fill
rate of Vive headset, so by just
using those two optimization
techniques, your application can
render to Vive Pro with much
higher resolution using exactly
the same GPU as it did when
rendering to Vive headset.
Of course, you can use those
techniques when rendering to
Vive as well, which will allow
you to bring visualize of your
application even further and
make it prettier.
There is one caveat here.
Multi-resolution shading
requires few render passes, so
it will increase your workload
on geometric [inaudible], but
you can easily mitigate that by
just reducing your central
vision by a few degrees.
Here, by just reducing our
central vision by 10 degrees,
we've reduced fill rate all the
way to 382 megapixels per
second.
And if your geometry workload is
really high, you can go further,
and experiment with lower fill
rate, lower regions, that will
reduce fill rate even more.
In case of 55 by 55 degrees
central region, 80% of your
[inaudible] eye movement will be
still inside this region, but
we've reduced our fill rate by
more than half, to 360
megapixels per second.
So of course there are different
ways to implement
multi-resolution shading.
And you will get different
performance gains from that.
So I encourage you to experiment
with this technique and try what
will work for you best.
So let's summarize everything
that we've learned during this
session.
We've just announced plug and
play support for Vive Pro
Headsets, and introduced new
Metal 2 features that allow you
now to develop even more
advanced VR applications.
And I encourage you to take
advantage of multi-GPU
configurations, as they are
becoming common on other
platforms.
You can learn more about this
session from this link, and I
would like to invite all of you
to meet with me and my
colleagues during Metal 4 VR
Lab, that will take place today
at 12:00 p.m. in Technology Lab
6.
Thank you very much.
[ Applause ]