Transcript
[ Music ]
[ Applause ]
>> Welcome.
Last year, we introduced Metal
2, which includes new ways for
the GPU to drive the rendering
pipeline.
This year, we're introducing
even more new and exciting
features to solve common game
development challenges.
My name is Brian Ross, and
together with my colleague,
Michael Imbrogno, we'll explore
new ways to make your
applications better, faster, and
more efficient.
But first, I want to talk about
some of the challenges that I'm
trying to help you solve.
Your games are using an
ever-increasing number of
objects, materials, and lights.
Games like Inside, for example,
use a great deal of special
effects to capture and support
the mood of the game.
Making games like this that
truly draw you in is challenging
because it can require efficient
GPU utilization.
At the same time, games are
requiring more and more CPU
cycles for exciting gameplay.
For example, games like Tomb
Raider that features
breathtaking vistas and
highly-detailed terrain, but, at
the same time, they're also
managing complex physics
simulations in AI.
This is challenging because it
leaves less CPU time for
rendering.
And finally, developers are
taking AAA titles like Fortnite
from Epic Games, importing them
to iOS so you can run a
console-level game in the palm
of your hand.
This is a truly amazing feat,
but this also leaves us with
even more challenges, like how
to balance battery life with a
great frame rate.
So now, let's look at how Metal
can help you solve these
challenges.
Today, I'm going to show you how
to harness parallelism on both
the CPU and the GPU to draw more
complex scenes.
We'll also talk about ways to
maximize performance using more
explicit control with heaps,
fences, and events.
And then, I'm going to show you
how to build GPU-driven
pipelines using our latest
features, argument buffers and
indirect command buffers.
Now, while all these API
improvements are key, it's
equally important to understand
the underlying hardware they run
on.
So the next section, my
colleague Michael is going to
show you how to optimize for the
A11 to improve performance
extend playtime.
And finally, I'm really excited
that we're going to be joined by
Nick Penwarden from Epic Games.
He is going to show us how
they've used Metal to bring
console-level games to our
devices.
So let's get started.
Harnessing both CPU and GPU
parallelism is probably the most
important and easiest
optimization you can make.
Building a command stream on a
single thread is not sufficient
anymore.
The latest iPhone has 6 cores,
and the iMac Pro can have up to
18.
So scalable, multithreaded
architecture is key to great
performance on all of our
devices.
Metal is designed for
multithreading.
I'm going to show you 2 ways how
to parallelize on the CPU, and
then I'm going to close this
section by showing you how Metal
could automatically parallelize
for you on the GPU.
So let's set up an example of a
typical game frame.
With a classic, single-threaded
rendering, you'd [inaudible]
build GPU commands and GPU
execution order into a single
command buffer.
Typically, you're then having to
fit this into some small
fraction of your frame time.
And, of course, you're going to
have maximum latency because the
entire command buffer must be
encoded before the GPU can
consume it.
Obviously, there's a better way
to do this, so what we're going
to do is we're going to start by
building in parallelism with the
CPU.
Render and compute passes are
the basic granularity of
multithread in Metal.
All you need to do is create
multiple command buffers and
start encoding each into
separate passes on a separate
thread.
You can encode them in any order
you wish.
The final order of execution is
determined by the order they're
added to the command queue.
So now, let's take a look at how
easy this is to do in your code.
So you can see this is not a lot
of code.
The first thing that you're
going to do is create any number
of command buffers from the
queue.
Next, we're going to define the
GPU execution order upfront by
using the enqueue interface.
This is great because you can do
all this without waiting for the
command buffer to be encoded
first.
And finally, we're going to
create separate threads and
caller encoding functions for
each.
And that's it.
That's all you have to do.
It's really fast, it's really
efficient, and it's really
simple.
So now, let's go back to the
previous diagram and look at
another example.
So as you can see, we did a
pretty good job parallelizing
these on the CPU, but what if
you have 1 really long rendering
pass?
So in cases like this, Metal has
a dedicated parallel encoder
that allows you to encode on
multiple threads without
explicitly dividing up the
render pass or the command
buffer.
So now, let's look at how simple
this is in your code.
It looks a lot like the previous
example.
The first thing you're going to
do is create a parallel encoder.
And from that, you create any
number of subordinate encoders.
And it's important to realize
that this is actually where you
define the GPU execution order.
Next, we're going to create
separate threads and encode each
of our G-buffer functions
separately.
And finally, we set up a
notification so that when the
threads are complete, we call
end encoding on the parallel
encoder.
And that is it.
That's all you have to do to
parallelize a render pass.
It's really fast, and it's
really easy.
So now that I've shown you 2
ways to parallelize on the CPU,
now let's see how Metal can
parallelize for you
automatically on the GPU.
So let's look at the frame
example from the beginning and
see how the GPU executes the
frame.
Based on the capabilities of
your platform, Metal can extract
parallelism automatically by
analyzing your data
dependencies.
Let's look at just 2 of these
dependencies.
So in this example, the particle
simulation writes data, which is
later used by the effects pass
to render the particles.
Similarly, the G-buffer pass
generates geometry, which is
later used by the deferred
shading pass to compute material
lighting.
All this information allows
Metal to automatically and
cheaply identify entire passes
that can run in parallel, such
as using async compute.
So you can achieve parallelism
and async compute for free on
the GPU.
It's free because Metal doesn't
require you to do anything
special on your part.
So I think we all love getting
free optimizations on the GPU,
but sometimes you as a
developer, you may need to dive
a little bit deeper.
For the most critical parts of
your code, Metal allows you to
incrementally dive deeper with
more control.
For example, you could disable
automatic reference counting and
do it yourself to save on CPU
time.
You could also use Metal heaps
to tightly control allocations
really cheaply.
And Metal heaps are complemented
by fences and events, which
allow you to explicitly control
the GPU parallelism.
Many of your games are using a
lot of resources, which can be
costly.
Allocations require a round trip
to the OS, which has to map and
initialize memory on each
request.
If your game uses temporary
render targets, these
allocations can happen in the
middle of your frame, causing
stutters.
Resource heaps are a great
solution to this problem.
Heaps also let you allocate
large slabs of memory from the
system upfront.
And from those, you can later
add or remove textures and
buffers from those slabs without
any costly round trip.
So starting from a case where
you allocate 3 normal textures,
Metal typically places these in
3 separate allocations, but
putting these all instead into a
single heap lets you perform all
memory allocation upfront at
heap creation time.
So then, the act of creating
textures becomes extremely
cheap.
Also, heaps can sometimes let us
use the space more efficiently
by packing allocations closer
together.
So with a traditional model, you
would deallocate textures,
releasing pages back to the
system, and then reallocate,
which will allocate a new set of
textures all over again.
With heaps, you deallocate and
reallocate without any costly
round trip to the OS.
Finally, heaps also let you
alias different memory resources
with each other.
This is really helpful if your
game frame has a lot of
temporary render targets.
There's no reason for these to
occupy a different memory all
the time, so you could alias and
save hundreds of megabytes.
Now, the faster allocations in
aliasing are great, but it's not
entirely free when it comes to
dependency tracking.
Let's return to our frame
example for a better
explanation.
With heaps, Metal no longer sees
individual resources, so
therefore, it can't
automatically identify the read
and write dependencies between
passes, such as the G-buffer and
deferred shading pass in our
example.
So you have to use fences to
explicitly signal which pass
produces data and which pass
consumes the data.
So in this example, the G-buffer
updates the fence, and the
deferred shading waits for it.
So now, let's take a look at how
we could apply these basic
concepts in your code.
So the first thing that we're
going to do is we're going to
apply this to our G-buffer and
deferred shading example.
First, we're going to allocate
our temporary render target from
the heap.
This looks just like what you're
probably already doing today
when you allocate a texture.
Next, we're going to render into
that temporary render target.
And finally, update the fence
after the fragment stage
completes.
This will ensure that all the
data is produced before the next
pass consumes it.
So now, let's switch gears over
to the deferred shading pass.
Now, we're going to use this
temporary render target to
compute material lighting.
Then, we're going to wait for
the fence to make sure that it's
been produced before we consume
it.
And finally, market is aliasable
so that we can reuse this for
other operations, saving
hundreds of megabytes.
So now that we've talked about
how to parallelize and optimize
performance with explicit
control, this is great, but what
if you want to put the GPU more
into the driving seat?
So let's talk about GPU-driven
pipelines.
Your games are moving more and
more of the decision logic onto
the GPU, especially when it
comes to processing extremely
large data sets or scene graphs
with thousands of objects.
With Metal 2, we've made another
really important step forward in
our focus on GPU-driven
pipelines.
Last year, we introduced
indirect argument buffers,
allowing you to further decrease
CPU usage and move a large
portion of the workload to the
GPU.
This year, we're also
introducing indirect command
buffers, and this will allow you
to move entire rendering loops
onto the GPU.
So first, let's briefly recap
the argument buffer feature.
An argument buffer is simply a
structure represented like this.
Previously, these would have
only constants, but with
argument buffers, we can have
textures and samplers.
Before, these would have to have
separate shader bind points.
So since this structure, you
have all the features of the
Metal shading language at your
disposal, so it's really
flexible and really easy.
You could do things like add
substructures, or arrays, or
even pointers to other argument
buffers.
You could modify textures and
samplers, creating new materials
on a GPU without any CPU
involvement.
Or you can make giant arrays of
materials and use a
single-instance draw call to
render many objects with unique
properties.
So argument buffers allow you to
offload the material management
onto the GPU and save valuable
CPU resources.
But this year, we're putting it
a little bit, extending it a
little bit more.
We started by adding 2 new
argument types.
This includes pipeline states
and command buffers.
Now, these are used to support
our brand-new indirect command
buffer feature.
With indirect command buffers,
you could encode entire scenes
on the GPU.
On the CPU, you only have a few
threads available for rendering,
but on the GPU, you have
hundreds or even thousands of
threads all running at the same
time.
With indirect command buffers,
you can fully utilize this
massively parallel nature.
Also, indirect command buffers
are completely reusable, so you
could spend the encoding cost
once and reuse it again and
again.
And since an ICB is a directly
accessible buffer, you can
modify its contents at any time,
like change the shader type, or
the camera matrix, or anything
else that you might need to
change.
And of course, by moving your
rendering to the GPU, you remove
expensive CPU and GPU
synchronization points that are
normally required to hand over
the data.
So let's take a look at an
example.
Here is a typical game frame.
The usual rendering loop has a
few common stages.
First, you walk your scene graph
to determine which objects you
need to render.
You probably use frustum culling
to determine what objects are
within the view frustum.
Some of you might use a more
complex solution that accounts
for occlusion.
Also, level of detail selection
naturally occurs at this stage.
Only once you encode and submit
your command buffer will the GPU
start to consume it.
More and more games are moving
the process of determining
visible objects onto the GPU.
GPUs are just better at handling
the growing scene complexity of
the latest games.
Unfortunately, this creates a
sync point in your frame.
And the, it makes it so that the
CPU cannot encode draw calls
until the GPU produces the data.
It's extremely difficult to get
this right without wasting
valuable CPU and GPU time on
synchronization.
With ICBs, the benefits are
immense.
Not only can you move the final
bits of processing to the GPU,
you naturally remove any sync
points required to hand over the
data and you improve your CPU
and GPU utilization.
At the same time, you reduce
your CPU overhead to a constant.
So let's look at the encoding in
a little bit more detail.
I'm going to start by expanding
on our previous example and look
at the massively parallel nature
that only the GPU can provide.
We could begin with the list of
visible objects and LODs coming
from our culling dispatch.
Also, keep in mind that we're
utilizing the power of argument
buffers here.
So in this case, each element
has a pointer to the actual
properties, so we don't need to
store everything in the same
buffer.
This solution saves us a lot of
memory and performance, and it's
because we only build a very
small list of information.
The actual argument buffer
contains several levels of
detail for geometry.
This includes position, vertex
buffer, index buffer, and a
material argument buffer.
For rendering, we only select 1
of these LODs per object.
The actual encoding happens in a
compute kernel, and we encode
into an indirect command buffer.
Each thread of the compute
kernel encodes a single draw
call.
So we read the object with all
of its properties, and we encode
these into the ICB.
There's a couple of details
worth noting.
You can think of an ICB as an
array of render commands.
A render command consists of a
pipeline object with shaders,
any number of buffers, and a
draw call.
Next, an ICB is built for
parallelism, so you could encode
concurrently and out of order.
And lastly, we kept the API very
simple, so it's just like what
you might be doing today on the
CPU.
Another thing -- each command
could have different properties
and even draw types.
So this is a really, really
significant step forward from
all the flavors of indirect
rendering that many of you may
have seen elsewhere.
Now, let's take a look at how we
can do this in your code.
So this is how easy it is to
encode a draw call.
The first thing you're going to
do is select the render command
by index using your thread ID.
Then, we're going to set the
properties.
So in this example, we're
setting a shader with a pipeline
state and then a separate buffer
for the geometry and material.
And finally, this is how you
encode a draw call.
Thanks to the Metal shading
language, encoding on the GPU is
really, really simple.
Even though this is in a compute
shader, this looks just like
what you're already doing on the
CPU today.
Now, let's look at 1 more
sample.
Here are some of the basic
things you need to do to create,
encode, and execute an ICB.
To create it, you first fill out
a descriptor.
The descriptor contains things
like draw types, and inheritance
properties, and per-stage bind
counts.
This describes the way that the
indirect buffer will behave.
When it's time to encode the
ICB, you simply create compute
encoder and call dispatch just
like what you've been doing
already.
Once the ICB is encoded, you can
optionally decide if you want to
optimize it.
When you optimize it, you remove
all the redundant state, and the
end result is a lean and
highly-efficient set of GPU
commands.
Now, once the ICB is encoded and
optimized, it's time to schedule
it for execution.
You notice here that you could
actually specify the exact range
of commands that you execute.
Also in this example, we use an
indirect buffer, which itself
can be encoded with a GPU.
So once the ICB is encoded, it
could be reused again and again,
and the overhead is completely
negligible.
So I'm really excited, but we
actually went ahead and we put
together a sample so you could
take a look.
So here you could see a number
of school buses in the middle of
a city.
Each bus is composed of 500,000
polygons and 2000 individual
parts.
Each part requires a separate
draw call, its own material
argument buffer, index buffer,
and vertex buffer.
As you could imagine, this would
be a lot of API calls on the
CPU, but we are using indirect
command buffers here, so
everything is being encoded on
the GPU.
We're also selecting the
appropriate level of detail, and
therefore, we're able to render
multiple objects without
increasing the CPU or GPU cost.
So on the left, you could see a
view of the regular camera.
And on the right, we've zoomed
in to a single bus, so you could
see the level of detail actually
changing.
ICBs enabled us to introduce
another really incredible
optimization.
We're able to split the geometry
into chunks of a few hundred
triangles and analyze those
chunks in a separate compute
kernel.
You could see the chunks in
different colors on the screen.
Each thread of the kernel
determines whether triangles are
facing away from the camera or
if they're obscured by other
objects or geometry in the
scene.
This is all really, really fast
because we've performed the
calculation for a chunk only and
not on each individual triangle.
We then tell the GPU to only
render the chunks that are
actually visible.
And again, let's see the
side-by-side view.
The left side is your camera
view, and the right side is
another view of the bus.
You could see the red and
pinkish tint there.
That is what our compute shaders
determined is invisible.
We never actually send this work
to the GPU, so it saves us 50%
or more of the geometry
rendering cost.
Here's 1 last view showing
which, how much this technique
could save you.
So notice on the right, many of
the buses and ambulances are
actually invisible.
This is really amazing.
I love this.
So please take a chance to
explore the code, and I hope
I'll see this technology in some
of your games in the future.
I think if utilized, ICBs can
really push your games to the
next level.
So now, I'm pleased to introduce
Michael, who will show you how
to optimize for the A11, improve
performance, and extend
playtime.
Thank you very much.
[ Applause ]
>> Thanks, Brian.
So everything Brian's just
showed you is available for iOS,
tvOS, and macOS.
Next, I'm going to dive into
some of the new Metal 2 features
for Apple's latest GPU, the A11
Bionic, designed to help you
maximize your game's performance
and extend your playtime by
reducing system memory bandwidth
and reducing power consumption.
So Apple-designed GPUs have a
tile-based deferred rendering
architecture designed for both
high performance and low power.
This architecture takes
advantage of a high bandwidth,
low-latency tile memory that
eliminates overdraw and
unnecessary memory traffic.
Now, Metal is designed to take
advantage of the TBDR
architecture automatically
within each render pass, load
and store actions, make explicit
how render pass attachments move
in and out of tile memory.
But the A11 GPU takes the TBDR
architecture even further.
We added new capabilities to our
tile memory and added an
entirely new programmable stage.
This opens up new optimization
opportunities critical to
advanced rendering techniques,
such as deferred shading,
order-independent transparency,
tiled forward shading, and
particle rendering.
So let's start by taking a look
at the architecture of the A11
GPU.
All right.
So on the left, we have a block
representation of the A11 GPU.
And on the right, we have system
memory.
Now, the A11 GPU first processes
all the geometry of a render
pass in the vertex stage.
It transforms and bends your
geometry into screen-aligned,
tiled vertex buffers.
These tiled vertex buffers are
then stored in the system
memory.
Now, each tiled vertex buffer is
then processed entirely on ship
as part of the fragment stage.
This tiled architecture enables
2 major optimizations that your
games get for free.
First, the GPU rasterizes all
primitives in a tile before
shading any pixels using fast,
on-ship memory.
This eliminates overdraw, which
improves performance and reduces
power.
Second, a larger, more flexible
tile memory is used to store the
shaded fragments.
Blending operations are fast
because all the data is stored
on ship next to the shading
cores.
Now, tile memory is written to
system memory only once for each
tile after all fragments have
been shaded.
This reduces bandwidth, which
also improves your performance
and reduces your power.
Now, these optimizations happen
underneath the hood.
You get them just by using Metal
on iOS.
But Metal also lets you optimize
rendering techniques by taking
explicit control of the A11's
tile memory.
Now, during the development of
the A11 GPU, the hardware and
software teams at Apple analyzed
a number of important modern
rendering techniques.
We accelerated, we noticed many
common themes, and we found that
explicit control of our tile
memory accelerated all of them.
We then developed the hardware
and software features together
around this idea of explicit
control.
So let's talk about these
features.
So programmable blending lets
you write custom blend
operations in your shaders.
It's also a powerful tool you
can use to merge render passes,
and it's actually made available
across all iOS GPUs.
Imageblocks are new for A11.
They let you maximize your use
of tile memory by controlling
pixel layouts directly in the
shading language.
And tile shading is our
brand-new programmable stage
designed for techniques that
require mixing graphics and
compute processing.
Persistent threadgroup memory is
an important tool for combining
render and compute that allows
you to communicate across both
draws and dispatches.
And multi-sample color coverage
control lets you perform resolve
operations directly in tile
memory using tile shaders.
So I'm going to talk to you
about all these features, so
let's start with programmable
blending.
With programmable blending, your
fragment shader has read and
write access to pixels and tile
memory.
This lets you write custom
blending operations.
But programmable blending also
lets you eliminate system memory
bandwidth by combining multiple
render passes that read and
write the same attachments.
Now, deferred shading is a
particularly good fit for
programmable blending, so let's
take a closer look at that.
So deferred shading is a
many-light technique
traditionally implemented using
2 passes.
In the first pass, multiple
attachments are filled with
geometry attributes visible at
each pixel, such as normal,
albedo, and roughness.
And in the second pass,
fragments are shaded by sampling
those G-buffer attachments.
Now, the G-buffers are stored in
the system memory before being
read again in the lighting pass,
and this round trip from tile
memory to system memory and back
again can really bottleneck your
game because the G-buffer track
consumes a large amount of
bandwidth.
Now, programmable blending
instead lets you skip that round
trip to memory by reading the
current pixel's data directly
from tile memory.
This also means that we no
longer need 2 passes.
Our G-buffer fill and lighting
steps are now encoded and
executed in a single render
pass.
It also means that we no longer
need a copy of the G-buffer
attachments in system memory.
And with memory, Metal's
memoryless render target
feature, saving that memory is
really, really simple.
You just create a texture with a
memoryless flag set, and Metal's
only going to let you use it as
an attachment without load or
store actions.
So now, let's take a look at how
easy it is to adopt programmable
blending in your shaders.
Okay, so here's what the
fragment shader of your lighting
pass would look like with
programmable blending.
Programmable blending is enabled
when you both read and write
your attachments.
And in this example, we see that
the G-buffer attachments are
both inputs and outputs to our
functions.
We first calculate our lighting
using our G-buffer properties.
As you can see here, we're
reading our attachments and
we're not sampling them as
textures.
We then accumulate our lighting
result back into the G-buffer,
and, in this step, we're both
reading and writing our
accumulation attachments.
So that's it.
Programmable blending is really
that easy, and you should it
where, whenever you have
multiple render passes that read
and write the same attachments.
So now, let's talk about
imageblocks, which allow you to
merge render passes in even more
circumstances.
Imageblocks give you full
control of your data in tile
memory.
Instead of describing pixels as
arrays of render pass
attachments in the Metal API,
imageblocks let you declare your
pixel layouts directly in the
shading language as structs.
It adds new pack data types to
the shading language that match
the texture formats you already
use, and these types are
transparently packed and
unpacked when accessing the
shader.
In fact, you can also use these
new pack data types in your
vertex buffers and constant
buffers to more tightly pack all
of your data.
Imageblocks also let you
describe more complex per-pixel
data structures.
You can use arrays, nested
structs, or combinations
thereof.
It all just works.
Now, direct control of your
pixel layout means that you can
now change the layout within a
pass.
This lets you combine render
passes to eliminate system
memory bandwidth in ways that
just weren't possible with
programmable blending alone.
Let's take a look at an example.
So in our previous example, we
used programmable blending to
implement single-pass deferred
shading.
You can also implement
single-pass deferred shading
using imageblocks.
Imageblocks only exist in tile
memory, so there's no render
pass attachments to deal with.
Not only is this a more natural
way to express the algorithm,
but now you're free to reuse the
tile memory once you're finished
reading the G-buffer after your
lighting.
So let's go ahead and do that.
Let's reuse the tile memory to
add an order-independent
transparency technique called
multi-layer alpha blending.
So multi-layer alpha blending,
or MLAB, maintains a per-pixel,
fixed-size array of translucent
fragments.
Each incoming fragment is sorted
by depth into the array.
If a fragment's depth lies
beyond the last element of the
array, then those elements are
merged, so it's really an
approximation, approximate
technique.
Now, sorting the MLAB array is
really fast because it lives in
tile memory.
Doing the same off chip would be
really expensive because of the
extra bandwidth and
synchronization overhead.
Now, the A11 actually doubles
the maximum supported pixel size
over your previous generation,
but that's still not going to be
enough to contain both the
G-buffer and MLAB data
structures simultaneously.
Fortunately, you don't need both
at the same time.
Imageblocks let you change your
pixel layouts inside the render
pass to match your current
needs.
So changing pixel layouts
actually requires tile shading,
so let's talk about that next.
So tile shading is the new
programmable stage that provides
compute capabilities directly in
the render pass.
This stage is going to execute a
configurable threadgroup for
each tile.
For example, you can launch a
single thread per tile, or you
can launch a thread per pixel.
Now, tile shading lets you
interleave draw calls and
threadgroup dispatches that
operate on the same data.
Tile shaders have access to all
of tile memory, so they can read
and write any pixel of the
imageblock.
So let's look at how tile
shading can optimize techniques
such as tiled forward shading.
So like deferred shading, tiled
forward shading is a
many-layered technique.
It's often used when MSA is
important or when a variety of
materials are needed and works
equally well for both opaque and
translucent geometry.
Now, tiled forward shading
traditionally consists of 3
passes.
First, a render pass generates a
scene depth buffer.
Second, a compute pass
generates, calculates per-tile
depth bounds and per-tile light
lists using that scene depth
buffer.
And finally, another render pass
is going to shade the pixels in
each tile using the
corresponding light list.
Now, this pattern of mixing
render with compute occurs
frequently.
And prior to A11, communicating
across these passes required
system memory.
But with tile shading, we can
inline the compute so that the
render passes can be merged.
Here the depth bounds and light
culling steps are now
implemented as tile shaders and
inlined into a single render
pass.
Depth is now only stored in the
imageblock and, but is
accessible across the entire
pass.
So, now, tile shading is going
to help you eliminate a lot of
bandwidth, but these tile shader
outputs are still being stored
to system memory.
Tile shader dispatches are
synchronized with draws, so
that's completely safe to do,
but I think we could still do
better using our next feature,
persistent threadgroup memory.
Okay, so threadgroup memory is a
well-known feature of Metal
compute.
It lets threads within a
threadgroup share data using
fast, on-ship memory.
Now, thanks to tile shading,
threadgroup memory is now also
available in the render pass.
But threadgroup memory in the
render pass has 2 new
capabilities not traditionally
available to compute.
First, a fragment shader now
also has access to the same
threadgroup memory.
And second, the contents of
threadgroup memory persist
across the entire life of a
tile.
Taken together, this makes a
powerful tool for sharing data
across both draws and
dispatches.
In fact, we believe it's so
useful that we've actually
doubled the maximum size of
threadgroup memory over our
previous generation so that you
can store more of your
intermediate data on ship.
Okay, so now, let's use
threadgroup persistence to
further optimize our tiled
forward shading example.
So with persistence, tile, the
tile shading stage can now write
both the depth bounds and the
culled light lists into
threadgroup memory for later
draws to use.
This means that now all our
intermediate data stays on ship
and never leaves the GPU.
Only the final image is stored
at system memory.
Minimizing bandwidth to system
memory is, again, very important
for your game's performance and
playtime.
Now, let's take a look at how
easy it is to make use of
persistence in the shading
language.
Okay, so the top function here
is our tile shader, and it's
going to perform our light
culling.
It intersects each light with a
per-tile frustum to compute an
active light mask.
The bottom function is our
fragment shader that performs
our forward shading.
It shades only the lights
intersecting the tile using that
active light mask.
Now, sharing threadgroup memory
across these functions is
achieved by using the same type
and bind point across both
shaders.
That's how easy it is to take
advantage of threadgroup
persistence.
Okay, so now that you've seen
tile shading and threadgroup
persistence, let's revisit our
order-independent transparency
example.
Okay, so remember how I said
that changing imageblock layouts
requires tile shading?
That's because tile shading
provides the synchronization we
need to safely change layouts.
This means we actually have to
insert a tile shade between the
lighting and the MLAB steps.
So tile shading is going to wait
for the lighting stage to
complete before transitioning
from G-buffer layout to MLAB
layout, and it's also going to
carry forward the accumulated
lighting value from the lighting
step into the MLAB step for
final blending.
Okay, so now that we've covered
imageblocks, tile shading, and
threadgroup persistence, it's
time to move on to our final
topic, multi-sample
anti-aliasing and sample
coverage control.
So multi-sample anti-aliasing
improves image quality by
supersampling depth, stencil,
and blending, but shades only
once per pixel.
Multiple samples are later
resolved into a final image
using simple averaging.
Now, multi-sampling is efficient
on all A series GPUs because
samples are stored in tile
memory, where blending and
resolve operations have fast
access to the samples.
The A11 GPU optimizes
multi-sampling even further by
tracking the unique colors
within each pixel.
So blending operations that
previously operated on each
sample now only operate on each
color.
This could be a significant
savings because the interior of
every triangle only contains 1
unique color.
Now, this mapping of unique
color to samples is called color
coverage control, and it's
managed by the GPU.
But tile shaders can also read
and modify this color coverage.
And we can use this to perform
custom resolves in place and in
fast tile memory.
Now, to see why this is useful,
let's take a look at a
multi-sampled scene that also
renders particles.
Now, particles are transparent,
so we blend them after rendering
our opaque scene geometry.
But particle rendering doesn't
benefit from multi-sampling
because it doesn't really have
any visible edges.
So to avoid the extra cost of
blending per sample for no good
reason, a game would render
using 2 passes.
In the first pass, your opaque
scene geometry is rendered using
multi-sampling to reduce
aliasing.
And then, you're going to
resolve your color and depth to
system memory, and we're
resolving depth because
particles can later be included.
Then in the second pass, the
resolve color and depth are used
in rendering the particles
without multi-sampling.
Now, as you probably guessed by
now, our goal is to eliminate
the intermediate system memory
traffic using tile shading to
combine these 2 passes.
But tile shading alone isn't
enough.
We need color coverage control
to change the multi-sampling
rate in place.
Using color coverage control is
really powerful and really easy.
Let's take a look at the shader.
Okay, so remember that our goal
here is to average the samples
of each pixel and then store
that result back into the image
block as the overall pixel
value.
Now, instead of looping through
each color, through each sample,
we're going to take advantage of
the color rate capabilities of
the A11 and only loop through
unique colors.
To properly average across all
samples, we need to weigh each
color by the number of samples
associated with it, and we do
this by counting the bit set in
the color coverage mask.
We then complete our averaging
by dividing by the total number
of samples and, finally, write
the result back into the
imageblock.
The output sample mask tells
Metal to apply the results to
all samples of the pixel.
And since all samples now share
the same value, the later
particle draws are going to
blend per pixel rather than per
sample.
So that's it for sample coverage
control.
Now, optimizing for Apple GPUs
is really important for
maximizing your game's
performance and extending its
playtime, but there's a lot more
work that goes into shipping a
tile in iOS, especially one
that's originally designed for
desktops and consoles.
To talk about that now and to
put into practice what we just
discussed, I'd like to bring on
Nick Penwarden from Epic Games.
Nick?
[ Applause ]
>> Thank you, Michael.
So, yeah. I'd like to talk a
little bit about how we took a
game that was originally made
for desktop and console
platforms and brought it to iOS
using Metal.
So some of the technical
challenges we faced.
The Battle Royale map is 1 map.
It's larger than 6 kilometers
squared.
That means that it will not all
fit into memory.
We also have dynamic time of
day, destruction.
Players can destroy just about
any object in the scene.
Players can also build their own
structures.
So the map is very dynamic,
meaning we can't do a lot of
precomputation.
We have 100 players in the map,
and the map has over 50,000
replicating actors that are
simulated on the server and
replicated down to the client.
Finally, we wanted to support
crossplay between console and
desktop players along with
mobile.
And that's actually a really
important point because it
limited the amount that we could
scale back the game in order to
fit into the performance
constraints of the device.
Basically, if something affected
gameplay, we couldn't change it.
So if there's an object and it's
really small, it's really far
away, maybe normally you would
cull it, but in this case, we
can't because if a player can
hide behind it, we need to
render it.
So want to talk a little bit
about Metal.
Metal is really important in
terms of allowing us to ship the
game as fast as we did and at
the quality that we were able to
achieve.
Draw call performance was key to
this because, again, we have a
really complicated scene and we
need the performance to render
it, and Metal gave us that
performance.
Metal also gave us access to a
number of hardware features,
such as programmable blending,
that we used to get important
GPU performance back.
It also has a feature set that
allowed us to use all of the
rendering techniques we need to
bring Fortnite to iOS.
In terms of rendering features,
we use a movable directional
light for the sun with cascaded
shadow maps.
We have a movable skylight
because the sky changes
throughout the day.
We use physically-based
materials.
We render in HDR and have a
tone-mapping pass at the end.
We allow particle simulation on
the GPU.
And we also support all of our
artist-authored materials.
It's actually a pretty important
point because some of our
materials are actually very
complicated.
For instance, the imposters that
we use to render trees in the
distance efficiently were
entirely created by a technical
artist at Epic using a
combination of blueprints and
the material shader graph.
So in terms of where we ended
up, here is an image of Fortnite
running on a Mac at high
scalability settings.
Here it is running on a Mac at
medium scalability settings.
And here it is on an iPhone 8
Plus.
So we were able to faithfully
represent the game on an iPhone
about at the quality that we
achieve on a mid-range Mac.
So let's talk a little bit about
scalability.
We deal with scalability both
across platforms as well as
within the iOS ecosystem.
So across platform, this is
stuff that we need to fit on the
platform at all, like removing
LODs from meshes that will never
display so we can fit in memory
or changing the number of
characters that we animate at a
particular quality level in
order to reduce CPU costs.
Within iOS, we also defined 3
buckets for scalability -- low,
mid, and high -- and these were
generally correlated with the
different generations of
iPhones, so iPhone 6s on the low
end, iPhone 7 was our mid-range
target, and the iPhone 8 and
iPhone X on the high end.
Resolution was obviously the
simplest and best scalability
option that we had.
We ended up tuning this per
device.
We preferred to use backbuffer
resolution where possible --
this is what the UI renders at
-- because if we do this, then
we don't have to pay a separate
upsampling cost.
However, we do support rendering
3D resolution at a lower
resolution, and we do so in some
cases where we needed a crisp UI
but had to reduce 3D render
resolution lower than that in
order to meet our performance
goals -- the iPhone 6s, for
example.
Shadows were another axis of
scalability and actually really
important because they impact
both the CPU and the GPU.
On low-end devices, we don't
render any shadows at all.
On our mid-range target, we have
1 cascade, 1024 by 1024.
We set the distance to be about
the size of a building, so if
you're inside of a structure,
you're not going to see light
leaking on the other side.
High-end phones add a second
cascade, which gives crisper
character shadows as well as
lets us push out the shadowing
distance a little further.
Foliage was another axis of
scalability.
On low-end devices, we simply
don't render foliage.
On the mid range, we render
about 30% of the density we
support on console.
And on high-end devices, we
actually render 100% of the
density that we support on
console.
Memory is interesting in terms
of scalability because it
doesn't always correlate with
performance.
For instance, an iPhone 8 is
faster than an iPhone 7 Plus,
but it has less physical memory.
This means when you're taking
into account scalability, you
need to treat memory
differently.
We ended up treating it as an
orthogonal axis of scalability
and just had 2 buckets, low
memory and high memory.
For low-memory devices, we
disabled foliage and shadows.
We also reduced some of the
memory pool.
So for instance, we limited GPU
particles to a total of 16,000
and reduced the pool use for
cosmetics and texture memory.
We still need to do quite a bit
of memory optimization in order
to get the game to fit on the
device.
The most important was level
streaming -- basically, just
making sure that nothing is in
memory that is not around the
player.
We also used ASTC texture
compression and tend to prefer
compressing for size rather than
quality in most cases.
And we also gave our artists a
lot of tools for being able to
cook out different LODs that
aren't needed or reduce audio
variations on a per-platform
basis.
Want to talk a little bit about
frame rate targets.
So on iOS, we wanted to target
30 fps at the highest visual
fidelity possible.
However, you can't just max out
the device.
If we were maxing out the CPU
and the GPU the entire time, the
operating system would end up
downclocking us, then we'd no
longer hit our frame rates.
We also want to conserve battery
life.
If players are playing several
games in a row during their
commute, we want to support that
rather than their device dying
before they even make it to
work.
So for this, what we decided to
do was to target 60 frames per
second for the environment, but
vsync at 30, which means most of
the time when you're exploring
the map in Fortnite, your phone
is idle about 50% of the time.
Using that time to conserve
battery life and keep cool.
To make sure that we hit those
targets, we track performance
every day.
So every day, we have an
automation pass that goes
through.
We look at key locations in the
map, and we capture performance.
So for instance, Tilted Towers,
and Shifty Shafts, and all of
the usual POIs that you're
familiar with in Battle Royale.
When one goes over budget, we
know we need to dive in, figure
out where performance is going,
and optimize.
We also have daily 100-player
playtests where we capture the
dynamic performance that you'll
only see during a game.
We track key performance over
time for the match, and then we
can take a look at this
afterwards and see how it
performed, look for hitches,
stuff like that.
And if something looks off, we
can pull an instrumented profile
off of the device, take a look
at where time was going, and
figure out where we need to
optimize.
We also support replays.
This is a feature in Unreal that
allow us to go and replay that
match from a client perspective.
So we can play it over and over,
analyze it, profile it, and even
see how optimizations would have
affected the client in that play
session.
Going to talk a little bit about
metal specifically.
So we, on most devices, we have
2 cores, right, and so the way
we utilize that is we have a
traditional game
thread/rendering thread split.
On the game thread, we're doing
networking, simulation,
animation, physics, and so on.
The rendering thread does all of
scene traversal, culling, and
issues all of the Metal API
calls.
We also have an async thread.
Mostly, it's handling streaming
tasks -- texture streaming as
well as level streaming.
On newer devices where we have 2
fast and 4 efficient cores, we
add 3 more task threads and
enable some of the parallel
algorithms available in Unreal.
So we take animation, put it,
simulate it over on multiple
frames, CPU particles, physics,
and so on, scene culling, a
couple other tasks.
I mentioned draw calls earlier.
Draw calls were our main
performance bottleneck, and this
is actually where Metal really
helped us out.
We found Metal to be somewhere
on the order of 3 to 4 times
faster than OpenGL for what we
were doing, and that allowed us
to ship without doing a lot of
aggressive work trying to reduce
draw calls.
We did stuff to reduce draw
calls, mostly pulling in cull
distance on decorative objects
as well as leveraging the
hierarchical level of detail
system.
So here's an example.
This is one of those POIs that
we tracked over time.
If you're familiar with the
game, this is looking down on
Tilted Towers from a cliff and
was kind of our draw call hot
spot in the map.
As you can see, it takes about
1300 draw calls to render this.
This is just for the main pass.
It doesn't include shadows, UI,
anything else that consumed draw
call time.
But Metal's really fast here.
On an iPhone 8 Plus, we were
able to chew through that in
under 5 milliseconds.
I mentioned hierarchical LOD.
This is a feature we have in
Unreal where we can take
multiple draw calls and generate
a simplified version, a
simplified mesh, as well as a
material so that we can
basically render a
representation of that area in a
single draw call.
We use this for taking POIs and
generating the simplified
versions for rendering very,
very far away.
For instance, during the
skydive, you can see the entire
map.
In fact, when you're on the map,
you can get on a cliff or just
build a very high tower of your
own and see points of interest
from up to 2 kilometers away.
Digging into some of the other
details on the Metal side, I
want to talk a little bit about
pipeline state objects.
This was something that took us
a little bit of time to get into
a shippable state for Fortnite.
You really want to minimize how
many PSOs you're creating while
you're simulating the game
during the frame.
If you create too many, it's
very easy to hitch and create a
poor player experience.
So first of all, follow best
practices, right.
Compile your functions offline,
build your library offline, and
pull all of your functions into
a single library.
But you really want to make sure
you create all of your PSOs at
load time.
But what do you do if you can't
do that?
So for us, the permutation
matrix is just crazy.
There's way too many for us to
realistically create at load
time.
We have multiple artist-authored
shaders -- thousands of them --
multiple lighting scenarios
based on number of shadow
cascades and so on, different
render target formats, MSAA.
The list goes on.
We tried to minimize
permutations where we could, and
this does help.
Sometimes a dynamic branch is
good enough and better than
creating a static permutation,
but sometimes not.
What we had to do is we decided
to identify the most common
subset that we're likely to
need, and we create those at
load.
We don't try to create
everything.
The way we achieved this is we
created an automation pass where
we basically flew a camera
through the environment and
recorded all of the PSOs that we
actually needed to render the
environment.
Then, during our daily
playtests, we harvested any PSOs
that were created that were not
caught by that automation pass.
The automation pass also
catches, like, cosmetics, and
effects from firing different
weapons, and so on.
We take all of that information
from automation and from the
playtest, combine it into a
list.
That's what we create at load
time, and that's what we ship
with the game.
It's not perfect, but we find
that the number of PSOs we
create at runtime is in the
single digits for any given play
session, on average.
And so players don't experience
any hitching from PSO creation.
Resource allocation.
So basically, creating and
deleting resources is expensive
or can be expensive.
It's kind of like, think of
[inaudible].
You really want to minimize the
number of [inaudible] you're
making per frame.
You really don't want to be
creating and destroying a lot of
resources on the fly, but when
you're streaming in content
dynamically, when you have a lot
of movable objects, some of this
just isn't possible to avoid.
So what we did for buffers is we
just used buffer suballocation
-- basically, a bend allocation
strategy.
Upfront, we allocate a big
buffer, and then we suballocate
small chunks back to the engine
to avoid asking Metal for new
buffers all the time.
And this ended up helping a lot.
We also leveraged programmable
blending to reduce the number of
resolves and restores and the
amount of memory bandwidth we
use.
Specifically, the main use case
we have for this is anywhere we
need access to scene depth, so
things like soft particle
blending or projected decals.
What we do is during the forward
pass, we write our linear depth
to the alpha channel.
And then, during our decal and
translucent passes, all we need
to do is use programmable
blending to read that alpha
channel back, and we can use
depth without having ever had to
resolve the depth buffer to main
memory.
We also use it to improve the
quality of MSAA.
As I mentioned, we do HDR
rendering, and a-- just an MSAA
resolve of HDR can still lead to
very aliased edges.
Think of cases where you have a
very, very bright sky and a
very, very dark foreground.
Just doing a box filter over
that is, basically, if 1 of
those subsamples is incredibly
bright and the others are
incredibly dark, the result is
going to be an incredibly bright
pixel.
And when tone mapped, it'll be
something close to white.
You end up with edges that don't
look anti-aliased at all.
So our solution to this was to
do a pre tone map over all of
the MSAA samples, then perform
the normal MSAA resolve, and
then the first postprocessing
pass just reverses that pre tone
map.
We use programmable blending for
the pre tone map pass.
Otherwise, we'd have to resolve
the entire MSAA color buffer to
memory and read it back in,
which would be unaffordable.
Looking forward to some of the
work we'd like to do in the
future with Metal, parallel
rendering.
So on macOS, we do support
creating command buffers in
parallel.
On iOS, we'd really need to
support parallel command
encoders for this to be
practical.
A lot of our drawing ends up
happening in the main forward
pass, and so it's important to
parallelize that.
I think it would be very
interesting to sort of see the
effects of parallel rendering on
a monolithic, fast core versus
what we had for parallel command
encoders on the efficient cores
on higher-end devices.
Could be some interesting
results in terms of battery
usage.
Metal heaps.
So we'd like to replace our
buffer suballocation with Metal
heaps -- first, because it'll
just simply our code, but
second, because we can also use
it for textures.
We still see an occasional hitch
due to texture streaming because
we're obviously creating and
destroying textures on the fly
as we bring textures in and out
of memory.
Being able to use heaps for this
will get rid of those hitches.
For us, we just, it's, the work
we have in front of us to make
that possible is setting up
precise fencing between the
different passes, right.
So we need to know explicitly if
a resource is being read or
written by a vertex or pixel
shader across different passes,
and it requires some reworking
of some of our renderer to make
that happen.
And of course, continue to push
the high end of graphics on iOS.
Last year at WWDC, we showed
what was possible by bringing
our desktop-class forward
renderer to high-end iOS
devices, and we continue, we
want to continue pushing that
bar on iOS, continuing to bring
desktop-class features to iOS
and looking for opportunities to
unify our desktop renderer with
the iOS renderer.
And with that, I'll hand it back
to Michael.
[ Applause ]
>> So Metal is low overhead out
of the box, but rendering many
objects efficiently can require
multithreading.
Metal is built to take advantage
of all the GPU, all the CPUs in
our systems.
Metal is also really accessible,
but advanced rendering sometimes
requires explicit control.
Metal provides this control when
you need it for memory
management and GPU parallelism.
We also introduced indirect
command buffers, our brand-new
feature that lets you move
command generation entirely to
the GPU, freeing the CPU for
other tasks.
Together with argument buffers,
these features provide a
complete solution to GPU-driven
pipelines.
And finally, Metal lets you
leverage the advanced
architecture of the A11 GPU to
optimize many rendering
techniques for both maximum
performance and extended
playtime.
For more information, please
visit our website, and be sure
to visit us in tomorrow's lab.
Thank you.
[ Applause ]