WWDC2016 Session 606

Transcript

[ Music ]
[ Applause ]
>> So, hello everyone.
My name is Fiona and this
is my colleague Alex.
And I work on the iOS GPU
complier team and our job is
to make your shaders run
on the latest iOS devices,
and to make them run as
efficiently as possible.
And I'm here to talk
about our presentations,
Advanced Metal Shader
Optimization, that is Forging
and Polishing your
Metal shaders.
Our compiler is based on LVM.
And we work with the
Open Source committee
to make LVM more suitable
for use on GPUs by everyone.
to make LVM more suitable
for use on GPUs by everyone.
Here's a quick overview of
the other Metal session,
in case you missed them,
and don't worry you can
watch the recordings online.
Yesterday we had part one
and two of adopting Metal
and earlier today we had part
one and two of what's new
in Metal, because there's quite
a lot that's new in Metal.
And of course here's
the last one,
the one you're watching
right now.
So in this presentation we're
going to be going over a number
of things you can do to
work with the compiler
to make your code faster.
And some of this stuff is
going to be specific to A8
and later GPUs including
some information
that has never been
made public before.
And some of it will
also be more general.
And we'll be noting that with
the A8 icon you can see there
for slides that are
more A8 specific.
And additionally, we'll be
noting some potential pitfalls.
That is things that may not
come up as often as the kind
of micro optimizations
you're used to looking for,
but if you run into these,
you're likely to lose
so much performance, nothing is
going to matter by comparison.
so much performance, nothing is
going to matter by comparison.
So it's always worth making
sure you don't run into those.
And those will be marked
with the triangle icon,
as you can see there.
Before we go on, this
is not the first step.
This is the last step.
There's no point to doing
low-level shader optimization
until you've done the
high-level optimizations before,
like watching the
other Metal talks
from optimizing your
draw calls, the structure
of your engine and so forth.
Optimizing your later shader
should be roughly the last thing
you do.
And, this presentation
is primarily
for experienced shader authors.
Perhaps you've worked on Metal
a whole lot and you're looking
to get more into optimizing your
shaders, or perhaps your new
to Metal, but you've done a
lot of shader optimization
on other platforms and
you'd like to know how
to optimize better
for A8 and later GPUs,
this is the presentation
for you.
So you may have seen this
pipeline if you watched any
of the previous Metal talks.
And we will be focusing
of course
on the programmable
stages of this pipeline,
as you can see there,
the shader course.
So first, Alex is going to go
over some shader
performance fundamentals
and higher level issues.
After which, I'll return
for some low-level,
down and dirty shader
optimizations.
[ Applause ]
>> Thanks, Fiona.
Let me start by explaining
the idea
of shader performance
fundamentals.
These are the things that
you want to make sure
that you have right
before you start digging
into source level optimizations.
Usually the impact of the kind
of changes you'll
make here can dwarf
or potentially hide other
more targeted changes
that you make elsewhere.
So I'm going to talk
about four of these today.
Address space selection
for buffer arguments,
buffer preloading, dealing
with fragment function
resource writes,
and how to optimize
your computer kernels.
So, let's start with
addresses spaces.
So since this functionality
doesn't exist
in all shading languages,
I'll give a quick primer.
So, GPUs have multiple paths
for getting date from memory.
And these paths are optimized
for different use cases,
and they have different
performance characteristics.
In Metal, we expose control
over which path is used
to the developer by requiring
that they qualify all buffers,
arguments and pointers
in the shading language
with which address
space they want to use.
So a couple of the address
spaces specifically apply
to getting information
from memory.
The first of which is
the device address space.
This is an address space with
relatively few restrictions.
You can read and write data
through this address space,
you can pass as much data as
you want, and the buffer offsets
that you specify at the API
level have relatively flexible
alignment requirements.
On the other end of things, you
have the constant address space.
As the name implies, this is
a read only address space,
but there are a couple of
additional restrictions.
There are limits on how
much data you can pass
through this address space, and
additionally the buffer offsets
that you specify at the API
level have more stringent
alignment requirements.
However, this is the address
space that's optimized for cases
with a lot of data reuse.
So you want to take advantage
of this address space
whenever it makes sense.
Figuring out whether or not the
constant address space makes
sense for your buffer
argument is typically a matter
sense for your buffer
argument is typically a matter
of asking yourself
two questions.
The first question is, do I
know how much data I have.
And if you have a potentially
variable amount of data,
this is usually a
sign that you need
to be using the device
address space.
Additionally, you want to
look at how much each item
in your buffer is being read.
And if these items can
potentially be read many times,
this is usually a sign
that you want to put them
into the constant address space.
So let's put this into practice
with a couple of examples
from some vertex shaders.
First, you have regular,
old vertex data.
So as you can see, each vertex
has its own piece of data.
And each vertex is the only one
that reads that piece of data.
So there's essentially
no reuse here.
This is the kind of thing
that really needs to be
in the device address space.
Next, you have projection
matrices, another matrices.
Now, typically what you have
here is that you have one
of these objects, and they're
read by every single vertex.
So with this kind of complete
data reuse, you really want this
to be in the constant
address space.
Let's mix things up a little bit
and take a look at
standing matrices.
So hopefully in this case
you have some maximum number
of bones that you're handling.
But if you look at each
bone that matrix may be read
by every vertex that
references that bone,
and that also is a potential
for a large amount of reuse.
And so this really ought to be
on the constant address
space as well.
Finally, let's look
at per instance data.
As you can see all vertices
in the instance will read
this particular piece of data,
but on the other hand you have
a potentially variable number
of instances, so this
actually needs to be
in the device address
space as well.
For an example of why address
space selection matters
for performance, let's move
on to our next topic,
buffer preloading.
So Fiona will spend some
time talking about how
to actually optimize loads and
stores within your shaders,
but for many cases the best
thing that you can do is
to actually off load this
work to dedicated hardware.
to actually off load this
work to dedicated hardware.
So we can do this
for you in two cases,
context buffers and
vertex buffers.
But this relies on knowing
things about the access patterns
in your shaders and what address
space you've placed them into.
So let's start with
constant buffer preloading.
So the idea here is
that rather than loading
through the constant
address space,
what we can actually do is
take your data and put it
into special constant
registers that are even faster
for the ALU to access.
So we can do this as long
as we know exactly
what data will be read.
If your offsets are
known a compile time,
this is straightforward.
But if your offsets aren't known
until run time then we need a
little bit of extra information
about how much data
that you're reading.
So indicating this
to the compiler is usually
a matter of two steps.
First, you need to make
sure that this data is
in the constant address space.
And additionally
you need to indicate
that your accesses are
statically bounded.
The best way to do this
is to pass your arguments
by reference rather than
pointer where possible.
If you're passing only a
single item or a single struct,
If you're passing only a
single item or a single struct,
this is straightforward, you
can just change your pointers
to references and change
your accesses accordingly.
This is a little different
if you're passing an array
that you know is bounded.
So what you do in this case is
you can embed that size array
and pass that struct
by reference rather
than passing the
original pointer.
So we can put this into
practice with an example
at a forward lighting
fragment shader.
So as you can see in sort
of the original version what we
have are a bunch of arguments
that are passed as
regular device pointers.
And this doesn't expose the
information that we want.
So we can do better than this.
Instead if we note the number
of lights is bonded what we can
do is we can put the light data
and the count together into
a single struct like this.
And we can pass that struct
in the constant address space
as a reference like this.
And so that gets us
constant buffer preloading.
Let's look at another example
of how this can affect
you in practice.
So, there are many ways to
implement a deferred render,
but what we find is that the
actually implementation choices
that you make can have a big
impact on the performance
that you make can have a big
impact on the performance
that you achieve in practice.
One pattern that's common
now is to use a single shader
to accumulate the
results of all lights.
And what you can see form the
declaration of this function,
is that it can potentially read
any or all lights in the scene
and that means that your
input size is unbounded.
Now, on the other hand if you're
able to structure your rendering
such that each light is handled
in its own draw call
then what happens is
that each light only needs
to read that light's data
and it's shader and that
means that you can pass it
in the constant address space
and take advantage
of buffer preloading.
In practice we see
that on A8 later GPUs
that this is a significant
performance win.
Now let's talk about
vertex buffer preloading.
The idea of vertex
buffer preloading is
to reuse the same dedicated
hardware that we would use
for a fix function
vertex fetching.
And we can do this for regular
buffer loads as long as the way
that you access your
buffer looks just
like fix function
vertex fetching.
So what that means
is that you need
to be indexing using the
vertex or instance ID.
Now we can handle a couple
additional modifications
Now we can handle a couple
additional modifications
to the vertex or instance IDs
such as applying a deviser
and that's with or
without any base vertex
or instance offsets you might
have applied at the API level.
Of course the easiest way to
take advantage of this is just
to use the Metal vertex
descriptor functionality
wherever possible.
But if you are writing
your own indexing code,
we strongly suggest that
you layout your data
so that vertexes fetch linearly
to simplify buffer indexing.
Note that this doesn't preclude
you from doing fancier things,
like if you were rendering quads
and you want to pass one value
to all vertices in the quad,
you can still do things
like indexing by vertex
ID divided by four
because this just
looks like a divider.
So now let's move on to a couple
shader stage specific concerns.
In iOS 10 we introduced the
ability to do resource writes
from within your
fragment functions.
And this has interesting
implications
for hidden surface removal.
So prior to this you might have
been accustomed to the behavior
that a fragment wouldn't
need to be shaded as long
as an opaque fragment
came in and occluded it.
So this is no longer
true specifically
So this is no longer
true specifically
if your fragment function
is doing resource writes,
because those resource
writes still need to happen.
So instead your behavior
really only depends
on what's come before.
And specifically what
happens depends on whether
or not you've enabled
early fragment tests
on your fragment function.
If you have enabled
early fragment tests,
once it's rasterized as long
as it also passes the early
depth and stencil tests.
If you haven't specified
early fragment tests,
then your fragment
will be shaded
as long as it's rasterized.
So from a perspective of
minimizing your shading,
what you want to do is
use early fragment tests
wherever possible.
But there are a couple
additional things
that you can do to improve
the rejection that you get.
And most of these boil
down to draw order.
You want to draw these objects,
the objects where the fragment
functions do resource writes
after opaque objects.
And if you're using these
objects to update your depth
and stencil buffers,
we strongly suggest
that you sort these
buffer from front to back.
Note that this guidance
should sound fairly familiar
if you've been dealing
with fragment functions
that do discard or modify
your depth per pixel.
that do discard or modify
your depth per pixel.
Now let's talk about
compute kernels.
Since the defining
characters of a compute kernels
that you can structure your
computation however you want.
Let's talk about what factors
influence how you do this
on iOS.
First we have computer
thread launch overhead.
So on A8 and later GPUs
there's a certain amount of time
that it takes to launch a
group of compute threads.
So if you don't do enough work
from within a single compute
thread you can potentially,
it leaves the hardware
underutilized
and leave performance
on the table.
And a good way to deal with
this and actually a good pattern
for writing computer
kernels on iOS in general is
to actually process multiple
conceptual work items
in a single compute threat.
And in particular a pattern
that we find works well is
to reuse values not
by passing them
through thread group memory, but
rather by reusing values loaded
for one work item when you're
processing the next work item
in the same compute thread.
And it's best to illustrate
this with an example.
And it's best to illustrate
this with an example.
So this is a syllable
filter kernel, this is sort
of the most straightforward
version of it, as you see,
it reads as a three-
[inaudible] region of its source
and produces one output pixel.
So if instead we
apply the pattern
of processing multiple
work items
in a single compute thread,
we get something
that looks like this.
Notice now that we're striding
by two pixels at a time.
So processing the first pixel
looks much as it did before.
We read the 3 by 3 region.
We apply the filter and
we write up the value.
But now let's look at
how pixel 2 is handled.
So stents are striding by
two pixels at a time we need
to make sure that there is
a second pixel to process.
And now we read its data.
Note here that a 2 by 3 region
of what this pixel
wants was already loaded
by the previous pixel.
So we don't need
to load it again,
we can reuse those old values.
All we need to load now is the 1
by 3 region that's
new to this pixel.
After which, we can apply
the filter and we're done.
Note that as a result we're
not doing 12 texture reads,
Note that as a result we're
not doing 12 texture reads,
instead of the old 9, but
we're producing 2 pixels.
So this is a significant
reduction in the amount
of texture reads per pixel.
Of course this pattern doesn't
work for all compute use cases.
Sometimes you do still
need to pass data
through thread group memory.
And in that case, when you're
synchronizing between threads
in a thread group, an important
thing to keep in mind is
that you want to use the barrier
with the smallest possible scope
for the threads that
you need to synchronize.
In particular, if your thread
group fits within a single SIMD,
the regular thread
group barrier function
in Metal is unnecessary.
What you can use instead is the
new SIMD group barrier function
introduced in iOS 10.
And what we find is actually
the targeting your thread group
to fit within a single SIMD
and using SIMD group barrier
is often faster than trying
to use a larger thread
group in order to squeeze
that additional reuse,
but having to use thread
group barrier as a result.
So that wraps things up
for me, in conclusion,
So that wraps things up
for me, in conclusion,
make sure you're using the
appropriate address space
for each of your buffer
arguments according
to the guidelines
that we described.
Structure your data
and rendering
to take maximal advantage
of constant
and vertex buffer preloading.
Make sure you're using early
fragment tests to reject
as many fragments as possible
when you're doing
resource writes.
Put enough work in
each compute thread
so you're not being limited
by your compute thread
launch overhead.
And use the smallest barrier
for the job when you need
to synchronize between
threads in a thread group.
And with that I'd like to pass
it back to Fiona to dive deeper
into tuning shader code.
[ Applause ]
>> Thank you, Alex.
So, before jumping into the
specifics here, I want to go
over some general
characteristics of GPUs
and the bottlenecks
you can encounter.
And all of you may be
familiar with this,
but I figure I should
just do a quick review.
So with GPUs typically you
have a set of resources.
And it's fairly common for
a shader to be bottlenecked
And it's fairly common for
a shader to be bottlenecked
by one of those resources.
And so for example if
you're bottlenecked
by memory bandwidth,
improving other things
in your shader will often
not give any apparent
performance improvement.
And while it is important to
identify these bottlenecks
and focus on them to
improve performance,
there is actually still
benefit to improving things
that aren't bottlenecks.
For example, in that example
if you are bottlenecked
at memory usage, but then
you improve your arithmetic
to be more efficient, you
will still save power even
if you are not improving
your frame rate.
And of course being on mobile,
saving power is always
important.
So it's not something to ignore,
just because your frame rate
doesn't go up in that case.
So there's four typical
bottlenecks to keep
in mind in shaders here.
The first is fairly
straightforward, ALU bandwidth.
The amount of math
that the GPU can do.
The second is memory bandwidth,
again, fairly straightforward,
the amount of data that the GPU
can load from system memory.
The other two are
little more subtle.
The first one is
memory issue rate.
Which represents the
number of memory operations
Which represents the
number of memory operations
that can be performed.
And this can come up in the case
where you have smaller
memory operations,
or you're using a lot of thread
group memory and so forth.
And the last one, which I'll
go into detail a bit more
about later is latency
occupancy register usage.
You may have heard about that,
but I will save that
until the end.
So to try to alleviate
some of these bottlenecks,
and improve overall shader
performance and efficiency,
we're going to look
at four categories
of optimization opportunity
here.
And the first one is data types.
And the first thing to consider
when optimizing your shader
is choosing your data types.
And the most important
thing to remember
when you're choosing
data types is that A8
and later GPUs have
16-bit register units,
which means that for example if
you're using a 32-bit data type,
that's twice the register
space, twice the bandwidth,
potentially twice the
power and so-forth,
it's just twice as much stuff.
So, accordingly you
will save registers,
you will get faster performance,
you'll get lower power
you will get faster performance,
you'll get lower power
by using smaller data types.
Use half and short for
arithmetic wherever you can.
Energy wise, half is
cheaper than float.
And float is cheaper
than integer,
but even among integers,
smaller integers are cheaper
than bigger ones.
And the most effective thing
you can do to save registers is
to use half for texture reads
and interpolates because most
of the time you really do
not need float for these.
And note I do not mean
your texture formats.
I mean the data types you're
using to store the results
of a texture sample
or an interpolate.
And one aspect of A8 in later
GPUs that is fairly convenient
and makes using smaller
data types easier
than on some other GPUs is
that data type conversions
are typically free,
even between float and half,
which means that you don't have
to worry, oh am I introducing
too many conversions in this
by trying to use half here?
Is this going to cost too much?
Is it worth it or not?
No it's probably fast because
the conversions are free,
so you can use half wherever
you want and not worry
so you can use half wherever
you want and not worry
about that part of it.
The one thing to keep
in mind here though is
that half-precision numerics
and limitations are
different from float.
And a common bug
that can come up here
for example is people will
write 65,535 as a half,
but that is actually infinity.
Because that's bigger
than the maximum half.
And so by being aware of
what these limitations are,
you'll better be able to
know where you perhaps should
and shouldn't use half.
And less likely to encounter
unexpected bugs in your shaders.
So one example application
for using smaller integer
data types is thread IDs.
And as those of you who worked
on computer kernels will know,
thread IDs are used
all over your programs.
And so making them smaller
can significantly increase the
performance of arithmetic, and
can save registers and so forth.
And so local thread IDs, there's
no reason to ever use uint
for them as in this case,
because local thread IDs can't
have that many thread IDs.
because local thread IDs can't
have that many thread IDs.
For global thread IDs, usually
you can get away with a ushort
because most of the
time you don't have
that many global tread IDs.
Of course it depends
on your program.
But in most cases, you won't
go over 2 to the 16 minus 1,
so it is said you can do this.
And this is going to be lower
power, it's going to be faster
because all of the arithmetic
involving your thread ID is now
going to be faster.
So I highly recommend
this wherever possible.
Additionally, keep in mind
that in C like languages,
which of course includes
Metal, the precision
of an operation is defined by
the larger of the input types.
For example, if you're
multiplying a float by a half,
that's a float operation not a
half operation, it's promoted.
So accordingly, make sure
not to use float literals
when not necessary, because
that will turn here what appears
to be a half operation, it
takes a half and returns a half,
into a float operation.
Because by the language
semantics,
that's actually a float
operation since at least one
of the inputs is float.
And so you probably
want to do this.
This will actually
be a half operation.
This will actually be faster.
This is probably what you mean.
So be careful not
to inadvertently introduce
float precision arithmetic
into your code when
that's not what you meant.
And while I did mention that
smaller data types are better,
there's one exception to
this rule and that is char.
Remember as I said that
native data type size on A8
and later GPUs is
16-bit, not 8-bit.
And so char is not going to
save you any space or power
or anything like that
and furthermore there's no
native 8-bit arithmetic.
So it sort of has
to be emulated.
It's not overly expensive if you
need it, feel free to use it.
But it may result in
extra instructions.
So don't unnecessarily
shrink things to char
that don't actually need it.
So next we have arithmetic
optimizations,
and pretty much everything
in this category
affects ALU bandwidth.
in this category
affects ALU bandwidth.
The first thing you can do
is always use Metal built-ins
whenever possible.
They're optimized
implementations
for a variety of functions.
They're already optimized
for the hardware.
It's generally better than
implementing them yourself.
And in particular,
there are some of these
that are usually
free in practice.
And this is because GPUs
typically have modifiers.
Operations that can be
performed for free on the input
and output of instructions.
And for A8 and later GPUs
these typically include negate,
absolute value, and
saturate as you can see here,
these three operations in green.
So, there's no point to trying
to "be clever" and speed
up your code by avoiding
those, because again,
they're almost always free.
And because they're free,
you can't do better than fee.
There's no way to
optimize better than free.
A8 and later GPUs, like a lot
of others nowadays,
are scalar machines.
And while shaders are
typically written with vectors,
the compiler is going to split
them all apart internally.
the compiler is going to split
them all apart internally.
Of course, there's no downside
to writing vector code,
I mean often it's clearer,
often it's more maintainable,
often it fits what you're trying
to do, but it's also no better
than writing scaler code
from a compiler perspective
and the code you're
going to get.
So there's no point in
trying to vectorize code
that doesn't really fit a vector
format, because it's just going
to end up the same
thing in the end,
and you're kind of
wasting your time.
However, as a side
note, which I'll go
into more detail a lot later,
in later A8 and later GPUs,
do have vector load in store
even though they do not have
vector arithmetic.
So this only applies
to arithmetic here.
Instruction Level Parallelism
is something that some
of you may have used
optimizing for,
especially if you've
done work on CPUs.
But on A8 and later GPUs this
is generally not a good thing
to try to optimize for
because it typically works
against registry usage,
and registry usage
typically matters more.
So a common pattern you may
have seen is a kind of loop
So a common pattern you may
have seen is a kind of loop
where you have multiple
accumulators in order
to better deal with
latency on a CPU.
But on A8 and later GPUs this
is probably counterproductive.
You'd be better off just
using one accumulator.
Of course this applies to
much more complex examples
than the artificial
simple ones here.
Just write what you mean, don't
try to restructure your code
to get more ILP out of it.
It's probably not going to
help you at best, and at worst,
you just might get worse code.
So one fairly nice feature
of A8 and later GPUs is
that they have very
fast select instructions
that is the ternary operator.
And historically it's
been fairly common
to use clever tricks,
like this to try
to perform select
operations in ternaries
to avoid those branches
or whatever.
But on modern GPUs this is
usually counterproductive,
and especially on A8 later GPUs
because the compiler can't see
through this cleverness.
It's not going to figure
out what you actually mean.
It's not going to figure
out what you actually mean.
And really, this is really ugly.
You could just have
written this.
And this is going to be faster,
shorter, and it's actually going
to show what you mean.
Like before, being overly clever
will often obfuscate what you're
trying to do and
confuse the compiler.
Now, this is a potential
major pitfall,
hopefully this won't
come up too much.
On modern GPUs most of them
do not have integer division
or modulus instructions,
integer not float.
So avoid divisional
modulus by denominators
that are not literal
or function consonants,
the new feature mentioned in
some of the earlier talks.
So in this example, what we
have over here, this first one
where the denominator
is a variable,
that will be very, very slow.
Think hundreds of clock seconds.
But these other two examples,
those will be very fast.
Those are fine.
So don't feel like you
have to avoid that.
So, finally the topic
of fast-math.
So in Metal, fast-math
is on by default.
And this is because compiler
fast-math optimizations are
critical to performance
Metal shaders.
They can give off in 50%
performance gain or more
over having fast-math off.
So it's no wonder
it's on be default.
And so what exactly do
we do in fast-math mode?
Well, the first is that some
of the Metal built-in functions
have different precision
guarantees between
fast-math and non fast-math.
And so in some of them they will
have slightly lower precision
in fast-math mode to
get better performance.
The compiler may increase
the intermediate precision
of your operations like
by forming a fuse multiple
add instructions.
It will not decrease the
intermediate precision.
So for example if you write a
float operation you will get an
operation that is at
least a float operation.
Not a math operation.
So if you want to write half
operations you better write
that, the compiler will
not do that for you,
because it's not allowed to.
It can't your precision
like that.
It can't your precision
like that.
We do ignore strict if not
a number, infinity steal,
and sign zero semantics,
which is fairly important,
because without that
you can't actually prove
that x times zero
is equal to zero.
But we will not introduce a new
not at new NaNs, not a number
because in practice
that's a really nice way
to annoy developers,
and break their code
and we don't want to do that.
And the compiler will perform
arithmetic re-association,
but it will not do
arithmetic distribution.
And really this just comes
down to what doesn't break code
and makes it faster versus
what does break code.
And we don't want to break code.
So if you absolutely cannot use
fast-math for whatever reason,
there are some ways to recover
some of that performance.
Metal has a fused multiply-add
built in which you can see here.
Which allows you to
directly request a fused
multiply-add instructions.
And of course if
fast-math is off,
the compiler is not even
allowed to make those,
it cannot change one bit of
your rounding, it is prohibited.
it cannot change one bit of
your rounding, it is prohibited.
So if you want to use
fused multiply-add
and fast-math is
off, you're going
to have to use the built-in.
And that will regain
some of the performance,
not all of it, but
at least some.
So, on our third
topic, control flow.
Predicated GP control flow
is not a new topic and some
of you may already
be familiar with it.
But here's a quick review
of what it means for you.
Control flow that is
uniform across the SIMD,
that is every thread is
doing the same thing,
is generally fast.
And this is true even if
the compiler can't see that.
So if your program doesn't
appear uniform, but just happens
to be uniform when it runs,
that's still just as fast.
And similarly, the
opposite of this divergence,
different lanes doing different
things, well in that case,
it potentially may
have to run all
of the different paths
simultaneously unlike a CPU
which only takes
one path at a time.
And as a result it does more
work, which of course means
And as a result it does more
work, which of course means
that inefficient control
flow can affect any
of the bottlenecks, because it
just outright means the GPU is
doing more stuff, whatever
that stuff happens to be.
So, the one suggestion I'll make
on the topic of control flow is
to avoid switch fall-throughs.
And these are fairly
common in CPU code.
But on GPUs they can potentially
be somewhat inefficient,
because the compiler has to do
fairly nasty transformations
to make them fit within the
control flow model of GPUs.
And often this will involve
duplicating code and all sort
of nasty things you probably
would rather not be happening.
So if you can find a nice way to
avoid these switch fall-throughs
in your code, you'll
probably be better off.
So now we're on to
our final topic.
Memory access.
And we'll start with
the biggest pitfall
that people most
commonly run into
and that is dynamically indexed
non-constant stack arrays.
Now that's quite a mouthful,
but a lot of you probably
are familiar with code
that looks vaguely like this.
You have an array that consist
of values that are defined
in runtime and vary between each
thread or each function call.
And you index it to the
array with another value
that is also a variable.
That is a dynamically indexed
non-constant stack array.
Now before we go on, I'm
not going to ask you to take
for grabs at the idea that
stacks are slow on GPUs.
I'm going to explain why.
So, on CPUs typically you
have like a couple threads,
maybe a dozen threads, and you
have megabytes of cache split
between those threads.
So every thread can have
hundreds of kilobytes
of stack space before they
get really slow and have
to head off to main memory.
On a GPU you often have tens of
thousands of threads running.
And they're all sharing
a much smaller cache too.
So when it comes down to
it each thread has very,
very little space
for data for a stack.
It's just not meant for that,
it's not efficient and so
as a general rule,
for most GPU programs,
if you're using the
stack, you've already lost.
It's so slow that almost
anything else would have
It's so slow that almost
anything else would have
been better.
And an example for a real
world app is at the start
of the program it needed
to select one of two float
for vectors, so it
used a 32-byte array,
an array of two float
fours and tried to select
between them using
this stack array.
And that caused a
30% performance loss
in this program even though it's
only done once at the start.
It can be pretty significant.
And of course every time we
improve the compiler we are
going to try harder and harder
to avoid, do anything we can
to avoid generating these stack
access because it is that bad.
Now I'll show you two
examples here that are okay.
This other one, you can
see those are constants,
not variables.
It's not a non-constant
stack array and that's fine
because the values don't vary
per threads, they don't need
to be duplicated per thread.
So that's okay.
And this one is also okay.
Wait, why?
It's still a dynamically indexed
non-constant stack array.
But it's only done dynamically
indexed because of this loop.
And the compiler is going
to unroll that loop.
In fact, your compiler
aggressively unrolls any loop
that is accessing the stack to
try to make it stop doing that.
So in this case after it's
unrolled it will no longer be
dynamically indexed,
so it will be fast.
And this is worth mentioning,
because this is a fairly
common pattern in a lot
of graphics code and I don't
want to scare you into not doing
that when it's probably fine.
So now that we've gone
over the topic of how
to not do certain types
of loads and stores,
let's go on to making
the loads and stores
that we do actually fast.
Now while A8 and later
GPUs use scalar arithmetic,
as I went over earlier, they
do have vector memory units.
And one big vector loading
source of course faster
than multiple smaller ones
that sum up to the same size.
And this typically effects the
memory issue rate bottleneck
because if you're running
through a loads,
that's fewer loads.
And, so as of iOS 10, one of
our new compiler optimizations,
And, so as of iOS 10, one of
our new compiler optimizations,
is we will try to vectorize
some loads and stores that go
to neighboring memory
locations wherever we can,
because again it can give
good performance improvements.
But nevertheless, this is one
of the cases where working
with the compiler
can be very helpful,
and I'll give an example.
So as you can see here,
here's a simple loop
that does some arithmetic and
reads in an array of structures,
but on each iteration,
it reads just two loads.
Now we would want that
to be one if we could,
because one is better than two.
And the compiler wants that too.
It wants to try to vectorize
this but it can't, because A
and C aren't next to
each other in memory
so there's nothing it can do.
The compiler's not allowed
to rearrange your structs,
so we've got two loads.
There's two solutions to this.
Number one, of course,
just make it a float to,
now it's a vector
load, you're done.
One load, a set of
two, we're all good.
Also, as of iOS 10, this
should also be equally fast,
because here, we've
reordered our struct
to put the values
next to each other,
so the compiler can
now vectorize the loads
so the compiler can
now vectorize the loads
when it's doing it.
And this is an example again
of working with the compiler,
you've allowed the compiler to
do something it couldn't before,
because you understand
what's going on.
You understand how the
patterns need to be
to make the compiler happy
and make it able to
do a [inaudible].
So, another thing to keep in
mind with loads and stores is
that A8 and later GPUs
have dedicated hardware
for device memory addressing,
but this hardware has limits.
The offset for accessing
device memory must fit
within a signed integer.
Smaller types like short
and ushort are also okay,
in fact they're highly
encouraged,
because those do also fit
within a signed integer.
However, of course uint does
not because it can have values
out of range of signed integer.
And so if the compiler
runs into a situation
where the offset is a
uint and it cannot prove
that it will safely fit
within a signed integer,
it has to manually
calculate the address,
it has to manually
calculate the address,
rather than letting the
dedicated hardware do it.
And that can waste power,
it can waste ALU
performance and so forth.
It's not good.
So, change your offset to
int, now the problem's solved.
And of course taking advantage
to this will typically
save you ALU bandwidth.
So now on to our final
topic that I sort of glossed
over earlier, latency
and occupancy.
So one of the core
design tenants
of modern GPUs is
they hide latency
by using large scale
multithreading.
So when they're waiting for
something slow to finish,
like a texture read,
they just go
and run another thread instead
of sitting there doing
nothing while waiting.
And this is fairly important
because texture reads typically
take a couple hundred cycles
to complete on average.
And so the more latency
you have in a shader,
the more threads you need
to hide that latency,
and how many threads
can you have?
Well it's limited by the fact
that you have a fixed set
of resources that are shared
between threads in
a thread group.
So clearly depending on
how much each thread uses,
you have a limitation on
the number of threads.
And the two things that
are split are the number
of registers and
thread group memory.
So if you use more
registers per thread,
now you can't have
as many threads.
Simple enough.
And if you use more thread group
memory per thread, again you run
into the same problem,
more thread your memory per
thread means to your threads.
And you can actually check out
the occupancy of your shader
by using MTLComputePipeLineState
incurring
maxTotalThreadsPerThreadgroup,
which will tell you what
the actual occupancy
of your shader is based
on the register usage
and the thread group
memory usage.
And so when we say a
shader is latency limited,
it means you have
too few threads
to hide the latency of a shader.
And there's two things
you can do there,
you can either reduce the
latency of your shader,
your save registers
or whatever else it is
that is preventing you
from having more threads.
So, since it's kind of
hard to go over latency
So, since it's kind of
hard to go over latency
in a very large complex shader.
I'll go over a little bit
of a pseudocode example
that will hopefully give you
a big of an intuition of how
to think about latency
and how to sort
of mentally model
in your shaders.
So, here's an example
of a REAL dependency.
We have a texture sample,
and then we use the operative
of that texture sample
to run an if statement
and then we do another texture
sample inside that x statement.
We have to wait twice.
Because we have to wait once
before doing the if statement.
And we have to wait again
before using the value
from the second texture sample.
So that's two serial
texture accesses
for a total of twice
the latency.
Now here's an example
of a false dependency.
It looks a lot like the other,
except we're not using
a in the if statement.
But typically, we can't
wait across control flow.
The if statement acts an
effective barrier in this case.
So, we automatically have
to wait here anyways even though
there's no data dependency.
So we still get twice
the latency.
As you noticed the GPU
does not actually care
about your data dependencies.
It only cares about what the
dependencies appear to be
and so the second one will
be just as long latency
as the first one, even
though there isn't a data
dependency there.
And then finally
here's a simple one
where you just have two
texture reads at the top,
and they can both
be done in parallel
and then we can have
a single wait.
So it's 1 x instead
of 2 x for latency.
So, what are you going to
do with this knowledge?
So in many real world
shaders you have opportunities
to tradeoff between
latency and throughput.
And a common example of this
might be that you have some code
where based on one texture read
you can decide, oh we don't need
to do anything in this shader,
we're going to quit early.
And that can be very useful.
Because now all that work
that's being done in the cases
where you don't need
it to be done,
you're saving all that work.
That's great.
But now you're increasing
your throughput
But now you're increasing
your throughput
by reducing the amount
of work you need to do.
But you're also increasing
your latency because now it has
to do the first texture read,
then wait for that texture read,
then do your early
termination check,
and then do whatever other
texture reads you have.
And well is it faster?
Is it not?
Often you just have to test.
Because which is faster
is really going to depend
on your shader, but it's
a thing worth being aware
of that often is a real
tradeoff and you often have
to experiment to
see what's right.
Now, while there isn't
a universal rule,
there is one particular
guideline I can give for A8
and later GPUs and that is
typically the hardware needs
at least two texture
reads at a time
to get full ability
to hide latency.
One is not enough.
If you have to do
one, no problem.
But if you have some choice
in how you arrange your
texture reads in your shader,
if you allow it to do
at least two at a time,
you may get better performance.
So, in summary.
Make sure you pick the correct
address spaces, data structures,
layouts and so forth, because
getting this wrong is going
to hurt so much that often
none of the other stuff
in the presentation will matter.
Work with the compiler.
Write what you mean.
Don't try to be too clever,
or the compiler won't know what
you mean and will get lost,
and won't be able to do its job.
Plus, it's easier to
write what you mean.
Keep an eye out for
the big pitfalls,
not just the
micro-optimizations.
They're often not as obvious,
and they often don't come
up as often, but when
they do, they hurt.
And they will hurt so
much that no number
of micro-optimizations
will save you.
And feel free to experiment.
There's a number of rule
tradeoffs that happen,
where there's simply
no single rule.
And try them both,
see what's faster.
So, if you want more
information, go online.
The video of the talk
will be up there.
Here are the other session if
you missed them earlier, again,
the videos will be online.
Thank you.