WWDC2013 Session 505

Transcript

X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
[ Applause ]
>> Good morning and welcome.
My name is Dan Omachi.
I work in Apple's
GPU software group
on the OpenGL ES framework.
I also work very closely
with our GPU driver engineers
on improving performance
and implementing features
on our graphics hardware.
And today I'm going to
talk to you about Advances
in OpenGL ES on iOS 7.
Apple offers a number
of rendering API's
that are highly optimized
for a variety
of specific rendering scenarios
- Core Graphics, Core Animation,
and now Sprite Kit
are among those.
They do a ton for you,
and they do it very well.
OpenGL ES, however, offers
the most direct access
to graphics hardware.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
This enables a lot
of flexibility
to create custom effects
and bring something new
and innovative into
your rendering.
Now, this flexibility can
be a challenge to master.
It's a low-level library,
and there can be some
stumbling blocks,
but if you can utilize
the API to its fullest,
you can bring some
really wild custom effects
that people are amazed
by and love.
This can make the difference
between shipping
a good application
that a few people download and
maybe play with for a few days,
and something great that people
talk about, use day to day,
and download in droves.
[ Pause ]
So what am I going to
be talking about today?
First, there are a
number of new features
in the OpenGL ES API on iOS7.
The first feature I'll
talk about is instancing,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and we support two
new extensions
to implement that feature.
We're also now supporting
texturing in the vertex shader.
I'll talk about why you
might want to do that
and how it can be done.
We're also now supporting
sRGB texture formats,
an alternate color
space that you can use.
I'll also talk in detail about
how you can utilize the API
and really optimize
it for your needs.
I'll give you an
in-depth understanding
of the GPU pipeline, which
should give you some insight
into the feedback that
our GPU tools provide.
[ Pause ]
But before I get
into any of that,
I just want to touch briefly
on a very important
topic: power efficiency.
So rendering requires power.
All the GPU's on iOS
are power efficient.
However, there's still
considerable needed
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to put vertices into the
pipe and spit out pixels.
The easiest thing that
your application can do
to conserve power is to manage
your frame rate appropriately.
You can use the CADisplayLink
API to sync to the display.
The display refreshes
60 times a second.
So that's really the maximum
frame rate you could possibly
achieve, but in many
cases, it really makes sense
to just limit your frame rate to
a steady 30 frames per second.
You can achieve some
smooth animations,
and you're conserving way more
power than rendering at 60.
Additionally, it's not
necessary to render at all
if there's no animation
or movement in your scene.
You don't have to submit
vertices to the pipe
and have pixels produced
if you're just going
to show the same thing
you showed a sixtieth
of a second ago or a
thirtieth of a second ago.
Just blit what's already in
your buffers to the front
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or don't even blit at all
because nothing's
going to change.
This is particularly important
with the multi-layered iOS 7 UI
where there's a lot of
compositing is going on.
The UI can skip this compositing
if nothing has changed
in the layer, thereby
saving some power
in the compositing operation.
[Pause] Alright.
I just wanted to
touch on that briefly.
Now I would really like to
get onto the meat of our talk
and some of the new features.
The first of which
is instancing.
If you're familiar at all with
the types of games that are
on the App Store, you'll know
that the Tower Defense
genre is quite popular.
In these games, you've got
hundreds of enemies trying
to storm your fortress.
The interesting thing about this
rendering is these enemies often
share the same vertex data
and use the same models.
They may be doing
something different.
Some may be running.
Some may be attacking you,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
but it's still the
same base vertex data.
Also, maybe you've
seen an adventure game
where your hero's running
through a forest that's
densely populated.
It's got trees all about .
You've got trees in different
orientations with branches
in different configurations,
but, again,
all using the same vertex data.
They look distinct, however.
This type of rendering
is a prime candidate
for optimization
with instancing.
[Pause] Let me start
with a simple example.
I've got a gear model, and I'd
like to render it 100 times
on the screen as you see here.
Without instancing, what I
would do before iOS 7 is,
I would create a for
loop, and in this case,
I'm going down the width of
the screen via the X axis,
and then within that loop,
I'm going up the
screen on the Y axis.
For each iteration, I'm setting
a uniform with the position
of my gear, and then
drawing that gear.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
That's 100 uniform sets and 100
draw calls, and as you may know,
draw calls consume
a lot of CPU cycles.
So it would be great if we
could trim that down a bit.
[Pause] Jere's what
instancing does.
It allows you to draw
the same model many,
many times in a single
draw call.
Each instance of that model
can have different parameters.
You can have different
positions for each model,
a different matrix for each
model, or a different set
of texture coordinates.
Even though it's the
same vertex data,
these models can look
significantly different.
So there are two forms of
instancing that we're shipping
on iOS 7, the first of which
is using an extension called
APPLE-instanced-arrays,
and this allows you
to send these instance
parameters
down via another vertex array.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The second form is
Shader Instance ID,
and we support this via an
extension APPLE-draw-instance,
and the way this works is
there's a new built-in ID
variable in the vertex
shader that gets incremented
for each instance drawn within
the draw call that you made.
[Pause] Let me talk about
the first method here:
instanced arrays.
We're introducing a new call
glVertexAttribDivisorAPPLE,
which indicates the
attribute array that is going
to supply the instance data.
It also indicates the
number of instances
to draw before you advanced to
the next element in this array.
You could, for example
have ten instances
that use the same
parameter and then move
on to the next parameter,
but the most common case is
to send a unique parameter
down to each instance
inside your draw call.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now we're introducing
two new draw calls
to use this form of instancing.
This includes
glDrawArraysInstancedAPPLE
and
glDrawElementsInstancedAPPLE,
and these work exactly the
same as the usual glDrawArrays
and glDrawElements, but
there's an extra parameter
which indicates the number
of instances you
would like to draw.
Alright. Here's our example.
We've got three vertex
arrays that have model data,
the first of which is the
position, the second normal,
and the third is vertex colors,
and we have an extra array
that I'll get to in a minute.
We set up our arrays the
same as we usually do.
We use glVertexAttribDivisor
pointer
to specify the location
of the array.
It also specifies
things like the type,
whether it's unsigned
byte, float, etc.,
whether the elements in it are
normalized or unnormalized,
and the number of scalars or
number of values per element.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We do this for our per vertex
position here, and, again,
for our normal, and then a third
time for our vertex colors.
Now we also do the same
thing for this other array,
the instance positions, and,
additionally, we make a call
to glVertexAttribDivisor.
The first argument here
specifies it's attribute number
three that has our per
instance attribute data.
These are the per
instance parameters
that we'd like send to OpenGL.
The second argument
here indicates
that each instance
will get its own value.
Alright. We've done the set up.
We're ready to draw.
This K argument here: this
indicates the size of our model,
the number of vertices
in our model.
It's the same as
in glDrawArrarys.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The second or last argument
here, N, is the number
of instances we would
like to draw,
and since each instance
is getting a unique value,
we're setting it to the
same value as the number
of elements inside
this instance array.
Alright. We're ready
to submit our vertices
to the vertex shader,
and here's what happens.
That instance element gets set
the vertex shader, and it's used
for all of the vertices inside
of the vertex array
containing our model.
The second instance is
drawn, in the same draw call,
and we set the second
value here,
and all of the vertices inside
of the model are
submitted to vertex shader.
They all use that same value
throughout the entire array.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And we go through all of
instances in our instance array,
and all of them get a unique
instance value, and we submit
for every element inside
that instance array all
of the vertices in our model.
Here's the API set
up...just going over again.
As we usually do, we
call glVertxAttribPointer
to indicate how we've
set up our model data.
We also call
glVertexAttribPointer
for this instance array
and glVertexAttribDivisor.
We're indicating that attribute
three is our instance array,
and we're iterating one
element for each instance.
Finally, we're ready to draw.
We call glDrawArraysInstanced
with the value 100
since e're going to
render 100 gears.
Here's the vertex shader.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
As usual, we've got attributes
for our per vertex model data.
Here we've got position
and normal.
And another attribute which will
contain our per instance data.
Per instance position.
Not per vertex.
Per instance.
And we do a simple add
of that instance position
to the vertex position.
We're displacing all the
vertices by this constant value,
or at least it's constant
throughout that instance.
And, finally, we will transform
our model space position
into clip space by transforming
with our model view
projection matrix and output
to the built-in gl-Position
variable.
We also will do any other
per vertex processing
such as maybe computing
color via lighting
or generating texture
coordinates, etc. Alright.
Here's the second method.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
This is using the
instance ID parameter.
We've built in this
gl-InstanceIDAPPLE variable
inside the vertex shader,
and it gets incremented
once for each instance.
You can use this ID
in a number of ways.
You can calculate a unique
info for each instance.
You can use the standard
math functions that available
in the vertex shader to
figure out unique details
of that instance, or you
can use it as an index
into a uniform array or
a texture, and I'll talk
about texturing in a vertex
shader in just a minute.
This method also uses the
same glDrawArraysInstanced
or glDrawElementsIntanced
as the previous method.
Here's how this works: We call
glDrawArraysInstancedAPPLE,
and the instance ID is
set inside the shader,
and it's the same value
for all the vertices.
It's incremented for
the next instance,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and we submit all the vertices
using the value of one.
Finally, we iterate through
the entire number of instances
until we get to the
Nth instance,
and we submit all the vertices
for each instance value.
And we can reference
that gl-InstanceID
within our vertex shader.
And here's what that looks like.
We use this gl-InstanceIDAPPLE
variable,
and it's actually
an integer value,
but we don't have integer math
in the OpenGL ES 100
shading language.
So the first thing we need
to do is cast it to a float
so that we can use our floating
point math operations on it.
And now we perform
a modulo of ten,
which will give us the x
position, and we multiply it
by a gear size, then
we divide by ten
to give us the y position.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now we have an instance
position, which we can add
to our vertex and output
to this temp position.
And like the other method,
we will do our model view
projection matrix multiply,
which will put our
position into clip space
and give us a position
that we can output.
[ Pause ]
So that was instancing.
The next feature is
vertex texture sampling.
Why would you want a
texture in the vertex shader?
It's not like you can see an
image [in the vertex stage],
right?
Well, there are a
number of uses for this.
The first and most obvious
is displacement mapping.
You can put an image
in memory and fetch it
in the vertex shader,
and if you've got a mesh,
you can take the values from
that texture and displace
that mesh with the
values in the texture.
You can also use it as an
alternative to uniforms.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Uniforms have a much smaller
data store whereas textures has
a very large data store
that you can now access
in the vertex shader.
Here's a height mapping example.
On the left, we've got our
grey scale height map image,
and on the right, we've
got the results of that.
And here's how we
implemented it.
First, we've got an x and
z position that we've sent
down via a vertex array.
Just X and Z.
No Y here, and we have
a height map sampler.
Now this looks exactly like it
would in the fragment shader.
This, however, is
a vertex shader,
and this height map is a
reference to a texture.
Now we sample from that texture
and get our Y value from it.
Now, it splats the Y value
across all four components
of temp position.
And so we overwrite
the X and Z values
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
with the X and Z positions.
Now we have X, Y, and Z
inside of our temp position.
The Y we just happened to
have gotten from the texture.
And as with our other shaders,
we can transform to clip space
and output to gl-Position.
[ Pause ]
Alright. That's a
pretty simple example
of how you might use
vertex texturing.
As I mentioned, the more
interesting way you can use this
is to store just about
any kind of generic data
into a texture for
shader access.
It's really just a very large
store of random access memory.
Read-only random
access memory, that is.
Data normally passed in via
a glUniform can be passed
in via a texture.
There are a number
of advantages here.
It's a really, a
much larger store.
We support 4K by 4K textures
on most iOS 7 hardware.
Whereas uniform arrays are
limited to 128 uniform,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that's four values per uniform,
there's way more storage
inside of a texture.
This also enables potentially
less API calls to set the data.
If you load your
texture at app startup,
and you have all these values
inside this large data store,
you can just bind the texture,
and it's set up for you to draw.
You don't have to load
a bunch of values to set
up for your draw call.
There's a bit more
variety and types
that you can use whereas
uniforms only allow you
to use 32-bit floats.
You can use unsigned byte,
half float, and float.
Any of the texture
types that you can use,
you can use for vertex
texture sampling.
You can choose the appropriate
type for the data that you'd
like to consume in your shader.
You can use filtering
with the texture.
Anything you can do
with the texture,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
you can do with a vertex
texture, and filtering is kind
of nice because you can average
sequential values that are
in your texture,
and with wrapping,
you can actually
average the last value
in your texture with
the first value.
So you can do a wraparound
of averaging.
And because you can
render to a texture,
you can have the
GPU produce data.
Instead of just loading it
in from CPU generated values,
you can render to the
texture and then consume
that data in the vertex shader.
[Pause] Now I'd like
to show you a demo
with some of these features.
Here we have 15,000 asteroids
rotating about this planet,
and this is using what
we call immediate mode.
There is a draw call
for each asteroid here.
So that's over 15,000
draw calls.
Now we're running at
17 frames per second,
maybe 18 in some case.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
That's alright, I guess.
The real problem here is
that we're consuming
a lot of CPU cycles.
We're really leaving
nothing for the app so that
if you've got some logic
there, the frame rate's going
to slow down even more.
So what we like to do is
offload this to the GPU.
Here we have the
first improvement,
which is using instance
ID, the built-in variable
within our vertex shader.
Now what's cool about this
is we're actually rotating
or spinning each asteroid.
They all have unique
values, and,
obviously, unique positions.
And here's another mode
that we've implemented.
This uses the
glVertexAttribDivisor method,
and we're getting even a
slightly better frame rate here.
This is due to our pre-computing
all of the rotations
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and position of values
outside the shader,
and we're just passing them in.
We're not actually doing
much computation inside
of our vertex shader.
What's cool to note
about this is
that a few years ago
we presented this
on a Mac Pro with, I
don't know how many cores
and a beefy desktop GPU.
This is really pretty nice
that we are now showing this
to you on an iOS device.
[ Pause ]
[ Applause ]
[ Pause ]
Let me talk about some
implementation details here.
With that second mode
using the instance ID,
we calculate the transformation
matrix in the vertex shader.
First, we figure out a spin
value by doing a modulo
of our instance ID, and this
gives us some spin value
in radians and we can then use
the cosine and sine functions
to build a rotation matrix.
We then apply a translation
matrix
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that gives us the
position of the asteroid.
We also use the instance
ID variable to figure
out the positions, and
then we create this matrix.
Now the matrix calculations
are done per vertex.
So even though this
matrix will be the same
for the entire asteroid,
which is about 30
to 60 vertices (I think
it's maybe a little bit
on the lower end)
but that's 30 times
that we're calculating this
transformation matrix, at least.
What we'd really like to do is
just create this matrix once per
instance, not per vertex.
This is what the instance
arrays method does.
We actually calculate
this matrix array up front
at app startup, or
all these matrices
up front at app startup.
We calculate positions
and rotations.
We stuff that into a
vertex array, and then set
up the vertex array with the
glVertexAttribDivisor call,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and pass the parameters
down for each asteroid,
not for each vertex.
There are a couple of
advantages and disadvantages
to each of these methods.
Using the instance ID method,
we're not using any memory
or really very little memory
because we're doing
all the calculation
as needed on the GPU.
Another advantage is
that you're using the GPU
as another computation device.
If you're not GPU bound, and
you need the CPU for a lot
of cycles, well, then
this may be the way to go.
But in general you may,
if you have a number
of instances using the GPU, you
could potentially overload it
with computation, which would
really slow it down if you need
to do other computations.
So what we've got here
is a different method
where we use instance array.
Instance arrays is generally
faster than computing on the GPU
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
since you can save
cycles on the GPU.
There's a lot more flexibility
and types over uniforms.
You can use any type that
a vertex array can use,
including bytes, unsigned
bytes, floats, half floats,
etc. Now there's a third method
that I didn't demonstrate,
but this would be to
use the instance ID
as an index into a texture.
So instead of passing parameters
down via a vertex attribute
array, you stuff them
into a texture and then fetch
using the instance ID variable
to get the location, the
position, and the rotation.
Now, as I mentioned before,
the textures are just this large
storage of random access memory.
It's often logically simpler
[to store data in a texture],
since you've got a 2D array,
to put tables or any other sort
of data inside of a texture.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So this is really cool
for bone matrices,
you can use the first row for
the arm matrix, the second row
for the other arm
matrix, the third row
for the leg matrix,
head, and so on.
So it's actually a lot
easier to use a texture
for your bone matrix parameters.
[ Pause ]
So here's a summary
of instancing
and vertex texture sampling.
Instancing allows you
to draw many models
of the single draw call, which
is particularly important
because draw calls consume
a number of CPU cycles,
and even though it's the same
model that you're drawing,
they can look distinct
since you are passing
down different parameters
for each instance.
Vertex texture sampling:
just think of it
as a large data store for
random access read-only memory
in the vertex shader.
You can use it with
the instance ID
to fetch per instance
parameters.
These extensions and these
features are supported
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
on all iOS 7 devices.
[ Pause ]
OK. Let's move on to the third
feature in iOS 7 on OpenGLES.
sRGB is an alternate
caller space,
which is more perceptually
correct.
It matches the gamma
curve of displays.
If you're looking at blacks and
greys and whites, what you'd see
with the usual color space
is that you'd move from black
to grey much more quickly
than from grey to white,
which effectively means
that your brighter colors
are weighted more heavily
when you're doing averaging
or mixing of colors.
So it's not a linear
distribution.
There's weight on
some of the values.
aRGB compensates for this
by basically applying
an inverse curve
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
so that the darker colors
get a little bit more weight
than usual, and this allows
you to have a linear mixing
when your image is
presented on the display.
Here's some API details.
There are two external formats
that you would put your data in.
This is sRGB8 and sRGB8-Alpha.
There is an internal
format, SRGB 8 alpha 8,
and four compressed internal
formats that you can read
from that support
this sRGB color space.
Now the non-compressed
format here is renderable.
This allows you to do linear
blending or color calculations
in the shaders and have them
come up in a linear fashion.
You need to check for the
GL-EXT-sRGB extension string
because this is supported
on all iOS 7 devices
except for the iPhone 4.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
This is a great new feature.
It's perceptually correct.
However, you don't want
to just turn this on.
You'll start getting some
things that may not look right.
You need to author
your textures for it.
You need your artists to keep
the SRGB color space in mind
so that when they're
actually presented,
they look as you
intended them to.
And you should only use these
SRGB textures for color data.
Lot of people encode normal maps
or just use an alpha
map perhaps.
You shouldn't even
use this for alpha.
Alpha is often thought
of as going with RGB,
but alpha should use
its own linear space.
[ Pause ]
Alright. So a lot of great new
features in the OpenGL ES API,
but you really need to have a
rock solid foundation before you
start adding to your
rendering engines.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And, fortunately, Apple provides
a slew of excellent GPU tools
to help you build
this foundation.
The first tool I'd like to talk
about the OpenGLES
frame debugger.
It allows you to capture a
frame of rendering and debug it
and play with it and
experiment with it.
Now, there are a ton of
widgets here that I'll
and I'll just go
over a few of them.
The first thing I'd like to
point out is the scrubber bar.
So you've captured a
frame of rendering,
and the scrubber bar
allows you to position
on a particular call
through your frame.
You can stop at a draw call or
a bind or a uniform set, etc.,
and you can see what
has just been rendered.
You can see your scene at
it gets built up not only
in the color buffer,
which is on the left,
but also the depth
buffer on the right,
and whatever you've
just rendered,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
the results of last draw call
you've made, shows up in green.
[ Pause ]
You can also examine all of
the contents of context state
at a particular call
inside that frame.
You can see everything
in the context,
the whole state vector
of OpenGL ES.
Everything that's bound,
the programs, textures,
etc. Your blend state,
your depth state,
whatever state you'd like.
If you think something
may be going wrong
with the state vector, you
can search in there for it.
But what's even nicer
is that in Xcode 5,
you can now view the
information that pertains
to the particular call
that you're stopped on.
Instead of looking through
all of the context state,
you can look at what's really
useful to you at the moment.
Here, I am stopped at
a glUseProgram call.
And so now I can look at
all of the information
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that pertains to
that GLSL program.
All the uniforms
and their values,
what attributes are
necessary for that program,
etc. You can set that view in
the lower left-hand corner here.
There's this auto
variables view,
and this is new with Xcode 5.
[ Pause ]
You also have an object viewer.
You can view any of the
objects in the OpenGL context.
You can view textures,
vertex buffer objects,
and I think the most
powerful feature here,
the most powerful object
viewer is your shader viewer.
And you can take a look at the
shaders and edit your shader
within it, and hit
this button here
on the lower left-hand corner,
which will compile your
shader immediately,
apply it to your scene, and then
you can see how it has changed
your rendering.
[ Applause ]
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So this allows you to experiment
and even debug shader
compiler error.
As you see here, I've got use
of an undeclared variable,
and it flags my error, and I can
go ahead and fix it right away.
[ Pause ]
So an often overlooked feature
of the OpenGL ES frame debugger
is the OpenGL issues navigator.
Here we point out a number
of things that you could do
to improve your rendering.
There's also some
information about things
that may cause rendering
errors, but more importantly,
there is a lot of information
about how you can
improve your performance.
Also in Xcode 5, we have the
performance analysis page,
which allows you
to hit this button
in the upper right-hand
corner, and we'll run a couple
of experiments on your frame
and figure out what bottlenecks
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that you've got, whether you're
vertex bound, fragment bound,
etc., and there are
some helpful suggestions
as to what you might
like to do next.
It also gives you
some information
such as whether your GPU is
pegged or your CPU is pegged.
So a lot of useful
information here as well.
[ Pause ]
And new in Xcode
5 is the ability
to break on any OpenGL error.
Now, what you used to have to
do is add a glGetError call
after every single OpenGL call
to stomp out these errors,
Figure out if your OpenGL call
produced some sort of some error
because you sent in
some bad arguments
or the state wasn't
set up properly.
Well, you don't have
to do this anymore.
In the lower left-hand
corner here,
you can just say add OpenGL ES
breakpoint, and any OpenGL call
that produces an error
will break immediately,
and you can immediately fix it.
[ Applause ]
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We also have the OpenGL
ES Analyzer instrument,
and there are a number
of very helpful views
for improving performance.
And a very powerful part
of the OpenGL ES Analyzer is the
OpenGL ES Expert, which points
out more information, more
things that you can do
to improve the performance
in your application.
This points out a lot of
data that is very similar
to what comes up in
the issues navigator.
Whereas the issues navigator can
actually run some more in-depth
experiments and give
you more data,
it only can analyze one frame
whereas the OpenGLES expert can
analyze multiple
frames of rendering.
[ Pause ]
We offer a number of tools
that really provide
an excellent means
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
for debugging your rendering.
Additionally, with
the OpenGL ES Expert,
the performance analysis
page and the frame debugger
with the issues navigator, we're
providing lots of valuable data
to improve performance.
But there is a lot of
data coming at you,
and it can be difficult to
digest and assess the severity
of the issues that come up.
So I think it would be helpful
if I can give you a more
in-depth understanding
of how OpenGL works
and, in particular,
how the GPU beneath it takes the
vertex data and transforms it
into pixels on the screen.
That way, you can keep the
OpenGL architecture in mind
when you're designing your
rendering architecture
and really assess the severity
of issues that crop up.
[ Pause ]
I'm going to give
you an overview
of the GPU architecture now.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
All of the iOS GPU's are
tile-based deferred renderers.
They are high-performance,
low-power GPUs,
and the TBDR pipeline is
significantly different
than that of traditional
streaming GPUs
that you would find on the Mac.
There are a number
of optimizations
to reduce the processing load,
which increase performance
and really save lots of power.
Very important on
these iOS devices.
Now the architecture
depends heavily on caches
because large transfers
to unified memory are costly
not only in terms of performance
and latency, but also
in terms of power.
I t takes a lot of power
to reach out across the bus
and grab something back in.
So we have these very
nice, significantly large,
caches so that we can do
a lot of work on the GPU.
There are certain operations
that developers can do
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that can prevent
these optimizations
or cause cache misses.
Fortunately, these operations
are entirely avoidable.
[ Pause ]
What I thought I'd do
is take you on a trip
down the tile-based
deferred rendering pipeline,
and along the way, I'll
point out some issues
that you may stumble across
and describe what's going
on when we warn you
about these issues.
Let's start out with
the vertex processor.
On your left, you've got the
vertex arrays that we've set up.
Hopefully, you've used
a vertex buffer object
or a vertex array object
to encapsulate this data,
And we issue a draw call,
which begins this trip
down the pipeline.
We shade the vertices,
transform them into clip space,
and actually also apply the
view port transformation
so that they're now
window coordinate vertices.
The vertices are shaded and
transformed, as I mentioned,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and stored out to
unified memory.
[ Pause ]
Now a frames' worth of
vertices are stored.
Unlike a traditional
streaming GPU
where it only needs three
vertices to produce a triangle
to go onto the next stage
and start rasterization
and fragment processing,
we defer all of that work
until you call
presentRenderbuffer
or somehow change the
render buffer another way,
by either binding a render
buffer or changing an attachment
to a frame buffer object.
Let's say now we call
presentRenderbuffer.
This, and only now is when
we move to the next stage
of the pipeline, which
is the tiling processor.
Every render buffer
is split into tiles.
This allows rasterization
and fragment shading to occur
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
on the GPU in little tile-sized
pieces of embedded memory.
We can't push the entire
frame buffer onto the GPU;
that's just way too large.
So we just split up this render
buffer into much smaller tiles,
and then we can render
to those one by one.
Here's what the tile processor
does: It works in groups
of triangles, and it figures
out where the triangles
would be rendered here.
Which tile they'll go to.
The larger triangles, which
intersect multiple tiles,
may be binned into
these multiple tiles.
[ Pause ]
And then we're ready
for raster set up,
or set up for the rasterizer.
Here's the first issue
that you could run across -
logical buffer load, and
here's what this means.
The rasterizer uses tile size
embedded memory, as I said.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now if there is data already
in this render buffer,
the GPU needs to load
it from unified memory
because you're going
to write on top of it.
This is pretty costly, OK.
We need to reach out
across the bus, pull it in.
Same for the depth buffer:
if there is data in it,
we also need to pull it
in from unified memory.
Fortunately, you
guys can avoid this.
Loading tiles is called
a logical buffer load,
and you can avoid such
a logical buffer load
if you call glClear
before your rendering.
The driver knows that there is
nothing important out in memory
since you're clearing the buffer
so in can just start
rendering to this tile memory.
Great. No load necessary.
Very fast.
[ Pause ]
Logical buffer loads can happen
in some less obvious ways.
For instance, if we render to a
texture, render to a new buffer
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or a new texture,
and then render
to that first texture again.
Here's what happens: we
render to our texture.
Now we want to render
to a new texture.
We clear it, and render to that.
Great. Now we would like to
render to our first texture.
Well, logical buffer load.
Need to load both the color
buffer and depth buffer.
Developers should avoid frequent
switching of render buffers.
Complete your rendering
to one buffer before
switching to another.
Don't just say, "hey, you know,
I've finished a pretty
good amount of rendering.
Let's just switch my buffer.
Go out and render something new,
and then now I'd like to go back
to that first buffer."
You'll get this tile thrashing
that I've just described.
[ Pause ]
Rasterization.
We're ready to actually do
some further computation.
The GPU reads the triangles
assigned to the tile,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and it computes the X
and Y pixel coordinates
and the Z value,
the depth value.
The fragment shader
is not yet run.
Positions and depth
are calculated only.
This allows an optimization
called hidden surface removal.
Now let's say we
submit a triangle,
and it's partially obscured
by another triangle.
Well, a portion of that
triangle is hidden.
W e don't need to run
the fragment shader
on that hidden portion.
That saves us from
fragment shader processing.
We can reject those fragments.
Now this is why we
deferred all the rendering
until you called
present render buffer.
W e have the entire
frames' worth of triangles.
That's potentially a lot of
fragments that we can reject.
[ Pause ]
But you can get this warning.
Loss of depth test
hardware optimizations.
Loss of hidden surface removal.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
It's really costly
to enable blending
or use discard in the shader.
Lots of times we like to
use discard for things
like implementing an alpha test,
but it defeats the hidden
surface removal optimization.
We submit a triangle
that maybe is blending
and it's transparent.
So you can see stuff behind it.
We need to run that fragment
shader even for triangles
that are behind that
other triangle.
The shader must run
a lot more times.
This is a cost of
performance and power.
We're doing a lot
more processing.
Therefore, you guys need
to be judicious in your use
of discard and blending.
Allow the GPU to reject as
many fragments as possible.
[ Pause ]
Next up, we can perform
fragment shading.
And what's great about
the TBDR renderer is that,
if the hidden surface removal
algorithm is allowed to work,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
we only need to run the fragment
shader on each pixel once.
It doesn't matter how
many layers of triangles.
Doesn't matter what your
depth complexity is.
Only one fragment shader
is run on each pixel.
The fragment processor shades
and produces color pixels,
and those colors are written
to the embedded tile
memory on the GPU.
Now we're ready for
tile storage.
[ Pause ]
Alright. The tile stored
into unified memory,
and once all the
tiles are processed,
the renderbuffer
is ready for use.
You can present it to the user
on the screen or you can use it
as a texture for another pass.
Storing a tile to unified memory
is called a logical buffer
store, and each frame
needs at least one.
It's considered a frame because
you've presented your buffer
to the user, and that requires
a logical buffer store.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
However, you can
get this warning -
unnecessary logical
buffer store.
And here's what that's about.
A depth buffer is only
needed to be stored
if you're using an
effect like shadowing
or screen space ambient
occlusion.
In general, if you're not
using an effect like that,
it doesn't need to be stored;
it's unnecessary to push it
out to unified memory.
So developers could call
glDiscardFramebuffer
to skip this logical buffer
store on the depth buffer.
It's simply flushed away.
We don't need that after
rendering is complete.
The same thing for multisample
anti-aliased renderbuffers,
and this is particularly
important because these are big.
A multisample 4xaa render
buffer has four times the amount
of data as a regular
color buffer.
Fortunately, you guys don't need
the pre-resolved MSAA buffer.
What you need is the
resolved, much smaller tile
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that you can store
out to unified memory.
Not the large tile that
has not been resolved yet.
You can call
glDiscardFramebuffer
for the MSAA color
buffer as well.
Same thing for depth.
Don't need the MSAA
depth buffer.
Call glDiscardFramebuffer
on the MSAA depth buffer.
Don't store that out.
[ Pause ]
We finished our trip down
the tile base deferred
rendering pipeline.
Here are some take aways.
Hidden surface removal is
a really unique strength
of this architecture.
It greatly reduces work
load which saves power,
increases performance.
There are certain
operations, however,
that defeat this HSR
process, alpha blending
or using discard and the shader.
But I'm not saying you
shouldn't use them.
There are some really cool
effects that you can achieve
by enabling blending
or using discard,
but there are some
perfereable ways to use them.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
First of all, draw all your
triangles using discard
or blending after
triangles that do not.
Hidden surface removal
can at least be used
for the triangles in
that opaque group.
Additionally, trim the
geometry around the triangles
that need this sort
of operation.
If you've implemented
an alpha test,
make sure you wrap your
alpha-tested object
so that you produce
less fragments
that need this operation.
It's worth adding more vertices
to reduce fragments
that need them.
[ Pause ]
Also, we've seen that transfers
between the unified memory
and the GPU are expensive, and
the best things that you can do
to avoid them is to call glClear
to avoid the logical
buffer loads
so that the GPU can just
simply start rendering.
Doesn't need to read
the framebuffer.
Also avoid frequent
render buffer switches,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
which can cause tile thrashing.
And avoid logical buffer stores.
Use the glDiscardFramebuffer
call,
especially for large
multi-sampled
anti-alias buffers.
[ Pause ]
There are a couple of
things that didn't fit
on that pipeline diagram,
and I want to point
those out to you now.
The first is dependent
texture sampling.
Now this happens if you
calculate a texture coordinate
in the fragment shader and
then sample from that texture
with the texture function.
Here I've got our texture
sampler and two varyings here,
and the first thing I do is
I add these values together
to produce a coordinate offset
cord, and I use this offset cord
in the texture function.
Because it's a result of two
previously-calculated varyings,
we now are making
a dependent fetch
or a dependent sample
or dependent read.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Here's a more devious example,
a much less obvious example
of a dependent texture read.
Some developers get clever, and
they think, "hey, you know what,
I've got two textures
I want to sample from,
and I only need two scalars
to get a 2D texture
for each texture.
What I'm going to do is pack
them into a single vec4.
So I've got an S and
T texture coordinate
in the first two components
of the vec 4 and another S
and T texture coordinate
in the second two
components of the vec4.
And then what I'm going to do is
I'm going to use the first two
as the first texture coordinate,
make the first texture fetch
with the X and Y and then
a second one with Z and W."
Now these are actually
both dependent reads.
Because what happens is the
texture coordinates need
to be converted first
from a vec4 to two vec2s.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
This is happening
all under the hood.
You don't actually see it,
but there is some
calculation being done
which makes these
dependent texture read.
[ Pause ]
Here's why it's bad.
There's a high latency to sample
a texture in unified memory.
Now we avoid this latency
when you're not doing a
dependent texture read
because the rasterizer
says, "Hey,
this triangle uses a texture
in this fragment shader,
and we've already
got the coordinates.
So let's signal out to a memory
and pull that data back in,
and soon as we start
that fragment shader,
we'll have the data."
We can't do that if you're
calculating the texture
coordinate in the shader.
The shader stalls.
It waits for the data
to come back to it.
So minimize your
dependent texture samples.
Hoist your calculation.
Do it in the vertex shader if
possible, put it in a uniform
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or put it in the vertex array.
Try to avoid, putting
the calculation
in the fragment shader.
Here's the fixed version of
that devious shader here.
We've now split that
vec4 into two vec2's.
There's no calculations done.
We simply fetch using these
two separate variables.
[ Pause ]
Alright. Here's another
warning that shows up.
Fragment shader dynamic
branching
or also Vertex shader
dynamic branching.
Here we've got our varying and
attributes that vary from vertex
to vertex, and because
it varies,
it becomes a little bit
difficult for the GPU to manage
because we now test, and the
outcome of that test in the
if statement is dependent
upon the test.
Here's why it's difficult.
GPU's are highly
parallel devices.
It can process multiple vertices
and fragments simultaneously.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
We need a special
branch mode for execution
of a dynamic branch, and
this adds a bit more latency
for the parallel
device to stay in sync.
If it's possible, calculate
the predicate of your
if statements outside
of the shader.
A branch on a uniform does
not incur that same overhead
because it's constant across all
of the vertices or fragments.
All of the shader execution.
And really if there's a shader
that uses both a
dependent texture sample
and dynamic branching,
this adds a lot of latency
and can be really costly.
Really look for that.
[ Pause ]
OK. I've talked a lot about how
to utilize the GPU
to its fullest.
You also really want to get to
the GPU as quick as possible
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and minimize the CPU overhead.
And as you may know, a lot of
time is spent in draw calls.
But what's less obvious is
that while state setting
looks inexpensive,
if you make a bind
call or an enable call
or use the new program, and
you profile that or add timers
around it, it doesn't
look like much time,
but that's because
a lot of that time,
a lot of the work is
deferred until draw.
We don't actually do a lot
of processing during
the state setting.
It's all done later on.
The more state you
set before a draw,
the more expensive
that draw becomes.
So maximize the efficiency
of each draw,
and the tools give you a
couple of warnings of ways
that you can reduce the
overhead for a particular call.
Redundant call and inefficient
state update are these two
warnings you should
look out for.
And what you can do is
there are some algorithms
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
such as shadowing state.
Keep the state vector
that you've been changing
in your application and
don't set it in OpenGL
if you've already set it.
Also a more elegant algorithm
is to use state sorting,
which minimizes the
number of state sets.
You can use a state
tree, for example,
and only set the
expensive states once,
and draw with a unique
vector each time.
[ Pause ]
However, there is some
fixed overhead for a draw.
It doesn't matter
how little the number
of state setting you make.
We still have to do
some state validation.
We need to check that
the parameters you've set
in the draw are appropriate for
the state that has been set,
and we need to make a call to
the driver, and the driver needs
to do some calculations to
convert to hardware state.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So minimize the number
of draw calls you make.
The most obvious way
is to don't draw things
that don't show up
on the screen.
Cull your objects.
You can use frustrum
culling if it's a 3D scene.
Just draw things that are
in the area of visibility,
and don't draw things that are
not in the area of visibility.
You can combine your draw
calls via instancing,
which I talked about
a lot earlier.
And also vertex batching
and texture atlases.
[ Pause ]
Here's a way to reduce
your binds.
What we would normally do is
we'd have these four models
and four textures.
We would bind, draw, bind, draw,
bind, draw, and bind and draw.
Now that's four binds, four
draws, and each draw needs
to validate that that bind
made sense for that draw.
We can reduce the number of
binds, create a texture atlas
by combining all of
these textures into one.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Simply bind once, then we can
draw, draw, draw, and draw.
Great. We can even go
further and combine our draws,
which would allow us to
bind once and draw them all.
This would require us to
combine all of our vertex data
into one vertex buffer object.
[ Pause ]
There is a new texture
atlas tool.
Sprite Kit is a new framework
in iOS 7, and it is mainly
for 2D games, but there
are some nice tools
that we can take
advantage of in OpenGL.
The texture atlas tool
combines images efficiently,
and it produces a property
list denoting the subimage.
You can scale your
texture coordinates based
on this property list, enabling
you to render your 3D models
with this texture atlas
that has been produced.
This texture atlas
tool comes with Xcode.
[ Pause ]
For more information, you
can talk to Alan Schaffer,
our graphics and games
technologies evangelist,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and there's some
excellent documentation
on our developer site.
You can also contact the
community via the developer
forum, and there are
some engineers that lurk
on those forums as well.
So you can get your questions
answered in a lot of detail.
There are a couple
of related sessions.
There were 2 Sprite Kit sessions
that happened yesterday,
but you can catch
the video of them.
And the Sprite Kit sessions
talked a little bit more
in detail about their
texture atlas tool.
Later on in the afternoon
there is
"What's new in OpenGL for OS X."
OpenGL ES is derived
from its big brother
on the desktop world.
So you can get a bigger picture
of what's happening
in 3D graphics there.
[ Pause ]
In summary, you want to reduce
your draw call overhead,
use the techniques
including instancing
and texture atlases to do that.
Consider the GPU's operation
when you're architecting
your rendering engine
and in your performance
investigations.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
The GPU tools really
help greatly
in this effort while the
tile-based deferred rendering
architecture has some
special considerations
that you want to think about.
Thank you very much.