Transcript
>> Matt Collins: Good morning.
Thanks for all coming this morning.
A little early, I see a few bleary eyes out there.
Today we're going to talk about OpenGL on Mac OS.
We're going to give the desktops some love.
So let's begin.
My name's Matt Collins, I'm a member
of Apple's GPU Software Team,
we work on the OpenGL software stack and the driver stack.
So today I'm going to go over a little bit about things
we added in 10.6.3 and some of the other software updates.
First you might be wondering where we're at.
So you wonder, I want to make a Mac game
or some 3D app on the Macintosh and I want
to know what I can target, what kind of features exist.
Want to know what's new in 10.6.3 and 10.6.4.
You may have heard of new extensions or
new features we've added for you guys.
So we'll go over those.
And you might want to know what does that mean for your app.
What's new, what's cool, how can you leverage it.
We'll also talk a little bit about performance.
So say you got your rendering looking great,
and now you want to get a little bit
of performance, you want some speed.
So I'll go over tips and tricks, and
we'll make your app really shine.
And lastly, we'll do some pretty pictures, because
I'd be remiss if I said I'm going come up here
and talk about graphics, but it's
going to be nothing cool to look at,
so we'll have some cool techniques,
rendering techniques, and cool demos.
First let's talk about OpenGL.
Give you a little background.
You may have gone to some of the
early talks, talking about OpenGL ES.
This is the lowest level access to the graphics hardware.
So on the desktop if you want to get the GPU's
power and really use it, you've got to use OpenGL.
Most of our other frameworks are built on top of it.
So everything else, like Core Image, Core Animation,
Quartz Composer, they're all built on top of OpenGL
and they all use it to leverage the GPU's power.
So let's talk about last time at WWDC.
So some of this should be a quick recap for some of you.
You might have heard of buffer objects, vertex
buffers, index buffers, frame buffer objects,
fixed function pipeline, multitexturing, shader
pipeline, and you've heard of the vertex shader,
the geometry shader, the fragment shader.
This should all be a quick overview.
We'll go over some of these topics in more detail and if
you have any questions you can always come and talk to us.
So that's where we were.
But where are we now?
Well, now we have new extensions, new features.
Give you better access to hardware functionality.
These are things your GPU could already
do but you didn't have good access to.
Most of this is 10.6.3 and above only.
So if you want to target these cool new
things you've got to use 10.6.3 and above.
But first some advice.
The first piece of advice, and you may have heard this at
some of our other talks is use generic vertex attributes.
So when you send a vertex down to the
GPU to render they have attributes.
This could be the position, the
color, the normal, et cetera.
And those are all built in to OpenGL.
But we're going to tell you to use the generic ones.
This is because this is the best way to
use a shader, and shaders are native.
So if you have a fixed function which is the
old style rendering, glBegin, glEnd, et cetera,
those are all actually emulated in the drivers.
If you went to Dan Omachi's talk you
may have heard that whenever you set
up fixed function state your driver will actually
generate a shader behind the scenes to emulate that state.
And even better, you can port to OpenGL
ES 2.0 if you use generic attributes,
because OpenGL ES doesn't have any of
the old fixed function stuff at all.
Now let's talk about some new things.
This is all stuff that's now available in 10.6.3.
First we have a set of extensions that are here
really to make your life easier, compatibility.
So provoking vertex, vertex array, BGRA, depth buffer flow.
This is all to help you with compatibility and porting.
Next we have a set that's really about empowering your app.
So some new features that allow you to do
techniques you just couldn't do before.
Frame buffer objects, texture arrays, and instancing.
And last we have a set for performance and memory,
conditional rendering, different texture formats,
texture RG, two-channel textures, packed float which
is mainly for HGR and shared exponent textures.
So we'll go over these in detail individually.
And we'll do some learning.
So first I'll talk about an extension,
I'm going to tell you what it does.
Then I'll tell you why you should care.
Lastly, I'll show you a demo of
something cool that you can do with it.
So let's get started.
First we'll talk about the flexibility
of the compatibility extensions.
The first one is provoking vertex
selection, this is EXT_provokes_vertex.
Now who here is familiar with the term provoking vertex?
Anyone? That's good.
Okay, well I'll give you a little explanation.
When you draw something you can either
have it smooth shaded or flat shaded.
A flat shaded thing is the same color.
So if you have a square, a quad
that's flat shaded, excuse me --
the color has to be chosen from one of
the vertices that make up that quad.
And that is the provoking vertex.
Now in OpenGL, normally the provoking
vertex would be the last one.
This let's you select which one is the provoking vertex,
so you can tell which vertex supplies that attribute.
You have a new entry point, very simple glProvokingVertex.
You can tell it pick the first or pick the last.
Couldn't be easier.
Now I brought up quads.
What about quads?
Well quads are hardware dependent, because all of
our GPUs, they're really good at rendering triangles.
You can make a quad with two triangles, but that's not
how the hardware works, you know, on the very basic level.
So the behavior for a quad is hardware dependent.
And you can query this with this enum
GL_QUADS_FOLLOW_PROVOKING_VERTEX,
and it will tell you whether your
setting is either ignored or obeyed.
Now that's interesting, but why do you care.
Well, for better flexibility, sure.
But mainly it allows you to pick which
vertex you're pulling your color,
your attributes from, without modifying your art assets.
So let's say you had a game and
it had some cool particle system.
And when a Jeep went across the
ground it kicked up a bunch of dust.
So you have a particle spray, and it's a gray dust cloud,
but maybe you want a red dust cloud or a green dust cloud.
You could set a color on one of the vertices for that
particle, and you could get a colored dust cloud.
So here's an example of a flat-shaded gear.
Now when you think about it, you don't just have to use
color, you could use anything to be flat shaded, right?
So here you have an example of
normals that you're being visualized.
So even the normal can be flat shaded.
It's anything you want.
It doesn't just have to be color.
This is just an example of how you
could use anything to be flat shaded.
Next let's talk about BGRA ordering.
Now in OpenGL normally when you
provide a color it's going to be RGBA.
Other APIs may not work this way.
So this allows you to specify colors in BGRA order.
Again, you can use something without
modifying your art assets.
Something a little strange about this extension,
though, is that you actually supply GL_BGRA
as a size parameter, not as -- there's no order parameter.
So normally you would specify ColorPointer,
SecondaryColorPointer, or VertexAttribPointer,
and I put VertexAttribPointer in gold because that's
how you supply a general vertex attribute which is good.
But if you use this, the sizes implied to be four, even
though you're actually setting the size as GL_BGRA.
And this has to be unsigned bytes only, because it
has to do with how you're packing your stuff in.
So here's some code, it's very simple.
You bind a buffer and your color VBO name.
If you've gone to some of the other talks
they talk about the importance of using VBOs.
So I put all my colors in this VBO, now I bind it.
And now I say VertexAttribPointer.
I give it the index, which is the
index of my attribute in my shader.
And then instead of size I have GL_BGRA and UNSIGNED_BYTE,
FALSE means I don't want it to be normalized, 0, NULL.
Pretty simple.
Next we'll talk about some floating point depth buffers.
This is just a couple new formats for your floating point
depth buffer -- or for your depth buffer in general.
It allows you to be floating point, 32-bits of float,
and there's also a 32-bit with an 8-bit stencil.
New type. And one thing to keep in mind is you notice
32-bits plus 8 bits, that's more than a single double word.
So you're going to be using a lot more
space if you use a 32-bit depth buffer
and an 8-bit stencil buffer, something to keep in mind.
Now why would you want to do this?
Well, it's mainly for very deep scenes or very small scenes.
So keep in mind if you're rendering, like, the
universe, that's really, really deep, right?
So a floating point number is going to
be better at doing something like that.
Or if you're rendering something really, really small, like
the insides of an atom and you want something that's .0001,
you know some extremely small number, a floating point
number is also going to excel at doing something like that.
This is worth keeping in mind, because the standard
depth buffer is actually has better precision closer
to the near plane.
So you think of your depth, your
depth is actually projected.
So it's the Z value over the W value.
And that results in a curve with greater precision,
better precision closer to the near plane.
Floating point numbers also have
better precision closer to 0.0,
so when you use this, you have
to keep both of those in mind.
That's something to watch out for.
All right, now let's talk about empowerment.
These are some techniques, some extensions to allow
you to use techniques you couldn't use before.
So the first one I'd like to talk about is array textures.
An array texture is an array of 1D or 2D
images with each layer being a distinct image.
There's no filtering between layers and
you have distinct bitmaps per level.
This allows you to do some cool things.
It also means you have to use the programable pipeline.
So the only way to use an array texture is to use a shader.
So you also have new texture targets, array,
2D, 1D array, 2D array, and new samplers.
So you sample these just like you'd
sample a normal texture except that --
so say the 2D case, the third texture
coordinate will actually pick the layer.
So you have layer 0, layer 1, layer 2, et cetera.
And in the 1D case you actually have a 2D texture
coordinate and the second coordinate picks the layer.
Now why is this interesting?
Well, it allows you to store unique
data slice per layer of this texture.
This is completely unique and it doesn't
-- isn't touched by any of the other ones.
It's like a distinct image you
literally have an array of 2D images.
So you think well 3D texture is kind of like that.
It's a volume texture.
I could just think of it as unique things.
But that doesn't quite work because
you can't bitmap each level.
If you've ever tried bitmapping a 3D texture
everything becomes this blob that's, you know,
an average of all the layers smashed together.
So let me show you a demo of this technique.
So this is a little terrain demo I wrote.
And you can see here there's a
couple little guys on the mountains.
And there's the water and below that there's a little
gray and some green grass going up to the snow.
Now you've probably seen terrain demos like this,
but the cool thing is this is actually one texture that's an
array texture with four layers depending on the elevation.
So there's a rock level, a grass level,
a more mountainy one, and then the snow.
And I can dynamically change the terrain and it will
actually update which texture is being sampled from.
So you can see that part that went up, and the
top part has snow and the bottom part has grass,
and it sort of blends in between the
layers, and that can go down, changes again.
So this is really cool for, like, a dynamic
terrain engine, because you have one texture,
and it will automatically texture correctly
based on the elevation of the point.
So there's actually sample code that
you can download and check this out.
I highly encourage you to do so.
It's a pretty cool technique.
We'll come back to that a little bit later.
Next I want to talk about instancing.
How many people have heard of instancing.
A few people, that's good.
So we expose instancing in with our instance arrays.
And this allows you to reuse premieres
with a single draw call.
Some of the other talks talked about the
importance of batching your draw calls.
Fewer draw calls is always going be better.
So if you can draw 100 things with
a single draw call that's awesome.
Again, programmable pipeline only, and you must use
generic vertex attributes to use this technique.
This is because you're essentially
sourcing attributes at different rates.
If you have a position for a point you're always
going to have a separate position for vertex,
otherwise they're going to be on top of each other, right?
But you may think what other things could I pass?
Well, you could pass an orientation matrix,
you could pass color, you could pass normals.
So you could reuse the same model, draw it
100 times with different orientation matrices
and put the same guy in 100 different places.
So to do this you use GL vertex attrib deviser R.
Now you specify deviser, which tells OpenGL
how to source these attributes differently.
So let's give a little example.
First you can see we have the positions
and they're moving, and the attributes.
So different positions get the same attribute.
Gets the purple one, now it gets the red attribute, yellow
position, completely different, gets the same red attribute.
Green one gets the yellow attribute and so on.
They're different positions, and
they're reusing the same attributes.
So consider each position as its own instance.
And then the attributes are being -- so here in this case
the deviser is 2, because every 2 you repeat the attribute.
So you might want to do this because it saves overhead.
That's the main reason.
And performance is always going to be
better the fewer draw calls you too.
There's different techniques.
This is commonly referred to as stream
instancing, if you ever heard of that.
You're sourcing vertex attributes at different
rates and the best way to do this and the example
that you can download is sourcing different
position and orientation matrices per instance.
So let's give a demo of that.
So this is a bunch of spinning gears,
use the gear model I showed you earlier.
And it's a 9 by 9 grid times 6 phases.
And this is a single draw call.
There's one draw, glDrawElements instance.
And you can see that they're all animating
separately, some of them turn one direction,
the others turn the other direction, they're all lit.
See any per pixel lighting.
They're kind of shiny.
And it's actually a pretty simple technique.
There's just one big buffer that is the attributes.
And I have a matrix packed in there.
So each individual gear gets it's own
matrix that represents its orientation,
the turning animation, and its position in space.
And finally I have a camera matrix that
controls the camera so I can move around this --
this cube of moving sphere -- moving gears.
This is also available for you to download.
You can check it out.
This is really common in a lot of games, for say
tree rendering if you want to render some foliage
or some tufts of grass, anything like that.
Even particles are great for instancing.
Any time you have to render something over and over again
that looks really similar this
is a great technique to leverage.
All right, let's talk about frame buffer objects.
This is a big one that a lot of people have
asked for, so we've implemented it for you guys.
Our frame buffer object is a generalized
off-screen render target.
Now we recommend when you do your
rendering, you do render to an FBO.
This is an off-screen target.
You can render to a texture or a render buffer.
If you're familiar with the iPhone
you have to render to an FBO.
So you should also be doing this on the desktop.
You can attach different dimensions to your
FBO now, and you can attach different formats.
This is something you can't do on
the iPhone, but you can do it here.
And there's a bunch of reasons you might want to.
Now you might, FBOs themselves are not
new, this is a technique that's been
around for a long time, capability
that we've had for a while.
But now you can attach different size
buffers or different types of buffers.
So you can do things like reuse the Z buffer.
Say you want to render something at full size.
So you have a full resolution Z buffer.
Then you want to render at a quarter size or half size.
You can reuse that Z buffer so you
don't have to rebuild that information,
and you can get the benefits of
having some early Z optimizations.
You can also use this to render data to a texture.
Now you could always do this before,
but you had four components
and you were pretty much forced to use RGBA or RGB.
But you may not always need that.
We'll get a little more on this later.
So with that in mind, let's move
on to performance and memory.
First let's talk about texture_RG.
Texture_RG is a one- or a two-channel
texture, R or RG, respectively.
There's many formats, 8-bit, 16-bit, 32-bit on sign
in, and we also have 16-bit, 32-bit floating point.
And this is mainly for data storage.
And it can also be a render target which
fits nicely into the ARB FBO extension.
So you can see here on the right we have a teapot.
And you might wonder well, why do I
care about one- or two-channel textures.
It's only going to give me red
and it's going to give me green.
Like this teapot is red and green.
Right? What am I looking at?
Well, you can combine this with ARB FBO to render to.
Render data to a texture.
What type of data.
Oops, let's move on.
Sorry, luminance.
You think I can render to luminance,
which is a one-channel texture, right?
Well, luminance is not renderable, so that won't quite work.
But going back to data, you might say why do I care.
Why don't -- why do I need four components?
Well, it saves screen space motion blur.
Screen space motion blur really only
needs two components, an X and a Y vector.
So you can write your X and Y vectors out to an RG texture
and then sample along those two vectors to get a nice blur.
This is also really useful for a
technique called deferred shading.
Which -- so deferred shading, let's go into that, how
many people have heard of this, by the way, anyone?
This is commonly used in a lot of games.
And you need two passes.
On your first pass you're going transform your geometry
as usually , then you'll render the lighting attributes
for the calculations to what's called a G buffer.
Your second pass you draw a full screen
quad and you read from the G buffer
to perform lighting calculations in screen space.
Now let me give you an example
of a possible G-buffer layout.
So here I have three render targets, three separate
textures that I'm going render to at the same time.
My first texture will store my position.
This is typically going to be your
position in world space or camera space.
I've chosen to do it in camera space, because
I think it's easier, but it's up to you.
Just make sure everything's in the same
coordinate space so your lighting looks right.
So my first texture will contain
my X, my Y, and my Z position.
My second render target, my second texture will actually
represent the color of that pixel, the unlit colors.
So if I have a texture map, it will be that color.
Or if I don't have a texture map it can be an ambient color
or the material color too, so red, green, blue, and alpha.
Lastly, I'm going to store my normal.
And you see here I actually only have the X and the Y.
That's because I have a two-channel texture, and
I can reconstruct the Z component and the shader.
Consider that your normal XYZ, you
know the length is going to be 1.
It's normalized.
So that way you can reconstruct that you know
1 has to be the length of your final thing.
So the Z component is pretty easy to reconstruct.
You also know that the Z is going to be
at least facing moderately towards you,
if it's in screen space, or if you store it in camera space.
Because if it's facing away from
you it wouldn't be lit at all.
Each of my components here is a 16-bit float.
So the whole thing here, you can see
how much space I'm going to take up.
Space is important to always keep in mind.
So as I start out attributes, here's the pixel shader,
the fragment shader, I have a varying position,
a varying normal, and I have my color as well.
So frag data is what you use to store
to multiple render targets in OpenGL.
So my first frag data 0, my first render target,
I'm storing at X, Y, and Z of the position.
My second render target, I'm just storing the color.
And my last one I'm going to store
the X and Y components of my normal.
Here we go.
So let's take a look at what deferred shading looks like.
Here I have the Utah teapot with some lights.
So the big advantage with deferred shading,
you might wonder why you want to use it,
all your lights will be done in screen space.
Normally, you would do lighting either at
the vertex stage or at the fragment stage.
The vertex stage you're going to apply your lights once
per vertex, it will be interpolated across your primitive.
In a forward render, if you apply your light at the fragment
stage you can redo the lighting calculations any number
of times.
Typically, you've have lots of overdraw in your scene.
Which means that everything you're
rendering may not be visible
because it may be behind something else, occluded by it.
That still means you have to do the lighting calculations
because you don't know if it will
be visible till the over end.
The good thing about deferred shading is you
only do lighting calculations once per pixel.
So you're guaranteed everything that
you're actually lighting will be visible.
So let's visualize these buffers
that we had, we talked about before.
Move the teapot a little bit so we get a better view.
This is the position buffer, the X, Y, and Z.
Now consider that the Z is probably
going to be pretty consistent across it
because the teapot is an equal length away,
so you can't really see the blue part.
But the X and the Y are going to be red and green.
And notice there's a big black part.
Well that's because those are negative coordinates, right?
And there's no negative color.
So that would be like negative 1 would be in the lower left.
And you just can't see it with color
because that's how we're visualizing it.
This is the color buffer.
So my teapot is actually one color, it's just gray.
If you had a texture map, the texture map would appear here.
It's the unlit color of the screen.
This is commonly referred to as the albedo.
You may have heard that term.
That's -- this is the color buffer.
Lastly, we have the normal buffer.
This is the transform normal that we
use in our final lighting calculation.
You can see the depth -- phases being
called in the teapot's spout there.
The normals are nicely smooth around the side so
you can see that I'm reconstructing them correctly.
Finally, we have the deferred shaded teapot.
It's pretty cool, you can see it has
full specular lighting per pixel.
And it's running at 60 frames a second no problem.
I was going to show this with like,
18 lights, because it's easy to do.
But it's so overwhelming when you have a bunch
of colors dancing, it's almost seizure-inducing.
So I decided to keep it a little simple.
Next we'll talk about packed floats.
This is a new format, it allows you to pack
three floating point numbers into a 32-bit value.
New internal formats, R11, G11, B10, so 11, 11, 10 textures.
You may have heard that if you've
been familiar with other APIs.
So you can see how this is packed in.
You have 11 bits for the red, 11 bits
for the green, 10 bits for the blue.
So what is this useful for.
Well, this is mainly for HDR rendering, high dynamic range.
And you can do this because if you look at the sun or say
you look at a bright light, like these here are shining
on me, they're very bright, or
look at the projector's light.
And then if you look at the ground at something in
shadow, that's orders and orders of magnitude difference.
So if you looked at the sun and then you
look at something that's completely dark,
that's like a million times brighter, right?
Just this huge magnitude.
And 8 bits, not really enough to express that.
Floating point is way better at doing
something that's vastly different.
So you could say well, I could just
use some floating point numbers.
We have 16-bit floats, we have 32-bit floats.
That's true, but that's a lot of space.
It would be much better if you could
somehow pack that into just 32 bits.
So that's what this allows you to do.
The next thing I'm going to talk
about is called conditional rendering.
Now conditional rendering is rendering
based on the values of an occlusion query.
Who here is heard the term occlusion query?
Okay, good.
An occlusion query is a test to see how many
pixels are actually rendered by the GPU.
You can use this to do a lot of different rendering.
It will give you a value back called samples passed.
And you can say like, oh, these last rendering calls
rendered 100 pixels, these last rendered 1,000.
You can do rendering based on that to see like,
oh, is this house behind a skyscraper, is it not.
The problem is you have to query the GPU and
it has to wait to give you that value back.
So there's a whole round trip involved here.
And we like to remove any round trip we possibly can.
So conditional render allows you to do that.
You can just give it the query, which is here, as ID.
You can give it a mode which tells it to wait or not.
So you would say begin conditional
render with this information,
put in all your rendering commands,
and then end conditional render.
And the stuff that's bracketed between those
two will be conditionally rendered based
on the result of your occlusion query.
Little code here.
The first thing to keep in mind when you're
rendering an occlusion query is you want some sort
of course bounding volume.
But you don't want to actually draw that volume
because you'll have some bounding box, right?
You have a complicated guy, he might be behind some house
or skyscraper, you don't want to draw the whole guy,
you just need a box that goes around him.
You know, sort of like the cone of silence or something.
So you turn color mask and depth mask off.
It is much faster to render to nothing than it is
to actually render to something to a frame buffer,
and you don't need to actually draw this stuff.
So you want to turn these off before
you render your occlusion query.
Next, you render the occlusion query as normal.
The course bounding volume.
Begin query, samples past with your query name.
Draw elements or whatever your draw
calls are, and then you end the query.
Now I actually want to start drawing
again, so I have to turn my state back on.
Lastly, pretty simple, begin conditional
render based on the same query I used before.
Draw all my other stuff, and then end conditional render.
Pretty simple technique, and could be pretty powerful.
Now the funny thing about conditional rendering
demos is there really isn't much to see,
because if it's working you're
not actually rendering anything.
Now here are something like 10,000
gears stretching off into infinity.
And this is just drawing them all normally.
I have a whole bunch of them, they're all lit
per pixel and kind of multicolored and shiny.
And I put something between the
camera and the back of the gears.
So here the problem is I'm only
rendering like three rows of gears.
There's still, you know, 9,000 something rows
behind that white plain that you can't see.
But they're still all being rendered.
That's wasting time that doesn't need to be wasted.
So I press this.
And you notice that it jumped and the
frame rate actually got a little better.
That's because all the other 9,000
rows are not being drawn anymore.
So back to normal.
So sometimes you can get, like, a good 10 FPS or maybe
even more, depending on the complexity of your rendering.
This is something you have to build
into your engine or your app.
But it's not that hard to do if you're already using
occlusion queries, that could be a very powerful technique.
I would urge you to take a look at the sample code
because it's a lot easier to see and understand
when you're looking at the code than it is for a demo.
Because as I said before, it's doing
its job, there's nothing to see.
So now we move on to performance.
I promised you I'd talk a little bit about
performance, and we're going to do that.
There's a bunch of performance
characteristics on the desktop
that you may not be familiar with
if you're an iPhone programmer.
There's some stuff -- so let's begin with stuff to avoid.
First of all, I want to advise
you not to use immediate mode.
Immediate mode is costly.
So when you do immediate mode, you say GL begin,
GL vertex, GL vertex, GL vertex, GL vertex,
GL vertex, you're specifying every point.
And that's really slow.
Consider your average model, which could
be anywhere from 2,000 to 10,000 points.
So you really want to specify 10,000 points with GL vertex,
GL vertex, you know, thousands of times, probably not.
Also, you have to send that data
over the bus every single frame.
Every single time you say GL vertex that's
some more data to be sent over the bus.
You have all this VRAM.
All of our desktops have tons and tons of VRAM, 128
Megs, 256 Megs, 512 Megs, you want to use this stuff.
So send all your data up to the card and render
from there instead of specifying it every time.
So if you have any code that looks like this in your
application I want you to go and cut it all out.
Just get rid of it.
Use VBOs. Draw arrays, draw elements, this is the way to go.
By the same token, if you've ever heard of a display list,
I'm here to say that display lists
probably don't really help you.
They're really not much of a performance boost.
You may see a little, but it's really not good.
You're caching commands in the display list.
But what really hurts you is caching state.
Now if you went to my colleague's talk yesterday,
he talked about state validation in the driver.
This is really where a lot of the CPU
overhead with the drawing is going to go.
And since display lists inherit state,
we can't really cache it for you.
You could say call list, but you can
change all the state in between each call.
So we still have to revalidate all that state, which FBO
is bound, which texture is bound, you have depth tests,
you have alpha tests, all that could be
different, different fragment programs.
We can't validate that, so you're really not
getting any benefit of caching these draw commands.
So if you have stuff that looks like this
in your apps, begin list, a bunch of stuff,
end list, call list, that also needs to go away.
[inaudible] isn't available on the phone anyway, so
if you really wanted to port any code or you wanted
to share code between the two platforms you couldn't.
So it's much better just to use draw arrays,
draw elements, use vertex buffer objects to draw.
So to reiterate what you might have
heard yesterday, batch your state.
This is an important way to improve performance, because
all state changes require validation by the driver.
There's a ton of state in OpenGL, and it all has
to be consistent before you get good rendering.
So the driver has to go validate all
this stuff, make sure it's coherent.
It also requires a vector to be sent down
to the hardware of all your current state.
So if you're batching this all the time between your draws
it's constantly revalidating and constantly resending back
down to the driver, or the driver
is resending back to the hardware.
And that takes time.
This is expensive, that's precious time
you could be doing something else with.
So you want to avoid it, if you batch
all your similar draw calls together,
all your similar objects, you don't have to repeat this.
Now as my colleague said yesterday there's sort of
a hierarchy of what costs more, what costs less,
and I would urge you to go look at
his presentation, take a look at that
and think about how you can re-architect your app to take
advantage of grouping all your similar draws together.
You can also use Shark to check to see where time is spent.
We have a tool called OpenGL Driver
Monitor that can help you as well.
You can look for things like CPU wait
for GPU, or GPU wait for CPU to find
out if you're bound by the CPU, if you're bound by the GPU.
If your GPU is constantly waiting for the CPU
that means there's probably something you can do
in your app to help your rendering go faster.
Also you may have heard of hoisting.
How many people have heard the term hoisting?
Okay, hoisting is moving something up, you want to
pick it up, you want to move it up the pipeline.
So consider you have a vertex shader.
If I have a model with 10,000 vertices, that
vertex shader is going to run 10,000 times.
Now consider I'm rendering at 16 by 12.
That's almost 2 million pixels.
My fragment shader is going to run on the order
of 2 million times, maybe even more with overdraw.
A common scene may have four layers of overdraw, right?
Someone could be inside a house and you could have
the front row of the house, so you could have a window
so you can see inside the house, so you still have
to draw everything inside because you don't know
if the window's going to be able to see it or not, right?
That's way too much overdraw, and you're going to run
this shader millions and millions of times per frame.
So if it's possible to move any
calculations out of the fragment shader
into the vertex shader your performance is going to jump.
Because would you rather run something
10,000 times or over 2 million times.
Pretty simple idea, but something
I think a lot of people overlook.
Another thing to keep an eye out is fall back.
To keep our platform fairly homogenous,
we implement everything in software.
So if something's not supported on
the particular hardware you have,
like if you're running on an Intel integrated
part, if may fall back to the software path.
There's two parameters you can check to see, there's a
separate vertex fallback and a separate fragment fallback.
And of course the software path is going
to be much slower than the hardware path.
So if you just change the shader and your
performance went down to like 1 or 2 frames a second,
you might think, this is weird, something's wrong.
Well the first thing to check to see if
you're falling back to the software path.
And if you are, you can figure out
exactly where you're falling back
and then fix your app so you stay on the hardware.
Now the other thing I'm going to talk about
is what's commonly referred to as a Z prepass.
A Z prepass is just drawing the
depth information into a buffer.
So you can do this quickly and early in your drawing cycle.
You would draw all your objects just into the depth buffer.
So turn the color buffer off.
Solo depth rights are about two
times as fast as color rights.
And since we're only interested in
the depth information and constructing
that information, we want to turn color rights off.
And this is done via GL color mask as
we did in the occlusion query example.
And why would you want to do this?
Well typically you've seen some diagrams of the pipeline,
and at the end of the pipeline you saw the fragment stage.
Well the Z test is done after the fragment stage.
This gets back to what I said earlier about rerunning
the same fragment shader on stuff you can't see.
Normally, stuff you can't see is wiped
out by the depth test, the Z test.
But if you already did all the expensive calculations,
this doesn't help you much with performance.
If you have a pre-made Z buffer, it can allow
your GPU to perform early Z optimizations
and actually not do those expensive fragment operations
if the resulting pixel won't actually be visible.
So it's important to do a Z prepass if you have
really expensive shaders or lots of overdraw.
This can be mitigated by certain
techniques such as deferred shading.
But there are things that happen even in those
techniques that you can use this to take advantage of.
There's also rendering techniques
that need an incoming Z value.
If you were just doing normal forward render, no Z
prepass, you wouldn't have a depth value of that pixel.
You could write to it, but you can't get it back.
Certain techniques, like screen space
ambient occlusion, need an incoming Z value.
Now if any of you have played the game Crisis they used
this technique, they detailed it in a paper I'll cite later.
And this reads from the Z buffer of
the pixel and the surrounding pixels
to determine this ambient occlusive factor,
which is sort of like how ambient light works.
It's just an emulation of the real
ambient lighting in the real world.
So if I'm just writing to my depth buffer
it's going to look something like this.
This is the terrain demo I showed you earlier.
This is just the Z pass.
You can see the closer part to me, it's going to be
darker, the further away it's going to be lighter.
That's just visualizing the depth of
value as sort of a gray scale color.
So let's talk about extensions that are specific to the Mac.
The first one I'm going to talk
about is FlushMappedBufferRange.
Normally when you modify a buffer you're
going to use glMapBuffer and glUnmapbuffer.
This will actually map what's in the VRAM into your
system RAM, then you can modify it or take a look at it.
It allows you to asynchronously modify that VBO,
which is important if you're doing
animation and you're updating things.
Now if you flush the buffer, you
can only flush a small amount.
Normally when you unmap it, you're going to
DMA the entire buffer back up to the GPU.
And so you have about a megabyte of buffer
data, you have a big complicated model,
and just the vertex position is about a Meg.
Probably don't want to be pushing a Meg back up
every frame if you're only modifying like 10 bytes.
So FlushMappedBufferRange allows
you to only flush those 10 bytes.
This also minimizes data that needs
to be copied back to system memory.
Normally, when you map a buffer the system
doesn't know if you're going to read from it.
If you're going to read from the whole buffer,
say the first 10 bytes and the last 10 bytes,
actually has to copy the entire
thing back to your CPU memory.
This can be really slow, especially if you're constantly
copying back, modifying, sending back up to the GPU,
copying back down to the CPU, sending back up to the GPU.
You can imagine that makes a lot of time.
So if you're only mapping and flushing a
portion of it, you can minimize the copying.
So to do this, here's some sample code.
You would set the buffer parameter
to false, the flushing parameter.
Then you could do some unrelated work,
your app goes along, does some things.
Then you map your buffer, you get this data pointer.
You update your buffer, and finally you
say FlushMappedBufferRange with the offset
that you started modifying and
the number of bytes you modified.
Then later on when you're done with all
your modifications you can do your unmap,
start drawing with it, and go on your way.
Pretty simple.
This also ties into our next thing, which is glFence.
A fence let's you test the command when a command is done.
So you could be adding commands to the command
stream, you can say like, I want to say GL bind buffer
and then GL draw elements, GL draw elements, GL
draw elements, then I want to say glMapBuffer,
modify some things, GL FlushMappedBufferRange.
And then I'm going to say GL set fence.
And then I can see when my flush is done.
So you think I mention this is asynchronous.
When I say this, it's going to tell
OpenGL flush this up to the GPU via DMA.
And you don't know when the DMA
is going to complete, exactly.
It's going to give you control back right away.
Later on if you want to actually draw with this, you
want to make sure that the data's gotten up to the GPU.
So an easy way to check is to set a
fence after your FlushMappedBufferRange.
And later on you can test to see in the fence is complete.
If the fence is complete you're guaranteed that
everything before it has actually executed.
If you ever used a multithreaded engine
with OpenGL, this is absolutely necessary.
Because there can be latency between
the time you modify your thing,
the time you say FlushMappedBufferRange,
and the time you want to draw.
So if you start drawing and then you start modifying again,
you don't know if anyone's actually using
that buffer you modified as you draw.
But with fences you can make sure I'm not modifying
it as I draw, and I'm not drawing it as I modify.
You can also use this for other
multithreaded synchronization.
You may have seen multicontext and people mentioning
you can upload textures on a background context.
Well, you can do this, sure.
But you also want to make sure that your texture
upload is done before you start rendering with it.
If you have two contexts, you can set a fence
after your texture upload and then check it,
and then signal your first context, yeah, my texture
upload's done, you can start drawing with this texture.
This is great for texture streaming in the background.
So this is a little bit how a fence would look like.
I would map buffer, write some data,
FlushMappedBufferRange, set my fence, and then unmap it.
And somewhere later on in my application
I would call GL test fence.
It would test that same fence that I set before
and tell me, yeah, we're done, we're good.
Or no, we're not done.
So you might want to wait.
Sample code is really easy.
You gen your fence, you do some work, you set
your fence, and then later on, you test it.
Now might wonder if we can put this all together.
We talked about a bunch of different
techniques, and we want to bring them all
into one application, one demo, leveraging all this stuff.
So leveraging every technique we've talked about.
So what did we talk about.
Well, we talked about instancing.
So let's think of something that has a bunch of objects.
Talked about texture RGs.
So we want to do some data storage, some
form of deferred rendering would be cool.
We talked about array textures.
We had that cool terrain demo, can we
bring this together into something else.
So let me show you a little something that I put together.
So this was the previous array.
This is just forward rendering.
Not as interesting.
Now I can switch it to deferred rendering and I
can have multicolored lights flying around it.
Kind of crazy.
You can see I have these little robot
guys which is from the Quest demo.
And he is actually being lit correctly.
And it's one draw call drawing all
these, and they're actually all
over the mountain, you just can't see them super-well.
See some of them there.
And I can throw on some animation of
the ground swallowing up this guy.
This puts everything together.
This is using deferred shading to draw.
So you can see, you can use texture
mapping with deferred shading.
And it's perfectly compatible with instancing as well.
And array texture.
So this is putting it all together, and
it runs at a perfectly fine frame rate.
You know, we have quite a few things that are being drawn.
The textures aren't super-high
resolution, but they do the job for this.
So all this sample code is available, you can
go onto the web, onto the WWDC attendees site
and download it, take a look at it, play around with it.
I highly urge you to try to integrate
it into your apps, into your games,
or if you just want to learn, take
a look at it, see how we do things.
Feel free to look on the forums or send us email.
Here are some sources.
There's a lot of information on deferred shading
on the Web, on NVIDIA's site, on other sites.
We have some sources from Crytek and other
game companies to talk about deferred shading,
the Crytek paper has information on screen
space ambient occlusion, if you're interested.
If you have more information, you should
e-mail Alan Schaffer, he is our Game
and Technology Graphics Technology Evangelist.
We also have a lot of documentation that's
really good on the OpenGL Dev Center,
our programming guide and some other things.
Lastly, we have our developer forums.
You can go on, take a look, talk
to your peers, get answers from us.
Now this is the second-to-last session, although
the game sessions will be repeated tomorrow,
but I urge you to go online and take a look at some of the
past ones and see the presentations, the design practices,
and all the ES overview and advanced rendering sessions.
Next up is Taking Advantage of Multiple GPUs and they
have some cool techniques that you can use for desktops
that have multiple graphics cards in them.
Thanks for coming, hope this was a
little informative and enjoyable.
If you have any questions, you
can come talk to me on the side.
Have a good day, and enjoy the rest of your WWDC.