WWDC2010 Session 414

Transcript

>> My name is Dan Omachi. I work for the Apples GPU Software Team on the OpenGL Framework for Mac OS X, as well as the OpenGL ES Framework on iOS 4. I hope you guys are here today, because you'd really like to add some stunning visual effects to your Mac or iOS applications. Perhaps you thought of adding some shadows, reflections, or refractions into your app. Maybe you've heard of some advanced techniques, such as parallax occlusion mapping, tone mapping, or deferred shading. I'm not going to be talking so much about those advanced techniques today, however, I am going to be talking about some essential design practices that you'll need to consider in your applications if you want to add such advanced techniques, or invent your own techniques using OpenGL. So what is OpenGL? So many of you know OpenGL as a 3D graphics API. The OpenGL specification, which is the definitive document on OpenGL, actually has what I think is a slightly more accurate definition. OpenGL is a software interface to graphics hardware. In other words, it's an interface with a graphics processing unit. Every device that ships with iOS 4 and Mac OS X has a pretty capable GPU on it. So what does this GPU do? Well, many people believe that the GPU is just there to make your graphics look good. Actually, you can make some pretty high quality renderers using just the CPU. Movie studios do this all the time. They make very high quality renderers, and you see some great special effects. However, their renderers take many, many minutes to render a single frame. This isn't so good for an interactive application. So what does the GPU do? It accelerates your rendering. And when you're talking about interactive frame rates, this matters. Faster rendering equals better image quality. Drawing efficiently, allows drawing more: more models, more vertices in those models, more pixels, longer shaders, better special effects in your application. All at an interactive frame rate. All right, so what will you learn today? So let's say you've got a Formula One car and just because you've got it you've driven to work every day doesn't mean that you're going to be winning any Formula One races or even qualifying for them, even though you've got this very advanced, almost 1,000 horsepower machine. You need to know how to use it effectively. Same thing with the GPU in OpenGL. It's a very complex machine. You need to know how to utilize it and use it efficiently in order to harness that power. So I'll tell you a little bit about how OpenGL works under the hood. Like any good Formula 1 driver, he knows exactly the strengths and weaknesses of his machine, where it excels, where he needs to work at it more. I'll talk a lot about this process called state validation. This is where OpenGL translates API calls into GPU commands. Now this is actually a CPU intensive operation, and a lot of applications stumble on it. So you need to be aware of what happens there. I'll also give you some fundamental OpenGL techniques to avoid these CPU bottlenecks, and efficiently access the GPU. Now it's important to note that these techniques apply equally to iOS 4 and Mac OS X. So the codes you write on one platform and the knowledge that you've gained here, you can leverage on the other platform. They've got pretty different architectures, very different GPU's, very different devices. But the OpenGL software stack is quite similar, so you can leverage that knowledge pretty easily. So let me talk a little bit about OpenGL's design. OpenGL is designed to be a low level API to allow it direct access to the GPU's capabilities. Allow you to get in and get out, not interfere, not have the software stack interfere with your code. Really lean. However, it's high enough level to drive many heterogeneous GPU's. On the Mac we support many GPU's. On the iOS we support many as well. So there is a fair amount of work to translate from these API calls to GPU commands. Not the lowest level way to actually get to the GPU. I mean we could make an API that's to the metal. However, what this allows, this high level allows for you to do is write code that's portable, and also write code today that will run on devices that have changed drastically underneath your application tomorrow. So as architecture changes, your code remains the same and works quite efficiently. So OpenGL is a state machine. And this state maps roughly to the GPU pipeline. If you were to look at Apple's implementation of OpenGL, what you'd see is this gigantic C Struct. OK? And what would, be in this struct you could actually look at the OpenGL specification. At the back you'd see this pretty large table that says, OpenGL state. And it has things like, you know, what stuff is enabled. The various GL enables in OpenGL, whether it's on and off. It has things like what's bound at certain points. And so that's basically what's in this context. All right? So much of OpenGL's live time is just spent tracking state that your app makes. So your application, there's a certain class of calls that OpenGL has that just exists to change this context state, this state becker. [phonetic] So for instance, you call glEnable, some state changes. You call glBindTexter some more state changes. You call glUseProgram, and again, the context is updated with new state. So at this point we're actually doing nothing that's GPU specific. It's all GPU agnostic. There's no hardware specific command getting generated. The real work happens when you make a draw call. This is when API calls get translated into GPU commands. All of that work is deferred until you draw. So for instance you call glDrawArrays, and look what happens here is we're munching all of this context state, taking that draw call, and creating a GPU command. So one thing to note about this is that this translation state is a CPU intensive operation. We need to do a lot of processing on the CPU to figure out exactly what your application has done, what it wants, and translate that into a GPU command. All right, so now we've got this GPU command, what happens to it? Well, GPU commands are inserted to a command buffer that's allocated by the kernel. So we've put that in there. And as the command buffer fills up, or your application calls glFlush, we flush it to the GPU. The GPU now can process it. But actually there's one step that needs to happen first. We need to actually transfer all the resources that are needed for those commands. For instance, we need to put textures that are used for those draw commands. And we need to download those to the GPU. And that can make this process a CPU intensive process. In addition to state validation, this is another potential bottleneck your application could incur. Now this is why we recommend not calling glFlush for really any reason. There are some very specific reasons where you might want to call glFlush for multicontext, multithreading rendering. But for the majority of applications, you're only talking about a single thread, and you should never need to call glFlush. It's an expensive operation. Now you've got the command buffer on the GPU, and the GPU can begin processing it. And it starts by fetching vertices, and then it pushes that data down the pipeline. So it's important to note that there are many potential bottlenecks on the GPU. Any one of these stages could be a bottleneck. However, a very common bottleneck is actually the CPU. All of those stages that I just talked about. So a key point, the GPU is another processor that runs in parallel to the CPU. Like any good multithreaded apps, you don't want one thread blocking and locking up the other thread. You want both cords being running at the same time, really maximizing the use of the CPU. Well, same thing happens with a GPU. As OpenGL pushes commands into a command buffer, the GPU fetches commands that have already been submitted to it. So, let me show that again. OpenGL is simultaneously adding commands to one command buffer while the GPU is fetching commands from another. So you really don't want to let the GPU wait for the CPU. That's just wasting the GPU's resources. You could be rendering 1,000 enemies in your game, and maybe you could be rendering 3,000 if you're not, if you are utilizing the GPU fully. If you're stalling it, well, you're not utilizing the GPU as much, and you're not rendering its cool effects or as many effects as you could be. Also, some applications tend to use the CPU to do some graphics work. Well, as I mentioned before, the CPU is good at a lot of things. And you can do some high quality rendering with it, but it's very slow. And you really want to use the GPU for this work, really offload the work. There are certain calls that; so this is what happens when you let the GPU wait, it just sits there. There are a certain class of calls where the CPU may need to wait for the GPU to process. Anytime the GPU needs to process something for the CPU and give it to the CPU, the CPU will sit there waiting for it. These things are like readpixels, queries, and fences. Here's what happens, CPU sort of just sits there waiting until it gets the data it needs from the GPU. Now there are ways you can use readpixels, queries, and fences efficiently where you're not stalling the CPU, and I can go into that a little bit later. So one thing I really want to reiterate here is that for each draw call, context state is translated into GPU command. This is a very CPU intensive operation. We call this state validation. Here's what happens you called glDrawElements, the CPU processing goes way up and we translate this GPU command here. If you were to profile your application and what you would see here is that the draw calls, in this case glDrawElements takes a substantially more time than any of the other OpenGL calls. OK, here you see it takes order of magnitude more than any of the other OpenGL calls that you see here. And what's important to note about this, is that there is some cost in making a draw call. It's actually not the draw call itself that is very expensive, or causing this expense, it's state setting calls that your app has made before the draw call. So you enable something, you bind something, those look cheap on a Shark profile, but that cost does show up. It doesn't show up in that particular call, but it shows up in the subsequent draw call. So state sending calls look cheap, they're not really as cheap as they might appear. So I'm going to over a couple of techniques, which you can use to reduce your CPU overhead. The first of which is to use OpenGL's various objects. The second is to manage your rendering state. Sort it, so that you're not making redundant state changes. State changes that incur more CPU costs, unnecessary state changes, and batcher state. Reduce some state by combining objects into one. [ silence ] So objects. Whenever you use an OpenGL objects, some of that state is cached in that object and pre-validated. It's not validated at draw time; it's validated up front. So when it's bound it's easily translated into GPU command. So let's take texture objects for example. You call glGenTextures, and you create this texture object. All right, you call tech image and you see this object state here. It's this little state vectors that the object itself has. And that gets updated and validated. You call tech parameter and that state gets validated and updated again. And finally, when you call bind texture, that state is merged with the rest of the context state, and cached much more easy to validate, much more, much less CPU processing required to create that GPU command. Now it's important to note that this process of setting up objects is actually somewhat expensive. So you really want to do this upfront, when you're app, your level, your document is loading when the user isn't expecting some high frame rate. Don't wait until your middle of your, you're in the middle of your real time run loop. Do it up front. So now I've described one way in which you can reduce CPU validation. I'll also describe a few other ways. And I'll also describe some ways in which not using objects can be an inefficient use of the API. The first thing I'll describe is how fixed function, vertex and fragment pipe, how that fixed function, vertex and fragment pipe, can be an inefficient use of the API. And instead, how using GLSL programmable objects can be a much more efficient use. So modern hardware doesn't have any fixed function vertex or fragment pipeline. OpenGL needs to convert all of your fixed function calls, glEnableLight, etc., into a shader internally. And this can be pretty costly. Here's what happens. You call glEnableLight, some context state changes, as it usually does. GL text in another fixed function call. And that state gets updated. glDrawArrays. Here's what happened, we generate this large shader under the hood for your application. Now you don't have to pay attention to exactly what's going on. We're just emulating what you've told us, what the state that you'd like to use. And this can be a pretty expensive operation. We actually cache the shader away, so we don't generate shader and complete it every single time that we use fix function. But we cached it away in the cache table. And when you use this, you use fixed function, we need to generate this cache key, which is somewhat expensive. And then go inside of our cache and pull out your shader and set it. This can be pretty expensive. So instead what we'd rather have you do, and which could be a much more efficient use of the API, is to use program object, shader objects. This is a most efficient way to set up the pipeline. So you specify shader code, you compile and link it into a program object while you're document your level, your document, your app is loading. Before you venture your real time run loop. And then when you want to use that pipeline you set it, and bam, easily set on the GPU. No looking into some weird hash table to pull out the fix function shader. Very efficient use of the API. So some people would like a little bit more understanding of how vertex and fragment shaders work. They are kind of a complex piece. So I'm just going to go through and describe a little bit about how they work. So the vertex shader is the first part of the programmable pipeline. A shader is executed every time for every vertex. So let' say you've got a model with 1,000 vertices. And you draw that model. Your vertex shader will run 1,000 times. OK, now this is great, this is fine, because the GPU's are very efficient at that. It's designed for this. It's a high bandwidth processor specifically designed to do this sort of professing. OK, so let's talk about the inputs. Inputs are per-vertex attributes that are specified outside the shader. Things like pre-transformed positions, pre-transformed normals, untransform texture coordinate. Specified outside the shader. Now there are two classes of inputs. The first is a position in clip space. OK? This is the glPosition built-invariable. You've assigned this, glpresent variable. And this clip space is used for OpenGL to map, map to the screen. OK? It's to map your 3D model onto a 2D screen. So that's one output of the vertex shader. The second are one or many bearings. And so the type of data that these bearings usually consist of are colors, normal, and texture coordinates. Values that you'd like to read in your fragment shader. Now these values are interpolated across any rasterized triangle. So, let's say this triangle was generated by three vertices. And on one vertex you signed 0.4 to one of these varying. One the other you would sign 0.8. You output that in your vertex here. OK, so halfway through the pixel generated on this polygon will have a value of 0.6. One-quarter of the way down it'll have 0.5. Three-quarters, 0.7. So in other words, it's distributed across the polygon from the two varyings that you output in your vertex shader. Now it's not it's not linearly output like it looks here where it's evenly distributed. In fact, if the polygon is facing you it will be. But if it's a little bit on edge you'll have this sort of perspective correct interpolation. Where values that are closer to you will be nearer, or actually be a little further apart. That's to give it this 3D effect that you need. OK, so let me give you an example of a pretty simple vertex shader. So here is a varying that we define, varTexCoord And this will, this is, this works just like the variances I just described that are distributed across the polygon. And here is the main body where all this work is going on. On the first line what we're doing is we're multiplying the incoming glVertex, the pre-transformed position, and multiply it by the modelViewProjectionMatrix. And then we output it to glPosition. So now this position is in clip space, because we transformed it. And then the second step here is we take in a texture coordinate that is specified outside the shader. We take the two components, S and T components, and we assign it to this varying. And this varying, again, it works just like that diagram I just showed you. And you can read varTexCoord in the fragment shader. The second stage of the programmable pipeline. Now there are a couple of things to note about this shader. We're using these built in variables here. Now these are sort of a throw back to the fixed function days. And they're only available on the Mac. They're not available if you use ES 2.0 on iOS 4. They're based on, sort of, the legacy pipeline. And actually we would prefer that you do not use them for a couple of reasons. There needs to be some mapping performed in order to use them. And they're not forward compatible. OpenGL ES doesn't have them as I decided. And code without them isn't the future of OpenGL. In fact, there are a number of extensions that would ship on Mac OS X where you can't use these variables. There a whole set of these variables. In fact, on the right hand side are varyings and attributes that are, you know, these legacy fixed function based variables. And on the left here, so excuse me on the right those are the varying attributes. And on the left here we've got uniforms. ModelViewProjectionMatrix, lighting, points, stuff that just doesn't exist anymore in the programmable pipeline. But OpenGL has them for a legacy reason. So instead of using these variables, you can use generics. Here's an example of a shader, which does the exact same thing as the previous shader. Except instead of using built in variables we use our own. We define our own, so that we have portable code and OpenGL can map them to the programmable pipeline much more easily. So we define an input position and an input texture coordinate. Well don't use the built in glVertex or glMultitexture cord. We use our own. And we define our own model view projection matrix. We don't use glView projection matrix. And we sat that outside the shader. And just as in our last shader, we have bar, TexCoord another varying that we output to. And in the main body we multiply our model view projection matrix by our input position that we've defined. And we output it to glPosition. Now glPosition is a built in variable, but it's not one of these legacy throwback variables. You still need to use it. And then again we take our input text TexCoord variable and pass it through to varTexCoord. So that it may now be read in the fragmentator. The fragmentator. So this runs once per pixel produced by each polygon. So let's say you've got a model with 1,000 vertices, and it draws 100,000 pixels. This shader will be run 100,000 times. Again, GPU's efficient at processing this, so that's great. But in general, if you have some processing that could be done higher up in a pipeline in the vertex shader, instead of the fragment shader. You probably want to do it up there, because that will be run less times. A little bit less processing, a little bit less work for the GPU. The cool thing about this programmable pipeline is that you can render effects that aren't possible using the fixed function pipeline. So here we've got this tune shading effect. OK? And here's how that works. Per fragment we calculate this edginess factor. So if the pixel on the polygon, if the polygon on which it lies is more on edge to the user, we'll give it a value that's lower to zero. That's closer to zero. OK? However, if the polygon is facing the camera, the user will give it a value that's closer to 1. So we now know whether this polygon is on edge or facing. And then we could use this edginess factor to give it a color. So, for instance, if it's on edge, it's probably we want to give it the sort of tune effect, you know, as the silhouette of the teapot hat, so we'll give it a black value. And otherwise we'll give it a blue value. And then we assign this color that we've determined the glFragColor, another built in variable, which is the ultimate color that will be rendered for that pixel. Let's talk about the inputs to the vertex shader. There's this call, glVertexAttribPointer, which points to data that can be fed to the vertex shader. Input position, for instance, or texture coordinates, things like that. Here's how it works. You allocate some memory in your application; in this case we're using malloc. And then you load data into this buffer that you've allocated. You call glEnableVertexAttrib to let OpenGL know that hey, I'm going to use a vertex array. And you give it a position at index, or some index 0 to 16 that maps to the shader. And then you call glVertexAttribPointer and you give it this position data that you've allocated. OK? Position data is basically tells OpenGL hey, my vertex data lives here. So there are some issues with using OpenGL this way. Because you're allocating the buffer on your end, you're not telling OpenGL to allocate it itself. OpenGL has to copy this data into its own stream. So CPU's cycles will be required to copy that vertex data. Here's what happens. You call glVertexAttribPointer, OpenGL now knows about this buffer that you've allocated, but then when you call glDrawArrays or draw with this buffer, OpenGL copies it into the command buffer. And there's a double whammy here. Because the CPU's also needing to copy, there's some CPU cycles being incurred. But also we're filling up the command buffer much more quickly. And then a flush will occur. Flushes will happen much more often, a second whammy. So instead of doing this, we'd like you to be able to just cache that vertex data on the GPU. And here's how you do it. You can store it in a VBL. You call glBufferData and BufferData allocates some space on the GPU, and then loads your vertex data into the GPU. Then you'll call glDrawArrays, a command is created and it simply references this data that's already on the GPU. You would call BufferData probably when your application loads before you're in the real time run loop. There is some cost to it. But if you do that it'll be cached ready to be used in a real time run loop. So here's how it works. You call glGenBuffers, that creates this vertex buffer object. You can call bindBuffer to bind it to the context. Tell OpenGL, hey I'm going to work with this object now. And you call glBufferData to allocate and load your data. Then you call glVertexAttribPointer the same call that you made before with client side vertex arrays. But this time instead of giving it a pointer, you give it an offset into the vertex buffer object, where your vertex data lives. So in this case we give it 0, which says, hey my vertex data is at the beginning of this buffer. You can actually store many attributes within a single vertex buffer object, so it doesn't have to be 0. Let's say you've got color data 50 bytes down, then you give it a value of 50 for the color attributes. You may want to modify your data for animations or some other reason. If you have a constant number of animations, if you have few of them, few enough of them, just create multiple VBOs. Let's say you've got 10 frames, 10 models that you want to animate for your character. Just make 10 VBOs. That way all 10 of them are cached on the GPU. However, you may generate data on the fly. Maybe you'll load it from disk. Someway that OpenGL may not know about it. In this case you can call glBufferSubData Or MapBuffer to modify this cached vertex buffer object. Here's how this works. You call glBufferData as you normally would, but instead you give it this glDynamicDrawHint. And that says to OpenGL, hey, I'm going to modify this buffer, so put in someplace that's easily accessible by the GPU, but can be easily modified by the CPU. Then you modify this data that you want to update in the vertex buffer object. And you call BufferSubData with this update data pointer, and it's loaded into the vertex buffer object. There are some caveats. There are some potential problems where if you're updating buffer a lot, buffers a lot, you can have some, encounter some problems. You can force the GPU to sync with the CPU. So all of a sudden you're running full out in parallel, and then you call BufferSubData and then lock, one depends on the other. And you don't want that. So instead what you can do, well so, basically what will happen is this CPU will wait for the GPU to finish drawing the buffer before it updates the buffer. OK? Both the CPU and GPU can't read and write from the buffer at the same time. So this can happen whenever you use glSubData or glBufferSubData or MapBuffer. You can use a double buffering technique to avoid this problem. And let me explain how that works. So, you have two vertex buffer objects. On an odd frame you'll load, you'll bind and load an odd buffer. OK? And you draw with it just as you normally would. But on an even frame you bind and load this other buffer. This way the CPU is loading this even buffer, while the GPU is reading from an odd buffer, a different object. They don't need to synchronize, because they're not accessing the same object, the same data. OK? And then you draw with an even buffer as you normally would. And so what you would do is you'd ping-pong between these two buffers updating one, while the other is being drawn. Vertex array layout. So vertex array layout. So VertexAttribPointers are really important to GL call. Because not only does it tell you, tell OpenGL where your vertices live, it tells it the vertex layout. So you call glVertexAttribPointer and you give it some data, like what kind of data is it? It's a floating point in this case. The size of the data. It's probably, it's a position, so maybe it has an X, Y, and Z. So it needs a value of 3. Describe the number of bytes from the one vertex, one attribute to the next. So in this case there's 16 bytes between to the next attribute 0, OK, in the array. And offset, where it lives within the vertex buffer object. Call glVertexAttribPointer again and some more data is updated for a different attribute. OK? So wouldn't be nice to cache this in the GPU so that you're not always having to call the glVertexAttribPointer. Well now you can, because now you have a vertex array object. And the way this works is you call glGenVertexArrays. And this creates this vertex array object. And any subsequent vertexAttribPointer call actually changes the data within this VAO. So it's all cached right there. Let me show you some code. You call glGenVertexArrays, create the vertex array object. You bind it to the context, tell OpenGL, hey; I'm going to work with the vertex array object. You call glEnableVertexAttrib just as you normally would, and glVertexAttribPointer. But instead of this getting set in the context, it's set in the VAO. You can set multiple vertex attributes and it'll all cache within the VAO. And then you can call glBindVertexArrays when you want to use this VAO to draw with. You don't have to call VertexAttribPointer many, many times to set it up, to set up your model data to be drawn. You just call BindVertexArray once and it's already cached ready to go to be drawn. Framebuffer objects. These are pretty cool objects. So with EAGL and OpenGL ES you must always use an FBO in some form. The EAGL IT guide requires that you use and FBO to allocate your backing store your store from, to which you will render to. However, you can do some pretty cool effects by attaching a texture to a frame buffer object. So, here we've got this little demon character. And what I've done here is we've rendered this demon character to a texture. All right? And then we bind that texture and textured this plane that you see here on the bottom. And this plane is the image that we've rendered to that we're now mapping to this polygon. So you can do all sorts of reflections, refractions, shadows. Some pretty neat effects with a renderable texture and framebuffer object. The way this works you call glGenTextures, create your texture as you normally would, bind it to the context, and then you allocate some data by, with glTexImage2D. In this case we're making a 512 x 512 texture. OK? And then we can create a frame buffer object, called glGenFramebuffers. And bind that to the context. Later when the texture is no longer bound to the context, we can attach it to this framebuffer object, which basically says, any drawing that you do in OpenGL, draw to this texture. Don't draw it to the screen. OK? And then later on you can bind that texture to the context, and you can read it and map your rendered image to some other object. Some notes about objects mutability. OpenGL objects can be changed at any time during their lifetime. So in this case we've created a texture, and we've given it GL Linear Filtering. And then later on maybe in a real time run loop when the user is expecting an interactive frame rate, we can do something like glLinearMipmapLinear and change that texture object. Change the way that it works. I would really avoid doing this, because this forces OpenGL to revalidate the object the next time it's used. If you need an object with 2 different states, just create 2 different objects. Set them up at fronts, don't change them, because then OpenGL needs to do some revalidation, and that's going to cause a stutter. Or it's potentially going to cause a stutter in your application. So OpenGL objects aren't actually pre-validated when they're created. Instead they're validated when they're first used to draw. Now there's some reasons for this. Shader objects can't be compiled until their first use of draw; because the compiler needs to know some context state that that shader's going to be used with. For instance, they may need to know that FBO, VAO, or textures that are bound, the blend states that is used in conjunction with that shader in order for it to do a good job compiling and optimizing your shader. Texture objects and vertex buffer objects, memory resources don't get cached in VRAM until they're first used to draw. Now this lazy validation step that I just spoke about, can cause some hiccups in your run loop. And there are some methods to get to avoid this. So you can avoid this validation, this hiccup by pre-warming your objects. And the way this works is, you bind the object and draw with it to an offscreen buffer. Maybe the back buffer and you don't present that to the user. And you use the state and other objects it's used with. So let's say you got a shader. Make sure you use it, you turn on all the blending state you bind all the textures that that shader is used with, and draw with it first before you're in your real time run loop. Don't wait until you're in your run loop and cause a stutter. Hey, here's a little bit of pseudo code. Oh, so one thing I should note is really only consider this if your application experiences a hiccup. If it experiences a hiccup, particularly when you've bound an object that you've never used to draw with previously. So here's some pseudo code for how this works. For every program in your scene bind it to the context. For every VAO that's used with that program bind that. And for every texture used with that VAO program, bind that. And then for every blend state use with all those other objects, etc., etc. You set that stake. And then you draw. Draw to the offscreen buffer. This isn't to present to the user, it's just to warm up OpenGL. Note about object size. Know how much memory your objects take. All current graphics resources need to fit in memory somehow. And memory is limited. On iOS 4 there's no virtual memory system, so this constrains. On Mac OS X. Some of the devices have limited VRAM. So, you know, if you're using too much memory there's some cost to it. There's some paging that might need to happen, and you don't want to have that in a real time run loop. Try to use compressed textures. Textures take up a lot of memory. There are a number of compressed texture formats that you can use. Also, don't use the data types like a 32-bit float for your textures, when you could be using an unsigned, 8-bit unsigned byte. An 8-bit unsigned byte. Use what you need. Use the smallest state as possible that you need for your scene. We see a lot of time, some applications allocate this humongous texture. A 2,000 by 2,000 pixel texture for a little tiny model that's going to fill up maybe only 200 pixels on the screen. You're never going to see most of that texture. It's just a waste of resources. So fit the texture to the size of the model that's rendered. Use a 256 x 256 texture in that case. And really to ensure a smooth frame rate, try to fit, the entire frames resources into VRAM so that we never need to page out a texture from VRAM in the middle of a frame. And if possible, fits entire level or scene's textures into VRAM. That way we will never need to page in the middle of your scene. So, OpenGL's objects are a very efficient way to use the API. However, there is still some costs to binding an object. You really need to determine this costs through profiling, however. Batch or draw calls to reduce this binding of more expensive objects. OK? Let's say you've got, you've determined that some texture takes a really long time to bind, or takes a fair number of CPU cycles to bind. In two objects that use this texture. Don't bind it once then draw, bind it again, and then draw. We said bind it once and draw, and draw the second one. Visibility. OpenGL processes every thing sent to it in some form. Even if it's ultimately not visible. You know, we're not going to draw something that's, you know, behind the camera. But there is some processing that needs to occur. The CPU needs to process the draw call. The vertex shader needs to run to determine whether or not the camera can actually see this object. So try not to send them to OpenGL. Imagine this is the scene of your application. Here's the view point, the user, the camera. Here's its field of view. And here is the frustum OK, this includes the right and left clip planes, the top and bottom clip planes, and the Z near and far clip planes. So now it will iterate through your list of objects and determine whether or not they're visible. Anything within this area we'll send to OpenGL. Anything outside of this frustum we'll just discard, we won't send it to OpenGL. And draw this robot; he's clearly behind the camera. We won't draw him. We draw this hero character; he's off to the right. We won't draw him. We draw this demon, hey, he is definitely visible, so we'll mark him as object 0. We draw this other robot, hey, you know, he's visible. So we'll draw him. We check this other demon character; he's not visible way off to the left. We won't send him to OpenGL. This hero character, he's definitely visible. He' also in the frustum. So we'll send him to OpenGL and then this demon character also visible. We'll send him to OpenGL. OK? So now we have the set of objects that we want to send to OpenGL. You can put this inside of a visibility list. Ok? Now one thing to note about this is you don't want to draw the objects in the same order they were determined visible. Ok? You want to sort it by render state. What would happen here you'd bind the demon's texture? Draw him, bind the robot's texture, draw him. Bind the hero's texture, draw him. And then we'd rebind the demon sections that we bound originally and draw the demon. OK? Instead what you want to do is sort them, so that now the demons are together, one bind two draw. All right, so here's an algorithm that you can use to sort your rendering state. It's called a state tree. So let's say you've got these four characters. You've got these two guys and they're clearly using the same texture. And you've got these two guys, and you've got this human character and he's got some metal armor on him and you've got this robot clearly all metal. So you want to have the shader that does some little shininess effect. They used the same shader. You stick them inside of a tree and you traverse it in order. So starting from the top, you bind the first shader. You bind the texture that's used by these two guys. You bind the VAO, and you render the demon. You go up and you've already bound the texture, don't do it again. You bind the new VAO and you render this guy. Go up to the top and now we bind this new GLSL program. This shininess program. And we bind the texture, bind the VAO, render this robot, go on up. You've already bound the GLSL program, so now we just bind the texture for this human guy, find his VAO and render him. Now it looks kind of like a binary tree here, but it would actually be an enary tree. This is a very simplistic scene. You probably would have, you know, many, many shaders, which would make this tree much wider. Also, you might want to account for different rendering state. Like you might want to account for depth, blend, clip state, etc., which would make this a deeper tree. OK? In this case I determined that the GLSL program objects I said, well those are pretty expensive to bind. So I'm going to put them at the top of the tree. We're going to set those first, so that we don't have to rebind them very often. OK? So the more expensive objects at the top of your tree, less expensive objects like vertex arrays towards the bottom. One way in which you can reduce CPU overhead, the CPU overhead of draw calls, is to combine the draw calls. Basically make less draw calls. And there are two methods that I'll talk about. One is texture atlasing. And this is basically combining multiple textures into a single texture. And the second one is instancing. Instancing requires some special hardware that's only available on Mac OS X. My colleague Matt Collins will be talking about that a little bit more tomorrow and how to use that. I'm actually going to talk about texture atlasing. Here you have these four characters, and you've got these four textures to map to these characters. In order to draw this in a naive way would be to bind this texture, draw to it, bind the second texture, draw to it, find the third texture, draw, find the fourth texture, draw. OK, you're binding four times for four textures. Instead you can bind it into one uber texture. A texture atlas. Then you bind that one texture, draw, draw, draw, draw. One bind, four draws. Much more efficient use of OpenGL. Here's an example of a texture atlas used in the Quest demo. Here we've got a lot of different elements in a single texture. We've got some flags on the upper right. We've got a stone wall in the upper left. We've got stairs, we've got doors, we've got statue, all in one texture, one bind, and they can draw a ton of different things of their dungeon. Multithreading in OpenGL. So because there is a fair amount of CPU overhead that OpenGL incurs, there are reasons to multithread it, so that you can amortize the CPU costs across multiple cores. This makes sense on iOS 4 devices as well, even though they only have a single core. CPU intensive calls can block. And that means that while they're blocking, while you're in your main loop doing some OpenGL processing, you can't handle UI events, you can't handle audio, you can't do your app logic. Additionally, if you're doing a lot of stuff on the main thread, there is this is watchdog process that looks at your app and determines whether you're in some infinite loop, whether you're behaving badly, and it may kill your app. So it may mistake you for doing something like in an infinite loop or a block and some sort of deadlock and just kill your application. So you really want, you could put some of this processing on anther thread and the watchdog won't do this to you. So here is the simplest multithreading technique I'll describe. And basically you load a second, maybe third thread, both with, or one or two OpenGL contexts. And you can use these threads to load data. Load textures. Load vertex data. Compile shaders. A lot of CPU heavy lifting. Important things to know is once you're done with all that loading you want to kill your other threads. You only want to have one thread running with an OpenGL context. OK? Because the other thread, two OpenGL contexts is running at the same time. There is some CPU over head that might be incurred, because there's some locking. And so two threads can block one another; two OpenGL threads can block one other. So just have one OpenGL thread running at a time. Another more advanced technique is to use a producer consumer paradigm. So in this case the main thread produces data, it can produce, you know, the animation frame, which your characters are in. To position on the screen or position in the world, you can compute the visibility of your objects. Use that frustum culling technique. On this thread you won't have an OpenGL context, you're just producing data to render by a second thread. When the producer thread is done you send all the render threads to begin. And then the render threads can take all that data, consume it, and render with OpenGL. It has the only OpenGL context on it, that render thread. You don't have two OpenGL contexts. The main thread can then process audio input and other app logic in parallel while the render thread is actually drawing stuff. And let's say you still have some CPU overhead and it's not well balanced. You can move some stuff, like maybe the visibility test to the render thread. Give it a more even distribution. So I've talked a lot about using the CPU, or the CPU and the GPU together. You need to also consider using the GPU with the display, the other device. That is important to consider when coding your OpenGL. One thing to note is you can only render as fast as the display can refresh. It doesn't make sense for you to render at 200 frames a second if the user can only see 60 frames a second. OK? It just wastes battery power. You're doing a lot of processing that the user will never have any context into. Will never see the result of. On iOS 4 you can use a CADisplayLink API. And we see a lot of apps using this NSTimer API to initiate their per-frame rendering. Instead, we use this DisplayLink API, because the NSTime is arbitrary when it's fired with respect to the display. What will happen is NSTimer might fire right before the refresh, and so there's going to be some latency between the time you draw and then you can see it. Or it might be fired right after display has refreshed. So it's not going to be consistent with respect to display. And in some very pathological cases it can reduce your frame rate. On Mac OS X you can use the analogous API CV display link. Now we see a lot of games trying to control their main loop. This is a kind of an outdated way of doing it. Because, again, you don't need to display, or don't need to render any faster than the user can see. Looping more than needed wastes power, particularly on these MacBooks where people want a fairly long battery life. You could implement some benchmark mode for advanced users or developers that runs on the main loop. But under shipping game in a normal loops, or for a normal running you may just want to use CVDisplayLink to initiate your rendering. A note about coding for both platforms. So OpenGL ES is a subset of OpenGL on the desktop. So if you code using OpenGL ES 2.0, you can port your code that you've invested a lot of time on iOS 4 onto the Mac. And vice versa. OK? So if you've got this Mac application, and if you stick to all the calls that OpenGL ES 2.0 provides, you can pretty easily port it to the iPhone OS 4. There are some things to be aware of. Clearly there are more memory and performance constraints on the embedded iOS 4 devices. So you need to consider that. There are different compressed formats. You will need to translate for that. And there are for some kind of silly reason, slightly different function names. Now the parameters of these functions are exactly the same. On OpenGL ES we have this BindVertexArray OES function and it's exact same function as BindVertexArrayAPPLE, it just has a slightly different name. So you just need to rename your functions. The sample code that I provided for this session that you can find for the session's site, compiles for both and you can kind of use it as a template or maybe a starter for creating your application that you might want to port and ship on both platforms. It's this kind of cool little reflection demo. And it works pretty well on most platforms. So in summary, there is a fair amount of CPU overhead that OpenGL needs to incur, and you can minimize this to efficiently access the GPU. Validation is where a lot of this CPU overhead occurs. You can use OpenGL objects to cache this validation and minimize state changes and draw calls to reduce the validation. For more information you can contact our Graphics and Game Technologies Evangelist, Allan Schaffer. As well there's a ton of documentation on OpenGL at the OpenGL Dev Center. And there are a lot of engineers from Apple at these Apple Developer Forums, so you can get help and ask questions throughout the year at this devforums.apple.com. Also, I've posted a bunch of code snippets and the sample code at this link down here. So you can check that out. There's a Q&A on the different variables that you shouldn't use in your GLSL program and some other information. This is just the first of six OpenGL sessions at this year's WWDC. And there's some great information for both Mac and iPhone developers, or iOS 4 developers. The first is OpenGL ES an overview on, for the iPhone. And that's mainly geared for the iOS 4 developers. And the second is OpenGL ES Shading and Advanced Rendering. And this is a pretty cool one. My colleagues have come up with some great demos. And even if you're a Mac developer, a lot of these demos are easily portable to the Mac. So they're going to be talking about some shadows, some reflections, some really cool techniques that you should probably check out even if you're a Mac developer. There's also an OpenGL ES Tuning and Optimization session. And this is going to be great for developers coding for iOS, but also some of the techniques that they mention you'll be able to use and leverage on the Mac. They're going to be describing a tool that's available on iOS 4 the OpenGL Analyzer. And even though it works only on iOS 4, if you're a Mac developer you can use some of the same techniques to profile your applications. They're also going to be talking about some of the GPU bottlenecks. Now I talked a lot about CPU bottlenecks, but there is a whole class of bottlenecks on the graphics processing unit that you should be aware of and potentially could run into after you've optimized the CPU portion of your app. And then tomorrow morning at 9 o'clock there's OpenGL for Mac OS X and my college Matt Collins will be talking about a number of the newly available features on Mac OS X instancing, texture arrays. He's also got some really cool demos that you might want to take a look at. And then finally there's this Taking Advantage of Multiple GPUs session, and they're going to be talking about OpenCL the other GPU API. And it's used in conjunction with OpenGL and leveraging multiple GPUs on the Mac Pro to do some really cool processing with OpenCL and really great rendering with OpenGL on two different devices. So thanks a lot for coming. I really appreciate it. And I'm hoping you guys will be able to take these techniques and really efficiently use OpenGL. [ Applause ]