WWDC2004 Session 210
Transcript
Kind: captions Language: en okay so my name is john stauber I manage the OpenGL engineering group so let's get right into it so what we're going to talk about today is a brief overview of what's new in OpenGL this will be what we have done differently or optimized since WIBC last year got to give you a year yearly update as we go through this session we're going to go into some basic tips we try to keep this short we want to get into the more advanced optimization techniques for OpenGL then we're going to get into some detail on optimizing open GL texture uploads and then a new optimization technique for asynchronously reading pixels back off the graphics card which is very important for people who are trying to get pixel data back off vertex data throughput how to optimize your vertex data uploads making sure that you're getting optimal performance uploading your your vertex data up to the GPU one-shot images sometimes you have images that are small or aren't going to be reused you just want to draw them once up to the screen so we'll talk briefly about how to optimize that pixel coffee operations how to optimally copy pixels around there's certain ways that you can optimize that making sure that they are VRAM VRAM copies and using threads there's a lot of people out there trying to use threads we see a lot of problems with that want to briefly cover that and make sure that people understand the limitations and and how to make that work optimally so briefly optimization strategies there's two basic ways to basic things that people are trying to do they're either trying to maximize performance of their application or they may be trying to minimize the CPU burden and depending on what your application demands are you'll want to focus on one of those two types of strategies and as you'll see in my presentation that effectively using the CPU can lead to greater optimizations than just simply just offloading all of the graphics processing work on to the GPU there's ways to balance that and get further performance by burdening the CPU but for applications that want to offer as much work because they are doing things like decoding some image or reading from a disk they may want to simply be trying to take all of the burdens of drawing OpenGL and offloading a GPU so there's two different types of techniques don't be be sure that you understand that the CPU is still a very effective processing device for getting performance so key concepts eliminating CP copies is very important one of the key concepts that I go through here over and over is how do you eliminate copies of the data as it goes through the OpenGL pipeline and that is one of the keys to getting performance because what you'll find is that data can be copied multiple times and limiting those copies getting direct a direct pipeline set up such the data goes directly to GPU is your your path to getting high performance caching data in the RAM and this can be textures or vertices is also very important you want to be able to leverage the large bandwidth that the graphics processor has to its memory it typically is going to be five times let's say four or five times the bandwidth that you can get from the CPUs memory bus so if you have static data lets you want to move into the vram and leave it there and reuse it caching at vram letting us manage the vram out of memory foot space for you maximize a dangerous behavior so obviously you want to minimize your synchronization points between the CPU and GPU you don't want to have points that stall and lead to secrets behavior between the CPU GPU you want to be able to operate asynchronously allowing both cpu GPU to perform at their fullest and using threads briefly as i had mentioned before okay so what's new well we've spent a year off my zing go kill some of the highlights are that we've been focusing quite a bit on immediate mode performance there's a number of applications out there that have been ported that use a meeting mode performance in media mode wrong modes that make it a key optimization and we've been spending a lot of time on that to try to help those types of applications coming over to the platform pixel transfer paths and what I mean by that is any kind of copying of pickle data are be a RGB data what have you we've been spending a lot of time optimizing those paths and continuing to improve that so if your application is sending a lot of pixel data we're off working on those paths vertex program emulation for any kind of applications that needs to run across all of our CPUs and is relying on the vertex program feature we are continuing to improve the emulation of vertex program on the CPU so you can run is a vertex program on all of our platforms and get the best performance possible asynchronous texture downloads as I mentioned before that is also a new feature since the last time we talked something is very important so as you can see we have a list of extensions that we've added since last year quite a few or continuing to add features regularly and as fast as they get approved and made a part of it to be the open till standard we will fold those into open jails okay so so some basic tips things that you don't want to do or void is you you want to avoid jail flushes keel flushes what those do is they truncate the command stream going to the processor and flush that command up to the GPU now the reason you want to avoid slush is one is it it's a command it's a kernel trap so you want to avoid that Colonel trap secondly you only have a limited number of command buffers so if you keep issuing geo flushers back to back you will run out of that resource and that will be a synchronization point where we will have to wait for the graphics processor processor to finish processing those command buffers and slights before we can get a free command buffer to start working with again so we tell people just to avoid yield flushes now there are points there are times when you'll want to use those and we'll go into that a little bit later I tell people never to use deal finish you'll finish is it truly asynchronous call what you'll finish does is that it submits the current commands stops waits for the processor the graphics processor to be done with those commands before it will return so it is truly a serialized synchronization points that will cause the CPU and GPU to stall against each other so I tell people just don't call that at all avoid GRE pixels if you can you would want to use some of our more modern ways of doing it one of the techniques for replacing jewelry pixels is used copy pixels and copy pixels is useful for getting the range of different copies so for instance if you wanted to save off the pixel data some depth data from simple data instead of reading it across the bus saving it with a CPU and then uploading it back what you want to do is you want to use copy pixels to store it on the vram somewhere else don't read it across the bus save it somewhere in another buffer up in the RAM and that way you can get the high bandwidth of copying it every time and restoring it when you need it a singer is now txt downloads it's also a good way to replace your read pixels get an asynchronous read back of your data and not to have a stall of waiting for the read pixels to finish so again a medium of performance we've been optimizing it but it is still one of the floor paths so we tell people when possible avoid immediate mode drawing instead use some of our more advanced extensions now there's one exception of this and that is in displaylist if you use a medium oh drawing a display list we will take that media mode data we will convert it into a more optimal form and prepared the data and then upload into the RAM cash envy ramp for you so this playlist is a separable place to using media mode and it turns out that's fairly convenient for a lot of people who already are using a media mode but they realize their data static give your data static you wrap a new list endless around it we will post process the data and stick in a video memory and then you'll get the benefit of that optimization so minimize state changes most people that have been working with a deal know this one state changes are expensive they do cause a revalidation of the hardware states which can can't be slow if you do it a lot so avoid redundant state changes and do your drawing in groups of state so what you want to do you want to coalesce your drawing under a given state setting which allows you to minimize your state changes okay so let's get into more detail text your uploads so the what we're going to talk about it we're going to talk about the texture pipeline overview give a people a brief description of what the pipeline looks like we're going to talk about some of the optimization basics and then we're going to get into some of the extensions the extensions are can be different depending on whether you're talking about a power to or non power to texture so we'll differentiate a little bit between those two types of textures for people who aren't familiar with that there are power to textures which is more standard opened jail and recently over the last few years there's been the non power to which allows you to have a texture of any size which is very useful for general image data video pictures what have you will use non power to so here's a basic diagram of the OpenGL pipeline the part that we're going to focus in on for this section of the talk we're going to focus in on the pixel pipeline and looking at just a block diagram of what the pipeline looks like standard OpenGL on Mac goes 10 you can end up with at any time if you're passing a texture through the system you can end up with four copies of the data so obviously that's a lot right you want to avoid those so what we're going to talk about we're going to talk about how to eliminate each one of those copies and get you performance increases obviously as you do that so but in the default setting you can end up with four copies of your texture as a pastor to the system one coffee is going to be the copy that you have one is going to be what the framework has one is going to be one it copied that the driver keeps and then one is going to be in video memory so so let's get into some of the ways to optimize that so again minimizing see two copies is the key here we want to we don't want to give the CPU cot the CPU redundant things to be doing we want to optimize its time so correct setup will minimize the CPU copies and what we mean by that is that you want to use the right texture formats the right pixel formats which will be ensure optimal paths it will also ensure that the graphics processor accepts that data type so you know OpenGL supports a very large number of pickle pipes and the graphics processors also accepts quite a few pixel types but if you stay if it's possible you want to stay on the confined set such that you are guaranteed the particular graphics processor you're on has native support for that type and it won't have to go through some kind of conversion to a type that is compatible for that graphics processor so here I've got listed three types BGR a and the 8888 reversed and the 1555 of reverse now those are the native Macintosh formats so when you set your monitor or 32-bit pixel mode are millions of pixels and or thousands of pixels those are the tikl types that the screen is running in and that will give you a compatible type it also turns out the graphics processors understand that type natively also you'll see a y EV type there for people are doing video or want you have a y UV source they can use a y UV texture and that will be accepted as well so when I usually put these up some people ask what about rgba which is the standard OpenGL type rgba isn't natively accepted by all graphics processors sometimes it will have to go through copy and get swiveling into a different format so usually it's fairly optimal copy and sometimes it might be natively supported but in general you have to be a little careful of that type so let's talk about the extensions so client storage is an apple extension that extension is a way to eliminate the frameworks copy of a texture what it does is that it instead of having the framework make a copy of the texture the framework instead keeps a pointer of that texture into your memory so if the application has a retained copy the texture you can just tell us use my copy don't make a copy for yourself and that will eliminate one CP copy that will eliminate the memory associated with keeping that copy Apple texture range is another apple extension this extension eliminates the drivers copy and there's two different ways to drive this is cached and shared what those mean is cash means keep a copy of the driver it's telling a driver to keep a copy of a texture and video memory shared means simply point to the copy in system memory when you're doing your drawing and I'll get a little more detail in this in a little bit but keep those concepts in mind they are important so now EXT texture rectangle is an extension required by some hardware to allow texture range to work properly and the reason for that is that some harder requires the the power of two texture to be formatted in a certain format so if you are if you don't use that extension you're not necessarily going to be guaranteed on all graphics processors to get a eliminate the graphics driver copy of the texture so keep in mind that texture rectangles tend to be more widely supported for eliminating a driver copy when using texture range okay so let's look at the doll go back to the block diagram so what we see is that client storage using that we eliminate the framework copy as I said and looking at a little bit of source it's fairly simple all you do is you enable the you of the texture the client storage option when you are building your texture so before you load it just call pixel store I enabling the client storage option and that will eliminate the framework copy do remember that when you do that that you are now responsible for the memory so if you go and delete your copy the texture the framework is pointing to that and if you try to do something that requires us to access that memory you'll crash now looking at the Apple texture range now and Apple texture rectangle extensions as I said before this eliminates the drivers copy and I'm running I'm showing a block diagram here of it running cache mode so what happens is using these two extensions the driver will be pointing directly the frameworks copy and deeming it directly from the frameworks copy up into video memory keeping the copy and video memory that's running a cached mode ok and again here's a little code snippet showing how to use texture range very simple one call touch pram i text a rectangle extension target type storage hint apple and then cash apple for the hint type now using all these together what we get is we get that a graphics processor now is going to be pointing the GPU directly to point directly to the applications copy of the memory and it's going to be DM aying it directly into video memory so what you get is you you've eliminated the CPU actually make it your copy we point directly to your copy of the texture DMA directly so the CPU never actually makes a pixel to pixel copy the graphics processor is deeming it directly in the video memory and looking at those code snippets together it simply looks like this it adds basically two calls to this you'll see that i'm using a texture rectangle type I've been sort of the two previous code snippets between the bind and the text image 2d call so that i'm getting the leader echt DMA transfer as I just showed now switching gears a little bit and looking at the shared options now I'm sure adoption as i said before makes it such that we are not going to cache copy the texture and video memory instead what we're going to do is we are going to set it up such the graphics processor is going to look up the text file that rasterization time directly from system memory so while strong its walls drawing the polygon each time goes to have such a text over they'll go across a GP bus look it up out of system memory so that eliminates the copy that's in video memory and there are some uses for that and I will show a demo of that shortly so here's what the code looks like it's the same thing as the cached text Bram I except for you would pass instead of the storage cache Apple you would pass the storage shared apple hint type and that will mean don't cache of coffee and video memory so looking at the block diagram block diagram again so this is what it looks like when you run a shared mode there's no copying video memory as your rasterizing is it is looking the text fills up directly from the applications memory and you end up with one copy your copy in the application there's times when this is useful times when it's better to use cached ok so same code snippet all I had to do to get the shared option in there again has changed the cash to shared I'm going kind of fast so what I should point out is that this example is actually available on the website you'll see at the bottom that there is the sample name you can look this up and it has this code in there so you don't need to be copying this down ok let's talk a little bit about cash assured when one is appropriate versus the other so cache mode is better for textures when you're going to use it multiple times you don't want to be look you don't want to be reading a text across the GP buzz multiple times so if you're going to use it a lot you're going to want to cash it video memory and then use it from there multiple times and not require cross the bus so it's also best when you're using linear filtering when you're filtering is a little bit higher bandwidth usage because it's happened to look up neighboring text files to do the linear filtering and now shared is talk much character second so share it is is better for one shot images that are very large and the reason that I say very large is that if you have low video memory cases and if you want to upload a large texture what you don't want to have happen is that texture to say evict everything else out of IDEO memory so if you're running in a low video memory case it is possibly a benefit to run in share mode where you're not going to consume any video memory you can leave what's resonant and video memory there and just look up your image straight DMA as your rasterizing as opposed to be made into video memory there are some caveats of that and that is that it the shared motor runs best when running in nearest filtering if you're running in linear filtering again as I said that as its wrath writing it's going to be looking up those texts holes well if you're running in linear filter linear filtering requires neighboring textiles and it will fetch more text files from from the neighboring part of the texture which will cause you some performance b-grade as well is it works share works really well when you're scaring the image down for the same reason I just said if you're scaling it down it's not going to have to pull all the pixels across the bus so so for power to briefly talk about that all the same extensions i just talked about are actually applicable to power to this is the same code snippet i had and all i did replace the rectangle texture option with the texture 2d numeron type and the difference here is that as i had mentioned before power to sometimes won't get you a direct DMA so not all graphics processors support direct DMA and instead what will happen as the driver will make a copy you can still use the options all these same extensions I've been talking about but sometimes you won't get the direct DMA ok let's talk about how to manage texture range now as we saw in the diagram the graphics processor is going to be starting starting to look directly at your memories ok so the graphics processor one cpu now are going to be sharing the same piece of memory as its rasterizing now there's there's a problem with that and that is that you are now going to have to synchronize the CPU and the graphics processor such that they don't collide you can't have the CPU and the GPU reading the same piece of day at the same time right the standard standard problem when you have multiple devices looking at the same piece of data so what you want to do is you're going to have to double buffer it and between double buffering you're going to have to use you a flush so I've got some diagram here I'll show this so if you're running single buffer mode and you have the CPU just generated texture let's say read it in or decompressed it and now the CP is going to want to flush that up to the graphics processor so it issues the deal flush to get that command in flight and get the transfer that data up into video memory and then the graphics processor is going to do it its work of processing and swapping it to the screen so there you just did one frame right and when single buffered I have to synchronize my cpu and GPU serializing the processing because I don't I only have one data set only want to work on at a time so as we build through this you know this is how the frames go I can only have a cpu GPU working one at a time now if I go with double buffering let's see what happens so the CPU generate let's say we start the sequence gp-generator to frame it flushes it now in the next frame I can see that if I had double buffering you see if you can start working on the second texture while the GPUs processing the first right so now we can flush the second one and swap the first one so i just showed one frame while i'm submitting the next one to the graphics processor for processing and likewise it goes continues on right now the the CPU can start working back on texture one and the GPS working on texture too and so on and so on and basically if you had you know this is an exaggeration where you have perfect parallelization but it does make a difference where you are getting getting a singer's behavior between the CPU and GPU okay since so let's talk about how to synchronize how do we synchronize the to the commands how do I know when the GPU is done processing my data for instance so it's pointing at my my data set how do I know when it's done accessing that data well if you're using text arrange what you need to use is is the apple sense extension and apple fence extension what you can do is you insert a token into the command stream and you can you can query for the token to determine when it is done reading your texture so that you'll know that it is now safe for the CPU to start touching the data again so there's a couple ways do that you can use it by inserting a token or you can actually use it straight by accessing or referencing a texture object and a texture object it's just your standard texture ID and what you'll do is in the fence object test object command you just send it the texture tight the GL texture object type so looking at a little bit of code for that so the first two commands here to talk about how to set up a fence so you just do a deal set sense Apple you can pass it any name you want just a token that you pass them in the command stream and then when you are ready to start touching the data the texture again with a cpu you would then test to make sure that the GPU is done and that's a synchronization point where you the CP will wait for the GPU to be done reading that data and at which point then you can start touching it again with cpu now the last command up on this screen is the way that you could use it the sense extension without having to set a fence explicitly you can just test for a texture object and all you would do is just call the finish object apple with the yield texture target type and the name of the texture so if you bound to a texture you would just test against that same texture ID number ok so we're going to switch to demo machine too so what I wanted to show here is the example of texture range so you'll notice that the CPU is doing very little here first thing I want to point out very quickly is that the CPU is continuing to do very little and that's a sign that the CPU is not making copies of the pixels right because of a CP was copying the pixels I would be seeing a big spike of cpu processing time instead what's happening as the graphics processor is talking directly to the memory controller getting a direct EMA so the date is not going through the CPU so now I'm going to turn online infinite button it's a button i recommend everyone put on their app make it go infinitely fast so so even though now i'm doing 240 frames a second i'm getting a gigabyte a second across the AGP bus I still have no CP time because again the CT is not doing anything here cpu is orchestrating this and not copying the data simply directing the traffic as you might might think of it that way now but what I really want to show what this example is using the shared options so as I said before the the cached option is good for drawing multiple times but the shared options good when you shrink an image down now it turns out this image actually is 1024 x 1024 it's a lot larger than my window so if i switch the two shared mode now you'll see i'm getting three gigabytes a second right well that's not even possible the application thinks it is though and the reason is that I've shrunk the image its nearest filtering some of the text holes of that image aren't actually going across a GP bus because the graphics processor is skipping scan lines and skipping pixels and always selecting the ones it needs to draw the image so now I'm getting 700 friends of seconds as opposed to you know one gigabyte a second and you know 240 frames a second so quite a boost it doesn't take any video memory because I'm not I'm not caching copy and video memory I'm bringing it directly across the bus so there's there's times when this type of technique can be a large win okay I'm going to switch back slides okay a singer's texture downloads let's talk about how to set up text arranged I'm sorry ACH section downloads are basically the same thing is uploading the texture where we just talked about where you set up a texture as an AGP texture for direct dming a stinger this texture downloads is the same setup but in Reverse so you set up the texture the same way and then you you copy use copy text sub image to copy the data into that texture so the way it works is that you the copy text some image is the call that initiates the transfer from video memory back into your texture and system memory the reverse and that is an autonomous call that will happen any synchronously so the next time you issue a flush there'll be a copy text double image call in there and the flush will issue a copy DMA transfer from video memory in the system memory okay and that's autonomous the CPU doesn't need to wait for that event to happen now there need to be a synchronization point because the CPU needs to know when that's done so what you use is you you use at some later point you use the copy the get text image call and that's the synchronization point that will wait until the transfer is done now hopefully you've done enough processing between your copy text sub image and your get text image where the transfer is done and you don't have to stall wait so the idea here is that you you separate those as far as you can maybe double buffer them triple buffer them do some processing between those so the basic setup of those is that again is the same the get sex image will take the same pointer as you originally passed it for the texture and the parameters must match the setup of the texture so how are you set up the texture those same parameters will be used in these calls as well and again you do this as late as possible to get the maximum asynchronous behavior so let's look at a little bit of code the setup you'll notice the HX is setup is the same as it is for texture upload it's exactly the same now the download is the key part of this and all you have to do the two key calls are they get the copy text sub image and the gift takes image if you issue those calls on a properly set of texture you'll get an asynchronous download and on my systems at work i can get about 500 megabytes a second download performance which is usually pretty acceptable for most people particularly considering that can be an asynchronous operation okay let's talk about vertex data so you'll notice in this part of the talk that vertex data set up and in optimizations is about the same as texture it's just a different data type so a lot of the parts of the discussion are the same and all we're going to do is walk you through and point out some of the differences of peculiarities of this so we're going to go through a pipeline overview we're going to talk about the basics again and then we're going to talk about the the extensions now we're going to try it we're going to separate dynamic and static data talk about some differences and get a little bit more detail on display list so in this part of the talk we're going to point out the the geometry part of the pipeline not to pick some part of the pipeline and let's talk about some basics so the first thing is that you want to take data types like in the pixel talk we want to pick data types that are our most optimal and the most optimal for the vertex paths are floats shorts and unsigned bytes if possible stick with those types most graphics processors will handle those types natively so you will be able to get optimal upload performance and if in some of the cases where the CP might be making a copy we spent time optimizing those paths so these are the ones that will give you optimal performance you the other basic point of optimizing vertex upload performance is you want to avoid function call overhead now obviously immediate mode where you sending one vector of data per call is pretty inefficient as far as a copy routine what you want to be able to do is use a vertex array called draw raised raw elements to get the data through the system with minimal function calls as possible another good technique is to use cgo macros cgo macros is a way to directly reduce function call overhead and i'll show example that it can be pretty dramatic how much how efficient you can make the function calls when you start using the function pointer dispatch table directly as opposed to using the top-level library entry points so that's a concept that people may want to keep them in mind if they are seeing or if they're making a lot of function calls so another key concept is if you are when you're drawing vertex data using passing vertex data and opened jail you want to maximize a number of vertices / draw command so what you're using arrays you want to maximize the number of vertices / draught men or if using begin in you want to maximize the number of vertices / begin end you want to get as many vertices / begin end as possible and that can make a significant performance improvement if you can do that another optimization technique is to offload CP processing using vertex programs onto the GPU so if you have computational processing you're doing on the vertices think about trying to offload that work to a vertex program on the graphics processor okay so how do we eliminate CP copies another key concept is almost exact same as and textures what we can use is we use the apple vertex array range which is a parallel API to the texture range extension and you could also think about using the the ARB new arm standard which is vertex buffer objects those are nearly the same type of API one is cross-platform the ARB vertex buffer objects is a cross-platform API that will allow you to optimize your vertex data throughput caching static data and vram is a key concept here just like textures you want to be able to cash that data nvram use this playlist for vram for static data again will process that data and cache that new ramp for you if you want to use this playlist it's a very effective way to get your data processed properly in n2 v ram so looking at the pipeline like textures there could be multiple copies of the data as it goes through the pipeline this is showing a medium o drawings media mode wrong we are required to keep the current vertex take before we pass it on to the graphics processor so using a median mode you get one extra copy immediately so if i switch to using vertex arrays I can eliminate that one copy and just by using vertex raise so I'm saving myself some processing time by using vertex arrays immediately now let's go into some talking about these extensions so Apple vertex are a range for dynamic data what you want to do is you want to pass it the storage shared apple just like on the textures we were using the shared hint for how to solve we want the data to be treated and for the art vertex buffer objects we want to use the dynamic draw ARB constant which is the equivalent to our storage shared constant so it will give you the the optimal treatment for dynamic data and what happens when we use these extensions combined with vertex arrays we end up with the same thing we had for textures and that is we get a direct DNA from the applications copy of the data directly into the graphics processors pipeline so it will read it directly into its pipeline of processes no copies and video memory so looking at some sample code how to setup for dynamic data using vertex array range it's very simple there's two calls that are key to this vertex array range Apple you just set it a pointer in a size and that tells us how big of a piece of member you want to map you Malik the data you give it to us tell us where the pointers and what size it is we map and a JP prepared for a suitable storage area for direct DMA now you need to make sure you flush that data so you need to tell us when the date has changed so you have to call the flush vertex array range Apple call anytime you change the data that includes initial and when you first set it up or when you modify some sub region you tell us a pointer and a size that you want us to flush it to flush sub-region to can flush the whole thing you tell us any time to change the data and we'll make sure that all caches all copies all are synchronized with your copy of the data okay vertex buffer objects a little more setup not much more though you'll see that we we bind buffer object has a a object type binding where you will bind to a name and that will give you the ability to switch between buffer objects so what you want to do is you can you can create many of these and you can bind to them as you need to and the what you do is you pass in your pointer in the size just like you set up your data size basically using the buffer data and in that call you'll see that most passing here the dynamic draw and that tells us that this particular buffer object is going to be set up for dynamic drawing and we're going to be changing the data frequently and then you call Matt before Matt buffer then actually is where you get the pointer back so instead of you allocating the memory open Jill is going to allocate the memory for you and hand back the pointer to you then you fill out the data and then you unwrap it and on unmap is when is the equivalent of our flush so I'm map we flush the data out and now the GPU is everything all the cash has been synchronized and ready for using that data so again you'll see that the sample code that shows this this list at the bottom of slide it's it's going to be available tonight up on the server so that everyone will be able to download this we left we've updated it with vertex buffer objects so that you'll be able to see how to use the new new extension okay static data so static is almost like dynamic data it just uses different constants so we can use it for apple vertex array range we use the storage cache apple hint or for our vertex buffer objects we use these statics draw ARB constant and the rest of setups the same basically you just pass with different constants displaylist again between the begin in you can use media mode between the begin end call will propose process the data to put it in optimal form after uploading a couple key things remember for display list is that you want to keep it data and they pass it consistent vertices so and what I mean by that is that if you pass a you'll begin and then you pass a deal color and are normal and then you pass it a deal normal I'm sorry passive a color a vertex that means paths are normal in a vertex you're passing different types of data per vertex and what that does is actually confuse our optimizer make it such that it won't optimize the data and you won't actually get to any benefit from it so what you want to do is the first vertex you call you want to make sure that you're passing all the data you're going to beat out required per vertex so if I need a normal and a color that's a normal color for the first vertex then you can call anything you want as long as you don't call say something besides the color in normal not sure that's 108 clear but we can talk about it more okay so vertex of a range and display list so what that gets us is for static data is like textures we can cache the data in vram and there's three different ways to do it as you'll see there's the vertex array range displaylist or vertex buffer objects they get you basically the same behavior and that means that you're cashing a static data nvram and now when you reuse that data you're getting the full bandwidth out of the graphics processor bus and it gets you a significant performance boost okay looking at static data set up similar to the dynamic data set up where you are passing the cached hint instead another shared hint for the vertex array range and again for the vertex buffer object all you would do is change it from being the dynamic to the static draw our constant for the for the static data set up for vertex buffer objects and parallel talks about how to synchronize your data so if you're using vertex array range the application and the graphics processor can be sharing the same data so you're going to need to be able to synchronize and flush between manage the synchronization and flush between your drawers if you're going to double buffer the data so it's the same type of operation right CDs generating data flushes it to the GPU GP now is going to process the data going to flush it up to up to the screen for single buffering and the CPU and GPU are going to be running a series serialize they're not going to be in parallel operating parallel so let's aim all the way through the frames and for double buffering the data now again what we can do is we can have the CPU process of data flush into the GPU and as a gps processing see if you can start processing the data again and so on through the frames that we can theoretically get up to double the performance before something is perfectly parallelized okay similar to the texture range you use the fence for synchronizing the vertex array range data you'll need to know when the GPU is done processing the data so what you do is you would set a fence or you use the test object mechanism for you referencing a vertex array object and that will let you know when that processing is been completed so that you can start touching the data again with a CPU so the sense extension is what you'll be looking forward to use for text for vertex array range and vertex buffer objects don't require this they have their own synchronization mechanism and it's the map you change your data and you on map so you don't need to use the fence extension for the buffer object only for the vertex array range extension and just like the textures looking at some sample code first two are the same where I'm setting a fence and finishing against that token I've inserted the finish vent and then the third line of code down there is instead of using a GL texture or texture type i'm using a vertex array type for testing against a vertex object type and that will allow me to present a synchronization point where i can be guaranteed the graphics processor is done touching the data in the in the vertex array range and allow me to synchronize the graphics processor and cpu so here's a little bit of history here so last year I showed this slide not quite I sure to low going out a little bit further but i want to show what we've been doing so i talked about some optimizations what this slide shows what exactly i should the data i showed last year and with all the hardware changes and software changes we made over the year here's where we are this year so it's a huge increase in performance and looking at some charts here so if we look at a media mode performance for 8 vertices / / begin end we've got up eight hundred percent vertex arrays eleven hundred percent thousand percent for vertex array range and seventeen hundred percent for display lists' now this is using very small only eight vertices / draw command so it's a very small data set per draka man so this shows some of the functional overhead associated with small drawing batches looking at another chart in another point in that chart this is using 42 vertices / draw command so this is a little more optimal setup but you'll see that we're still making quite a bit of performance gains up to four hundred seventy-seven percent for a media mode so as I said we're working quite a bit on a media mode so since last year we've almost increased performance of our systems by five hundred percent and this is not only a software change but hardware change right so this is comparing state-of-the-art hardware software last year state-of-the-art hardware software this year five hundred percent faster almost okay switch the demo to please okay what I wanted to show here is just some of the effects I talked about so here I've got a just a simple mesh this only has eight eight vertices / strip you can see that i'm only getting a 1.5 million triangles second with immediate mode now if i increase the detail of this this mesh by selecting this option so now i'm up to a mesh that has 198 vertices / strip you'll see that my performance jump dramatically now i'm up to 12 million triangles a second now as I said before you can reduce function call overhead using cgl macro so I've got an option here to turn on cgo macros I just select cgl macros and you'll see that i went from 12,000 to 17,000 so I got five million triangles a second in a medium own performance by enabling cgo macros it's a really large difference but now let's see what we get when we use a more optimal form of drawing so I'm going to switch now from immediate mode to just draw a raise now I go to 24 million so draw raise is more optimal than the best you can make immediate mode now if I try some using some of the extensions we talked about let's switch to vertex array range so draw raises vertex array range and go 250 million ok so we're making some pretty good strides here now let's say my data static which this happens to be now let's go switch the display list now again displaylist are set up cache the data static in video memory i'll be using the the bus bandwidth available on the graphics processor so i go 450 million to over a hundred million so we started off as just a few million and now we're up at 100 million so using the proper extensions understanding how to optimally patch your data through the system can make a very very large difference okay back slides please okay once shot images the best way to pass up one shot images that are small is using draw pixels the reason being is that the overhead of small inches is not going to be the copy of the data which your optical does drop pixels will always copy your pixel data it's the functional overhead of getting in and out of the system so you have to way off the functional cost of the the driving OpenGL versus cut the expense of copying the pixels so drop Nichols works really well for small images and I recommend that you experiment with this so if your images are smaller than 120 128 pixels in size now one of the keys for gate making drop pixels go fast as you want to disable any complex rasterization shape and that the reason for that is that tropics which goes fastest when we're not going through the really the 3d pipeline as much as we're allowing the graphics processor to use its 2d pipeline we can just get a stray bullet into the frame buffer so we're not doing blending not doing dithering no stenciling alpha testing nothing thats a 3d pipe needs to do so we can stay on the 2d pipeline so disabled complex state will give you the best performance and again this demo is available on the website today so people can look at the example i'm going to be running here ok so a simple simple little code snippet disabled complex state and then the issue would drop pixels again same with textures you want to use a format a pixel format that is supported by the graphics processor so we don't have to do expensive conversions because if you pass in some type like a float we're going to convert it to something the graphics processor can handle and that may be slower than you would like to see okay so back to demo to please so this is a demo showing draw pixels again i put the infinitely fast button on there something i do highly recommend so if i zoom this down to something very small and you can't almost see it and i apologize but it's two pixels by two pixels so you can see i'm getting a million draw a pixel commands per second okay now you can see i'm only getting 15 megabytes a second of bandwidth of actual pixel copying performance about getting a million draw commands now as i move this up in size to something drawing something larger you see now i'm getting eight hundred megabytes a second a pickle coffee performance my actual frame frame rate has dropped from was it a million down to 11,000 so now i'm starting to be memory bandwidth limited so the performance now is being throttled very much by the copy operation being done on the pixels as opposed to the functional overhead of getting in and out of open Jill okay so and this is kind of a boundary of which i was describing is that small engines are going to go really fast larger images are going to start hitting a memory bandwidth limit and you might want to start considering some of the other techniques of her uploading textures for doing this this operation okay please back to demo slides okay so let's talk about pics of coffee operations real quickly so the key to pickle copy operations to get is to get dear and revere am performance you can get extremely high bandwidth if you don't have to come across the bus so anytime you're storing data that you wanted to have temporary stashed off and you want to be able to store it back tops pics the old copy pixels is a great way to do that now where you want to store the data there's there's a couple options one is to use an auxiliary buffer auxiliary buffers you can create an auxiliary buffer we Apple has extensions where you can have auxiliary buffers that have depth and stencil associated with them such that you can copy my depth buffer a stencil buffer or color buffer off into a temporary location and and use computational to copy back to restore your data so if you wanted to refresh any one of those types of buffers and auxiliary buffer will work well for for storing that data and you'll see that I list the pixel format attributes that you would want to include in your pixel formats for setting up your OpenGL context to allow your auxiliary buffers to have additional buffers associated them and you'll see that the depth stencil one on the bottom there is the one for expanding your auxiliary buffer format type just like withdraw pixels copy pixels you'll want to have the state in a very simple form because you want to use a 2d pipeline when you're when you're copying from one memory location for graphics processor video memory to another you want to have the 2d engine do that operation if possible and to do that you want to minimize your state have your state in a very simple settings and it turns out basically the same thing as drop pixels as the same basic restrictions so you'll want to disable as much of the state as you can to try to get that that be rendered vram 2d blit okay so looking at the feet little piece of sample code for that you just disable your state you set up your read buffer and draw buffer and then our copy pixels so here you can see that i'm going to copy data from the auxiliary buffer maybe where I store the data temporarily back into the back buffer for restoring say a depth buffer or in this case of color buffer and getting a very fast restore of that that image okay threads let's talk about threads a little bit so rules for threading are well first off let's talk about what I'm gonna talk about rules for threading is what I'm going to go over and then I'm going to talk about divisions to work how you can what kind of strategies you can use for dividing up your your OpenGL processing onto multiple threads what are some of the effective techniques sharing data between context how do you effectively share data you can set up multiple comics to have them share some common data set synchronizing your threads we'll go over a little bit about what is the proper mechanisms for synchronizing multiple threads okay so rules so OpenGL if you're going to use multiple threads talking to a single con both those threads cannot be an opengl simultaneously if you do really bad things will happen because we do not mutex lock on a per context basis against multiple threads going into a guild that work is required by the application so you will need to be doing your own mutex locking if you to multiple threads we can be talking to a single appeal contest now as I'll show in the examples you can have multiple of your contacts one and have a thread for each and now you don't have to worry about any mutex walking so if you are sharing a single context of across multiple threads mutex walking is the applications job to get done properly so other things you can share you can share context data across threads so you can set it up such that you have multiple contexts with a common set of object data object scared shared state that multithread will be referencing and we do mutex lock that such that multiple threads can just be beating on the shared state v of their own context and we will manage the shared data set you can also have multiple contexts talk to a single surface so you can have one video memory surface multiple threads multiple contexts talking to that same video memory and doing their drawing that way so let's look at let's talk about divisions to work a little bit so possibilities are moving opengl on to a separate thread so you can have the your application on one thread and opengl on a separate thread obviously very obvious way of doing it that's not always the optimal way so another thing you can think about splitting open GL vertex end and texture processing that's very useful for when you want to have video data or you're generating some pixel data coming from a disk or coming from some source that you want to load in dome Jill and then you want a second thread to be drawing it so you can have OpenGL have multiple threads one for loading and one for drawing so what gets shared between context so a lot of times people don't clearly understand when you have multiple contexts with a shared set up to share each other's object states the things that get shared our display lists' textures six and fragrant programs and vertex array objects that data gets shared when you share to context so that data set will become common between multiple contexts if you set them up properly and we will manage the mutex blocking of accessing that data share okay and like I said you can share an open till surface so you can also set it up such that a multiple contacts we talked to one vidiyum vram buffer so let's look at some diagrams of how that looks so here in the red circles I've got threads and on the on the left I've got the application doing some cv processing it passes that data off to the thread to thread to then take that data and uses it to draw some opengl very simple simply using one open till context of jill's on its own thread here's an example of splitting opengl across multiple threads now what I've got here is I've got two threads one open shield contacts per thread I've got them set up such that they're sharing open jewel staite and I've got it such that they are talking to the same video memory surface so they share state they share the vram buffer and we manage the object shared state and the what these this shows just using texture data on one thread and vertex data on another obviously those are arbitrary you can you can obviously mix those up have any kind of inputs from either talking to the shared object state machine it's like there's a variation on that if people want to use P buffers obviously you can have one talking one thread talking to a p buffer and then link that P buffer into the shared state for as a texture and then referencing a p buffer for drawing and using thread one to draw using that p buffer draw some scenes that is using generated textures as a p buffer so looking at a little bit of setup code here so this is using cocoa so this is how to set up a shared context using cocoa so you'll see that I create a context and then I knit with a pixel format and I'm passing in a shared content so the third line down is a shared context and that's the way you can link to contact together to have a common object shared data structures and that allows you to share textures displaylist programs perfect object data okay so synchronization between threads and what you want to do that you want to use standard OS thread locking use NS thread and this lock for instance one example obviously you can use any other type of os-level facility for managing threads and I guess the main point of this slide is that there's nothing in OpenGL to manage your threads synchronization its standard OS tools facilities that do that don't use the apple sense extension for managing your threads a placental extension is for managing synchronization between CPU and GPU not between to CP threads so they'd so that's an important point to remember when if you're going to start dabbling in multiple threads and by the way just just as a point if you if you mess up threads and you have multiple contexts multiple threads talk multiple control threads talking into the same context you will cause all kinds of bad things bad things can go as far as hanging your system you will you'll introduce a bad command into the graphics processor graphics processor may hang and your screen a wedge the CPU will block up against that and everybody will come to a halt okay so let's switch to demo machine too please so in the beginning of this talk i talked about effectively using cpu and the GPU and one of the things i want to show is first off i wrote a we wrote a altivec routine of this little sinusoidal wave simulator here and you'll see it's going pretty fast it's generate 18 million triangle a second so what we've got here on this chart is we've got the red time at the bottom is time spent in the system outside of just the application or OpenGL the green is time spent calculating the wave and blue is time spent in OpenGL okay so I'm going to just multi-thread that I'm going to split it across both of the CPUs I have on the system so they bought my performance up a little bit i was at 19 now i'm up to 21 something and you can see that I've improved performance a little bit now the thing that is surprising this is this actually has a high-end graphics card in it and now I'm what I'm going to do is I'm going to move the wave calculation into a vertex program onto the GPU so so before I do that you'll see that the CPU cps are very busy right there there's a lot of time going into calculating this wave with a cpu and now i'm going to move the wave calculation off onto the GPU and the thing to watch is the performance for 21 million triangles second 338 frames a second and look at it drop down to 15 million and I know there's people out here from the hardware vendors and they're saying now that can't be possible our Hardware always can run the CPU but it's not true some the CPU is really good at some things and you can actually write really efficient code sometimes it'll outrun the graphics processors and you'll see that by the way notice that the CPU is now barely doing anything but my performance went down so that kind of what I'm trying to point out here is that if your goal is maximum performance sometimes you want the CPU to be doing work but if your goal is to have a CP do nothing for sure offload all the processing on to the GPU and the CPU can be free to do something else but that won't guarantee maximum performance maximum performance went round by experimenting what the optimal combination is ok back to slides please ok let's wrap up ok so after this session there's a couple more open your sessions that are really good i recommend people go to there's the optimization live session which is going to have a live session talking about our tools using our tools the open to a profiler profiler open till driver monitor live on stage showing people how to use it really good session I find that I'm always using the open trail profiler for analyzing applications figuring out where the bottlenecks are what I need to be optimizing it's a similar tool for OpenGL as shark is for the CPU and then on Friday we've got the introduction to the OpenGL shader language for those that don't know what the OpenGL shader language is it's a good introduction to what that language looks like some of the capabilities that has and highly recommended for people that are interested in programming the graphics processor and to the contact contact myself or travis browne if people want to talk to me then come up afterwards and i can give business card so you don't need to write that down too quickly so for more information you can go to the apple website so it's developer.apple.com / opengl that's a good resource for opengl information from apple or you can go to the opengl org website the wmg org that's a open fields official website contains specifications pointers to a variety of resources that people find useful and reference library so we do have some references out there I want to take note of these couple documentation that are out on the system