WWDC2003 Session 209
Transcript
Kind: captions Language: en good afternoon welcome to session 2 9 advanced opengl optimizations I'm Travis Browne I'm sure by now you know what I do so well cover the hack but this is a really good session for you to attend because we're going to be talking about the techniques that optimize opengl mac OS 10 and i want to just make a couple quick comments and gather way because we're sort of has a super-sized session here so we're going to try to fit it in the allotted amount of time it should be a little bit of struggle but we got great content but the key thing i want to sort of point out is that when the advantages Apple has in terms of our implementation of the operating systems that we deliver the driver and also the OpenGL stack and because of that we're able to put in certain fast paths that enable you to really unlock incredible performance out latest generations of GPUs and we also work in the operating system and create our own tools we have a tool chain that's available to you to be able to really take a look at how your OpenGL applications performing debug it and also just just unlock it so it runs as wide open as it can so it's my pleasure to welcome John Stauffer the manager of the OpenGL engineering team to the stage take you through the presentation thank you travis has travis said I manage the OpenGL engineering team at apple and this is going to be a session on optimizing open Jill trying to get the most out of OpenGL on the Macintosh okay so what we're going to go over today is we're going to try to cover some basics we're gonna try to move those through those as quickly as possible and get into some techniques that you'll want to try to leverage in your application to get the maximum performance so some of those things are optimizing texture uploads optimizing your vertex data throughput optimizing for one shot images if you're just bleeding an image up to the screen and you want to dispose of it that's what i call a one-shot image optimizing for copy pixels if you want to move pixels around how do you do that quickly using threads and open jail opengl is thread-safe and a great way to leverage our systems is to use threads and lastly we're going to go into the open Joe profiler which is a tool you can use to hopefully analyze and look for hot spots look for blocks and in your code where it may be blocking up against Jill ok so the goals of optimizing there's a couple goals one is to maximize the performance of your rendering so you may want to get maximum performance and that means utilizing the CPU and GPU and a combination that will get you that performance another possible goal is to minimize CP burden so it's a little different maybe sometimes they result in the same code but sometimes they don't and to minimize CP burden usually means maximize burden on the GPU you may want to have a technique to offload much burden on the GPU leaving the CPU free for doing other work so key concepts keep in mind that while I'm talking keep these things in mind so that hopefully the concepts i will be presenting will make sense if I'm not pointing about as clearly as I need to be eliminating CD copies is a key goal to any optimization you want to reduce the amount of amount of times you touching data so you want to move it through the system get it to where it needs to go and and start operating or start drawing on it cash static data and DRAM so the video card has a higher memory bandwidth than sister memory so if you have static data an ideal place to do it to put it is up in video memory put it video memory leave it there and draw with it from there and you'll get dramatically higher performance as I'll show in some of the demos maximization its behavior between CPU and GPU that's key you've got to asynchronous process to a singers pieces of hardware you're going to want to stay asynchronous right you're going to want to run one in parallel to the other you're not going to want to block against each other that's the key concept to getting maximum performance and again like I said using threads is a concept that can be beneficial times so basic things to avoid so we're into the basics of just a general overview what we don't want to do and what you usually don't want to do is call Geo flush flush flush as a command buffer up to the hardware it uses resources in the driver it's it is it can cost you a little bit of function overhead getting into the driver tube to make that happen don't do it if unless you have to there's a couple reasons you might have to we'll cover those a little bit later in general avoid it never call GL finish I frankly don't know of any cases where Jill finish is really required there's other ways to do it possibly more efficiently avoid calling Geo read pick tools jewelry pixels is only really required when you want tickets pixels back off the video card and save them somewhere save them for later but if you're going to use an algorithm for rendering Sun effect and you have read pixels in there you probably want to look for a better way to do it like caching a data is somewhere else in the video card and then we're using it by just copying it out of that store that cash on video memory back into your scene avoid a media mode wrong so a media mode drawing is when you use the Jill begin gln and you have a series of vertices and colors you put in between those begin end there's one caveat to that you can use the media mode drawing in displaylist we have a post processor that will go through take that begin in sequence post process it optimize it cash it and video memory for you so display us are static data right you can't modify display this once you've created so by definition it's fundamentally a static so we will treat it like static data caching video memory so displays does a good place to put your static data minimize state changes so state state changes in the hardware can be expensive and in fact the more complex the hardware gets the more expensive they tend to get so therefore you want to group your rendering according to state change there's usually a hierarchy of how you want to group it may be like group it by texture group by blending modes glute group it by drawing method where you cole s your your type of primitives triangles quads but coalescing your your database according to state will cause less transition state transitions in the hardware and that can be very beneficial to performance okay so jumping into optimizing texture uploads first what we're going to go through is we're going to do an overview of the texture pipeline and we're going to talk briefly about that and then we're going to go into texture optimization basics just some oversight overview of what we want to look at and then we're going to go into smoke gel extensions so we're going to break those down into categories we're gonna break them down into power of two extensions and non power to extensions there's ways to optimize slightly differently for those two different cases ok so the OpenGL pipeline generally looks like this for this part of the talk for optimizing texture uploads we're going to focus in on the pixel pipeline so we're going to focus on on the highlighted yellow boxes there and to zoom on that a little bit and talk a little more in depth on what happens while the data is moving through the system there we have a basic block diagram and each block on this diagram oval or block the data may be copied okay so you have the application you the application has a copy of the texture you handed OpenGL OpenGL may make a copy of the data and store in the framework when it goes to draw it the driver may make a copy to some part of a specific format for uploading to the to video memory so and then of course video memory has a copy right so theoretically there is possible that you might have up to four copies in your system at some point so the goal of this some of this discussion is how to avoid some of that how to control how that operates okay so basics so again like I said we want to minimize CP copies and conversions it is possible that you pass data that is not necessarily in the format that the harder wants it will have to do a conversion so you're going to want to pick data formats that are optimal and I mentioned some here the BGR a unsigned int 8888 reverse BGR a unsigned short 1555 reverse and the y UV format for ycbcr data those formats can be used for a fast fast food system and without any other types of state that may cause conversions those do not need to be converted for the hardware to natively understand those okay so like I said before you'll want to avoid Geale flushes but there are exceptions and why exception is when you have the GPU asynchronously reading data from your data cache so if you have a texture and the GPS going to read directly from it you want to asynchronous with the GPU so you may have to double up your data for instance you may have to have texture 1 and texture too and while the CPU is uploading or working on texture one you'll have the GPU working off of texture too and we're going to go into more details here but I want to mention it so you'll keep it in mind as we look at some of the diagrams okay so double buffering so like I said what you want to do is if to stay asynchronous you want to have double buffering your data so if double buffer your data it looks like this right so you can fundamentally stay asynchronous from the hardware and allow the hardware to some data to work on Wall CP is working on some data standard concept for double buffering that commonly comes up okay so some OpenGL extensions of how to optimize your your throughput in the texture pipeline the first one is Apple client storage apple cloud storage is where you tell weapons yield that I will allocate a piece of memory and I'll keep it around and you can just you have a pointer to it so we will retain a pointer to your data we will not copy the data and retain it locally so that requires that you retain your copy of the texture until you delete it because we're going to be referencing it another extension is Apple texture range that has a couple interesting modes that you can define one is cached which means that you're going to want the data cache in video memory one shared where it says that I don't want to catch the video memory I want you to leave an AGP space don't put it in video memory and what actually what texture range does is it defines a region of memory you put a texture there we will map that Ridge region of memory in a GP space and leave it there so so that the GPU is able to come and dma directly from that piece of memory without us having to copy it out of that region of memory put it into a GP space we will map that age that memory directly in a GP space ok so if we look at what these extensions due to the stack so Clive storage what it does is it bypasses one copy that texture may undergo so it goes from the application to the driver without having to be copied by the framework so that will automatically increase your performance if you happen to be given incurring a copy in the framework we'll go over some sample code it's pretty easy to enable all you have to do is make a single call when you bind to a texture you make a call to enable client storage and you just set it to true ok text arrange and rectangle texture now rectangle texture that extension is for allowing some hardware to do direct dma of the of the texture and the reason for that is that some hardware of the power of two textures and to define what power of two nurses non-power to is that power to meeting a power to width height versus rectangle which means any right is not restricted to a power of two dimension is required by some hardware to do directly maybe cuz some hardware requires a hardware specific format to tour the data to be in before it can upload it so therefore rectangle texture is required when you use text arrange to get direct DMA and you'll need to use those in conjunction and when you do you can bypass the drivers copy so now we've showed how to bypass to independently and looking at the sample code for texture range like I said there's the cached hint and the cached in this case is for cash natives for storing the data up in video memory and it's for non power of two and if we look at the shared hint here's how you said it it's the same it's just that you have a set of cash can't you have a shared hint and just for those that if I'm going too fast there's sample code that'll be up on the website you can look at that and it has all these features so don't get too worried about writing these things down you can reference the sample code okay so if we use those two in conjunction we end up bypassing all the copies we end up going straight from your copy of the texture directly into video memory and therefore the GPU is directly DM aying from your memory so the GPU and you the GPU and the Indian application are directly talking to one another opengl fundamentally has been moved out of the way hope you'll did the set up but the transfer is happening between your application and the GPU okay so looking at putting this all together and looking at a little bit of a piece of code that does all these things together for non power of two we first thing we do is we're going to bind to a rectangle texture and then we're going to set up the cash hint and that is going to work in conjunction with the client storage which is next on the fourth line and between those two that's going to set up for a direct EMA and then when we call the text some image or the text image 2d with a rectangle texture target it is going to set up the GPU for a direct DMA of your texture from directly DMA hear from your memory right so pretty simple to do but this particular setup that I'm showing right now does require that you are going to be using rectangle texture and you'll want to use redirect angle texture specification because it does have some restrictions on on functionality so it's not as quite as quite the same as a to power Q texture which allows for MIT Maps allow sort of different clamping modes and such so you want to use it read the extension see if rectangle texture suits your needs ok so for power to it's slightly different but not much all I did here in this piece of code different from the previous one is change rectangle texture to texture 2d and sexy 2d then allows me to to use a power Q texture gets some of the additional functionality that power two textures bring to me but it won't give you a direct email in curve one copy so it's not going to be quite as fast performance typically we see that is ok because rectangle textures are usually ok for things like video and such games like quake three are going to use power too and they're going to load them at the beginning of a of the game or a level and you're not going to need to do real-time texture loading as much so rectangle textures is very powerful for playing video playing through images that you want to get to the screen fast which typically are non power to ok so let's switch to the demo machine i'm going to show you a demo of that ok so the first thing we're going to do is just look at this and explain what the demo is this demo it's hard to see but it has numbers in the middle of that middle of that image that the numbers go from one to five and it is uploading a 1024 this even though it's a small window it is a 1024 x 1024 32-bit image it is uploading it across a GP bus and blending it on the screen every frame so you can see that we're getting about six hundred fifty megabytes of seconds and you also see that we've got a couple sliders i can switch from single buffered all the way up to five buffers to test the different effects that may have on the parallelization of the hardware in the cpu you can also see a frame rate slider and a number of checkboxes turn off the different extensions so the frame rate slaughter goes all the way for a thousand we actually had to add that to test the g5 but we're on our g4 so we're going to keep in the middle so the interesting note here is like I had said is that the idea is to eliminate CPU copies well this actually eliminates all the TP copies right and you can see that the cpu monitor is showing very little activity the CPU actually isn't doing much your other than running the event loop and drawing a little clod on the screen so the CPU has been effectively removed from the bulk of the work of this demo now if we turn off say all the extensions let's just start turning these things off you can see that our performs drops for six hundred fifty megabytes second down to 111 megabytes a second and you can see now we have effectively saturated one cpu so when our work Kate we're single threaded app we've taken one cpu we are basically memory bound by a cpu copying the data to get it into a format that can be uploaded by the GPU so so with those extensions again I'll turn it back on not only do you get higher bandwidth but you save CP work right because you're not using a cpu and you're getting higher throughput so for people who are able to use some of these extensions you can get quite a benefit ok back to the slides please ok let's jump into optimizing vertex throughput optimizing vertex throughput actually is very parallel to optimizing texture throughput and you'll see as I go through the slides that it is parallels the same concepts and in fact we did that on purpose so that it has a very analogous concepts ok so we're going to we're going to look at an overview of the vertex pipeline we're going to go through some of the basic optimizations that can be done we're going to go through some of the guys that can help you and we're going to break this into a couple categories static and dynamic and display list so those are three separate categories we're going to touch on that will slightly different techniques for each of those categories ok again same same pipeline this time we're going to focus in on this playlist and the vertex path and let's get into the basics so for minimizing the CPU copies for vertex data just like pixel data you're going to want it into a format the harder understands a safe data type is do float if you keep all your data and geo floats you're pretty safe it all the hardware knows how to understand how to read geo floats if you start using like double tight doubles or bytes for vertex data or some combination that's a little bit off the normal the driver may say I don't know I can't directly upload this to the GPU I may have to do a CPU copy do conversion maybe some slow conversion you may find your performance dismal so stick with geo floats that's a guarantee to be one of the faster pads use vertex arrays so like I said before stay away from immediate mode wrong just because that incurs overhead per function call overhead and some other overhead I'm going to go into in a minute so use the standard yield array funk function calls and maximize your vertices / drachman so I will show some performance charts in a little bit that will show the benefit of maximizing the number of vertices you passed opens yell at one time so for instance instead of drawing one quad at a time if you draw 100 clubs at a time you'll get dramatically better performance because you're lowering per function call overhead you're lowering the driver happening do work on a per primitive basis cash your static data vram we've already said that and use vertex programs to offload your CP work i'm going to show a demo in a bit that will show how you can do actual work with vertex programs and freeing up CP cycles not just in the data transfer aspect but actually in the effects that you may want to do with your application ok and again the same thing double buff your data is analogous to the textures and we have the same double buffer data diagram where if you have the GPU reading your data directly out of your applications memory you're going to need to have some isolation between the asynchronous behavior of the GPU and your application so you're going to want to double both your data you have a CPU a buffer to work on while the GP is working on the buffer and pas going back and forth so when you do that you get a check ins behavior you can get some significant performance improvements will show some of that and a demo as well you'll notice I but you'll flush there what you want to do is when the cpu is done with doing some work you'll want to get that data in flight to the graphics garden so soon as the CP is done yes yugioh flush send it on its way the Geo and hopefully you've done with substantial amount of work where you're not calling flush too often because that will hurt you okay again very much like a text pipeline when vertex data comes through the pipeline it can go through multiple copies depending on what api's you're using and how the data is formatted so we can end up with fork with the data going from your application and if it's going to media mode opengl is required to capture the current vertex state so we retain a golden image of basically the current vertex state when you're running a media mode if you're running vertex arrays we don't have to do that so the first copy we have to do is into a local storage of a single vertex instance of a current vertex fate so we incur that one copy if you're going to media mode and then we're going to have to copy to a format for the hardware upload so we're going to copy it somewhere into a cheapy space for the hardware DMA it up then eventually makes it to the GPU so if use vertex raise you immediately just eliminate that one copy and that one's easy to do media mode is a easy one to work around no extensions needed just use the right API okay so let's talk a little bit about dynamic data analogous to the texture range extension we previously talked about we have a vertex of a range extension and it is exactly parallel it has the same storage hints where you have a shared hint for leaving the data in a GP space and that's what you're going to want to do for dynamic data you're not going to want to necessarily cash in video memory you're going to want to leave an AGP space and what that happen happens there is that you've allocated on a razor array of memory in your application we come along we we map that into a GP we wire it down and then the application could come along and poke values into it tell the hardware I'm done with that issued rock band will be made up and therefore the the GPU is reading directly from your arrays and and it's never it never makes it video memory and for dynamic data obviously that that could have benefit that you don't want it to be cached in video memory because you're going to change it again the very next frame okay so what does it look like if we use that extension we have vertex arrays and we use the vertex array range and just like texture range we we bypass all the top is in the driver and we r DM aying directly from your copy of the applications arrays so we can get very high throughput doing that very low CPU work is going on okay so looking at a little bit of sample code for that the first two calls were just a standard vertex pointer setup standard OpenGL for setting up a vertex array the next two calls in its setting up a vertex range and what you do is you pass it in a size and a pointer and you tell us what memory to map in so you're just going to give us a pointer with the size and we're going to map that that memory in and then there's the last call on this is a flush now that's an important call because every time you change that data you're gonna have to tell us you changed it and what we're what we'll do with that is potentially flush harder or GPU caches or we may be ma it to some other location but what's important of that is that you have to tell us the areas that you've changed so every frame that you come along and you write write two more vertices and you change that data you have to tell us the pointer and the size offsets the size from that pointer that you want us to flush you've changed it where you will then know that that has changed in an update the hardware okay so static vertex data very similar you can use the vertex array range but instead of using the shared hint you can use the cached in and what will happen is is that when you define the vertex array if it has a cache tint and you say flush then we'll know that you've changed it we will DMA a copy of the video memory will keep it there every time you call flush though we will have to read the MA that back up in the video memory but it'll be cash the video memory and if you're going to draw from it multiple times quite a benefit because you're not having to reread that data cross AGP bus every time instead what you're doing is your local to the the video cards bus which is a very high speed bus and like I said previously you can use this playlist what's begin end the one caveat for using display list is that we do have an optimizer it goes back through the data and and parses it and reconfigures it into an optimal format you can fool that optimizer and what you want to avoid is using inconsistent vertex tie into subsistent vertex data and what I mean by that is that if you have if you go through you'll begin and you say geo color Jill vertex do color GL vertex that's consistent if inconsistent would be Jill begin geo color GL vertex g / txt / text GL vertex you did a color for the first vertex but not following one you may fool the the optimizer and to not being able to handle that so if you want to play it safe just keep it into a consistent format and the optimizer will definitely be able to take that data pack it into a format that we can then cash in video memory and you can get optimal performance so one last caveat down for display list is that there's a minimum threshold for which it's worthwhile for us to work on the data and that threshold is 16 vertices if you have less than 16 vertices we won't even consider optimizing it and that's just we just found that out by testing different machines and finding out where the threshold was and deciding that you know if it's not going to get your performance benefit in fact they actually slowed you down because of other overhead of doing work on the data that 16 was the minimum ok so what does that diagram look like then so when using static data with either display list or with the cached vertex array range what happens is the data gets delayed in the video memory and then the GPU draws from that right so it's going to be taking the data from the video memory cache and drawing so you get very high throughput for data to draw more than once and the sample code for that again it looks like we're for static data we're setting up a standard vertex array again we set up the the hint this time the hint for the vertex array range is cached it's not shared like it was for dynamic and I'd like before we set up the vertex array range pointer in size and then again we tell the flush and again the flush this time instead of just the flush is going to cause us to reupload that data so if it's not there already will upload it if you have touch the data again and it was uploaded will refresh it with another copy so it's like a text sub image call where we're going to refresh the data on the video memory ok so for basic reviewed what displaylist look like it's pretty simple you just call begin list draw your your drawing and then call n list and you can pack anything you want in there and it takes any opengl calls if you put your geometry in between a beginning lists and lists hopefully we'll be able to optimize it and get it cashed in video memory ok so looking at what this could do for you for performance this is a chart of low vertex account performance so on the x axis we have the number of vertices / draw command and on the y-axis we have millions of triangles so as you can see the orange is a media mode and orange tops out pretty quickly as to what benefit you can get by going down that path and the red then is vertical raise the richest race has a little bit lower per function call overhead a little more give you a little more performance but if you look at the blue the blue is vertex array range to vertex array range has a great potential for performance and it doesn't give you a whole lot until you start giving OpenGL a lot of work to do it at one time so that's the key the key is giving up until lots of work at one time and then the green the top one is this playlist so you know it goes up to on this chart goes up to about 12 million triangles a second issuing 30 triangles per document now this is the high vertex count performance and picking up a little bit where the other one left off you can see that some of these continue to grow quite a bit so you can see that the vertex raised in media mode stay flat array range basically grows until you're limited and in this case my the test i was running was a GP bus limited so I limited about six hundred forty megabytes a second a vertex data that I could transmit across the bus so I pretty much bottleneck that fat AGP bus and that's all the data I get across button this be display this case that I was testing here the data was effectively static it only went across the bus once so the GPU was able to utilize its internal bus bandwidth which in the case I was running on on our 300 that's about 20 gigabytes a second so we can train to a whole lot more data and you can see that at the number of the top number i quoted here it was about two point eight gigabytes a second where the geometry going into the GPU so that's quite a bit of data almost 90 million triangles a second so let's do a demo and show a little bit of this okay so what we've got here is anybody that's been to my session before it's the same old thing but next iteration of improvement so initially we're drawing with quads and we're doing the standard basic Geale begin in not too impressive we're getting about 800,000 triangles a second so what I'm going to do is I'm going to step through the difference as I move the slider up I'm going to step through different optimizations and using different extensions and we'll see what effect that has on the performance down at the bottom you see the color coding the color coding represents where time is being spent so red is system time time spent outside the application green is time spent calculating the wave and blue is time spent in OpenGL so you can see right now that I'm spending a lot of time in OpenGL a lot of time calculating the ways so if we start moving up the level of optimizations i went to quad strips that got us quite a bit of speed improvement about twenty-five percent and that was pretty to do worthwhile but let's not stop there and let's keep going up so if we go to vertex arrays a little bit more but i wasn't a great improvement then we go vertex array range okay so here's where it gets interesting now you can see that the time spent in OpenGL which was the blue went from filling basically the top of that bar almost to nothing so now the time spent in OpenGL is very little and we're basically now saturated on the calculation of the wave we are not able to calculate the way fast enough to get the data to OpenGL so if we move up one more notch and we see what altivec can do to us so we also make the wave calculation because that was my bottle neck once I optimized open Jill Jill is no longer the bottleneck so the CP was I optimize that and then we do one last thing like I said before you may want to off load calculations on the GPU so what if we write a vertex program to do that ways and now again the interesting thing to watch is that we are calculating a wave motion and we were sending almost 12 million triangles a second to the screen and look at the CPU the CPUs I was doing nothing right so so we basically not only have we optimized it we've uploaded the CPU from doing any work now CPU again is just doing an event loop CPU doesn't know that this this complex wave is being calculated and if we actually look at the density of this it's a really dense wave there's a lot of triangles there okay back to the slides please ok let's go into a new subject optimizing for one shot images one shot images again our images that you may have that you want to get to the screens fast you can't discard it you're not going to do it you're not going to blitz to the screen multiple times so it's just one shot so one possible way of doing that is draw pixels drop pixels is fairly effective in some cases its best of small images if you have like a little small little widgets you want to draw somewhere drop pixels is probably the fastest way to get the data there it's a very optimized path very quick for images larger than 120 x 128 you probably want to start considering doing some kind of texturing like our previous demo showed where you don't have to make a copy because drop pixel is going to make a copy of the data larger the image gets the more data there's a copy and your benefits for drop tips holes for instance is going to go down because it will make a copy okay so the trick for one shot images using drop pixels is to get your state right where you're going to go down the optimized path there's different paths there's three different paths and OpenGL for how to draw these things you want to hit the one that's fast so the first thing you need to do is get your state right and listed here is a number of disables of things you need to have disabled before you will be going down the fastpass again don't worry about writing them down we'll have a demo posted that you can look at okay so a little bit of a little bit more code draw place was very basic right you disable some some options and you call drop pixels you feed it to the right pixel format type like we talked about before that will be a format that the hardware natively understands and you give the image and off it goes okay we're going to do another demo please and I couldn't relaunch it here okay so first thing I'm going to do this is a little bit strange but I got an infinite button and that isn't button is to sit sit enough yeah it doesn't really go infinite it's going to send a for loop that's go beat on it really hard because it goes so fast that just running through an event loop is this too slow so it sits in a for loop really fast and bangs on really hard and I reduce the image size two by two now most you don't see that but but the key point here is how fast do we really get through the stack to OpenGL and we can get 660,000 of these little images up to the screen okay so you can get a lot of little things up in the screen and that's one of the things to remember because other paths through the system may have more per function call overhead and limit you not because of the pixel data but because of what you have to do to get through OpenGL right so that's the benefit of drop XO has a Loper function call overhead and can get lots of little small things up to the screen so you know if I start increasing the size of these sorry I want to go a little smaller than that so here's a 75 x 75 image you can see that the megabytes per second is about 400 megabytes second believe it or not I'm already memory battle a bottleneck here I'm basically saturated my memory and it's no longer function call overhead that's that's stopping me it's memory bandwidth obviously with the g5 these numbers all change because these goes much faster but and that's actually another trick is how to tune for the different systems it can be a delicate job so if we start increasing the size of this we quickly run into some rather slower frame rates right so now we're down to we're still at 400 megabytes second so we a bottleneck the memory bus we were just flat line now and as I creased a number of pixels I will proportionally decrease the frame rate because I'm 400 megabytes a second limited that's all I get through the system that's my limiting factor as I increase it I go slower and that's why when you get to larger images it's better to relieve the memory bus of that working go down the texture path but for small images draw pixels great ok back to slides please ok optimizing pixel copy operations so there's a lot of cases where you want to draw something save it off and then you want to be able to grab a saved copy and blit it back to maybe your back buffer use it for for some part of a scene and you want to Brender it and save it so one of the things you can use to do that is copy pixels so copy pixels will allow you to do a veer am to be rep copy it's like drop pixels where you have to set up your state correctly one area you can store the data is in an auxiliary buffer so on OS 10 you can create auxiliary buffers now auxiliary buffer is just another back buffer so if you have one you have your main back buffer you can create another one off to the side and use that as a temporary holding area for copying data into you can either draw to directly or you can copy data between your back buffer and this auxiliary buffer one additional extension you can have that we have that allows some more flexibility is the Ox depth stencil which not only will to create the back buffer but it also create the depth and stencil buffers associated with that back buffer so you can have two depths buffers to stencil buffers and therefore you can copy your not only color data between these ox buffers but you can copy depth and stencil but data and use it as a temporary holding area for fast refresh of some pixels so it's a number of techniques that people use for interacting with very complex geometry that that becomes an important technique so like draw pixels there are certain state that you'll have need to have right to make it go fast it's very similar to draw a pixel state and basically what you don't want to be doing is you don't wanna be trying to dinner or alpha test or blend or our things basically what it comes down you don't want to do anything that can't be done by the 2d engine on the GPU because this is a 2d operation and we need to be able to stick within the feature set of the 2d pipeline on the graphics card so you need to disable all the operations that require the 3d pipeline three pipelines not company's optimal is just a memory copy that you can do through the 2d pipe so there's a number of states that you want to disable and you can look at the draw pixels example for what state very similar okay so looking at some sample code very basic we got we have the standard disabled the right state so you can go down the fast path and then when you go to draw you're going to want to set your read buffer and your draw buffer just a source of destination source and destination can be anything any of the buffers you have allocated whether it's back buffer rocks but for 12 what have you and you can copy between those two then you issue copy pixels and the transfer will be a VRAM VRAM transfer okay so let's jump into reading using threads with OpenGL so let's go refers to rules for using threads of OpenGL and then we'll talk about some possible ways to divide up your work on to multiple threads and then we'll talk about what data you can share between those different threads and how to synchronize those threads so rules for threading what you can't do is / thread re-entrance / context free entrance so if you have an open field context you have two threads only one of those threads could be in that in open field referencing that context at a time if both Reds are nope Jill with that same context you're going to cause corruption in your in your open jewel staite and all kinds of bad things can happen well you can do is you can share context state across across the threads and you can share surfaces across contexts and I'll show some diagrams of how that can be put together to help you with threaded applications okay so division to work one possible division of work is to move OpenGL hold on to a separate thread this is like what quake three does quake three lose opengl onto one thread and has a bunch of other CP work for the game logic and and other work on the other thread and it's a reasonable division of work that's easy to manage other more complex ways to divide your work are to potentially split your texture work with your geometry work so you may have texture data it's getting spooled off of a disk may have another thread that's doing other work for the application but then it comes along and uses a geometry to utilize those textures for drawing another possible way to divide the work is to split your your output your surface and when I say a surface i mean the opengl back buffer is basically the service is a piece of memory in video memory that you're drawing too so that's what we call surface so you can split the processing of a surface so let's say you want that you can have your CP work be divided amongst regions on the in the in the surface and it might be beneficial to split those on two separate threads and leverage both cpus to get that work done ok so sharing data between context what gets shared so when you share to context when two contexts are sharing states the things that get shared our display lists' textures vertex are very objects and vertex and fragment programs so those are the things that get shared and those are really all objects those are things that usually have a bind in some name associated with them and those things are the shared items between opengl there's lots of other states in an opengl context it's not get shared and those are going to be / context even if you set these contexts up to share that state so you can share like a briefly touched on before you can share a surface between context so you can have two contexts multiple contexts drawing to the same surface and that's another way to to have sharing of data okay so let's look at some of the diagrams of how to divide up your work and move it onto different threads so here's the first example of just moving opengl on separate thread very basic you have one thread doing work for the application you have another thread that's driving the opengl so thread one is generating data that is used for input to thread to to draw into the open Jule context which goes to the surface which gets swapped to the frame buffer okay you can split your texture and vertex processing on two different threads different contexts so you can have two threads to open show context they're sharing the same some of the same state right they have shared state and they're going to be attached to the same surface and what you can do is you can have a synchronous processing between the two where you can have one thread spooling data from a disk I'll say on decompressing JPEGs do you compressing a movie what have you breeding that data that those textures into the geometry state machine are into the OpenGL state machine and then having the other thread come along referencing those textures and wrong so that's a way to split your workload if you have spooling or some kind of work to do with imaging that you want to offload another possibility then is to use our new API for P buffers where you're not reading say the data but you're using geometry to generate a texture you're using some geometry to draw into a p buffer which is in video memory that then is used as a texture which is then referenced by thread ones context which then is drawn which then goes to the surface and to the frame buffer this so so just to be clear now the only difference between these two examples is that one we're dynamically this example we're dynamically creating the surface by drawing 2d texture by drawing to it previous example we were basically loading a texture through the OpenGL API okay so we can also split the OpenGL processing of a surface right so we can take the surface and we can split it across some line and and use one OpenGL context to render one part of it and OpenGL context to render another and where this might be beneficial is if your CPU bound and your CPU work to be regionally divided along some portion of the screen real estate where you have a lot of work to do geometric calculations or what have you in one portion you can split that across two CPUs and divide your work across regions of the surface so one way to do that is to just create two threads to up Joe context not have the state shared and they're just open loop drawing to a surface and but they're drawing to different regions now the way you separate what regions they draw to you can use a scissor and a viewport so you can just set the scissor in the viewport to the region you want to control and with a scissor wreck the pixels will not come out of that you can just set it to one half for one context and one half to the for the other context and allow the drawing to be open loop to that surface or it's your application wanted to you could just share state right there's no reason you couldn't be sharing State same basic concept just that they're sharing possibly geometry programs textures to do their work of drawing into the into the surface okay so how do we set up opengl shared context so there's a little bit of sample code to show how we would create a 2 context create a context attach another context to it as a shared context so this example assumes that you've already created a opengl view through say app kits and now you come to so your knit frame and all you're going to do is create a NS open Geale context and you can see we're creating an assumption or contact alex and then we come along and we do a knit with format and we're going to take self the format from self so we're inside of a creation of a of a content or any a view already that has a context so we're just gonna take the pixel format out of that context we're going to use it as an the pixel format to create a new context and we're going to provide the current views opengl context as the share context so we're going to create a new context we're going to create and when we attach it or i'm sorry when we create it we're going to handle the pixel format and the already created context for sharing against the state and that's all you need to do to make sure that the two are connected and the key line here then would be the self opengl context on the third line which hands the new context the previously created context for our sharing ok then the next two are standing fairly standard OpenGL concepts where you're just making the current context and you're attaching it to the view so one small deviation on that then is that if we wanted to have two contexts that talk to a surface but don't share state instead of setting passing in the already created context into the newly created context we just passed nil but we attach it to the same view so we're creating two independent context or attaching it to the same view that will allow them to talk to the same surface but the independent as far as state ok thread synchronization the main tool you'll have a bread synchronization obviously is going to be the OS tools that are provided the OSAP is so you'll have NF thread and NS lock those are going to be what you'll mostly need to leverage there is one interesting ap IB you want to be familiar with which is NS apple sense Apple fence is a way to insert tokens into the open shell command stream and then to test when they're done so I can do a set sense and I can test when that token I've inserted individual can stream has gone through the GPU and made a round trip is completed so there's ways to test when portions of your drawing are done and that's one another way to potentially synchronize events within your OpenGL commands so we'll look at a little bit of sample code how to do that okay so so there's two basic ways you can do it you can do it with the set fence which I you can see here on the first level piece of sample code which is I'm setting a fence and I'm giving it a name so i get i can give it any name i want and you can set that token individual can stream and then do some work and later test to see if it's done so you would do a finish fence apple and if that call will block until that token is completed there's another simpler API if you're using either wiring a block against the texture upload being completed or draw against the texture or draw against a vertex array object you can just simply test for that object so what you do is you call finish object and finish finished August apple and what you do is you pass it in the type of target you're wanting to look against check against so you may have a GL texture or a GL vertex array as the type and again you pass in the ID number of that texture or vertex array object so that'll be an ID number that you use to create or bind to the texture or vertex array object ok so let's let's do a little bit of a demo again and this one is going to be the same time when we did before but what we didn't show before is we have a multi-threaded button at the top and I want to talk a little bit about that so if we if we enable multi threading we can see that we went from 800,000 triangles second to 1.5 million triangles a second so we got pretty good parallelization right we almost got a 2 to 1 or 2 x speed improvement just by doing multi threading so that was pretty worthwhile and you can see the cpu monitor where working two CPUs pretty hard now what's interesting about this is if i increase the optimization level of opengl if you look at the performance not doing a whole lot right well the problem is that the workload for calculating the wave is the bottleneck right the drawing of the opengl is not so we're not going to gain by improving on Jill because the wave calculation is the bottleneck but once we go with the vertex array range in the altivec now we again remove the bottleneck of the wave calculation not multi-threading is paying off right so now we're getting the 10.5 million triangles in a second as opposed to the 8.5 so it's not to X because as you can see even with altivec green still is significantly the people wave calculation which is represent about the green it's still significantly more expensive than the opengl drawing but we do get some benefit by moving the system and the opengl drawing off to another thread okay back to the slides please ok let's go into an important subject that we want to spend quite a bit of time on the OpenGL profiler it's a tool that comes with the developer CD it's a very powerful tool it does have a lot of features and will take a little bit of learning to to understand how to use that use it effectively what you can use it for is optimizing debugging and experimenting with your own job application there's a lot of different features in it so the profiler is a bit of a restricted name it does a lot more than just profile and let's go through some of the screens and have a brief overview of what the profiler can do for you so first you open your profiler like some of the other tools in the system you can have it launched your application for you so it will launch and basically attached to your application or you can attach to a running application so if you already have an open application running you can just attach to it and start utilizing the services of the profiler right just by simply attaching to a pre running application so one of the services it provides is it will provide function statistics for you so it will nine in and out times of all the OpenGL functions and provide you counts and percent times and and overall time spent in each function so this way you can quickly get an idea of which functions you're spending time and OpenGL and quickly get an idea of how expensive those are for you it'll generate you can capture call traces so you can simply enable call trace capture and it'll capture all the OpenGL commands and their arguments so that you can scroll through it and look what your application is feeding OpenGL and get an idea of the call sequence it will capture textures and vertex and pixel programs so you can actually run your program and it will capture all the textures that you've passed in it will capture the pixels pixel programs and vertex programs you can look at those and you can see make sure that your you've got the textures you think you have loaded under the right names or or what have you you could set breakpoints so you can go to an open field function and you can say I want to break here and at that break point it will give you application call stack so you can see what your application call stack was at that point it'll also give you a complete listing the OpenGL state so you can sit there and come through the OpenGL state make sure that at that break point the state is what you expected it to be it'll also add a breakpoint let you look at the off-screen buffer so let you look at the back bar for depth buffer stencil alpha buffer and you can look at it at any point any any point you can set a breakpoint it'll let you write scripts and execute OpenGL commands so add a breakpoint you can type in an open field command and say well I don't I think that states wrong I'll modify right here type of gel command hit execute and it'll poke that OpenGL command right into your application and change the moment you'll state for you so one one useful thing for this then is going to be debugging if you think that you've got a bug in your state setup you can modify it on the fly scripts could be attached to break points so they can be Auto executed if you wanted it to be executed every time a breakpoint came along the OpenGL driver monitor another powerful tool this does is it attaches to the driver itself and starts collecting stats out of your graphics driver there's a number of parameters you can monitor like video memory usage hardware wait times so you can watch the see if you're stalled the CPU stalled against the hardware you can watch what kind of stall it is there's a whole breaks down into many different categories and why you DCP you may be blocked up against the GPU you can try to monitor those you can look at bandwidth usage of how much data you're getting through the system so it'll track bytes per second through the system so the whole bunch of useful stats takes a little bit of studying this tool to get useful data out of it because it is somewhat complex so we'll go through a little bit of that in one of our demos so why don't we switch to the demo machine let's do that okay so quickly here before we go any further screenshots I didn't show you can also customize your pixel format so for instance if I wanted to make a custom pixel format I can come in here and change my pixel format attributes that the application uses so if you have a pre-compiled application you can modify your pixel formats on the fly without have to recompile them what you can also do is you can emulate hardware now when I say emulate yes it's neat but all you but I'm gonna give you the bad part now all it really does is deprecated your current hardware to some less capable harder so mm-hm so so for instance if I'm running in our 300 like I am here and I wanted this harder to look to the application like a rage 128 I could say choose driver rage 128 that got released in the same feature set that got released in SE OS 10.3 and anytime the your application will come along and make a query into OpenGL for some kind of capability like an extension some kind of min max values the driver will return a value that looks like a rage 128 so if your application is coded correctly to respect extension strings and values that are queryable the powerful utility for I'm making your applications think it's running on a rage 128 this was actually a feature request last wwc so we got it in there ok so i contrived an application of a slight variation of the texture range demo i showed before and what i did is i did what i said not to do and I stuck a geo finish in there so first thing we're going to do is we're going to look at the statistics with this thing's collecting so what do we see we see geo finish is taking eighty-two percent of the percent time in GL and forty percent of the application time so a couple values let's go over the screen real quick a couple values that are interesting to look at when you when you pull the screen up one is the number down here which I highlighted that's the estimated percent time spent in jail so that will try to estimate how much of the time total time is spent in OpenGL so we can see we're spending about sixty-seven percent of the total time at on Jill of that total time we're spending fifty five percent fifty-six percent of that injeel finish so somebody calling a synchronous called you'll finish and causing the application to stall and wait for the GPU to flush the pipeline on that call so here's what we're going to do since I don't like that call we're going to go in here and we're going to pull up the breakpoints window and we're going to get rid of it so there's deal finish you'll notice not only that we set breakpoints before or after a function call but we can also stop executing functions so I'm just going to disable that function there's a favorite tool at Apple by the way when we catch our petitions we don't doing things will just disable it right so now you can see that things are looking a little better right so now we got rid of the Geo finish now we're only spending twenty-four percent of the time in OpenGL are not spending sixty-five percent that we were before we are spending the time where we want to be we're spending it basically injeel begin where some real work is going on of getting the data up to the system and things look good now I talked before about double buffering and the importance of double buffering data stays asynchronous well this application has the ability to switch down to one buffer right so I can make it look like I'm only the feeding one buffer to time and you can see I'm stuck on buffers on texture zero so real quick let's just see what the performance impact is so right here i'm at five about 500 megabytes a second if i'm at five i'm at about 600 sometimes 6 30 megabytes a second so quite a performance difference by stalling you know having the hardware stall on on the cpu having to prepare the next texture so let's look at the difference of what the call stats will show in this case and what we're going to do is we're going to pull up the driver monitor so let's look at a couple of things in conjunction okay okay so quickly here let me move this back up a little bit so what we're seeing here on the driver monitor is we have three lines i'm drawing i'm drawing the hardware wait time in red which represents the total time the cpu is waiting for the hardware and so any time the cpus blocked up against the hardware that's going to start registering wait time in the yellow i'm measuring texture paging bites so this since this is a texture demo uploading lots of textures i'm going to record the number of bytes or the textures per second I'm sending up to the hardware and green is the swap complete wait time now let's see what that what happens we can see what the changes are when I go through and go from single buffer to double buffer and we can watch the effect that has on some of the statistics and give you a little bit of an idea of how to use this tool and watch for different events that may be going on in your application okay so now that was single buffered and you can see up in the stats that when I'm single buffer time spending all my time basically injeel text sub-image Judy so as I change the pixels for that texture I'm spending all my time there and I'm spending my time there because I'm blocked against the hardware the harder maybe hasn't completed uploading that texture and the CPU is ready to give it another one of the CPU has to wait for the harder to be done so we're going to block and that's what that's the effect that single buffering is doing to me is that the CPU is not running asynchronous to the GPU so now if I move this up and I double buffer this we can start seeing some of the effects and I'm going to change a couple options here to give me a little bit of a better vantage point here ok so again the red is the harder wait time and you can see that when I went to two buffers you can see and it's subtle so you have to watch you can see that the red line went down so i'm now waiting cps now waiting less on the hardware and you can see the yellow line went up meaning that i'm getting more bytes per second up to the graphics card so by double buffering i have made myself more asynchronous to the GPU allowing for better parallelization and less blocking on the CPUs behalf so there's a couple other things here let's play these steps again and just look what effect that had on the steps so so previously I was spending all my time and text sub-image I still AM let's bump it up a little more and see what happens here we go up to five like we were so now the blocking point switched back again so you can see that we all right we're actually able to catch the driver blocking at different points as we move to different numbers of buffers so we can see the double buffering wasn't quite enough love it I can see that double buffering doesn't quite get me the same behavior that three buffers does now one one thing to watch out for that can potentially fool you is that there's only limited numbers of different types of resources in the driver as you vary the way your application works you can actually start consuming those different resources and when consumer resource the drivers going to block waiting for our resources become free so what is happening here is that as i'm only at say three buffers I'm running out of one type of resource and that is I'm probably blocked up against the harder waiting for the completion of that that command buffer but when I go to five buffers it changes because I'm I believe I'm blocked against slop buffers there's a there's a particular packet type in the driver that is needed to swap and there's only four of them when i switch up to 55 different buffers I've now asynchronous maybe the CPU so asynchronous that I'm running out of a different type of resource right so I'm so separated from the hardware that I'm consuming a driver resource that is making me block somewhere else so the key points here though our hardware wait time is always a good one to look at and kind of what kind of bytes throughput you're getting so you can look at a different variety of different stats for byte throughput let me pull up the the different stamps here so we can look at command bytes GL if we wanted to for instance if I poke that down there and I disable this other one so that one's not very interesting actually looks like a bug in any case there's lots of different stats you can even put up in here and if we're going to be releasing another version of this and has a little more descriptive names hopefully with some better information they are a little bit cryptic if you need some more detail information don't be afraid to post information up onto the OpenGL mailing list I switch back to the slides please okay so let's wrap up okay so text your optimizations the goal is to minimize your CD copies of pixel data there's different ways to optimize for power to non power to vertex optimizations you'll want to use the vertex array range for dynamic data with the shared hint storage hint and for static data use the vertex array range with the cash storage hints or display lifts offload the CPU on to the GPU with virtue programs free up some work on to the GPU use threads you can share different types of data between threads surfaces context data draw pixels for one shot images and copy pixels for fast VRAM VRAM copies of your data have your pixels use the open Jule profiler to find hot spots and points in the in your code that may be getting blocked in OpenGL and with that if you have more questions you can contact myself or travis browne so quickly references we got the OpenGL org webpage you can go to we have the apple developer page and we have some apple documentation that is available on the on the developer page and with that i'm going to bring travis up for the roadmap alright we're rapidly running out of sessions at this year's wwc let me actually skip toward one too far so so yes so the next session and the graft imaging track isn't specifically related to opengl but it's certainly a popular session nonetheless of mac OS 10 printing and then tomorrow again we have introduction to court services which is if your game developer or full screen OpenGL application developer please attend the session will be covering sort of the core API is that the system used to do use it to do display configuration management them again our hardware partners from api or kind of give us a presentation on friday lecture tomorrow is friday is cutting edge opengl techniques where they're going to really show us some of the absolute latest things are able to do with their current radeon products we also have a session on accessibility this will actually will contain some content that may be interested of interest to game developers because we will be covering at least in a flight or so issues affecting using assistive technology which is software that adapts the function of the computer with opengl applications that take over the full screen we have a suggestion there for some possible ways that you can ensure compatibility then we have the historically the last session of WWDC which is the crafts and imaging feedback forum going to voice opinions give us suggestions please take the time to attend the feedback form because that's where we get a lot of the information that we use to create great new features an operating system for next year you