WWDC2004 Session 210

Transcript

Kind: captions
Language: en
okay so my name is john stauber I manage
the OpenGL engineering group so let's
get right into it so what we're going to
talk about today is a brief overview of
what's new in OpenGL this will be what
we have done differently or optimized
since WIBC last year got to give you a
year yearly update as we go through this
session we're going to go into some
basic tips we try to keep this short we
want to get into the more advanced
optimization techniques for OpenGL then
we're going to get into some detail on
optimizing open GL texture uploads and
then a new optimization technique for
asynchronously reading pixels back off
the graphics card which is very
important for people who are trying to
get pixel data back off vertex data
throughput how to optimize your vertex
data uploads making sure that you're
getting optimal performance uploading
your your vertex data up to the GPU
one-shot images sometimes you have
images that are small or aren't going to
be reused you just want to draw them
once up to the screen so we'll talk
briefly about how to optimize that pixel
coffee operations how to optimally copy
pixels around there's certain ways that
you can optimize that making sure that
they are VRAM VRAM copies and using
threads there's a lot of people out
there trying to use threads we see a lot
of problems with that want to briefly
cover that and make sure that people
understand the limitations and and how
to make that work optimally so briefly
optimization strategies there's two
basic ways to basic things that people
are trying to do they're either trying
to maximize performance of their
application or they may be trying to
minimize the CPU burden and depending on
what your application demands are you'll
want to focus on one of those two types
of strategies and as you'll see in my
presentation that effectively using the
CPU can lead to greater optimizations
than just simply just offloading all of
the graphics processing work on to the
GPU there's ways to balance that and get
further performance by burdening the CPU
but for applications that want to offer
as much
work because they are doing things like
decoding some image or reading from a
disk they may want to simply be trying
to take all of the burdens of drawing
OpenGL and offloading a GPU so there's
two different types of techniques don't
be be sure that you understand that the
CPU is still a very effective processing
device for getting performance so key
concepts eliminating CP copies is very
important one of the key concepts that I
go through here over and over is how do
you eliminate copies of the data as it
goes through the OpenGL pipeline and
that is one of the keys to getting
performance because what you'll find is
that data can be copied multiple times
and limiting those copies getting direct
a direct pipeline set up such the data
goes directly to GPU is your your path
to getting high performance caching data
in the RAM and this can be textures or
vertices is also very important you want
to be able to leverage the large
bandwidth that the graphics processor
has to its memory it typically is going
to be five times let's say four or five
times the bandwidth that you can get
from the CPUs memory bus so if you have
static data lets you want to move into
the vram and leave it there and reuse it
caching at vram letting us manage the
vram out of memory foot space for you
maximize a dangerous behavior so
obviously you want to minimize your
synchronization points between the CPU
and GPU you don't want to have points
that stall and lead to secrets behavior
between the CPU GPU you want to be able
to operate asynchronously allowing both
cpu GPU to perform at their fullest and
using threads briefly as i had mentioned
before okay so what's new well we've
spent a year off my zing go kill some of
the highlights are that we've been
focusing quite a bit on immediate mode
performance there's a number of
applications out there that have been
ported that use a meeting mode
performance in media mode wrong modes
that make it a key optimization and
we've been spending a lot of time on
that to try to help those types of
applications coming over to the platform
pixel transfer paths and what I mean by
that is any kind of copying of pickle
data are
be a RGB data what have you we've been
spending a lot of time optimizing those
paths and continuing to improve that so
if your application is sending a lot of
pixel data we're off working on those
paths vertex program emulation for any
kind of applications that needs to run
across all of our CPUs and is relying on
the vertex program feature we are
continuing to improve the emulation of
vertex program on the CPU so you can run
is a vertex program on all of our
platforms and get the best performance
possible asynchronous texture downloads
as I mentioned before that is also a new
feature since the last time we talked
something is very important so as you
can see we have a list of extensions
that we've added since last year quite a
few or continuing to add features
regularly and as fast as they get
approved and made a part of it to be the
open till standard we will fold those
into open jails okay so so some basic
tips things that you don't want to do or
void is you you want to avoid jail
flushes keel flushes what those do is
they truncate the command stream going
to the processor and flush that command
up to the GPU now the reason you want to
avoid slush is one is it it's a command
it's a kernel trap so you want to avoid
that Colonel trap secondly you only have
a limited number of command buffers so
if you keep issuing geo flushers back to
back you will run out of that resource
and that will be a synchronization point
where we will have to wait for the
graphics processor processor to finish
processing those command buffers and
slights before we can get a free command
buffer to start working with again so we
tell people just to avoid yield flushes
now there are points there are times
when you'll want to use those and we'll
go into that a little bit later I tell
people never to use deal finish you'll
finish is it truly asynchronous call
what you'll finish does is that it
submits the current commands stops waits
for the processor the graphics processor
to be done with those commands before it
will return so it is truly a serialized
synchronization points that will cause
the CPU and GPU to stall against each
other so I tell people just don't call
that at all avoid GRE pixels if you can
you would want to use some of our more
modern ways of doing it one of the
techniques for replacing jewelry pixels
is used copy pixels and copy pixels is
useful for getting the range of
different copies so for instance if you
wanted to save off the pixel data some
depth data from simple data instead of
reading it across the bus saving it with
a CPU and then uploading it back what
you want to do is you want to use copy
pixels to store it on the vram somewhere
else don't read it across the bus save
it somewhere in another buffer up in the
RAM and that way you can get the high
bandwidth of copying it every time and
restoring it when you need it a singer
is now txt downloads it's also a good
way to replace your read pixels get an
asynchronous read back of your data and
not to have a stall of waiting for the
read pixels to finish so again a medium
of performance we've been optimizing it
but it is still one of the floor paths
so we tell people when possible avoid
immediate mode drawing instead use some
of our more advanced extensions now
there's one exception of this and that
is in displaylist if you use a medium oh
drawing a display list we will take that
media mode data we will convert it into
a more optimal form and prepared the
data and then upload into the RAM cash
envy ramp for you so this playlist is a
separable place to using media mode and
it turns out that's fairly convenient
for a lot of people who already are
using a media mode but they realize
their data static give your data static
you wrap a new list endless around it we
will post process the data and stick in
a video memory and then you'll get the
benefit of that optimization so minimize
state changes most people that have been
working with a deal know this one state
changes are expensive they do cause a
revalidation of the hardware states
which can can't be slow if you do it a
lot so avoid redundant state changes and
do your drawing in groups of state so
what you want to do you want to coalesce
your drawing under a given state setting
which allows you to minimize your state
changes okay so let's get into more
detail text your uploads so the what
we're going to talk about it we're going
to talk about the texture pipeline
overview give a people a brief
description of
what the pipeline looks like we're going
to talk about some of the optimization
basics and then we're going to get into
some of the extensions the extensions
are can be different depending on
whether you're talking about a power to
or non power to texture so we'll
differentiate a little bit between those
two types of textures for people who
aren't familiar with that there are
power to textures which is more standard
opened jail and recently over the last
few years there's been the non power to
which allows you to have a texture of
any size which is very useful for
general image data video pictures what
have you will use non power to so here's
a basic diagram of the OpenGL pipeline
the part that we're going to focus in on
for this section of the talk we're going
to focus in on the pixel pipeline and
looking at just a block diagram of what
the pipeline looks like standard OpenGL
on Mac goes 10 you can end up with at
any time if you're passing a texture
through the system you can end up with
four copies of the data so obviously
that's a lot right you want to avoid
those so what we're going to talk about
we're going to talk about how to
eliminate each one of those copies and
get you performance increases obviously
as you do that so but in the default
setting you can end up with four copies
of your texture as a pastor to the
system one coffee is going to be the
copy that you have one is going to be
what the framework has one is going to
be one it copied that the driver keeps
and then one is going to be in video
memory so so let's get into some of the
ways to optimize that so again
minimizing see two copies is the key
here we want to we don't want to give
the CPU cot the CPU redundant things to
be doing we want to optimize its time so
correct setup will minimize the CPU
copies and what we mean by that is that
you want to use the right texture
formats the right pixel formats which
will be ensure optimal paths it will
also ensure that the graphics processor
accepts that data type so you know
OpenGL supports a very large number of
pickle pipes and the graphics processors
also accepts quite a few pixel types but
if you stay if it's possible you want to
stay on
the confined set such that you are
guaranteed the particular graphics
processor you're on has native support
for that type and it won't have to go
through some kind of conversion to a
type that is compatible for that
graphics processor so here I've got
listed three types BGR a and the 8888
reversed and the 1555 of reverse now
those are the native Macintosh formats
so when you set your monitor or 32-bit
pixel mode are millions of pixels and or
thousands of pixels those are the tikl
types that the screen is running in and
that will give you a compatible type it
also turns out the graphics processors
understand that type natively also
you'll see a y EV type there for people
are doing video or want you have a y UV
source they can use a y UV texture and
that will be accepted as well so when I
usually put these up some people ask
what about rgba which is the standard
OpenGL type rgba isn't natively accepted
by all graphics processors sometimes it
will have to go through copy and get
swiveling into a different format so
usually it's fairly optimal copy and
sometimes it might be natively supported
but in general you have to be a little
careful of that type so let's talk about
the extensions so client storage is an
apple extension that extension is a way
to eliminate the frameworks copy of a
texture what it does is that it instead
of having the framework make a copy of
the texture the framework instead keeps
a pointer of that texture into your
memory so if the application has a
retained copy the texture you can just
tell us use my copy don't make a copy
for yourself and that will eliminate one
CP copy that will eliminate the memory
associated with keeping that copy Apple
texture range is another apple extension
this extension eliminates the drivers
copy and there's two different ways to
drive this is cached and shared what
those mean is cash means keep a copy of
the driver it's telling a driver to keep
a copy of a texture and video memory
shared means simply point to the copy in
system memory when you're doing your
drawing and I'll get a little more
detail in this in a little bit but keep
those concepts in mind they are
important so now EXT texture rectangle
is an extension required by some
hardware
to allow texture range to work properly
and the reason for that is that some
harder requires the the power of two
texture to be formatted in a certain
format so if you are if you don't use
that extension you're not necessarily
going to be guaranteed on all graphics
processors to get a eliminate the
graphics driver copy of the texture so
keep in mind that texture rectangles
tend to be more widely supported for
eliminating a driver copy when using
texture range okay so let's look at the
doll go back to the block diagram so
what we see is that client storage using
that we eliminate the framework copy as
I said and looking at a little bit of
source it's fairly simple all you do is
you enable the you of the texture the
client storage option when you are
building your texture so before you load
it just call pixel store I enabling the
client storage option and that will
eliminate the framework copy do remember
that when you do that that you are now
responsible for the memory so if you go
and delete your copy the texture the
framework is pointing to that and if you
try to do something that requires us to
access that memory you'll crash now
looking at the Apple texture range now
and Apple texture rectangle extensions
as I said before this eliminates the
drivers copy and I'm running I'm showing
a block diagram here of it running cache
mode so what happens is using these two
extensions the driver will be pointing
directly the frameworks copy and deeming
it directly from the frameworks copy up
into video memory keeping the copy and
video memory that's running a cached
mode ok and again here's a little code
snippet showing how to use texture range
very simple one call touch pram i text a
rectangle extension target type storage
hint apple and then cash apple for the
hint type now using all these together
what we get is we get that a graphics
processor now is going to be pointing
the GPU directly to point directly to
the applications copy of the memory and
it's going to be DM aying it directly
into video memory so what you get is you
you've eliminated the CPU actually make
it your copy we point directly to your
copy of the texture DMA directly so the
CPU never actually makes a pixel to
pixel copy
the graphics processor is deeming it
directly in the video memory and looking
at those code snippets together it
simply looks like this it adds basically
two calls to this you'll see that i'm
using a texture rectangle type I've been
sort of the two previous code snippets
between the bind and the text image 2d
call so that i'm getting the leader echt
DMA transfer as I just showed now
switching gears a little bit and looking
at the shared options now I'm sure
adoption as i said before makes it such
that we are not going to cache copy the
texture and video memory instead what
we're going to do is we are going to set
it up such the graphics processor is
going to look up the text file that
rasterization time directly from system
memory so while strong its walls drawing
the polygon each time goes to have such
a text over they'll go across a GP bus
look it up out of system memory so that
eliminates the copy that's in video
memory and there are some uses for that
and I will show a demo of that shortly
so here's what the code looks like it's
the same thing as the cached text Bram I
except for you would pass instead of the
storage cache Apple you would pass the
storage shared apple hint type and that
will mean don't cache of coffee and
video memory so looking at the block
diagram block diagram again so this is
what it looks like when you run a shared
mode there's no copying video memory as
your rasterizing is it is looking the
text fills up directly from the
applications memory and you end up with
one copy your copy in the application
there's times when this is useful times
when it's better to use cached ok so
same code snippet all I had to do to get
the shared option in there again has
changed the cash to shared I'm going
kind of fast so what I should point out
is that this example is actually
available on the website you'll see at
the bottom that there is the sample name
you can look this up and it has this
code in there so you don't need to be
copying this down ok let's talk a little
bit about cash assured when one is
appropriate versus the other so cache
mode is better for textures when you're
going to use it multiple times you don't
want to be look you don't want to be
reading a text across the GP buzz
multiple times so if you're going to use
it a lot you're going to want to cash it
video memory and then use it from there
multiple times and not require
cross the bus so it's also best when
you're using linear filtering when
you're filtering is a little bit higher
bandwidth usage because it's happened to
look up neighboring text files to do the
linear filtering and now shared is talk
much character second so share it is is
better for one shot images that are very
large and the reason that I say very
large is that if you have low video
memory cases and if you want to upload a
large texture what you don't want to
have happen is that texture to say evict
everything else out of IDEO memory so if
you're running in a low video memory
case it is possibly a benefit to run in
share mode where you're not going to
consume any video memory you can leave
what's resonant and video memory there
and just look up your image straight DMA
as your rasterizing as opposed to be
made into video memory there are some
caveats of that and that is that it the
shared motor runs best when running in
nearest filtering if you're running in
linear filtering again as I said that as
its wrath writing it's going to be
looking up those texts holes well if
you're running in linear filter linear
filtering requires neighboring textiles
and it will fetch more text files from
from the neighboring part of the texture
which will cause you some performance
b-grade as well is it works share works
really well when you're scaring the
image down for the same reason I just
said if you're scaling it down it's not
going to have to pull all the pixels
across the bus so so for power to
briefly talk about that all the same
extensions i just talked about are
actually applicable to power to this is
the same code snippet i had and all i
did replace the rectangle texture option
with the texture 2d numeron type and the
difference here is that as i had
mentioned before power to sometimes
won't get you a direct DMA so not all
graphics processors support direct DMA
and instead what will happen as the
driver will make a copy you can still
use the options all these same
extensions I've been talking about but
sometimes you won't get the direct DMA
ok let's talk about how to manage
texture range now as we saw in the
diagram the graphics processor is going
to be starting starting to look directly
at your memories ok so the graphics
processor one cpu now are going to be
sharing the same piece of memory as its
rasterizing now there's there's
a problem with that and that is that you
are now going to have to synchronize the
CPU and the graphics processor such that
they don't collide you can't have the
CPU and the GPU reading the same piece
of day at the same time right the
standard standard problem when you have
multiple devices looking at the same
piece of data so what you want to do is
you're going to have to double buffer it
and between double buffering you're
going to have to use you a flush so I've
got some diagram here I'll show this so
if you're running single buffer mode and
you have the CPU just generated texture
let's say read it in or decompressed it
and now the CP is going to want to flush
that up to the graphics processor so it
issues the deal flush to get that
command in flight and get the transfer
that data up into video memory and then
the graphics processor is going to do it
its work of processing and swapping it
to the screen so there you just did one
frame right and when single buffered I
have to synchronize my cpu and GPU
serializing the processing because I
don't I only have one data set only want
to work on at a time so as we build
through this you know this is how the
frames go I can only have a cpu GPU
working one at a time now if I go with
double buffering let's see what happens
so the CPU generate let's say we start
the sequence gp-generator to frame it
flushes it now in the next frame I can
see that if I had double buffering you
see if you can start working on the
second texture while the GPUs processing
the first right so now we can flush the
second one and swap the first one so i
just showed one frame while i'm
submitting the next one to the graphics
processor for processing and likewise it
goes continues on right now the the CPU
can start working back on texture one
and the GPS working on texture too and
so on and so on and basically if you had
you know this is an exaggeration where
you have perfect parallelization but it
does make a difference where you are
getting getting a singer's behavior
between the CPU and GPU
okay since so let's talk about how to
synchronize how do we synchronize the to
the commands how do I know when the GPU
is done processing my data for instance
so it's pointing at my my data set how
do I know when it's done accessing that
data well if you're using text arrange
what you need to use is is the apple
sense extension and apple fence
extension what you can do is you insert
a token into the command stream and you
can you can query for the token to
determine when it is done reading your
texture so that you'll know that it is
now safe for the CPU to start touching
the data again so there's a couple ways
do that you can use it by inserting a
token or you can actually use it
straight by accessing or referencing a
texture object and a texture object it's
just your standard texture ID and what
you'll do is in the fence object test
object command you just send it the
texture tight the GL texture object type
so looking at a little bit of code for
that so the first two commands here to
talk about how to set up a fence so you
just do a deal set sense Apple you can
pass it any name you want just a token
that you pass them in the command stream
and then when you are ready to start
touching the data the texture again with
a cpu you would then test to make sure
that the GPU is done and that's a
synchronization point where you the CP
will wait for the GPU to be done reading
that data and at which point then you
can start touching it again with cpu now
the last command up on this screen is
the way that you could use it the sense
extension without having to set a fence
explicitly you can just test for a
texture object and all you would do is
just call the finish object apple with
the yield texture target type and the
name of the texture so if you bound to a
texture you would just test against that
same texture ID number ok so we're going
to switch to demo machine too so what I
wanted to show here is the example of
texture range so you'll notice that the
CPU is doing very little here first
thing I want to point out very quickly
is that the CPU is continuing to do very
little and that's a sign that the CPU is
not making copies of the pixels right
because of a CP was copying the pixels
I would be seeing a big spike of cpu
processing time instead what's happening
as the graphics processor is talking
directly to the memory controller
getting a direct EMA so the date is not
going through the CPU so now I'm going
to turn online infinite button it's a
button i recommend everyone put on their
app make it go infinitely fast so so
even though now i'm doing 240 frames a
second i'm getting a gigabyte a second
across the AGP bus I still have no CP
time because again the CT is not doing
anything here cpu is orchestrating this
and not copying the data simply
directing the traffic as you might might
think of it that way now but what I
really want to show what this example is
using the shared options so as I said
before the the cached option is good for
drawing multiple times but the shared
options good when you shrink an image
down now it turns out this image
actually is 1024 x 1024 it's a lot
larger than my window so if i switch the
two shared mode now you'll see i'm
getting three gigabytes a second right
well that's not even possible the
application thinks it is though and the
reason is that I've shrunk the image its
nearest filtering some of the text holes
of that image aren't actually going
across a GP bus because the graphics
processor is skipping scan lines and
skipping pixels and always selecting the
ones it needs to draw the image so now
I'm getting 700 friends of seconds as
opposed to you know one gigabyte a
second and you know 240 frames a second
so quite a boost it doesn't take any
video memory because I'm not I'm not
caching copy and video memory I'm
bringing it directly across the bus so
there's there's times when this type of
technique can be a large win okay I'm
going to switch back slides
okay a singer's texture downloads let's
talk about how to set up text arranged
I'm sorry ACH section downloads are
basically the same thing is uploading
the texture where we just talked about
where you set up a texture as an AGP
texture for direct dming a stinger this
texture downloads is the same setup but
in Reverse so you set up the texture the
same way and then you you copy use copy
text sub image to copy the data into
that texture so the way it works is that
you the copy text some image is the call
that initiates the transfer from video
memory back into your texture and system
memory the reverse and that is an
autonomous call that will happen any
synchronously so the next time you issue
a flush there'll be a copy text double
image call in there and the flush will
issue a copy DMA transfer from video
memory in the system memory okay and
that's autonomous the CPU doesn't need
to wait for that event to happen now
there need to be a synchronization point
because the CPU needs to know when
that's done so what you use is you you
use at some later point you use the copy
the get text image call and that's the
synchronization point that will wait
until the transfer is done now hopefully
you've done enough processing between
your copy text sub image and your get
text image where the transfer is done
and you don't have to stall wait so the
idea here is that you you separate those
as far as you can maybe double buffer
them triple buffer them do some
processing between those so the basic
setup of those is that again is the same
the get sex image will take the same
pointer as you originally passed it for
the texture and the parameters must
match the setup of the texture so how
are you set up the texture those same
parameters will be used in these calls
as well and again you do this as late as
possible to get the maximum asynchronous
behavior so let's look at a little bit
of code the setup you'll notice the HX
is setup is the same as it is for
texture upload it's exactly the same now
the download is the key part of this and
all you have to do the two key calls are
they get the copy text sub image and the
gift takes image if you issue those
calls on a properly set of texture
you'll get an asynchronous download and
on my systems at work i can get about
500 megabytes a second download
performance which is usually pretty
acceptable for most people particularly
considering that can be an asynchronous
operation okay let's talk about vertex
data so you'll notice in this part of
the talk that vertex data set up and in
optimizations is about the same as
texture it's just a different data type
so a lot of the parts of the discussion
are the same and all we're going to do
is walk you through and point out some
of the differences of peculiarities of
this so we're going to go through a
pipeline overview we're going to talk
about the basics again and then we're
going to talk about the the extensions
now we're going to try it we're going to
separate dynamic and static data talk
about some differences and get a little
bit more detail on display list so in
this part of the talk we're going to
point out the the geometry part of the
pipeline not to pick some part of the
pipeline and let's talk about some
basics so the first thing is that you
want to take data types like in the
pixel talk we want to pick data types
that are our most optimal and the most
optimal for the vertex paths are floats
shorts and unsigned bytes if possible
stick with those types most graphics
processors will handle those types
natively so you will be able to get
optimal upload performance and if in
some of the cases where the CP might be
making a copy we spent time optimizing
those paths so these are the ones that
will give you optimal performance you
the other basic point of optimizing
vertex upload performance is you want to
avoid function call overhead now
obviously immediate mode where you
sending one vector of data per call is
pretty inefficient as far as a copy
routine what you want to be able to do
is use a vertex array called draw raised
raw elements to get the data through the
system with minimal function calls as
possible another good technique is to
use cgo macros cgo macros is a way to
directly reduce function call overhead
and i'll show example that it can be
pretty dramatic how much how efficient
you can make the function calls when you
start using the function pointer
dispatch table directly as opposed to
using the top-level library entry points
so that's a concept that people may want
to keep them in mind if they are seeing
or if they're making a lot of function
calls so another key concept is if
you are when you're drawing vertex data
using passing vertex data and opened
jail you want to maximize a number of
vertices / draw command so what you're
using arrays you want to maximize the
number of vertices / draught men or if
using begin in you want to maximize the
number of vertices / begin end you want
to get as many vertices / begin end as
possible and that can make a significant
performance improvement if you can do
that another optimization technique is
to offload CP processing using vertex
programs onto the GPU so if you have
computational processing you're doing on
the vertices think about trying to
offload that work to a vertex program on
the graphics processor okay so how do we
eliminate CP copies another key concept
is almost exact same as and textures
what we can use is we use the apple
vertex array range which is a parallel
API to the texture range extension and
you could also think about using the the
ARB new arm standard which is vertex
buffer objects those are nearly the same
type of API one is cross-platform the
ARB vertex buffer objects is a
cross-platform API that will allow you
to optimize your vertex data throughput
caching static data and vram is a key
concept here just like textures you want
to be able to cash that data nvram use
this playlist for vram for static data
again will process that data and cache
that new ramp for you if you want to use
this playlist it's a very effective way
to get your data processed properly in
n2 v ram so looking at the pipeline like
textures there could be multiple copies
of the data as it goes through the
pipeline this is showing a medium o
drawings media mode wrong we are
required to keep the current vertex take
before we pass it on to the graphics
processor so using a median mode you get
one extra copy immediately so if i
switch to using vertex arrays I can
eliminate that one copy and just by
using vertex raise so I'm saving myself
some processing time by using vertex
arrays immediately now let's go into
some talking about these extensions so
Apple vertex are a range for dynamic
data what you want to do is you want to
pass it the storage shared apple just
like on the textures we were using the
shared hint for how to solve we want the
data to be treated
and for the art vertex buffer objects we
want to use the dynamic draw ARB
constant which is the equivalent to our
storage shared constant so it will give
you the the optimal treatment for
dynamic data and what happens when we
use these extensions combined with
vertex arrays we end up with the same
thing we had for textures and that is we
get a direct DNA from the applications
copy of the data directly into the
graphics processors pipeline so it will
read it directly into its pipeline of
processes no copies and video memory so
looking at some sample code how to setup
for dynamic data using vertex array
range it's very simple there's two calls
that are key to this vertex array range
Apple you just set it a pointer in a
size and that tells us how big of a
piece of member you want to map you
Malik the data you give it to us tell us
where the pointers and what size it is
we map and a JP prepared for a suitable
storage area for direct DMA now you need
to make sure you flush that data so you
need to tell us when the date has
changed so you have to call the flush
vertex array range Apple call anytime
you change the data that includes
initial and when you first set it up or
when you modify some sub region you tell
us a pointer and a size that you want us
to flush it to flush sub-region to can
flush the whole thing you tell us any
time to change the data and we'll make
sure that all caches all copies all are
synchronized with your copy of the data
okay vertex buffer objects a little more
setup not much more though you'll see
that we we bind buffer object has a a
object type binding where you will bind
to a name and that will give you the
ability to switch between buffer objects
so what you want to do is you can you
can create many of these and you can
bind to them as you need to and the what
you do is you pass in your pointer in
the size just like you set up your data
size basically using the buffer data and
in that call you'll see that most
passing here the dynamic draw and that
tells us that this particular buffer
object is going to be set up for dynamic
drawing and we're going to be changing
the data frequently
and then you call Matt before Matt
buffer then actually is where you get
the pointer back so instead of you
allocating the memory open Jill is going
to allocate the memory for you and hand
back the pointer to you then you fill
out the data and then you unwrap it and
on unmap is when is the equivalent of
our flush so I'm map we flush the data
out and now the GPU is everything all
the cash has been synchronized and ready
for using that data so again you'll see
that the sample code that shows this
this list at the bottom of slide it's
it's going to be available tonight up on
the server so that everyone will be able
to download this we left we've updated
it with vertex buffer objects so that
you'll be able to see how to use the new
new extension okay static data so static
is almost like dynamic data it just uses
different constants so we can use it for
apple vertex array range we use the
storage cache apple hint or for our
vertex buffer objects we use these
statics draw ARB constant and the rest
of setups the same basically you just
pass with different constants
displaylist again between the begin in
you can use media mode between the begin
end call will propose process the data
to put it in optimal form after
uploading a couple key things remember
for display list is that you want to
keep it data and they pass it consistent
vertices so and what I mean by that is
that if you pass a you'll begin and then
you pass a deal color and are normal and
then you pass it a deal normal I'm sorry
passive a color a vertex that means
paths are normal in a vertex you're
passing different types of data per
vertex and what that does is actually
confuse our optimizer make it such that
it won't optimize the data and you won't
actually get to any benefit from it so
what you want to do is the first vertex
you call you want to make sure that
you're passing all the data you're going
to beat out required per vertex so if I
need a normal and a color that's a
normal color for the first vertex then
you can call anything you want as long
as you don't call say something besides
the color in normal not sure that's 108
clear but we can talk about it more okay
so
vertex of a range and display list so
what that gets us is for static data is
like textures we can cache the data in
vram and there's three different ways to
do it as you'll see there's the vertex
array range displaylist or vertex buffer
objects they get you basically the same
behavior and that means that you're
cashing a static data nvram and now when
you reuse that data you're getting the
full bandwidth out of the graphics
processor bus and it gets you a
significant performance boost okay
looking at static data set up similar to
the dynamic data set up where you are
passing the cached hint instead another
shared hint for the vertex array range
and again for the vertex buffer object
all you would do is change it from being
the dynamic to the static draw our
constant for the for the static data set
up for vertex buffer objects and
parallel talks about how to synchronize
your data so if you're using vertex
array range the application and the
graphics processor can be sharing the
same data so you're going to need to be
able to synchronize and flush between
manage the synchronization and flush
between your drawers if you're going to
double buffer the data so it's the same
type of operation right CDs generating
data flushes it to the GPU GP now is
going to process the data going to flush
it up to up to the screen for single
buffering and the CPU and GPU are going
to be running a series serialize they're
not going to be in parallel operating
parallel so let's aim all the way
through the frames and for double
buffering the data now again what we can
do is we can have the CPU process of
data flush into the GPU and as a gps
processing see if you can start
processing the data again and so on
through the frames that we can
theoretically get up to double the
performance before something is
perfectly parallelized
okay similar to the texture range you
use the fence for synchronizing the
vertex array range data you'll need to
know when the GPU is done processing the
data so what you do is you would set a
fence or you use the test object
mechanism for you referencing a vertex
array object and that will let you know
when that processing is been completed
so that you can start touching the data
again with a CPU so the sense extension
is what you'll be looking forward to use
for text for vertex array range and
vertex buffer objects don't require this
they have their own synchronization
mechanism and it's the map you change
your data and you on map so you don't
need to use the fence extension for the
buffer object only for the vertex array
range extension and just like the
textures looking at some sample code
first two are the same where I'm setting
a fence and finishing against that token
I've inserted the finish vent and then
the third line of code down there is
instead of using a GL texture or texture
type i'm using a vertex array type for
testing against a vertex object type and
that will allow me to present a
synchronization point where i can be
guaranteed the graphics processor is
done touching the data in the in the
vertex array range and allow me to
synchronize the graphics processor and
cpu so here's a little bit of history
here so last year I showed this slide
not quite I sure to low going out a
little bit further but i want to show
what we've been doing so i talked about
some optimizations what this slide shows
what exactly i should the data i showed
last year and with all the hardware
changes and software changes we made
over the year here's where we are this
year so it's a huge increase in
performance and looking at some charts
here so if we look at a media mode
performance for 8 vertices / / begin end
we've got up eight hundred percent
vertex arrays eleven hundred percent
thousand percent for vertex array range
and seventeen hundred percent for
display lists' now this is using very
small only eight vertices / draw command
so it's a very small data set per draka
man
so this shows some of the functional
overhead associated with small drawing
batches looking at another chart in
another point in that chart this is
using 42 vertices / draw command so this
is a little more optimal setup but
you'll see that we're still making quite
a bit of performance gains up to four
hundred seventy-seven percent for a
media mode so as I said we're working
quite a bit on a media mode so since
last year we've almost increased
performance of our systems by five
hundred percent and this is not only a
software change but hardware change
right so this is comparing
state-of-the-art hardware software last
year state-of-the-art hardware software
this year five hundred percent faster
almost okay switch the demo to please
okay what I wanted to show here is just
some of the effects I talked about so
here I've got a just a simple mesh this
only has eight eight vertices / strip
you can see that i'm only getting a 1.5
million triangles second with immediate
mode now if i increase the detail of
this this mesh by selecting this option
so now i'm up to a mesh that has 198
vertices / strip you'll see that my
performance jump dramatically now i'm up
to 12 million triangles a second now as
I said before you can reduce function
call overhead using cgl macro so I've
got an option here to turn on cgo macros
I just select cgl macros and you'll see
that i went from 12,000 to 17,000 so I
got five million triangles a second in a
medium own performance by enabling cgo
macros it's a really large difference
but now let's see what we get when we
use a more optimal form of drawing so
I'm going to switch now from immediate
mode to just draw a raise now I go to 24
million so draw raise is more optimal
than the best you can make immediate
mode now if I try some using some of the
extensions we talked about let's switch
to vertex array range so draw raises
vertex array range and go 250 million ok
so we're making some pretty good strides
here now let's say my data static which
this happens to be now let's go switch
the display list now again displaylist
are set up
cache the data static in video memory
i'll be using the the bus bandwidth
available on the graphics processor so i
go 450 million to over a hundred million
so we started off as just a few million
and now we're up at 100 million so using
the proper extensions understanding how
to optimally patch your data through the
system can make a very very large
difference okay back slides please okay
once shot images the best way to pass up
one shot images that are small is using
draw pixels the reason being is that the
overhead of small inches is not going to
be the copy of the data which your
optical does drop pixels will always
copy your pixel data it's the functional
overhead of getting in and out of the
system so you have to way off the
functional cost of the the driving
OpenGL versus cut the expense of copying
the pixels so drop Nichols works really
well for small images and I recommend
that you experiment with this so if your
images are smaller than 120 128 pixels
in size now one of the keys for gate
making drop pixels go fast as you want
to disable any complex rasterization
shape and that the reason for that is
that tropics which goes fastest when
we're not going through the really the
3d pipeline as much as we're allowing
the graphics processor to use its 2d
pipeline we can just get a stray bullet
into the frame buffer so we're not doing
blending not doing dithering no
stenciling alpha testing nothing thats a
3d pipe needs to do so we can stay on
the 2d pipeline so disabled complex
state will give you the best performance
and again this demo is available on the
website today so people can look at the
example i'm going to be running here ok
so a simple simple little code snippet
disabled complex state and then the
issue would drop pixels again same with
textures you want to use a format a
pixel format that is supported by the
graphics processor so we don't have to
do expensive conversions because if you
pass in some type like a float we're
going to convert it to something the
graphics processor can handle and that
may be slower than you would like to see
okay so back to demo to please so this
is a demo showing draw pixels again i
put the infinitely fast button on there
something i do highly recommend so if i
zoom this down to something very small
and you can't almost see it and i
apologize but it's two pixels by two
pixels so you can see i'm getting a
million draw a pixel commands per second
okay now you can see i'm only getting 15
megabytes a second of bandwidth of
actual pixel copying performance about
getting a million draw commands now as i
move this up in size to something
drawing something larger you see now i'm
getting eight hundred megabytes a second
a pickle coffee performance my actual
frame frame rate has dropped from was it
a million down to 11,000 so now i'm
starting to be memory bandwidth limited
so the performance now is being
throttled very much by the copy
operation being done on the pixels as
opposed to the functional overhead of
getting in and out of open Jill okay so
and this is kind of a boundary of which
i was describing is that small engines
are going to go really fast larger
images are going to start hitting a
memory bandwidth limit and you might
want to start considering some of the
other techniques of her uploading
textures for doing this this operation
okay please back to demo slides okay so
let's talk about pics of coffee
operations real quickly so the key to
pickle copy operations to get is to get
dear and revere am performance you can
get extremely high bandwidth if you
don't have to come across the bus so
anytime you're storing data that you
wanted to have temporary stashed off and
you want to be able to store it back
tops pics the old copy pixels is a great
way to do that now where you want to
store the data there's there's a couple
options one is to use an auxiliary
buffer auxiliary buffers you can create
an auxiliary buffer we Apple has
extensions where you can have auxiliary
buffers that have depth and stencil
associated with them such that you can
copy my depth buffer a stencil buffer or
color buffer off into a temporary
location and
and use computational to copy back to
restore your data so if you wanted to
refresh any one of those types of
buffers and auxiliary buffer will work
well for for storing that data and
you'll see that I list the pixel format
attributes that you would want to
include in your pixel formats for
setting up your OpenGL context to allow
your auxiliary buffers to have
additional buffers associated them and
you'll see that the depth stencil one on
the bottom there is the one for
expanding your auxiliary buffer format
type just like withdraw pixels copy
pixels you'll want to have the state in
a very simple form because you want to
use a 2d pipeline when you're when
you're copying from one memory location
for graphics processor video memory to
another you want to have the 2d engine
do that operation if possible and to do
that you want to minimize your state
have your state in a very simple
settings and it turns out basically the
same thing as drop pixels as the same
basic restrictions so you'll want to
disable as much of the state as you can
to try to get that that be rendered vram
2d blit okay so looking at the feet
little piece of sample code for that you
just disable your state you set up your
read buffer and draw buffer and then our
copy pixels so here you can see that i'm
going to copy data from the auxiliary
buffer maybe where I store the data
temporarily back into the back buffer
for restoring say a depth buffer or in
this case of color buffer and getting a
very fast restore of that that image
okay threads let's talk about threads a
little bit so rules for threading are
well first off let's talk about what I'm
gonna talk about rules for threading is
what I'm going to go over and then I'm
going to talk about divisions to work
how you can what kind of strategies you
can use for dividing up your your OpenGL
processing onto multiple threads what
are some of the effective techniques
sharing data between context how do you
effectively share data you can set up
multiple comics to have them share some
common data set synchronizing your
threads we'll go over a little bit about
what is the proper mechanisms for
synchronizing multiple threads okay so
rules so OpenGL if you're going to use
multiple threads talking to a single con
both those threads cannot be an opengl
simultaneously if you do really bad
things will happen because we do not
mutex lock on a per context basis
against multiple threads going into a
guild that work is required by the
application so you will need to be doing
your own mutex locking if you to
multiple threads we can be talking to a
single appeal contest now as I'll show
in the examples you can have multiple of
your contacts one and have a thread for
each and now you don't have to worry
about any mutex walking so if you are
sharing a single context of across
multiple threads mutex walking is the
applications job to get done properly so
other things you can share you can share
context data across threads so you can
set it up such that you have multiple
contexts with a common set of object
data object scared shared state that
multithread will be referencing and we
do mutex lock that such that multiple
threads can just be beating on the
shared state v of their own context and
we will manage the shared data set you
can also have multiple contexts talk to
a single surface so you can have one
video memory surface multiple threads
multiple contexts talking to that same
video memory and doing their drawing
that way so let's look at let's talk
about divisions to work a little bit so
possibilities are moving opengl on to a
separate thread so you can have the your
application on one thread and opengl on
a separate thread obviously very obvious
way of doing it that's not always the
optimal way so another thing you can
think about splitting open GL vertex end
and texture processing that's very
useful for when you want to have video
data or you're generating some pixel
data coming from a disk or coming from
some source that you want to load in
dome Jill and then you want a second
thread to be drawing it so you can have
OpenGL have multiple threads one for
loading and one for drawing so what gets
shared between context so a lot of times
people don't clearly understand when you
have multiple contexts with a shared set
up to share each other's object states
the things that get shared our display
lists' textures
six and fragrant programs and vertex
array objects that data gets shared when
you share to context so that data set
will become common between multiple
contexts if you set them up properly and
we will manage the mutex blocking of
accessing that data share okay and like
I said you can share an open till
surface so you can also set it up such
that a multiple contacts we talked to
one vidiyum vram buffer so let's look at
some diagrams of how that looks so here
in the red circles I've got threads and
on the on the left I've got the
application doing some cv processing it
passes that data off to the thread to
thread to then take that data and uses
it to draw some opengl very simple
simply using one open till context of
jill's on its own thread here's an
example of splitting opengl across
multiple threads now what I've got here
is I've got two threads one open shield
contacts per thread I've got them set up
such that they're sharing open jewel
staite and I've got it such that they
are talking to the same video memory
surface so they share state they share
the vram buffer and we manage the object
shared state and the what these this
shows just using texture data on one
thread and vertex data on another
obviously those are arbitrary you can
you can obviously mix those up have any
kind of inputs from either talking to
the shared object state machine it's
like there's a variation on that if
people want to use P buffers obviously
you can have one talking one thread
talking to a p buffer and then link that
P buffer into the shared state for as a
texture and then referencing a p buffer
for drawing and using thread one to draw
using that p buffer draw some scenes
that is using generated textures as a p
buffer so looking at a little bit of
setup code here so this is using cocoa
so this is how to set up a shared
context using cocoa so you'll see that I
create a context and then I knit with a
pixel format and I'm passing in a shared
content
so the third line down is a shared
context and that's the way you can link
to contact together to have a common
object shared data structures and that
allows you to share textures displaylist
programs perfect object data okay so
synchronization between threads and what
you want to do that you want to use
standard OS thread locking use NS thread
and this lock for instance one example
obviously you can use any other type of
os-level facility for managing threads
and I guess the main point of this slide
is that there's nothing in OpenGL to
manage your threads synchronization its
standard OS tools facilities that do
that don't use the apple sense extension
for managing your threads a placental
extension is for managing
synchronization between CPU and GPU not
between to CP threads so they'd so
that's an important point to remember
when if you're going to start dabbling
in multiple threads and by the way just
just as a point if you if you mess up
threads and you have multiple contexts
multiple threads talk multiple control
threads talking into the same context
you will cause all kinds of bad things
bad things can go as far as hanging your
system you will you'll introduce a bad
command into the graphics processor
graphics processor may hang and your
screen a wedge the CPU will block up
against that and everybody will come to
a halt okay so let's switch to demo
machine too please so in the beginning
of this talk i talked about effectively
using cpu and the GPU and one of the
things i want to show is first off i
wrote a we wrote a altivec routine of
this little sinusoidal wave simulator
here and you'll see it's going pretty
fast it's generate 18 million triangle a
second so what we've got here on this
chart is we've got the red time at the
bottom is time spent in the system
outside of just the application or
OpenGL the green is time spent
calculating the wave and blue is time
spent in OpenGL okay so I'm going to
just multi-thread that I'm going to
split it across both of the CPUs I have
on the system
so they bought my performance up a
little bit i was at 19 now i'm up to 21
something and you can see that I've
improved performance a little bit now
the thing that is surprising this is
this actually has a high-end graphics
card in it and now I'm what I'm going to
do is I'm going to move the wave
calculation into a vertex program onto
the GPU so so before I do that you'll
see that the CPU cps are very busy right
there there's a lot of time going into
calculating this wave with a cpu and now
i'm going to move the wave calculation
off onto the GPU and the thing to watch
is the performance for 21 million
triangles second 338 frames a second and
look at it drop down to 15 million and I
know there's people out here from the
hardware vendors and they're saying now
that can't be possible our Hardware
always can run the CPU but it's not true
some the CPU is really good at some
things and you can actually write really
efficient code sometimes it'll outrun
the graphics processors and you'll see
that by the way notice that the CPU is
now barely doing anything but my
performance went down so that kind of
what I'm trying to point out here is
that if your goal is maximum performance
sometimes you want the CPU to be doing
work but if your goal is to have a CP do
nothing for sure offload all the
processing on to the GPU and the CPU can
be free to do something else but that
won't guarantee maximum performance
maximum performance went round by
experimenting what the optimal
combination is ok back to slides please
ok let's wrap up ok so after this
session there's a couple more open your
sessions that are really good i
recommend people go to there's the
optimization live session which is going
to have a live session talking about our
tools using our tools the open to a
profiler profiler open till driver
monitor live on stage showing people how
to use it really good session I find
that I'm always using the open trail
profiler for analyzing applications
figuring out where the bottlenecks are
what I need to be optimizing it's a
similar tool for OpenGL as shark is for
the CPU and then on Friday we've got the
introduction to the OpenGL shader
language for those that don't know what
the OpenGL shader language is it's a
good
introduction to what that language looks
like some of the capabilities that has
and highly recommended for people that
are interested in programming the
graphics processor and to the contact
contact myself or travis browne if
people want to talk to me then come up
afterwards and i can give business card
so you don't need to write that down too
quickly so for more information you can
go to the apple website so it's
developer.apple.com / opengl that's a
good resource for opengl information
from apple or you can go to the opengl
org website the wmg org that's a open
fields official website contains
specifications pointers to a variety of
resources that people find useful and
reference library so we do have some
references out there I want to take note
of these couple documentation that are
out on the system