WWDC2003 Session 209

Transcript

Kind: captions
Language: en
good afternoon welcome to session 2 9
advanced opengl optimizations I'm Travis
Browne I'm sure by now you know what I
do so well cover the hack but this is a
really good session for you to attend
because we're going to be talking about
the techniques that optimize opengl mac
OS 10 and i want to just make a couple
quick comments and gather way because
we're sort of has a super-sized session
here so we're going to try to fit it in
the allotted amount of time it should be
a little bit of struggle but we got
great content but the key thing i want
to sort of point out is that when the
advantages Apple has in terms of our
implementation of the operating systems
that we deliver the driver and also the
OpenGL stack and because of that we're
able to put in certain fast paths that
enable you to really unlock incredible
performance out latest generations of
GPUs and we also work in the operating
system and create our own tools we have
a tool chain that's available to you to
be able to really take a look at how
your OpenGL applications performing
debug it and also just just unlock it so
it runs as wide open as it can so it's
my pleasure to welcome John Stauffer the
manager of the OpenGL engineering team
to the stage take you through the
presentation thank you travis has travis
said I manage the OpenGL engineering
team at apple and this is going to be a
session on optimizing open Jill trying
to get the most out of OpenGL on the
Macintosh okay so what we're going to go
over today is we're going to try to
cover some basics we're gonna try to
move those through those as quickly as
possible and get into some techniques
that you'll want to try to leverage in
your application to get the maximum
performance so some of those things are
optimizing texture uploads optimizing
your vertex data throughput optimizing
for one shot images if you're just
bleeding an image up to the screen and
you want to dispose of it that's what i
call a one-shot image optimizing for
copy pixels if you want to move pixels
around how do you do that quickly using
threads and open jail opengl is
thread-safe and a great way to leverage
our systems is to use threads and lastly
we're going to go into the open Joe
profiler which is a tool you can use to
hopefully analyze and look for hot spots
look for blocks and in your code where
it may be blocking up against
Jill ok so the goals of optimizing
there's a couple goals one is to
maximize the performance of your
rendering so you may want to get maximum
performance and that means utilizing the
CPU and GPU and a combination that will
get you that performance another
possible goal is to minimize CP burden
so it's a little different maybe
sometimes they result in the same code
but sometimes they don't and to minimize
CP burden usually means maximize burden
on the GPU you may want to have a
technique to offload much burden on the
GPU leaving the CPU free for doing other
work so key concepts keep in mind that
while I'm talking keep these things in
mind so that hopefully the concepts i
will be presenting will make sense if
I'm not pointing about as clearly as I
need to be eliminating CD copies is a
key goal to any optimization you want to
reduce the amount of amount of times you
touching data so you want to move it
through the system get it to where it
needs to go and and start operating or
start drawing on it cash static data and
DRAM so the video card has a higher
memory bandwidth than sister memory so
if you have static data an ideal place
to do it to put it is up in video memory
put it video memory leave it there and
draw with it from there and you'll get
dramatically higher performance as I'll
show in some of the demos maximization
its behavior between CPU and GPU that's
key you've got to asynchronous process
to a singers pieces of hardware you're
going to want to stay asynchronous right
you're going to want to run one in
parallel to the other you're not going
to want to block against each other
that's the key concept to getting
maximum performance and again like I
said using threads is a concept that can
be beneficial times so basic things to
avoid so we're into the basics of just a
general overview what we don't want to
do and what you usually don't want to do
is call Geo flush flush flush as a
command buffer up to the hardware it
uses resources in the driver it's it is
it can cost you a little bit of function
overhead getting into the driver tube to
make that happen don't do it if unless
you have to there's a couple reasons you
might have to we'll cover those a little
bit later in general avoid it never call
GL finish I frankly don't know of any
cases where Jill finish is really
required there's other ways to do it
possibly more efficiently avoid calling
Geo read pick
tools jewelry pixels is only really
required when you want tickets pixels
back off the video card and save them
somewhere save them for later but if
you're going to use an algorithm for
rendering Sun effect and you have read
pixels in there you probably want to
look for a better way to do it like
caching a data is somewhere else in the
video card and then we're using it by
just copying it out of that store that
cash on video memory back into your
scene avoid a media mode wrong so a
media mode drawing is when you use the
Jill begin gln and you have a series of
vertices and colors you put in between
those begin end there's one caveat to
that you can use the media mode drawing
in displaylist we have a post processor
that will go through take that begin in
sequence post process it optimize it
cash it and video memory for you so
display us are static data right you
can't modify display this once you've
created so by definition it's
fundamentally a static so we will treat
it like static data caching video memory
so displays does a good place to put
your static data minimize state changes
so state state changes in the hardware
can be expensive and in fact the more
complex the hardware gets the more
expensive they tend to get so therefore
you want to group your rendering
according to state change there's
usually a hierarchy of how you want to
group it may be like group it by texture
group by blending modes glute group it
by drawing method where you cole s your
your type of primitives triangles quads
but coalescing your your database
according to state will cause less
transition state transitions in the
hardware and that can be very beneficial
to performance okay so jumping into
optimizing texture uploads first what
we're going to go through is we're going
to do an overview of the texture
pipeline and we're going to talk briefly
about that and then we're going to go
into texture optimization basics just
some oversight overview of what we want
to look at and then we're going to go
into smoke gel extensions so we're going
to break those down into categories
we're gonna break them down into power
of two extensions and non power to
extensions there's ways to optimize
slightly differently for those two
different cases ok so the OpenGL
pipeline generally looks like this for
this part of the talk for optimizing
texture uploads we're going to focus in
on the pixel pipeline so we're going to
focus on on the highlighted yellow boxes
there and to zoom
on that a little bit and talk a little
more in depth on what happens while the
data is moving through the system there
we have a basic block diagram and each
block on this diagram oval or block the
data may be copied okay so you have the
application you the application has a
copy of the texture you handed OpenGL
OpenGL may make a copy of the data and
store in the framework when it goes to
draw it the driver may make a copy to
some part of a specific format for
uploading to the to video memory so and
then of course video memory has a copy
right so theoretically there is possible
that you might have up to four copies in
your system at some point so the goal of
this some of this discussion is how to
avoid some of that how to control how
that operates okay so basics so again
like I said we want to minimize CP
copies and conversions it is possible
that you pass data that is not
necessarily in the format that the
harder wants it will have to do a
conversion so you're going to want to
pick data formats that are optimal and I
mentioned some here the BGR a unsigned
int 8888 reverse BGR a unsigned short
1555 reverse and the y UV format for
ycbcr data those formats can be used for
a fast fast food system and without any
other types of state that may cause
conversions those do not need to be
converted for the hardware to natively
understand those okay so like I said
before you'll want to avoid Geale
flushes but there are exceptions and why
exception is when you have the GPU
asynchronously reading data from your
data cache so if you have a texture and
the GPS going to read directly from it
you want to asynchronous with the GPU so
you may have to double up your data for
instance you may have to have texture 1
and texture too and while the CPU is
uploading or working on texture one
you'll have the GPU working off of
texture too and we're going to go into
more details here but I want to mention
it so you'll keep it in mind as we look
at some of the diagrams okay so double
buffering so like I said what you want
to do is if to stay asynchronous you
want to have double buffering your data
so if
double buffer your data it looks like
this right so you can fundamentally stay
asynchronous from the hardware and allow
the hardware to some data to work on
Wall CP is working on some data standard
concept for double buffering that
commonly comes up okay so some OpenGL
extensions of how to optimize your your
throughput in the texture pipeline the
first one is Apple client storage apple
cloud storage is where you tell weapons
yield that I will allocate a piece of
memory and I'll keep it around and you
can just you have a pointer to it so we
will retain a pointer to your data we
will not copy the data and retain it
locally so that requires that you retain
your copy of the texture until you
delete it because we're going to be
referencing it another extension is
Apple texture range that has a couple
interesting modes that you can define
one is cached which means that you're
going to want the data cache in video
memory one shared where it says that I
don't want to catch the video memory I
want you to leave an AGP space don't put
it in video memory and what actually
what texture range does is it defines a
region of memory you put a texture there
we will map that Ridge region of memory
in a GP space and leave it there so so
that the GPU is able to come and dma
directly from that piece of memory
without us having to copy it out of that
region of memory put it into a GP space
we will map that age that memory
directly in a GP space ok so if we look
at what these extensions due to the
stack so Clive storage what it does is
it bypasses one copy that texture may
undergo so it goes from the application
to the driver without having to be
copied by the framework so that will
automatically increase your performance
if you happen to be given incurring a
copy in the framework we'll go over some
sample code it's pretty easy to enable
all you have to do is make a single call
when you bind to a texture you make a
call to enable client storage and you
just set it to true ok text arrange and
rectangle texture now rectangle texture
that extension is for allowing some
hardware to do direct dma of the of the
texture and the reason for that is that
some hardware of the power of two
textures and to define what power of two
nurses non-power to is that power to
meeting a power to width height versus
rectangle which means any
right is not restricted to a power of
two dimension is required by some
hardware to do directly maybe cuz some
hardware requires a hardware specific
format to tour the data to be in before
it can upload it so therefore rectangle
texture is required when you use text
arrange to get direct DMA and you'll
need to use those in conjunction and
when you do you can bypass the drivers
copy so now we've showed how to bypass
to independently and looking at the
sample code for texture range like I
said there's the cached hint and the
cached in this case is for cash natives
for storing the data up in video memory
and it's for non power of two and if we
look at the shared hint here's how you
said it it's the same it's just that you
have a set of cash can't you have a
shared hint and just for those that if
I'm going too fast there's sample code
that'll be up on the website you can
look at that and it has all these
features so don't get too worried about
writing these things down you can
reference the sample code okay so if we
use those two in conjunction we end up
bypassing all the copies we end up going
straight from your copy of the texture
directly into video memory and therefore
the GPU is directly DM aying from your
memory so the GPU and you the GPU and
the Indian application are directly
talking to one another opengl
fundamentally has been moved out of the
way hope you'll did the set up but the
transfer is happening between your
application and the GPU okay so looking
at putting this all together and looking
at a little bit of a piece of code that
does all these things together for non
power of two we first thing we do is
we're going to bind to a rectangle
texture and then we're going to set up
the cash hint and that is going to work
in conjunction with the client storage
which is next on the fourth line and
between those two that's going to set up
for a direct EMA and then when we call
the text some image or the text image 2d
with a rectangle texture target it is
going to set up the GPU for a direct DMA
of your texture from directly DMA hear
from your memory right so pretty simple
to do but this particular setup that I'm
showing right now does require that you
are
going to be using rectangle texture and
you'll want to use redirect angle
texture specification because it does
have some restrictions on on
functionality so it's not as quite as
quite the same as a to power Q texture
which allows for MIT Maps allow sort of
different clamping modes and such so you
want to use it read the extension see if
rectangle texture suits your needs ok so
for power to it's slightly different but
not much all I did here in this piece of
code different from the previous one is
change rectangle texture to texture 2d
and sexy 2d then allows me to to use a
power Q texture gets some of the
additional functionality that power two
textures bring to me but it won't give
you a direct email in curve one copy so
it's not going to be quite as fast
performance typically we see that is ok
because rectangle textures are usually
ok for things like video and such games
like quake three are going to use power
too and they're going to load them at
the beginning of a of the game or a
level and you're not going to need to do
real-time texture loading as much so
rectangle textures is very powerful for
playing video playing through images
that you want to get to the screen fast
which typically are non power to ok so
let's switch to the demo machine i'm
going to show you a demo of that ok so
the first thing we're going to do is
just look at this and explain what the
demo is this demo it's hard to see but
it has numbers in the middle of that
middle of that image that the numbers go
from one to five and it is uploading a
1024 this even though it's a small
window it is a 1024 x 1024 32-bit image
it is uploading it across a GP bus and
blending it on the screen every frame so
you can see that we're getting about six
hundred fifty megabytes of seconds and
you also see that we've got a couple
sliders i can switch from single
buffered all the way up to five buffers
to test the different effects that may
have on the parallelization of the
hardware in the cpu you can also see a
frame rate slider and a number of
checkboxes turn off the different
extensions so the frame rate slaughter
goes all the way for a thousand we
actually had to add that to test the g5
but we're on our g4 so we're going to
keep in the middle so the interesting
note here is like I had said is that the
idea is to eliminate CPU copies well
this actually eliminates all the TP
copies right and you can see that the
cpu monitor is showing very little
activity the CPU actually isn't doing
much your other than running the event
loop and drawing a little clod on the
screen so the CPU has been effectively
removed from the bulk of the work of
this demo now if we turn off say all the
extensions let's just start turning
these things off you can see that our
performs drops for six hundred fifty
megabytes second down to 111 megabytes a
second and you can see now we have
effectively saturated one cpu so when
our work Kate we're single threaded app
we've taken one cpu we are basically
memory bound by a cpu copying the data
to get it into a format that can be
uploaded by the GPU so so with those
extensions again I'll turn it back on
not only do you get higher bandwidth but
you save CP work right because you're
not using a cpu and you're getting
higher throughput so for people who are
able to use some of these extensions you
can get quite a benefit ok back to the
slides please ok let's jump into
optimizing vertex throughput optimizing
vertex throughput actually is very
parallel to optimizing texture
throughput and you'll see as I go
through the slides that it is parallels
the same concepts and in fact we did
that on purpose so that it has a very
analogous concepts ok so we're going to
we're going to look at an overview of
the vertex pipeline we're going to go
through some of the basic optimizations
that can be done we're going to go
through some of the guys that can help
you and we're going to break this into a
couple categories static and dynamic and
display list so those are three separate
categories we're going to touch on that
will slightly different techniques for
each of those categories ok again same
same pipeline this time we're going to
focus in on this playlist and the vertex
path
and let's get into the basics so for
minimizing the CPU copies for vertex
data just like pixel data you're going
to want it into a format the harder
understands a safe data type is do float
if you keep all your data and geo floats
you're pretty safe it all the hardware
knows how to understand how to read geo
floats if you start using like double
tight doubles or bytes for vertex data
or some combination that's a little bit
off the normal the driver may say I
don't know I can't directly upload this
to the GPU I may have to do a CPU copy
do conversion maybe some slow conversion
you may find your performance dismal so
stick with geo floats that's a guarantee
to be one of the faster pads use vertex
arrays so like I said before stay away
from immediate mode wrong just because
that incurs overhead per function call
overhead and some other overhead I'm
going to go into in a minute so use the
standard yield array funk function calls
and maximize your vertices / drachman so
I will show some performance charts in a
little bit that will show the benefit of
maximizing the number of vertices you
passed opens yell at one time so for
instance instead of drawing one quad at
a time if you draw 100 clubs at a time
you'll get dramatically better
performance because you're lowering per
function call overhead you're lowering
the driver happening do work on a per
primitive basis cash your static data
vram we've already said that and use
vertex programs to offload your CP work
i'm going to show a demo in a bit that
will show how you can do actual work
with vertex programs and freeing up CP
cycles not just in the data transfer
aspect but actually in the effects that
you may want to do with your application
ok and again the same thing double buff
your data is analogous to the textures
and we have the same double buffer data
diagram where if you have the GPU
reading your data directly out of your
applications memory you're going to need
to have some isolation between the
asynchronous behavior of the GPU and
your application so you're going to want
to double both your data you have a CPU
a buffer to work on while the GP is
working on the buffer and pas going back
and forth so when you do that you get a
check ins behavior you can get some
significant
performance improvements will show some
of that and a demo as well you'll notice
I but you'll flush there what you want
to do is when the cpu is done with doing
some work you'll want to get that data
in flight to the graphics garden so soon
as the CP is done yes yugioh flush send
it on its way the Geo and hopefully
you've done with substantial amount of
work where you're not calling flush too
often because that will hurt you okay
again very much like a text pipeline
when vertex data comes through the
pipeline it can go through multiple
copies depending on what api's you're
using and how the data is formatted so
we can end up with fork with the data
going from your application and if it's
going to media mode opengl is required
to capture the current vertex state so
we retain a golden image of basically
the current vertex state when you're
running a media mode if you're running
vertex arrays we don't have to do that
so the first copy we have to do is into
a local storage of a single vertex
instance of a current vertex fate so we
incur that one copy if you're going to
media mode and then we're going to have
to copy to a format for the hardware
upload so we're going to copy it
somewhere into a cheapy space for the
hardware DMA it up then eventually makes
it to the GPU so if use vertex raise you
immediately just eliminate that one copy
and that one's easy to do media mode is
a easy one to work around no extensions
needed just use the right API okay so
let's talk a little bit about dynamic
data analogous to the texture range
extension we previously talked about we
have a vertex of a range extension and
it is exactly parallel it has the same
storage hints where you have a shared
hint for leaving the data in a GP space
and that's what you're going to want to
do for dynamic data you're not going to
want to necessarily cash in video memory
you're going to want to leave an AGP
space and what that happen happens there
is that you've allocated on a razor
array of memory in your application we
come along we we map that into a GP we
wire it down and then the application
could come along and poke values into it
tell the hardware I'm done with that
issued rock band will be made up and
therefore the the GPU is reading
directly from your arrays and and it's
never it never makes it video memory and
for dynamic data obviously that that
could have
benefit that you don't want it to be
cached in video memory because you're
going to change it again the very next
frame okay so what does it look like if
we use that extension we have vertex
arrays and we use the vertex array range
and just like texture range we we bypass
all the top is in the driver and we r DM
aying directly from your copy of the
applications arrays so we can get very
high throughput doing that very low CPU
work is going on okay so looking at a
little bit of sample code for that the
first two calls were just a standard
vertex pointer setup standard OpenGL for
setting up a vertex array the next two
calls in its setting up a vertex range
and what you do is you pass it in a size
and a pointer and you tell us what
memory to map in so you're just going to
give us a pointer with the size and
we're going to map that that memory in
and then there's the last call on this
is a flush now that's an important call
because every time you change that data
you're gonna have to tell us you changed
it and what we're what we'll do with
that is potentially flush harder or GPU
caches or we may be ma it to some other
location but what's important of that is
that you have to tell us the areas that
you've changed so every frame that you
come along and you write write two more
vertices and you change that data you
have to tell us the pointer and the size
offsets the size from that pointer that
you want us to flush you've changed it
where you will then know that that has
changed in an update the hardware okay
so static vertex data very similar you
can use the vertex array range but
instead of using the shared hint you can
use the cached in and what will happen
is is that when you define the vertex
array if it has a cache tint and you say
flush then we'll know that you've
changed it we will DMA a copy of the
video memory will keep it there every
time you call flush though we will have
to read the MA that back up in the video
memory but it'll be cash the video
memory and if you're going to draw from
it multiple times quite a benefit
because you're not having to reread that
data cross AGP bus every time instead
what you're doing is your local to the
the video cards bus which is a very high
speed bus and like I said previously you
can use this playlist what's begin end
the one caveat for
using display list is that we do have an
optimizer it goes back through the data
and and parses it and reconfigures it
into an optimal format you can fool that
optimizer and what you want to avoid is
using inconsistent vertex tie into
subsistent vertex data and what I mean
by that is that if you have if you go
through you'll begin and you say geo
color Jill vertex do color GL vertex
that's consistent if inconsistent would
be Jill begin geo color GL vertex g /
txt / text GL vertex you did a color for
the first vertex but not following one
you may fool the the optimizer and to
not being able to handle that so if you
want to play it safe just keep it into a
consistent format and the optimizer will
definitely be able to take that data
pack it into a format that we can then
cash in video memory and you can get
optimal performance so one last caveat
down for display list is that there's a
minimum threshold for which it's
worthwhile for us to work on the data
and that threshold is 16 vertices if you
have less than 16 vertices we won't even
consider optimizing it and that's just
we just found that out by testing
different machines and finding out where
the threshold was and deciding that you
know if it's not going to get your
performance benefit in fact they
actually slowed you down because of
other overhead of doing work on the data
that 16 was the minimum ok so what does
that diagram look like then so when
using static data with either display
list or with the cached vertex array
range what happens is the data gets
delayed in the video memory and then the
GPU draws from that right so it's going
to be taking the data from the video
memory cache and drawing so you get very
high throughput for data to draw more
than once and the sample code for that
again it looks like we're for static
data we're setting up a standard vertex
array again we set up the the hint this
time the hint for the vertex array range
is cached it's not shared like it was
for dynamic and I'd like before we set
up the vertex array range pointer in
size and then again we tell the flush
and again the flush this time instead of
just the flush is going to cause us to
reupload that data so if it's not there
already will upload it if you have
touch the data again and it was uploaded
will refresh it with another copy so
it's like a text sub image call where
we're going to refresh the data on the
video memory ok so for basic reviewed
what displaylist look like it's pretty
simple you just call begin list draw
your your drawing and then call n list
and you can pack anything you want in
there and it takes any opengl calls if
you put your geometry in between a
beginning lists and lists hopefully
we'll be able to optimize it and get it
cashed in video memory ok so looking at
what this could do for you for
performance this is a chart of low
vertex account performance so on the x
axis we have the number of vertices /
draw command and on the y-axis we have
millions of triangles so as you can see
the orange is a media mode and orange
tops out pretty quickly as to what
benefit you can get by going down that
path and the red then is vertical raise
the richest race has a little bit lower
per function call overhead a little more
give you a little more performance but
if you look at the blue the blue is
vertex array range to vertex array range
has a great potential for performance
and it doesn't give you a whole lot
until you start giving OpenGL a lot of
work to do it at one time so that's the
key the key is giving up until lots of
work at one time and then the green the
top one is this playlist so you know it
goes up to on this chart goes up to
about 12 million triangles a second
issuing 30 triangles per document now
this is the high vertex count
performance and picking up a little bit
where the other one left off you can see
that some of these continue to grow
quite a bit so you can see that the
vertex raised in media mode stay flat
array range basically grows until you're
limited and in this case my the test i
was running was a GP bus limited so I
limited about six hundred forty
megabytes a second a vertex data that I
could transmit across the bus so I
pretty much bottleneck that fat AGP bus
and that's all the data I get across
button this be display this case that I
was testing here the data was
effectively static it only went across
the bus once so the GPU was able to
utilize its internal bus bandwidth which
in the case I was running on on our 300
that's about 20 gigabytes a second so we
can train
to a whole lot more data and you can see
that at the number of the top number i
quoted here it was about two point eight
gigabytes a second where the geometry
going into the GPU so that's quite a bit
of data almost 90 million triangles a
second so let's do a demo and show a
little bit of this okay so what we've
got here is anybody that's been to my
session before it's the same old thing
but next iteration of improvement so
initially we're drawing with quads and
we're doing the standard basic Geale
begin in not too impressive we're
getting about 800,000 triangles a second
so what I'm going to do is I'm going to
step through the difference as I move
the slider up I'm going to step through
different optimizations and using
different extensions and we'll see what
effect that has on the performance down
at the bottom you see the color coding
the color coding represents where time
is being spent so red is system time
time spent outside the application green
is time spent calculating the wave and
blue is time spent in OpenGL so you can
see right now that I'm spending a lot of
time in OpenGL a lot of time calculating
the ways so if we start moving up the
level of optimizations i went to quad
strips that got us quite a bit of speed
improvement about twenty-five percent
and that was pretty to do worthwhile but
let's not stop there and let's keep
going up so if we go to vertex arrays a
little bit more but i wasn't a great
improvement then we go vertex array
range okay so here's where it gets
interesting now you can see that the
time spent in OpenGL which was the blue
went from filling basically the top of
that bar almost to nothing so now the
time spent in OpenGL is very little and
we're basically now saturated on the
calculation of the wave we are not able
to calculate the way fast enough to get
the data to OpenGL so if we move up one
more notch and we see what altivec can
do to us so we also make the wave
calculation because that was my bottle
neck once I optimized open Jill Jill is
no longer the bottleneck so the CP was I
optimize that and then we do one last
thing like I said before you may want to
off load calculations on the GPU so what
if we write a vertex program to do that
ways
and now again the interesting thing to
watch is that we are calculating a wave
motion and we were sending almost 12
million triangles a second to the screen
and look at the CPU the CPUs I was doing
nothing right so so we basically not
only have we optimized it we've uploaded
the CPU from doing any work now CPU
again is just doing an event loop CPU
doesn't know that this this complex wave
is being calculated and if we actually
look at the density of this it's a
really dense wave there's a lot of
triangles there okay back to the slides
please ok let's go into a new subject
optimizing for one shot images one shot
images again our images that you may
have that you want to get to the screens
fast you can't discard it you're not
going to do it you're not going to blitz
to the screen multiple times so it's
just one shot so one possible way of
doing that is draw pixels drop pixels is
fairly effective in some cases its best
of small images if you have like a
little small little widgets you want to
draw somewhere drop pixels is probably
the fastest way to get the data there
it's a very optimized path very quick
for images larger than 120 x 128 you
probably want to start considering doing
some kind of texturing like our previous
demo showed where you don't have to make
a copy because drop pixel is going to
make a copy of the data larger the image
gets the more data there's a copy and
your benefits for drop tips holes for
instance is going to go down because it
will make a copy okay so the trick for
one shot images using drop pixels is to
get your state right where you're going
to go down the optimized path there's
different paths there's three different
paths and OpenGL for how to draw these
things you want to hit the one that's
fast so the first thing you need to do
is get your state right and listed here
is a number of disables of things you
need to have disabled before you will be
going down the fastpass again don't
worry about writing them down we'll have
a demo posted that you can look at okay
so a little bit of a little bit more
code draw place was very basic right you
disable some some options and you call
drop pixels you feed it to the right
pixel format type like we talked about
before that will be a format that the
hardware natively understands and you
give the image and
off it goes okay we're going to do
another demo please and I couldn't
relaunch it here okay so first thing I'm
going to do this is a little bit strange
but I got an infinite button and that
isn't button is to sit sit enough yeah
it doesn't really go infinite it's going
to send a for loop that's go beat on it
really hard because it goes so fast that
just running through an event loop is
this too slow so it sits in a for loop
really fast and bangs on really hard and
I reduce the image size two by two now
most you don't see that but but the key
point here is how fast do we really get
through the stack to OpenGL and we can
get 660,000 of these little images up to
the screen okay so you can get a lot of
little things up in the screen and
that's one of the things to remember
because other paths through the system
may have more per function call overhead
and limit you not because of the pixel
data but because of what you have to do
to get through OpenGL right so that's
the benefit of drop XO has a Loper
function call overhead and can get lots
of little small things up to the screen
so you know if I start increasing the
size of these sorry I want to go a
little smaller than that so here's a 75
x 75 image you can see that the
megabytes per second is about 400
megabytes second believe it or not I'm
already memory battle a bottleneck here
I'm basically saturated my memory and
it's no longer function call overhead
that's that's stopping me it's memory
bandwidth obviously with the g5 these
numbers all change because these goes
much faster but and that's actually
another trick is how to tune for the
different systems it can be a delicate
job so if we start increasing the size
of this we quickly run into some rather
slower frame rates right so now we're
down to we're still at 400 megabytes
second so we a bottleneck the memory bus
we were just flat line now and as I
creased a number of pixels I will
proportionally decrease the frame rate
because I'm 400 megabytes a second
limited that's all I get through the
system that's my limiting factor as I
increase it I go slower and that's why
when you get to larger images it's
better to relieve the memory bus of that
working
go down the texture path but for small
images draw pixels great ok back to
slides please ok optimizing pixel copy
operations so there's a lot of cases
where you want to draw something save it
off and then you want to be able to grab
a saved copy and blit it back to maybe
your back buffer use it for for some
part of a scene and you want to Brender
it and save it so one of the things you
can use to do that is copy pixels so
copy pixels will allow you to do a veer
am to be rep copy it's like drop pixels
where you have to set up your state
correctly one area you can store the
data is in an auxiliary buffer so on OS
10 you can create auxiliary buffers now
auxiliary buffer is just another back
buffer so if you have one you have your
main back buffer you can create another
one off to the side and use that as a
temporary holding area for copying data
into you can either draw to directly or
you can copy data between your back
buffer and this auxiliary buffer one
additional extension you can have that
we have that allows some more
flexibility is the Ox depth stencil
which not only will to create the back
buffer but it also create the depth and
stencil buffers associated with that
back buffer so you can have two depths
buffers to stencil buffers and therefore
you can copy your not only color data
between these ox buffers but you can
copy depth and stencil but data and use
it as a temporary holding area for fast
refresh of some pixels so it's a number
of techniques that people use for
interacting with very complex geometry
that that becomes an important technique
so like draw pixels there are certain
state that you'll have need to have
right to make it go fast it's very
similar to draw a pixel state and
basically what you don't want to be
doing is you don't wanna be trying to
dinner or alpha test or blend or our
things basically what it comes down you
don't want to do anything that can't be
done by the 2d engine on the GPU because
this is a 2d operation and we need to be
able to stick within the feature set of
the 2d pipeline on the graphics card so
you need to disable all the operations
that require the 3d pipeline three
pipelines not company's optimal is just
a memory copy that you can do through
the 2d pipe so there's a number of
states that you want to disable and you
can look at the draw pixels
example for what state very similar okay
so looking at some sample code very
basic we got we have the standard
disabled the right state so you can go
down the fast path and then when you go
to draw you're going to want to set your
read buffer and your draw buffer just a
source of destination source and
destination can be anything any of the
buffers you have allocated whether it's
back buffer rocks but for 12 what have
you and you can copy between those two
then you issue copy pixels and the
transfer will be a VRAM VRAM transfer
okay so let's jump into reading using
threads with OpenGL so let's go refers
to rules for using threads of OpenGL and
then we'll talk about some possible ways
to divide up your work on to multiple
threads and then we'll talk about what
data you can share between those
different threads and how to synchronize
those threads so rules for threading
what you can't do is / thread
re-entrance / context free entrance so
if you have an open field context you
have two threads only one of those
threads could be in that in open field
referencing that context at a time if
both Reds are nope Jill with that same
context you're going to cause corruption
in your in your open jewel staite and
all kinds of bad things can happen well
you can do is you can share context
state across across the threads and you
can share surfaces across contexts and
I'll show some diagrams of how that can
be put together to help you with
threaded applications okay so division
to work one possible division of work is
to move OpenGL hold on to a separate
thread this is like what quake three
does quake three lose opengl onto one
thread and has a bunch of other CP work
for the game logic and and other work on
the other thread and it's a reasonable
division of work that's easy to manage
other more complex ways to divide your
work are to potentially split your
texture work with your geometry work so
you may have texture data it's getting
spooled off of a disk may have another
thread that's doing other work for the
application but then it comes along and
uses a geometry to utilize those
textures for drawing another possible
way to divide the work is to split your
your output your surface and when I say
a surface i mean the opengl back buffer
is basically the service is a piece of
memory in video memory that you're
drawing too so that's what we call
surface so you can split the processing
of a surface so let's say you want that
you can have your CP work be divided
amongst regions on the in the in the
surface and it might be beneficial to
split those on two separate threads and
leverage both cpus to get that work done
ok so sharing data between context what
gets shared so when you share to context
when two contexts are sharing states the
things that get shared our display
lists' textures vertex are very objects
and vertex and fragment programs so
those are the things that get shared and
those are really all objects those are
things that usually have a bind in some
name associated with them and those
things are the shared items between
opengl there's lots of other states in
an opengl context it's not get shared
and those are going to be / context even
if you set these contexts up to share
that state so you can share like a
briefly touched on before you can share
a surface between context so you can
have two contexts multiple contexts
drawing to the same surface and that's
another way to to have sharing of data
okay so let's look at some of the
diagrams of how to divide up your work
and move it onto different threads so
here's the first example of just moving
opengl on separate thread very basic you
have one thread doing work for the
application you have another thread
that's driving the opengl so thread one
is generating data that is used for
input to thread to to draw into the open
Jule context which goes to the surface
which gets swapped to the frame buffer
okay you can split your texture and
vertex processing on two different
threads different contexts so you can
have two threads to open show context
they're sharing the same some of the
same state right they have shared state
and they're going to be attached to the
same surface and what you can do is you
can have a synchronous processing
between the two where you can have one
thread spooling data from a disk I'll
say on decompressing JPEGs do you
compressing a movie what have you
breeding that data that those textures
into the geometry state machine are into
the OpenGL state machine and then having
the other thread come along referencing
those textures and
wrong so that's a way to split your
workload if you have spooling or some
kind of work to do with imaging that you
want to offload another possibility then
is to use our new API for P buffers
where you're not reading say the data
but you're using geometry to generate a
texture you're using some geometry to
draw into a p buffer which is in video
memory that then is used as a texture
which is then referenced by thread ones
context which then is drawn which then
goes to the surface and to the frame
buffer this so so just to be clear now
the only difference between these two
examples is that one we're dynamically
this example we're dynamically creating
the surface by drawing 2d texture by
drawing to it previous example we were
basically loading a texture through the
OpenGL API okay so we can also split the
OpenGL processing of a surface right so
we can take the surface and we can split
it across some line and and use one
OpenGL context to render one part of it
and OpenGL context to render another and
where this might be beneficial is if
your CPU bound and your CPU work to be
regionally divided along some portion of
the screen real estate where you have a
lot of work to do geometric calculations
or what have you in one portion you can
split that across two CPUs and divide
your work across regions of the surface
so one way to do that is to just create
two threads to up Joe context not have
the state shared and they're just open
loop drawing to a surface and but
they're drawing to different regions now
the way you separate what regions they
draw to you can use a scissor and a
viewport so you can just set the scissor
in the viewport to the region you want
to control and with a scissor wreck the
pixels will not come out of that you can
just set it to one half for one context
and one half to the for the other
context and allow the drawing to be open
loop to that surface or it's your
application wanted to you could just
share state right there's no reason you
couldn't be sharing State same basic
concept just that they're sharing
possibly geometry programs textures to
do their work of drawing into the into
the surface
okay so how do we set up opengl shared
context so there's a little bit of
sample code to show how we would create
a 2 context create a context attach
another context to it as a shared
context so this example assumes that
you've already created a opengl view
through say app kits and now you come to
so your knit frame and all you're going
to do is create a NS open Geale context
and you can see we're creating an
assumption or contact alex and then we
come along and we do a knit with format
and we're going to take self the format
from self so we're inside of a creation
of a of a content or any a view already
that has a context so we're just gonna
take the pixel format out of that
context we're going to use it as an the
pixel format to create a new context and
we're going to provide the current views
opengl context as the share context so
we're going to create a new context
we're going to create and when we attach
it or i'm sorry when we create it we're
going to handle the pixel format and the
already created context for sharing
against the state and that's all you
need to do to make sure that the two are
connected and the key line here then
would be the self opengl context on the
third line which hands the new context
the previously created context for our
sharing ok then the next two are
standing fairly standard OpenGL concepts
where you're just making the current
context and you're attaching it to the
view so one small deviation on that then
is that if we wanted to have two
contexts that talk to a surface but
don't share state instead of setting
passing in the already created context
into the newly created context we just
passed nil but we attach it to the same
view so we're creating two independent
context or attaching it to the same view
that will allow them to talk to the same
surface but the independent as far as
state ok thread synchronization the main
tool you'll have a bread synchronization
obviously is going to be the OS tools
that are provided the OSAP is so you'll
have NF thread and NS lock those are
going to be what you'll mostly need to
leverage there is one interesting
ap IB you want to be familiar with which
is NS apple sense Apple fence is a way
to insert tokens into the open shell
command stream and then to test when
they're done so I can do a set sense and
I can test when that token I've inserted
individual can stream has gone through
the GPU and made a round trip is
completed so there's ways to test when
portions of your drawing are done and
that's one another way to potentially
synchronize events within your OpenGL
commands so we'll look at a little bit
of sample code how to do that okay so so
there's two basic ways you can do it you
can do it with the set fence which I you
can see here on the first level piece of
sample code which is I'm setting a fence
and I'm giving it a name so i get i can
give it any name i want and you can set
that token individual can stream and
then do some work and later test to see
if it's done so you would do a finish
fence apple and if that call will block
until that token is completed there's
another simpler API if you're using
either wiring a block against the
texture upload being completed or draw
against the texture or draw against a
vertex array object you can just simply
test for that object so what you do is
you call finish object and finish
finished August apple and what you do is
you pass it in the type of target you're
wanting to look against check against so
you may have a GL texture or a GL vertex
array as the type and again you pass in
the ID number of that texture or vertex
array object so that'll be an ID number
that you use to create or bind to the
texture or vertex array object ok so
let's let's do a little bit of a demo
again and this one is going to be the
same time when we did before but what we
didn't show before is we have a
multi-threaded button at the top and I
want to talk a little bit about that so
if we if we enable multi threading we
can see that we went from 800,000
triangles second to 1.5 million
triangles a second so we got pretty good
parallelization right we almost got a 2
to 1 or 2 x speed improvement just by
doing multi threading so that was pretty
worthwhile and you can see the cpu
monitor where
working two CPUs pretty hard now what's
interesting about this is if i increase
the optimization level of opengl if you
look at the performance not doing a
whole lot right well the problem is that
the workload for calculating the wave is
the bottleneck right the drawing of the
opengl is not so we're not going to gain
by improving on Jill because the wave
calculation is the bottleneck but once
we go with the vertex array range in the
altivec now we again remove the
bottleneck of the wave calculation not
multi-threading is paying off right so
now we're getting the 10.5 million
triangles in a second as opposed to the
8.5 so it's not to X because as you can
see even with altivec green still is
significantly the people wave
calculation which is represent about the
green it's still significantly more
expensive than the opengl drawing but we
do get some benefit by moving the system
and the opengl drawing off to another
thread okay back to the slides please ok
let's go into an important subject that
we want to spend quite a bit of time on
the OpenGL profiler it's a tool that
comes with the developer CD it's a very
powerful tool it does have a lot of
features and will take a little bit of
learning to to understand how to use
that use it effectively what you can use
it for is optimizing debugging and
experimenting with your own job
application there's a lot of different
features in it so the profiler is a bit
of a restricted name it does a lot more
than just profile and let's go through
some of the screens and have a brief
overview of what the profiler can do for
you so first you open your profiler like
some of the other tools in the system
you can have it launched your
application for you so it will launch
and basically attached to your
application or you can attach to a
running application so if you already
have an open application running you can
just attach to it and start utilizing
the services of the profiler right just
by simply attaching to a pre running
application so one of the services it
provides is it will provide function
statistics for you so it will
nine in and out times of all the OpenGL
functions and provide you counts and
percent times and and overall time spent
in each function so this way you can
quickly get an idea of which functions
you're spending time and OpenGL and
quickly get an idea of how expensive
those are for you it'll generate you can
capture call traces so you can simply
enable call trace capture and it'll
capture all the OpenGL commands and
their arguments so that you can scroll
through it and look what your
application is feeding OpenGL and get an
idea of the call sequence it will
capture textures and vertex and pixel
programs so you can actually run your
program and it will capture all the
textures that you've passed in it will
capture the pixels pixel programs and
vertex programs you can look at those
and you can see make sure that your
you've got the textures you think you
have loaded under the right names or or
what have you you could set breakpoints
so you can go to an open field function
and you can say I want to break here and
at that break point it will give you
application call stack so you can see
what your application call stack was at
that point it'll also give you a
complete listing the OpenGL state so you
can sit there and come through the
OpenGL state make sure that at that
break point the state is what you
expected it to be it'll also add a
breakpoint let you look at the
off-screen buffer so let you look at the
back bar for depth buffer stencil alpha
buffer and you can look at it at any
point any any point you can set a
breakpoint it'll let you write scripts
and execute OpenGL commands so add a
breakpoint you can type in an open field
command and say well I don't I think
that states wrong I'll modify right here
type of gel command hit execute and
it'll poke that OpenGL command right
into your application and change the
moment you'll state for you so one one
useful thing for this then is going to
be debugging if you think that you've
got a bug in your state setup you can
modify it on the fly scripts could be
attached to break points so they can be
Auto executed if you wanted it to be
executed every time a breakpoint came
along the OpenGL driver monitor another
powerful tool
this does is it attaches to the driver
itself and starts collecting stats out
of your graphics driver there's a number
of parameters you can monitor like video
memory usage hardware wait times so you
can watch the see if you're stalled the
CPU stalled against the hardware you can
watch what kind of stall it is there's a
whole breaks down into many different
categories and why you DCP you may be
blocked up against the GPU you can try
to monitor those you can look at
bandwidth usage of how much data you're
getting through the system so it'll
track bytes per second through the
system so the whole bunch of useful
stats takes a little bit of studying
this tool to get useful data out of it
because it is somewhat complex so we'll
go through a little bit of that in one
of our demos so why don't we switch to
the demo machine let's do that okay so
quickly here before we go any further
screenshots I didn't show you can also
customize your pixel format so for
instance if I wanted to make a custom
pixel format I can come in here and
change my pixel format attributes that
the application uses so if you have a
pre-compiled application you can modify
your pixel formats on the fly without
have to recompile them what you can also
do is you can emulate hardware now when
I say emulate yes it's neat but all you
but I'm gonna give you the bad part now
all it really does is deprecated your
current hardware to some less capable
harder so mm-hm so so for instance if
I'm running in our 300 like I am here
and I wanted this harder to look to the
application like a rage 128 I could say
choose driver rage 128 that got released
in the same feature set that got
released in SE OS 10.3 and anytime the
your application will come along and
make a query into OpenGL for some kind
of capability like an extension some
kind of min max values the driver will
return a value that looks like a rage
128 so if your application is coded
correctly to respect extension strings
and values that are queryable the
powerful utility for
I'm making your applications think it's
running on a rage 128 this was actually
a feature request last wwc so we got it
in there ok so i contrived an
application of a slight variation of the
texture range demo i showed before and
what i did is i did what i said not to
do and I stuck a geo finish in there so
first thing we're going to do is we're
going to look at the statistics with
this thing's collecting so what do we
see we see geo finish is taking
eighty-two percent of the percent time
in GL and forty percent of the
application time so a couple values
let's go over the screen real quick a
couple values that are interesting to
look at when you when you pull the
screen up one is the number down here
which I highlighted that's the estimated
percent time spent in jail so that will
try to estimate how much of the time
total time is spent in OpenGL so we can
see we're spending about sixty-seven
percent of the total time at on Jill of
that total time we're spending fifty
five percent fifty-six percent of that
injeel finish so somebody calling a
synchronous called you'll finish and
causing the application to stall and
wait for the GPU to flush the pipeline
on that call so here's what we're going
to do since I don't like that call we're
going to go in here and we're going to
pull up the breakpoints window and we're
going to get rid of it so there's deal
finish you'll notice not only that we
set breakpoints before or after a
function call but we can also stop
executing functions so I'm just going to
disable that function there's a favorite
tool at Apple by the way
when we catch our petitions we don't
doing things will just disable it right
so now you can see that things are
looking a little better right so now we
got rid of the Geo finish now we're only
spending twenty-four percent of the time
in OpenGL are not spending sixty-five
percent that we were before we are
spending the time where we want to be
we're spending it basically injeel begin
where some real work is going on of
getting the data up to the system and
things look good now I talked before
about double buffering and the
importance of double buffering data
stays asynchronous well this application
has the ability to switch down to one
buffer right so I can make it look like
I'm only the feeding one buffer to time
and you can see I'm stuck on buffers on
texture zero so real quick let's just
see what the performance impact is so
right here i'm at five about 500
megabytes a second if i'm at five i'm at
about 600 sometimes 6 30 megabytes a
second so quite a performance difference
by stalling you know having the hardware
stall on on the cpu having to prepare
the next texture so let's look at the
difference of what the call stats will
show in this case and what we're going
to do is we're going to pull up the
driver monitor so let's look at a couple
of things in conjunction okay okay so
quickly here let me move this back up a
little bit so what we're seeing here on
the driver monitor is we have three
lines i'm drawing i'm drawing the
hardware wait time in red which
represents the total time the cpu is
waiting for the hardware and so any time
the cpus blocked up against the hardware
that's going to start registering wait
time in the yellow i'm measuring texture
paging bites so this since this is a
texture demo uploading lots of textures
i'm going to record the number of bytes
or the textures per second I'm sending
up to the hardware and green is the swap
complete wait time now let's see what
that what happens we can see what the
changes are when I go through and go
from single buffer to double buffer and
we can watch the effect that has on some
of the statistics and give you a little
bit of an idea of how to use this tool
and watch for different events that may
be going on in your application
okay so now that was single buffered and
you can see up in the stats that when
I'm single buffer time spending all my
time basically injeel text sub-image
Judy so as I change the pixels for that
texture I'm spending all my time there
and I'm spending my time there because
I'm blocked against the hardware the
harder maybe hasn't completed uploading
that texture and the CPU is ready to
give it another one of the CPU has to
wait for the harder to be done so we're
going to block and that's what that's
the effect that single buffering is
doing to me is that the CPU is not
running asynchronous to the GPU so now
if I move this up and I double buffer
this we can start seeing some of the
effects and I'm going to change a couple
options here to give me a little bit of
a better vantage point here ok so again
the red is the harder wait time and you
can see that when I went to two buffers
you can see and it's subtle so you have
to watch you can see that the red line
went down so i'm now waiting cps now
waiting less on the hardware and you can
see the yellow line went up meaning that
i'm getting more bytes per second up to
the graphics card so by double buffering
i have made myself more asynchronous to
the GPU allowing for better
parallelization and less blocking on the
CPUs behalf so there's a couple other
things here let's play these steps again
and just look what effect that had on
the steps so so previously I was
spending all my time and text sub-image
I still AM let's bump it up a little
more and see what happens here we go up
to five like we were so now the blocking
point switched back again so you can see
that we all right we're actually able to
catch the driver blocking at different
points as we move to different numbers
of buffers so we can see the double
buffering wasn't quite enough love it I
can see that double buffering doesn't
quite get me the same behavior that
three buffers does now one one thing to
watch out for that can potentially fool
you is that there's only limited numbers
of different types of resources in the
driver as you vary the way your
application works you can actually start
consuming those different resources and
when
consumer resource the drivers going to
block waiting for our resources become
free so what is happening here is that
as i'm only at say three buffers I'm
running out of one type of resource and
that is I'm probably blocked up against
the harder waiting for the completion of
that that command buffer but when I go
to five buffers it changes because I'm I
believe I'm blocked against slop buffers
there's a there's a particular packet
type in the driver that is needed to
swap and there's only four of them when
i switch up to 55 different buffers I've
now asynchronous maybe the CPU so
asynchronous that I'm running out of a
different type of resource right so I'm
so separated from the hardware that I'm
consuming a driver resource that is
making me block somewhere else so the
key points here though our hardware wait
time is always a good one to look at and
kind of what kind of bytes throughput
you're getting so you can look at a
different variety of different stats for
byte throughput let me pull up the the
different stamps here so we can look at
command bytes GL if we wanted to for
instance if I poke that down there and I
disable this other one so that one's not
very interesting actually looks like a
bug in any case there's lots of
different stats you can even put up in
here and if we're going to be releasing
another version of this and has a little
more descriptive names hopefully with
some better information they are a
little bit cryptic if you need some more
detail information don't be afraid to
post information up onto the OpenGL
mailing list I switch back to the slides
please okay so let's wrap up okay so
text your optimizations the goal is to
minimize your CD copies of pixel data
there's different ways to optimize for
power to non power to vertex
optimizations you'll want to use the
vertex array range for dynamic data with
the shared hint storage hint and for
static data use the vertex array range
with the cash storage hints or display
lifts offload the CPU on to the GPU with
virtue
programs free up some work on to the GPU
use threads you can share different
types of data between threads surfaces
context data draw pixels for one shot
images and copy pixels for fast VRAM
VRAM copies of your data have your
pixels use the open Jule profiler to
find hot spots and points in the in your
code that may be getting blocked in
OpenGL and with that if you have more
questions you can contact myself or
travis browne so quickly references we
got the OpenGL org webpage you can go to
we have the apple developer page and we
have some apple documentation that is
available on the on the developer page
and with that i'm going to bring travis
up for the roadmap alright we're rapidly
running out of sessions at this year's
wwc let me actually skip toward one too
far so so yes so the next session and
the graft imaging track isn't
specifically related to opengl but it's
certainly a popular session nonetheless
of mac OS 10 printing and then tomorrow
again we have introduction to court
services which is if your game developer
or full screen OpenGL application
developer please attend the session will
be covering sort of the core API is that
the system used to do use it to do
display configuration management them
again our hardware partners from api or
kind of give us a presentation on friday
lecture tomorrow is friday is cutting
edge opengl techniques where they're
going to really show us some of the
absolute latest things are able to do
with their current radeon products we
also have a session on accessibility
this will actually will contain some
content that may be interested of
interest to game developers because we
will be covering at least in a flight or
so issues affecting using assistive
technology which is software that adapts
the function of the computer with opengl
applications that take over the full
screen we have a suggestion there for
some possible ways that you can ensure
compatibility
then we have the historically the last
session of WWDC which is the crafts and
imaging feedback forum going to voice
opinions give us suggestions please take
the time to attend the feedback form
because that's where we get a lot of the
information that we use to create great
new features an operating system for
next year
you