WWDC2010 Session 422

Transcript

>> My name's Ken Dyke, I'm a member of the
Graphics and Architecture Group at Apple Computer,
and I'm going to talk to you guys this morning about
how to take advantage of multiple GPU's in your apps.
So, what are we going to learn today?
First, the basics of supporting multiple GPU's in your
apps, how you find all of the renders in the system,
how do you switch between then on your own
or when you want the system to do it for you.
Second, we'll talk a little bit about how to support
multiple GPU's at the same time and what that entails.
Some things involved with that are shared contexts, resource
management, how synchronization works between the two GPU's,
and some performance tips when you really want to
get the best, best that you can out of your system.
And lastly, we'll talk a little bit about IOSurface
and multiple GPU's and how that can help you.
All right, so what are the motivations
for talking, for using multiple GPU's?
So, we've been shipping systems with multiple GPU's for
quite a long time, you know, you can buy a Mac Pro with up
to four GPU's in it, recent MacBook Pros
have multiple GPU's in them as well,
so it's good for users to support them there.
But probably the biggest reason is, you know,
getting a better user experience, you know, again,
increased performance is one thing, if your apps respond to
GPU changes the correct way, the system can tell you, "Hey,
you know, the user moved their
window from one GPU to the next."
If you don't do that, the window server might be
stuck copying data between the GPU's on your behalf.
It works, but it's not the best performance message.
Another thing that goes along with that is that it's sort of
important to support multiple GPU's for hot-plug, you know,
it's like on a Mac Pro, you can yank the display
card out of one, or not the display card,
but the display cable out of one card, plug it into
the other one, everything is supposed to move over,
and most of the time it works, but again,
if your app isn't written the right way,
and the system decides to switch GPU's,
things might not work quite so well.
Okay, so let's talk about the basics of Multi-GPU support.
So first, a little bit of terminology here.
So, a renderer is basically a single piece of
graphics hardware in your machine, you know,
it could be an AMD card an NVIDIA NVIDIA card, or it
could be the CPU software based renderer, as well.
Now each one of these guys has a unique renderer ID in
the system, that way if you have more than one NVIDIA card
or more than one AMD card or like say
you've got four GT120's in your machine,
you can still identify specific pieces of hardware,
even though they're all basically the same.
Now a pixel format object is something you normally
think about being, hey, you know, I want 32-bit depth
and 32-bit color and Multi-Sample and that sort
of thing, but the important thing for this talk is
that it also embodies what set of renderers
your OpenGLContext is going to use.
In this case, I might have a pixel format that
actually supports three renderers at the same time.
Okay, so how does that relate to context?
So when you create a context, you always have to
pass in the pixel format that you want to use,
again that's what decides what
renderers your context is going to use.
Now once you have that, there's
this concept of a virtual screen.
Those are assigned, basically into the
renderer slots within your OpenGLContext.
Now, OpenGLContexts all have the
concept of a current virtual screen.
You know, you're always saying, "For this
OpenGLContext I want to be using my AMD card
or I want to be using a software
renderer, or the NVIDIA card,
depending on maybe what screen I'm on, that sort of thing."
An important thing, I'll show you
in the demo a little bit later,
is that the virtual screen order does match
the order of the renderers in the pixel format.
All of this stuff correlates so that you can go from
a context virtual screen back to the pixel format,
and figure out what piece of hardware you're
on, figure out what display that might support,
core renderer attributes, that sort of things.
Now, we call it a screen, but the
virtual part of it really is
because it doesn't have any correlation
with physical displays on your system.
You might have one display and three GPU's, you might have
two displays and one GPU, they don't necessarily really line
up with each, so don't let that confuse you.
And last, and something that's, again, sort of cool
for this talk, is that Mac OS X is the only platform
that supports NSOpenGLContext that can support different
renderers at the same time, and switch between them.
It, you know, it's just one way that Mac OS
X makes supporting head systems a lot easier
than it is on other platforms.
Now OpenCL are similar concepts to all of this.
It's a different set of API's,
but the concepts are all the same.
So instead of using Choose Pixel Format to
choose a particular set of OpenGL renderers,
you can call clGetDeviceIDs to get a list of
all supported OpenCL devices in the system.
And instead of passing, you know a pixel format
to NSOpenGLContext, when you call clCreateContext,
you get to specify exactly the set
of renderers that you want to use.
And instead of using a virtual screen to select
the OpenGL renderer you want, in OpenCL land,
what you're going to do is create a specific
command queue against an OpenCLContext
and that lets you pick the particular
piece of hardware you're going to use.
Okay, so how do you allow your app to
see multiple renderers in the system?
So, say you've got some code like this in your app, you
know, setting up some pixel format attributes for OpenGL,
you want accelerated, you want it to be double buffered,
32-bit color, but say you've also told the system,
"Hey, this is the one display I want to use."
In this case I'm saying, "Hey, I just want whatever
GPU's hooked up to the main display, give me that one."
Please don't do this.
First of all, it's guaranteed you're only ever going
to get a single renderer if you do this, okay,
which means your app is guaranteed not
to support hot-plug, that sort of thing,
or even if the user has two cards plugged in with four
displays, you're only going to end up using one GPU
to get any OpenGL acceleration, and the window
server is going to be stuck copying the contents
of your drawable across to the other card, not good.
The whole point of ScreenMask is really only from
sort of legacy full-screen context where, you know,
you would have your OpenGLContext and then you just
tell it, Set Full-screen, you had no way to say, "Hey,
I want to be using, you know, go full-screen
on this particular GPU versus another."
Now in 10.5, we added a new API that lets you have multiple
hardware supporting full-screen context, but as of 10.6,
you don't need to use full-screen context anymore anyway.
You can just create a full-screen
window, covers the display,
and will automatically give you
all the performance benefits,
so don't use ScreenMask, it's really not necessary anymore.
Now what you should use though is allow offline renderers.
This one is the biggee.
If you don't specify this one, you won't get any
GPU's that don't have displays attached at the time.
So, if, you know, you just do the normal thing and the
user has two GPU's and they would start up your app
and they switch to the other card, for whatever
reason, your app isn't going to even be able to move
to the other GPU, even if you wanted it to, okay?
So this is what's allows you to see
renderers that don't have displays attached.
Now, the reason we don't do this by
default is primarily for compatibility.
Unlike most of the other pixel format attributes,
this one actually adds to the list of renderers,
rather than takes away from them, and there's
some apps that just don't deal with that,
so this is kind of an opt-in thing, but at
this point, everybody should be doing it.
And, again, it's important for hot-plug.
So, how do you trigger a renderer change
to happen within your OpenGLContext?
Well, normally the system will do it for you, but
if you just call NSOpenGLContext update method,
the system will look at the window
your OpenGLContext is attached to,
and automatically chose the right renderer based on, you
know, if I'm on this display or another display or how,
if I'm straddling, which one has more
screen coverage that sort of thing.
So, normally you would do this in response to NSOpenGLViews
update method, you just have almost a single line of code
in there that just says, "Hey, my view could have moved
displays, or it's been scrolled or something like that,
OpenGLContext, now's a good time, go ahead and
re-choose what the best renderer to be using is."
Now if you're not using NSOpenGLView and you're sort of
just using a regular NSView yourself and attaching a context
to it by hand, there's an AppKit notification you can
sign up for that you see here NSGlobalFrameDidChange,
AppKit will post that notification when it believes
that your view could have moved to a different display.
It doesn't post it all the time for performance reasons, but
if you move to a different display, or you've been scrolled
or something like that, AppKit will post it for you.
You can also tell the OpenGLContext
to use a particular virtual screen.
This forces it to use a particular renderer, no matter
what it thinks, you're, what card it thinks you're on.
This is primarily useful for off-screen context, like
say, you've got one context doing some rendering to an FBO
or something like that, or doing some texture uploading,
well you want to make sure your texture uploading goes
to the same virtual screen or the
same renderer that your onscreen does.
So, this is how you would accomplish that.
Now, how do you respond when a renderer change happens?
What do you have to do?
Well, for a lot of apps, you don't have to do anything.
If you're not using a lot of crazy or very hardware
specific OpenGL extensions, calling update will be enough.
However, if you're doing something NVIDIA specific, or AMD
specific, you're right up against some hardware limits,
you want to basically look and see if the virtual screen
has changed since the last time you called update.
Now let's say that it changed.
What do I do then?
Well, in that case, check the extension
list, see if the extensions you want
to be using have changed, now that you're on a new renderer.
Other things you can look for are, see if there's a hardware
limit, some of your textures might exceed or something
like that, and you might have to re-upload them.
And lastly, again, make sure that if you have any off-screen
context, that they're synchronized with your onscreen ones,
so we don't have to copy data across the bus for you.
Okay, so how does resource management work when you've
got one of these context supporting multiple GPU's?
Well, OpenGL and OpenCL will automatically track anything
you've specified, via the API textures, anything like that,
we can keep track of which GPU's that data is
on and move it back and forth as necessary.
Resources that you specify, like an OpenGL texture,
or VBOData, something that you're not going
to be modifying using the GPU, we can just simply
re-upload to the second GPU and everything will continue.
Now, if you've modified a texture, using
like FBO, or something like that, on one GPU,
when a renderer change happens, we're going to have pull
that data off one card and ship it over to the other one.
Now the nice thing is is that happens
completely automatically just
as a result of you doing the context update.
Okay, any objects that are bound at the time you
perform the renderer update, we automatically move over,
because you could just keep doing any
rendering right on them and expect it to work.
Any other objects are basically done lazily as you bind
to them, we'll detect that they're on the wrong GPU,
pull them back to system memory, then upload
them to whatever renderer is necessary.
So, this will sort of give you
an idea of how all this works.
So let's say on my OpenGLContext, I go ahead and
create a texture and give some texture data to OpenGL,
in this case it's just a simple little brick texture.
At this point, the OpenGLContext in host memory has a copy
of the texture, but it doesn't exist on either GPU yet.
So let's say I do some simple rendering code that draws
with the texture, it still sitting just in system memory,
not until that command stream gets Flushed to the
open GPU does it finally get uploaded, as we see here.
Now let's say I do something like I talked about before
where I go ahead and modify that texture in some way,
that's going to, and in this case, let's say that
that function was accelerate by this particular driver
and all of that data is now basically going to in-flight
up to the GPU to modify that texture up in VRAM.
The host memory copy of that texture has now become invalid.
We can't really use it to upload
to the other card if necessary.
And let's say we stay bound to that texture and we do
a context update that switches us to the NVIDIA card,
part of that process involved paging that back to
host memory, and then switching our virtual screen
over to the NVIDIA renderer in this case.
So, again, let's say I draw with that texture over on the
NVIDIA card, again nothing happens until I do a Flush.
At that point, the data gets copied
up and everything is all good.
All right, so let me give you a little demo here.
Was anybody here at my 2002 Quartz Extreme Talk?
Nobody, oh man, all right.
So, all right, Peter was.
So, this is a really simple demo, but it's designed to show
that an app can basically seamlessly move between renderers
without having to do much of anything at all.
So, I have the ever-important Allow Off-screen
renderers in here so that I can detect things.
Now I should back up a step a little bit, the machine that
I'm using for this has both a GTX285 and an AMD4870 card
in it at the same time, but we're
only hooked to one display.
Now were driving the display off the NVIDIA card, so
if I want to see or be able to get at the AMD renderer,
I have to put all offline renderers in here,
or I'm not going to be able to get at it.
Now for demo purposes, which you'll see why in a minute, I'm
also signing up to just detect whenever the window moves,
because I want really fine grained control
over when we're going to switch GPU's.
Now in my GPU changed thing, for the purposes of this
app, I really don't have to do anything important.
Everything that's in here, is just
really to update the User Interface
with what renderer I'm going to be showing you guys.
So the simple part of that is here, I'm just calling OpenGL,
doing a little bit of Cocoa to take the C-string I get back
from OpenGL and slam that into a text field, so we
can see the name of the renderer I'm going be on.
This code down here, which I'm not
going to go into the gory details on,
lets me go from the OpenGLContext virtual screen,
whichever one that is, all the way back to being able
to query the renderer for how much video memory it has.
So finally, down here, I can say, you know,
does it have half a gigabyte or a gigabyte.
Now, also for the purposes of this demo, normally
you would just do this, you'd just call Update,
because you said I needed to, for the demo, I have
to do something a little bit more interesting.
In this case, I want to switch renderers based on which half
of the screen I'm on, so this part figures that out for me.
It lets me sort of simulate having
two displays hooked up here.
So, let me go ahead and run this.
So, anybody recognize what this is from?
Come on, geez, all right, nobody
grew up in the '90's I guess.
>>
>>Ken: Thank you.
All right.
So, the other thing about this is kind of funny,
I implemented this to use VBO's for performance
and of course it doesn't matter in this case, because
there aren't enough Polys and it uses per pixel lighting
and you still can't tell, so anyway, so now if I move
this over, it switches, now everything's being rendered
on the ATI card and the Window's
server just automatically figuring out.
The take away from this is nothing
happens on the screen when I do it.
There's no glitch, it just keep going and the
OS basically does the right thing for you.
That's what we want the user experience to be in this case,
all right, you know, just nice and smooth transition as far
as your user's concerned, life is good,
nothing happened, that's the whole idea.
So, check out the sample code, it should be posted already
and you can take a look at what
the rest of the app is doing.
Okay, so let's talk a little bit
about advanced Multi-GPU Support,
when you want to purposely be using
more than one GPU at once.
So, what are the motivations for this?
Well, the biggest one, obviously, is performance.
You know, you've got this other card sitting there,
maybe it's really good at doing OpenCL
stuff and you want to make use of it.
Due to the way GPUContext switch, or don't these day,
you know, if you have an offline GPU doing a lot compute,
it doesn't really matter if you're tying that
thing up for really long periods of time,
the GPU driving your display will
be free to just do GUI stuff.
Another reason is, you might just find
yourself on a system that has two GPU's,
but only one of them has the extensions you
want, so you really have to use the one that,
just for whatever reason the user
doesn't have the display hooked up to.
In that case, you know, you might just
simply need to use the other one along
with the one that you're using for display.
So, there's some issues to consider with this.
First, context sharing, how does that work?
How do we, you know, make resources share between
multiple contacts running on different OpenGL renderers
and how does synchronization and resource management
work when you have more than one renderer involved.
So, first we'll talk a little bit
about context sharing in OpenGL,
which is something not everybody might be familiar with.
On MAS OS X, you can have multiple
OpenGLContext sharing the same set of resources.
This mean like textures, display lists,
FBO's, VBO's, whole bunch of stuff like that.
That way, no matter how many different
OpenGLContext you have in the system,
you only have to give it the texture data once.
It's better for system performance, there's
not duplicates of everything, life is all good.
Now normally when you create a context,
this is just a really simple example,
here you can see that the second parameter here is null,
that's saying that parameter is the
sort of shared context parameter, okay.
If you switch that out to a different context, now
when you create context B, it's going to share all
of your, you know, resources with the first one.
Now there's some limitations here.
Both contexts have to have exactly the same set of renderers
in it, you can't have one context that just has the AMD card
and another context that has just the NVIDIA card
and share resources between them, that won't work.
Both contexts have to have both renderers in them.
So, be very careful if you're going to call
ChoosePixelFormat once for each context.
If you use DisplayMask, or something else that limits
you to one particular hardware device or another,
you're probably not going to get what you want.
The safest thing you can do is just ask the first
context for its pixel format and pass that in here,
that way when you create the second context,
you're guaranteed that it's going to work.
In OpenCL, it's a pretty similar story, but
it's easier to sort of force the sharing stuff.
Whenever you create a command queue in
OpenCL, off of a given OpenCLContext,
it's basically all of the command queues created against
that OpenCLContext are guaranteed to share stuff.
You don't necessarily have to jump
through the same hoops you do in OpenGL.
Now the interesting thing on Mac OS X you can do is you
can create an OpenGLContext, that already has automatically
in it, the set of renderers that another OpenGLContext is
using, and if you do it this way, as the code shows here,
you automatically get resource sharing between OpenCL
and OpenGL's, so like images and textures will be shared,
or buffer objects will be shared automatically
and you can pass data between them.
That's how you do that.
It is worthy to point out, in this particular case, your
OpenGLContext might have the software renderer in it,
but this code here for OpenCL won't get you the CL software
compute device, you would have to add that in as well.
So, let's talk a little bit about multiple-context
synchronization and what goes on there.
So, if you've got two different contexts, even if they're
on the same GPU, you really have to pay attention to order
of operations if you're going to be sort of
doing a producer consumer sort of thing, okay?
On Mac OS X, OpenGL at least uses what we call Flush
and then Bind semantics, if you're going to do this.
Any context that's modifying a resource, like rendering
to a texture using FBO, or anything along those lines,
has to Flush that context before we do
anything else with the data in another context.
That could be a GLFlush, GLFlush renderer Apple,
GLFinish, not the best performance, but it would work,
anything that basically drains the entire open
GLPipeline dry will ensure that everything is good.
After that point, any contexts that are going
to use that modified resource, must bind to it.
You can't just already be sitting there bound to a
texture that you modify in another GPU or another context
and then draw with it, it won't work
right, you have to redo the bind.
The bind calls how, where we get
a chance to go in and detect
that the data might not be on the GPU where it's now needed.
Now this applies to both single and multi GPU cases.
Even if, you know, I've got two contexts, it's
on the same renderer, if I do a text sub image,
or something along those lines on one context and
don't do a Flush and go to the second context,
even though it's all on the same GPU, because of the command
buffering that happens and host memory and everything else,
those text for modifications might not simply be visible
to that second context, unless it's been Flushed first.
In the Multi-GPU case, it's also very critical,
because at that point that's what allows us
to pull the data off the first GPU
and ship it over to the second one.
Now OpenCL you can use either event
model to accomplish the same sort of thing.
If you're shipping data from GL to OpenCL, for example, you
kind of have to follow the appropriate rule for each API.
In this case, if you're going to
render to a texture using OpenGL,
and then you want to do some CL specific image processing
on the result, you're going to have to Flush the GLContext,
make sure all those commands are in flight first,
then you can do an acquire on the image in CL,
and everything will work the way you expect.
So, this is similar to my previous diagram, but what I've
got to show here, is that I have two OpenGLContexts, okay,
they're using the same Share Group, in this
particular case, but each one is talking
to a different GPU, so how's this going to work?
So, using my previous example, I create a new
texture object and I put some data into it.
At that point, it's really the Share
Group that owns the copy of the data.
In the previous animation that I showed for you,
I didn't show the Share Group as a separate thing
from the OpenGLContext because I didn't
want to sort of muddy the waters,
but the reality is is that even a standalone
OpenGLContext always has a Share Group sitting under it,
and that's really where all the resources are owned.
So in this case, again, I've specified a texture,
it's now owned by the Share Group,
now I come along and I render with it.
Now, again, that draw texture command, or whatever that
entailed in my app, whatever those GL commands were,
is still just sitting in host memory, it
hasn't been shipped off to that other GPU yet.
Or to the AMD card in this case.
Once I call Flush, that's when everything gets going
and that data will be uploaded to the card and consumed.
Now, again, let's say I to a text sub-image operation
that winds up being accelerated, so we're not going to end
up modifying at host memory first, that causes the
host memory copy to now be out of date with respect
to what's up in video memory on the AMD card.
And again, Flush it to make sure
that everything is really up there.
Now let's say I go over to my there
OpenGLContext that's currently bound
to the NVIDIA renderer, and do
something interesting over there.
So, I'm going to bind to that same texture object.
Now the instant I do the bind, the Share Group is smart
enough to say, "Hey, wait a minute, you know that texture,
we don't have the most up-to-date copy in system memory.
I've got to reach all the way over to this
other piece of hardware and pull the data back.
This is why all of the, both contexts have
to have the same set of renderers in them,
because even though over on the Open, the second
OpenGLContext on your right, I'm talking to the NVIDIA card,
I might need to have sort device access to
the AMD card so that I can pull resources
on it back to host memory, on your behalf, okay?
So now I'll go ahead and I draw with it and I Flush, that
causes that data to get pulled up to the NVIDIA card now,
and everything continues the way you want it to.
Now, let's say I go over here and do
an accelerated copy text sub-image.
Well the instant you do that, OpenGL knows that, not only
is the Share Group copy out of date with respect with what's
up on the NVIDIA card, but the AMD's data is now going to
be out of date too, so if I go back to the first context
and do something with that, it's going to have
to start this entire process all over again
and pull the data off the NVIDIA, back to
host memory, and back up to the AMD card.
And, again, make sure you finish
your rendering with a Flush.
So, there's some performance things to think about, if
you're going to use multiple GPU's within your application.
The biggest one is make sure you're
doing enough work to make it worthwhile.
You have to take into account how much compute or rendering
you're going to do versus what it's going to cost you
to transfer the data, you know, back to
host memory and up to the second GPU.
Especially in the Mac Pros where it might even matter
what slots you're in, you have to pay attention to that,
you know the user may have four cards in the
system, but only two of them will have 16X slots,
the other two will have 4X, getting data off certain
GPU's is going to be considerably more expensive.
So, what you want to do ideally as well, is decouple
the workloads between the two GPU's as much as possible,
you know, say if you had four GPU's in there,
and you're doing some kind of, I don't know,
like image processing thing where you don't
have to have dependencies between the GPU's,
you can just fire stuff off the one card,
run some image processing, get it back.
You could probably set up four different threads,
one to work on each GPU, get them all fired up
and running completely in parallel with
each other, that's the ideal, okay?
Another thing to watch out for is, while we just talked
about how the resource synchronization stuff works
in multiple context, don't rely on
that to get the best performance.
You know, before we can pull data off of
one GPU, we have to synchronously wait
for it to be done before we can pull it off.
Okay, so ideally, you want to find some way, be
it whatever API you're working with, GL or CL,
to make sure that that data has
somehow gotten back to system memory.
So, consider using extra buffering sort of to
double or triple buffer your data if this is cases
where you know you're going to be streaming data from one
GPU to the other, you want to get those two things working
and parallel as much as you possibly can.
Another thing is, if you've got some compute device in the
background, or GL doing offline rendering in the background,
make sure you don't wind up bottlenecking yourself behind
trying to display ever single frame that you get out,
it's really not necessary, show a progress
bar, or just take snapshots, you know,
every now and then of where the GPU's at, but don't
try and shove every single frame that you compute
onto the display, it's just going to slow everything down.
All right, so now I am going to bring up Abe Stevens,
one of our OpenCL Engineers to show you a couple of demos
that use multiple GPU's at the same time.
Abe?
[Applause]
>>Hi, my name is Abe Stevens, and I
work with the OpenCL group at Apple.
Yesterday during the OpenCL, OpenGL sharing talk, we
showed a simple demo that was, that consisted of a number
of objects that bounced around the desktop with some post
process effects that were rendered in, rendered in OpenCL.
What we've done for this demo is we've taken
that same code and made a certain number
of changes to it, to enable Multi-GPU support.
So first, let's take a look at the demo that
we saw, or that some people saw yesterday.
In this example, the spheres that are bouncing
around the image are rendered with OpenGL,
using a GLSLShader that computes refractions
and reflections and then the caustic effect
that you see is rendered in OpenCL as a post process.
Now with only a few spheres bouncing around the desktop,
there really isn't much need for an additional GPU, however,
if we switch to a more complicated example, with a much
larger number of spheres, if we look at the frame rate,
which is displayed in the lower corner here in milliseconds,
the caustic is relatively expensive
and incurs a relatively high latency.
If I switch to the physic step in this program, which
right now is running in OpenCL on the GeForce card,
over to the Radeon card, the performance will
increase by about, in this case, about 35% or so.
This increase in performance isn't, you
know, it's not a 2X increase in performance,
but in this case all we've done is we've taken a simple
CL application and made a small modification to it
that allows some of that CL work
to be performed on the second GPU.
If we wanted to take this application and design it from the
ground up, and I'll show another example in just a moment
that does this, we could set up our application so that
more work could be run concurrently between the two devices.
In that case, we might have to double buffer our
data, so that one GPU doesn't always have to wait
for the other GPU before it can perform work concurrently.
So this is another demo that was built from
the ground up to run on multiple devices.
In this case, we're running the demo on the Radeon
for graphics, so all the OpenGL rendering happens
on the ATI device, and the simulation, which is simulating
the movement of these particles around the desktop,
is occurring in OpenCL on our NVIDIA device.
Now if we switch to the Radeon device for both graphics and
compute, the performance changes significantly because now
that one GPU has to perform both task, both the OpenGL
display and the compute process to render the position
of the particles, and of course, if I switch
that back, our performance goes back up.
But this application is very different than
the one that I showed just a second ago.
In this case, we've set up the application to double buffer
the data, and we have really designed it from the ground
up to use both devices, so if you consider using both
devices and allowing work to execute concurrently
on the two pieces of hardware, you might be
able to get about a 2X performance increase.
If you just take your application
and add support for multiple devices
or multi GPU's the performance increase will be a lot less,
but were still able to achieve
that, about a 20 to 30% improvement.
Anyway, I'm going to hand this back over to Ken who
will talk about another advanced multi GPU topic.
[Applause]
>>Ken: Okay.
So let's talk a little bit about IOSurface.
So this API was introduced in Mac
OS 10.6 with not a lot of fanfare,
this is actually the first place we're talking about it.
The whole point of IOSurface, and what's relevant
to this talk, is that it makes resource sharing
between different parts of the system
a lot easier than it used to be.
IOSurface is basically nothing more than a really nice high
level abstraction around a chunk of system shared memory.
So, what this is designed for is to do very efficient
cross process and/or cross API data sharing, you know,
you might need to send some data from CoreImage to OpenGL
and you don't have control over the context involved,
so you can't set up sharing, this can help you with that.
Germane to this talk is that it's integrated directly
into the GPU software set, for all
supported hardware on Mac OS X.
This is what allows us to pull off some really cool
tricks that we'll talk about a little bit later.
Now the really neat thing about it is that
from the app developer's point of view,
it hides nearly all of the details
about moving data from GPU to the other,
or between the CPU and GPU and vice versa, okay?
If you follow a few simple rules, it
pretty much is designed to just work.
So, let's talk about the GPU integrations
stuff, because it's important for this talk.
So, an OpenGL texture can be bound to an IOSurface.
This is sort of a live connection, it means that anytime
the contents of that surface are modified anywhere
in the system, that texture at anytime it gets it used
is going to see those modifications happen right away,
you don't have to keep copying the data into the texture.
Also, IOSurface does support planar image type, so you
can bind and OpenGL texture to a single image plane,
for example, say you had an IOSurface with NV12 42.0
style video in it, you can bind an IOSurface once
to the luminants plane once to the cromonent and
write a shader together to do RGB conversion,
works out really well, and we do
that internally in some cases.
If you want to modify an IOSurface, you just take your
IOSurface back to OpenGLTexture, bind it to an FBO and go,
it's really no more complicated than that.
For the most part, you just get to use
the standard OpenGL techniques to do it.
Now, OpenCL itself doesn't have direct binding to IOSurface
at this time, but via the resource sharing stuff we talked
about before, you can more or less take and OpenGL texture,
bind it to an IOSurface, then take that OpenGLTexture
and use it with the appropriate extension, whose name
escapes me at the moment, but you can take that texture
and use it in OpenCL's and image
and get access to it that way.
Now, what's also cool about IOSurfaceTexture,
is it doesn't matter how many textures
in the system get bound to that IOSurface.
They all are going to use exactly the
same video memory, on any given GPU.
Okay, so this mean, if I have two different processes
in the system, both looking at the same IOSurface,
and they both create a texture off of it, and they're both
using the same GPU, there's not going to be any copies
that happen back to host memory, just because were crossing
process boundaries, okay, that's part of the just works part
of the whole API, and it's a good performance thing as well.
Also, no matter how many different
renderers we have in the system,
the host memory backing for IOSurface
is shared between them.
So if we do have to transfer stuff from one
GPU to host memory and up to another card,
they're aren't every any CPU copies
involved in this process.
With regular OpenGLTexture objects, there actually
can be, in this case it's DMA to system memory,
DMA up to the other card and that's
it, CPU does not touch all of the data.
So this is just a really simple example of creating
an IOSurface and getting it usable inside of OpenGL.
So, IOSurface is standard, sort of Mac OS X, Core
foundation based API, you give it a dictionary of all
of the properties you want for
the IOSurface and away it goes.
In this came I'm cheating a little bit and using
toll-free bridging because it's a lot less code.
In this case, I'm just going to create a
simple 256 by 256 IOSurface, 4-bytes per pixel,
and that's pretty much all I need to
specify as far as IOSurface is concerned.
IOSurfaces do not have an intrinsic
format associated with them.
You can give it a pixel format identifier, like you know,
BGRA or any of the sort of quick draw style 4CC or NB12
or anything like that, but IOSurface really doesn't care.
The only reason that's there at all is just so that
two processes can sort of pick something to agree upon.
Now from the OpenGL side, all I really have to do is
basically generate a new texture object and call this,
you know, Mac OS X specific API,
CGLTextImageIOSurface2D, it's kind of a mouthful,
and that will take that currently bound texture object
and bind it to the backing store of that IOSurface.
Okay in this case, I'm telling OpenGL I want the internal
format of this texture treated as RGBA, that's 256 by 256,
and I want OpenGL to look at that
data as if it's BGRA onsite,
and 888Reverse, which is just your basic ARGB format.
And you give it the surface that's involved and, in this
case it's not a planar surface, so I just specify zero.
If the IOSurface had multiple planes, this
is where you would stick that argument.
I want to call out this, again, because
it's kind of an important point.
OpenGL is going to view that data in
the IOSurface, via these parameters.
It doesn't really matter what the
data is, or what format it is,
what you specify here is what OpenGL
is going to interpret that data as.
When we transfer it back and forth between host
memory and the GPU, there's no CPU touching,
there's no data formatting, it's straight copy up, straight
copy back, you know the GPU's might do hidden tiling
or that sort of thing in their local video memory, but
that's not exposed to the app developer in any way.
Now, the nice thing is, IOSurface follows the same
synchronization rules that we talked about earlier,
there isn't anything new to learn here,
they work exactly the same way, okay?
If you're going to take a texture on one context, and
modify it with the GPU and ship it over to another context,
you just have to do the Flush and you just
have to do the bind, behind the scenes,
IOSurface sort of works outside the
Share Group to figure out in the system,
that hey, this data is not in the right card.
Now the neat part about this too, is that the two contexts
involved in this don't have to know about each other at all
and they don't have to even share the same renderers, this
is where, because this is integrated at such a low level
on the system, we can still get at a GPU
that your app doesn't even necessarily
know about to go pull the data off of it.
Now the other sort of neat thing about IOSurface is that
it lets you get direct access to that backing memory.
You know, for regular textures, if you're not using
all of the texture range extension and client storage
and all of that stuff, normally you can't get at the
sort of shared system memory copy of the textures.
With IOSurface, you can get direct access to it,
but you have to be careful about synchronization.
If you're going to write into an IOSurface directly
with the CPU and then consume it using the GPU,
you have to do what we are doing here, you have
to lock it, put your data in it, unlock it.
At that point, we realize that you've changed
the host memory copy and then you can go off
and use it with OpenGL and everything is good.
For the opposite case, where you're consume some data using
the CPU, after you've used a GPU to modify it, you, again,
you have to make sure that all of the
commands that may have been buffered
to that GPU, have been Flushed and are in flight.
If they're not, the kernel part of IOSurface has no way
of knowing how long it has to wait before it can DMA
that copy back to host memory so
that you can use it, so, again,
follow the same synchronization rules as you would before.
So, let's talk a little bit about some
performance tips when using IOSurface.
So, as I alluded to earlier, the automatic synchronization
that we do when data is on the wrong GPU, it's easy to use,
but it's not asynchronous, you know, if you've got one GPU
consuming some data, and you immediately need to use it
on the second one, there's a synchronization
point there that we just simply can't avoid,
and we don't want to give you bad data, so
we're going to go ahead and wait for the data
to be done before we pull it back to host memory.
One trick you can pull here, and this is a little
bit advanced, because you can force IOSurface
to page the data back to host memory by performing a lock,
one trick you might consider if you want to use IOSurface
for doing double buffering between different
GPU's, is you could produce some content,
and on that same thread immediately do a lock.
What that means is that the GPU is going to do all
of its work and then the first thing it's going
to do is page it back to host memory so it's ready to go.
Then you could go and do a second frame, do the same thing.
Get a couple of those frames going like that in
buffered and host memory, then fire up the second GPU
and start consuming the data, that way you get a nice
overlap, you know if you go ahead and bind to the IOSurface
and the host memory copy is already
up to date on a downstream GPU,
you won't pay any synchronization penalty for that.
All right, and again, this gives you really good
tight control over exactly when that DMA happens.
And again, earlier, I talked, said there's no
CPU copies, that's true in this case as well,
so you're not going to pay any extra CPU overhead, other
than the wait, for getting the data from one GPU to another.
Another really neat trick you can do is, remember
in the slide before I showed you that, you know,
IOSurface is going to, you know, view
that texture based on the format and type.
Well, one thing you might want to do, for whatever reason,
is say you've got, you know a luminants playing a video,
and you want to do something with all the
luminent channels, like run some kind of filter
or something interesting like that, you
know, change a gamma, setting, something,
what you could do is basically lie to OpenGL
and say, you know what, you know, 19, 20,
or let's do something that I can do in my head, 640 by
480 video frame, it's really 160 by 480 luminant, or RGBA,
even though it's really luminants, now
I need to basically process four pixels
at the same time and save to my shader instead of one.
So, the neat thing about this is that you can have different
textures all pointing at the same piece of video memory,
viewing it as different pixel formats, which again,
sort of cool trick for doing image processing stuff.
Now if you're going to do that trick, the total data sizes
have to match, you know, if you're going to say, you know,
it was 640 by 480 four bytes per pixel, whatever width tie
in to sort of bytes per pixel OpenGL is going to use works
out to, it's going to have to work out
to that same amount, or things will fail.
So what are some sort of cool examples for using
IOSurface and how does it apply to the Multi-GPU stuff?
Well, say your plug-in, you know, you're an
application developer and you want to support plug-ins
and you're really having this quandary about, well,
do I make this CPU based, or do I make it GPU based,
and if we're going to make it GPU based, how do
we tell them what renderer to use and Oh my God,
this is really complicated, what do I do?
If you just say, here's an IOSurface, go modify it and
hand it back to me, we'll abstract everything for you.
You could be looking at it with a CPU, they'll look at it
with a GPU, they do their thing, they ship it back to you.
Or, if you're really lucky, they're
going to use the same GPU you are,
and there's not going to be any copies
back and forth, so that's pretty cool.
Another really cool thing to use IOSurface
for is Client Server applications.
Because we can pass stuff back and forth across process
boundaries so cheaply, even keeping them on the same GPU,
this is just really good if you need to use like a renderer
server type operation, we use this internally in Mac OS X
in a couple of situations as well, just to,
you know, do things in sort of a secure manner.
And again, even in the Client Server
situation any resources that are up on the GPU,
will stay there if the downstream sort of
client process is using those exact same GPU's,
so again, there won't be any copies involved.
Now, probably the coolest thing
you could do is combine both.
You could actually run your plug-in, in a different
address space, on a different host architecture,
and even on a different GPU, and it would all still just
work, you know it's a really nice case to say, you know,
as an app developer, "Hey, my plug-in guy
crashes, it's not going to take down my app,
I don't have to care if he's using a CPU to do his work,
I don't care if he's using the GPU,
everything is all pretty cool."
So, let me give you a real quick demo.
Okay, so this app here, this is the server,
he's just generating Atlantis Frames and waiting
for clients to come can check in with him.
The client checks in with the server, and then the
server basically starts sending these Atlantis Frames
over to this other application.
Now in this case, they're both on the same GPU, there
shouldn't be any transfers going back and forth.
But I can say, "Hey, server, start using
the hardware, the other hardware renderer."
Now the system is just automatically still just shipping
the frames across GPU's to the other application.
Now again, you can't really see any visual difference.
I can even force this guy to use the software renderer,
and it starts writing into the IOSurface
directly and the client is still just going.
Now, I wrote this server to actually support multiple
clients simultaneously, so I can make a duplicate
of this client, start it up, and just to be
interesting, I'll force it to run 32-bit, okay,
and now the server's running 64-bit in this case.
So I can launch another copy of it and now he's
running, you know the ponies are in different positions,
but they're basically on two different, or actually in
this case let's even make it more interesting, so now,
the software renderer is writing into the IOSurface, it
has no idea what GPU either of the two clients are using.
The clients, in this case, each one
of them is using a different GPU
and one of them is even a different architecture
than the first, and it all still just works.
I think that's pretty cool, I don't know about you guys.
[ Applause ]
So the code for this, you can check out the sample, it's
really not all that complicated, let's see, here's where,
you know, I set up a little pool of IOSurface buffers and
I am going to do frame rendering in here, I'm using two
or four, something like that, a bunch of mock port goo to
get the stuff between the two guys, but for the most part,
the server really doesn't have
to do too much complicated stuff.
When it starts up, I set up a texture and an FBO with a
depth buffer all together so I can render into an IOSurface,
I set it up so that if I, you know, had to stretch the
IOSurface I get linear filtering, that sort of thing.
Then I have a little routine that lets me
render that IOSurface, not that complicated,
I bind the FBO that's attached to it, do all the Atlantis
fun stuff, and then bind back into the system drawable,
then I'd turn around and, just so
you can visualize what I just drew,
it just draws a copy of that IOSurface back into the window.
Now the client, he knows even less about what's going on.
For the most part, where is it, so again, he doesn't have
to set up an FBO, he's just rendering from the IOSurface,
it's not a big deal, so I just set up
a texture, turn on linear filtering,
clamp to edge, a few things like that and go.
I modified the blue pony code a little bit, so that I
could pass in a texture name, and a set of dimensions,
but that's pretty much all I had to do, and now I
can have this guy rendering stuff that was generated
on a completely different GPU, in a different
process entirely, and everything just works.
Okay, all right, so in summary, please support systems
with multiple GPU's whenever possible for you guys.
They're becoming more and more common, they probably,
they're not going to go away anytime
soon and your users will be happier.
Again, if it's advantageous to your app and
you can get a performance win out of it,
please try and take advantage of
multiple GPU's are available as well.
Again, the person who spent money on his big
scary Mac Pro is going to be very happy with you.
I know I wish more apps supported it on my system,
so please take advantage of it if it helps you.
And lastly, you know, if you need to use,
if you're in one of these tricky situations
where you can't use OpenGLContext sharing, or you need to
be a different process, or you don't want to have to care
about what GPU you're on and you want to be using multiple
GPU's, IOSurface is a great tool to help let you do that.
And read the sample code, you know,
they're not particularly complicated,
the whole idea of IOSurface is it's
not this insanely complicated API.
It is kind of a big API when you look at the header file,
but don't let it seem too daunting,
it's really not that big a deal.
So, for more information, please contact Allen Shaffer,
he's our Graphics and Game Technology Evangelist at Apple,
or check out our Apple Developer forum, you can ask
questions in there and hopefully we can get back to you.
Related sessions, unfortunately, have all happened
before this, but please use these for reference,
and go and check them out, there's, you know, more
details on OpenCL and how to do the sharing in that case,
with the previous session to this, some
performance, cool performance stuff for Mac OS X.
And with that, thank you very much.
[Applause]