WWDC2014 Session 601

Transcript

X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
[ Music ]
>> Good morning and
welcome to the session.
My name is Abe Stephens,
and I'm an engineer
in the GPU Frameworks Team
at Apple, and this morning,
I'm going to tell you
about how to take advantage
of the New Mac Pro Workstation,
which is a really
exciting platform to work
with for graphics and
compute applications.
We're going to talk a
little bit about the Mac Pro
and the hardware
that it contains
and the hardware
that's available
for you as a programmer.
Then we're going to take a
look at some of the graphics
and general purpose GPU
compute APIs that you'll use
to program on this computer.
And then, at the end of the
talk, we're going to take a look
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And then, at the end of the
talk, we're going to take a look
at some common patterns
that you might follow
when you're writing an
application to take advantage
of the hardware that's
available in this configuration.
So let's take a look at-let's
take a look at the Mac Pro.
And as you can see here, this
is actually the Mac Pro tower
and the new desktop.
And the tower's actually
eight times larger
than the new workstation
and it actually is only
about four times heavier.
And so there's a lot
of hardware packed
into a very small package.
And if you're an application
developer, you know,
this means that someone can
have a very powerful workstation
sitting on their desk
that you can - you know,
that can run a pretty
high-performance application
and in a much smaller
form factor
than the tower, the
tower system.
If we take a look at this
computer in some detail,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
If we take a look at this
computer in some detail,
probably the most exciting thing
as a graphics programmer is
that the Mac Pro has two
GPUs in every configuration.
And these are two
identical devices.
So, in the tower Mac Pro,
you could-a customer could
configure this system
to have multiple GPUs and
there are actually cases
in that system where you might
end up with two different GPUs
that were from different
vendors.
In the new Mac Pro, you always -
your application will always
have two identical GPUs
available to use.
And as we'll see in a little
while, that's - you know,
that's kind of an
advantage because it means
that you don't have to or
your application doesn't have
to have a lot of logic to query
and find out if the capabilities
of the available
GPUs are different.
You can write to
a single platform.
And then there are some very
specific things that you can do
to distinguish the two devices.
In this configuration, one of
the GPUs is directly connected
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
In this configuration, one of
the GPUs is directly connected
to the display hardware,
and the other GPU isn't.
And so there's some
specific things
that your application
can do that you can write
into your application to
take advantage of that
and to make sure that when you
are sending work, graphics work,
or compute work to
one of the GPUs,
you know which one is
connected to the display.
So inside this configuration,
the GPUs have
about two thousand
stream processors.
That's important if
you're working with OpenCL
to do general purpose
GPU compute work.
The configuration that
we'll be looking at today
when we show a demo later on
has six gigabytes of memory
and 3.5 teraflops peak.
And so it's a very
capable package
for graphics and
for GPU compute.
So now I'd like to explain some
of the APIs that you can use
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So now I'd like to explain some
of the APIs that you can use
to program this configuration.
And if you're - if you've
worked with Apple Graphics
or Compute before, these
will be familiar to you.
Actually, you've probably used
OpenCL and OpenGL on - say,
an ordinary laptop or on a
single GPU system in the past.
Maybe you've even worked with a
tower Mac Pro and done multi-GPU
or multi-display programming.
What I'm going to talk
about in this session is,
the parts of those APIs that
you have to pay attention to
and that you have to
use when you are setting
up an application to use
this Mac Pro configuration,
because it's a little
bit different than some
of the other configurations
that have been available
in the Mac platform in the past.
Okay, so let's take a look
at the Software Stack.
So on top of the Software
Stack, is your application.
So this is your code
that you've implemented
and this might be a
graphics application
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and this might be a
graphics application
that does some type
of 3D rendering.
It could be an application
that's doing say,
image processing or video
processing using OpenCL
or GPU - for GPU compute.
Maybe your application
does a little bit of OpenCL
and a little bit of OpenGL to
fully take advantage of the GPU.
Anyway, the first sort of
level of GPU programming
that you might have in
your app is some Cocoa code
that is using an
NSOpenGLView or a CAOpenGLLayer.
And I'm going to show you
how to configure this level
of the software to correctly set
up or to set up your application
to take advantage of the
two GPUs in the Mac Pro
in the most efficient way.
And then on top of
the next sort of level
on this stack are a
number of lower level APIs.
The CGL API is something
that you may be familiar
with in the Mac from
other systems.
It's a lower level API that you
can use to configure the GPU
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
It's a lower level API that you
can use to configure the GPU
and to find out information
about the displays
and the hardware
that's in the system.
Then, the few other
sort of programming APIs
that you're quite familiar
with, OpenCL and OpenGL,
which are the APIs that you'll
end up writing a lot of the code
that dispatches work to the GPU
and also the kernels and shaders
that are executed
on the GPU itself.
And then of course,
underneath OpenCL and OpenGL,
there's some graphics drivers.
And these graphics
drivers are handling things
like allocating memory on
the devices and moving memory
between the host and the device.
And we'll look at certain
parts of the driver that end
up performing some of the
standard movement for us
and we'll try to understand
exactly how this will impact the
way that we design applications.
So now I'm going to
talk about OpenCL
and OpenGL in some detail.
And most of the programming that
we'll see in a little while,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And most of the programming that
we'll see in a little while,
is going to focus on
this level of the stack.
The CGL, OpenCL and
OpenGL layer.
So, on the Mac Pro,
we support OpenGL.
It's our accelerated
3D rendering API.
And we support OpenGL 4.1
Core Profile and the shaders
that you write that are
executed on the GPU are written
in a language called GLSL.
And we support version 4.10.
And, OpenCL is the data
parallel programming API
that the Mac Pro supports.
And it is-the version that
is supported is OpenCL 1.2
with a number of extensions.
So the Mac Pro has a pretty
advanced AMD graphics card, er,
a pair of AMD graphics cards.
And there are a couple
extensions that are supported,
so on the Mac Pro you
can use double precision.
And there's actually an
extension that allows you
to set the priority of
the OpenCL command queues
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
to set the priority of
the OpenCL command queues
that are used when you
enqueue work onto the GPU.
And so for example,
on this configuration,
you can set up a work or
an OpenCL command queue
for background priority work.
Say, for example, if you
are performing an operation
that isn't related to the
GUI and can be performed
at a lower priority, maybe
you're applying some type
of final render to an image
processing application,
you can send that work off
to a lower priority queue.
And then a if higher priority
work comes to the GPU,
that work won't interfere-or
the lower priority work won't
interfere, say, with rendering
work that was being used
to display the GUI
in the system.
And so, if you take advantage of
these extensions, you can work
around some of the
challenges that we'll talk
about in a little while.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
about in a little while.
Okay, so when you have this
computer and the GPUs that are
in it, you can do a couple
different things actually.
You can take advantage
of this second device.
One thing that you can do is
you can take, say, the OpenCL,
the compute portion of your
application and simply run
that compute portion
on the second device,
and let the primary GPU
be continued to be used
for GUI rendering and for
maybe OpenGL 3D graphics.
And this is actually
a relatively easy type
of operation to perform.
You can sort of offload
compute work to the second GPU.
This is kind of similar to
if you took an application
to where you were, say, running
most of your compute work
on the CPU and you decided
to move some of that work
from the CPU off
to a GPU in OpenCL,
we would perform
the same operation.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
we would perform
the same operation.
We'd take that compute work
and just move it to
the secondary GPU.
Another common design or another
common task that you might use
for the Mac Pro is to
perform off-screen rendering
on the second GPU.
So, if you were going to
do some type OpenGL work,
and that the results of that
work don't necessarily have
to be displayed, every single
frame, that's a great candidate
for moving to the
secondary GPU as well.
And we'll take a
look at an example
of that in a little while.
So let me tell you about how
to set up your application
to use this configuration and
to take advantage of both GPUs.
And it's important to know that,
if you have an application
that's running on a Mac
and using OpenCL and
OpenGL, it will run just fine
on the new Mac Pro, but
there are a couple things
that you can do to make sure
that it's using one
GPU or the other.
And also a couple things
that you can do to make sure
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And also a couple things
that you can do to make sure
that if it's possible, you
can divide your application
into pieces and run one piece
of the work on the first GPU
and the other piece of
the work on the second GPU
with a small number of changes.
So to get started with
modifying the application,
there are a number or steps
that you can go through.
There are four steps.
There's creating the
context that you're using.
This is either the
OpenCL context
or the OpenGL context
in a specific way.
And if you follow
this procedure,
the runtime in the system
will be able to do a lot
of the tasks associated
with moving data
between the two devices and
between the host automatically.
Then, your application
should identify what happens
or actually it should identify
which device is the primary GPU
and which is the secondary GPU.
Then once you've
figured that out,
you can dispatch work
in a particular way.
And I'll show you
how to do that.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And I'll show you
how to do that.
The way that you select
the GPU and send work
to it is a little bit
different in OpenCL and OpenGL.
And then after you've done
that, you can synchronize data
between the two devices
or offload data
from the secondary GPU to
the host's main memory.
So, let's take a look
now at context creation.
And it turns out that this
process is, it's similar
in OpenCL and OpenGL,
but there's some specific
terminology that I'd like to go
over that I think will make the
process a little bit more clear.
So before we showed the
Software Stack, now we're going
to concentrate on the
graphics APIs in the stack.
So, OpenGL is our graphics
API and, like I said before,
CGL is this API that we use.
It's a Mac platform API.
And we use it to set
up our OpenGL context
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And we use it to set
up our OpenGL context
and to select devices
in the system and figure
out get some information
about the hardware that's
in the system, the
devices and the renderers.
And we also have OpenCL, and
it turns out that, in OpenCL,
a lot of the operations that we
were going to perform in CGL,
about learning about displays
and devices are actually
included in that OpenCL API.
And so as I walk through this,
I'm going to show you an example
of how to perform an operation
using the graphics APIs,
OpenGL and CGL.
And then I'll also show you how
to perform the same
operation using OpenCL.
And in most cases,
we're performing very
similar tasks or operations.
We're just using two
different APIs to do that.
Okay, so the first
thing to think
about in OpenGL is the notion
of a piece of hardware.
And in OpenGL, there
are a certain number
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And in OpenGL, there
are a certain number
of renderers in the system.
And each renderer is
assigned a render ID number.
So for example, in the Mac Pro,
there's going to be one renderer
for each of the GPUs, and then
there will also be a software
fallback renderer that's
available on all Macs
that the system can
use if, for example,
if you were on a configuration
that didn't have hardware
support for a certain feature,
the system might fall back
to this software renderer.
And so, if you look at the
render IDs in the system,
you'd see two render IDs for the
two discreet GPUs, and then one
for the software renderer GPU.
So when you start to
set up an OpenGL context
and an OpenGL application, you
have to figure out how to select
between all of these
different renderers
and these different
render IDs in the system.
And in OpenGL, you do that
by putting together a list
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And in OpenGL, you do that
by putting together a list
of attributes.
These are called pixel
format attributes.
And these are things like, does
the renderer you're looking
for support double buffering?
Does it support a
certain color format
or a certain depth format?
Does it support Core profile
or legacy profile OpenGL?
Anyway, you put together
this list of attributes
and one important attribute for
setting up an OpenGL application
on the new Mac Pro is the
offline renderers attribute.
Now, the Mac Pro
has two GPUs and one
of those GPUs is
always connected
to the display hardware
that's in the Mac Pro.
The other GPU isn't.
And the terminology for
this on the Mac platform is
that the display connected
GPU is considered "online",
and the GPU that's not connected
to the display is
considered "offline".
Now, it turns out that you
know, both GPUs are powered up
and both can perform rendering
and compute operations,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and both can perform rendering
and compute operations,
but the terminology is that
the one that's connected
to the display hardware
is "online",
and the other one's "offline".
So we when we put together a
pixel format attribute list,
we add an attribute
that says that we want
to include offline renderers.
And then, when we send
that into the system
and we call the Choose
Pixel Format API routine,
and we'll look at that in a
second at what that looks like.
We're going to get back
a list of the renderers
of all the renderers
in the system.
In this case, we're going to
get both GPUs - the online GPU
and the offline GPU and
then the software renderer.
Now, the next step in this
context creation process is
to actually create the context.
And this is really just
a container that points
to these renderers and
associates state with them.
So there's a lot of
state in the OpenGL API
and that state is associated
with this context object.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and that state is associated
with this context object.
Now, once we have a context,
we need a way for that context
to refer back to the renderers.
And this is actually
a difference
between OpenCL and OpenGL.
In OpenGL, the context assigns
virtual screen numbers to each
of the renderers that
are in the system.
And so here we have a
context with three renderers,
and we have virtual screen
numbers, zero through two.
Now, the last piece of
terminology I want to go
over in OpenGL before
we start talking
about OpenCL, is
the share group.
If you have a context,
now remember,
a context is that
state container.
If you have a context, and
you want to set some state
and create some objects
and then maybe you decide
that you need another thread
to do some OpenGL work
and maybe it's going to have
slightly different state,
you can create another
context that has the same set
of renderers in it
and can share objects,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of renderers in it
and can share objects,
serve buffers and textures.
And you would obtain a share
group from the first context
and use that to create
or to communicate
with a second context.
So a share group in
OpenGL terminology
and CGL terminology
is this entity
that lets you bridge
two OpenGL contexts.
And that's important because,
as we'll see in a second,
we can also use that share
group to communicate its objects
and state between
OpenGL and OpenCL.
Okay, so let's look at what
the equivalent operations
and components are
of the OpenCL API.
Now, OpenCL has device IDs which
are kind of like renderers.
A CL context object, that's
a lot like that GL context,
and then a command queue, which
as it turns out is a little bit
like the virtual screen.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
like the virtual screen.
We use it in a similar way.
The API is presented in a
slightly different fashion,
but these operations
are very similar.
Now, our CL context is
something-it's a lot
like a GL context.
It turns out, when we
set up our CL context,
we actually are going to set
it up using, in some cases,
a share group that we
obtain from a GL context.
And I'll show you what that
looks like in just a moment.
Okay so, now that I've
described the terminology,
and remember the, you know,
the term-there's a lot
of terminology in OpenGL
and there's some terminology
in OpenCL, but the two
APIs are performing very
similar operations.
Let's take a look at the
API that you have to use
when you set up, when you start
to set up your application
for working on the Mac Pro
with OpenCL and OpenGL.
So, let's say that we're
in an application that's using
an NSOpenGLView in Cocoa.
Now, I'm going to create
an NSOpenGLView and I'd
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Now, I'm going to create
an NSOpenGLView and I'd
like to use the Core profile.
So I'd like to use the
newest features in OpenGL.
And so I'd like to make sure
that I get a Core
profile OpenGL context.
And in order to do that
with an NSOpenGLView,
I have to implement my
own NSOpenGLView class
that is derived from
the Cocoa base class.
And then I would
implement my own version
of the initWithFrame
method and in that function,
I'm going to set up my
pixel formal attribute list.
And as you can see
here, at the top,
I included the Core
profile attribute and then
at the very bottom-and
this is the important piece
for the Mac Pro-at the bottom,
I also said that I wanted
to allow offline renderers.
And now when I create a GL
context using this pixel format
attribute list by passing it up
to the super class, when I go
and do that, I'll get a context
object initialize that has all
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and do that, I'll get a context
object initialize that has all
of the devices in the system.
And that's really
the important part.
Okay, let's see how
to do that in OpenCL.
Well, if you're in an
ordinary OpenCL application,
that's an application
that is just going
to do some OpenCL programming,
it's not going to do anything
where there is sharing
between OpenCL and OpenGL.
The easiest way to get a context
that has all the GPU devices
in it, is just to create
a context with a type.
So here, I'm calling
clCreateContextWithType
and I'm asking for CL Type GPU.
That's going to give me a
CL context that contains all
of the GPUs in the system.
On the Mac Pro, that means
that I'm going to get a context
that has two device IDs, one
for each of the discreet GPUs.
Now, if we were in an
application that was going
to do some OpenCL
and some OpenGL,
and those operations were going
to interact with each other,
I'd want to create a context
in a slightly different way.
So here what I'm doing,
is I've created already
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So here what I'm doing,
is I've created already
in my previous slide, I
set up my NSOpenGLView.
And then I would obtain
the context object
from that NSOpenGLView.
And then you can see here,
I'm using the CGL API
to get the share group that is
associated with that GL context.
Then I take the share group,
remember the share group is
that entity that we use to
create a pair of contexts
that operate on the same objects
and use the same devices.
I use that share group
now with clCreateContext
and another property list
to create a CL context
that contains the
same devices that were
in that original GL context.
And now what I'm going
to do is I'm going to end
up with my CL context here, C,
that contains all the
GPUs in the Mac Pro.
And that's really
the important part.
It's very important that
I always create a context
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
It's very important that
I always create a context
in either API or in this case,
I've created a GL context
and then a CL context
that contains all the
devices in the system.
Okay, so now that
I've done this,
now that I have this very
versatile and flexible context,
the next step is to take a
look inside it and figure
out which device
corresponds to the primary GPU
and which device corresponds
to the secondary GPU.
And so that's important
because, if I'm doing a task
that is going to be, the
results of which are going
to be displayed on the screen,
it might make sense for me
to use the primary GPU first
because it's directly
connected to display hardware.
And then maybe if I have a task
that isn't related to the GUI
or isn't related to the display,
I might want to always send
that task to the secondary GPU.
So if I look at the OpenCL
API and the OpenGL API,
it turns out there are a
number of different queries
that I can make, but
since the two devices
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that I can make, but
since the two devices
in the Mac Pro are identical,
all of those queries,
all those CL device
info queries,
are going to return
exactly the same information
for both devices.
In order to distinguish
the two devices,
we have to do something
different.
So what we're trying to do
here, is we'd like to figure
out which GPU is the online
one-that's the primary GPU-and
which GPU is the offline one,
that's the secondary GPU.
And then we're going
to try to figure
out what its virtual
screen number is
if we're doing OpenGL work, or
what the CL device ID is for it
if we're doing OpenCL work.
Okay, so let's walk
through some code here.
This is the process that
you go through to decide
which GPU is the primary
GPU or the secondary GPU.
In this particular example,
I'm going to be looking
for the secondary GPU.
So I'm going to go through
a bunch of steps here
where I issue some queries
against the system to figure
out which GPU is
the offline GPU.
So the first thing that I do
is I iterate the renderers
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So the first thing that I do
is I iterate the renderers
in the system.
I obtain these using this
CGLQueryRendererInfo call.
I iterate over all of the
renderers and I ask the system,
"Is the renderer
online or offline?"
So this will actually tell
me, once I get past this step,
I'll know if I have that GPU
that's connected to display
and that's online, or
the one that's offline.
Of course, as I mentioned
earlier,
there are some other
renderers in the system.
There's the software renderer
and I have to make sure
that I am able to distinguish
between the offline GPU
and the software
renderer and so,
we'll do that in just a second.
So, if I find the GPU that's
offline, I then check to see
if it supports accelerated
compute.
This is basically saying,
does it support OpenCL?
And now, in the Mac, the OpenCL
API actually does have a CPU
device, but it's presented
to the system differently
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
device, but it's presented
to the system differently
than the software renderer.
Those are two separate
entities within the system.
And so if I obtained-if
I've sort of iterated
over the render ID for
the software renderer,
it wouldn't match this
accelerated compute query.
And so I'd be able
to distinguish
between it by making this check.
And then, if I get
past that step,
I'm going to issue another query
here using CGLDescribeRenderer
and I'm going to ask
for the renderer ID.
So I started by getting
a renderer info object.
I then walked over
all of the renderers
that were in the object.
And then filtered them using
a number of other queries,
and eventually ended up querying
them for their render ID number.
And in this case, I found the
secondary GPU, its renderer ID
and I'm going to write that
or store that to a value.
And we'll use that
in a little while.
So now I have a renderer ID
but in order to actually select
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So now I have a renderer ID
but in order to actually select
or send work to a
GPU in the system,
I need to know its virtual
screen number because,
if you recall, the virtual
screen number is how the context
refers to the different
renderers that it contains.
And so here what I'm going
to do is I actually have
to have a context in order to
have virtual screen numbers.
So I'll get a context from my
NSOpenGLView and then I'll check
to see for each virtual
screen in the context,
I'll check for its number
and also its render ID.
So here I am, getting the number
of virtual screens
that are available.
And then the next step is to
walk over those virtual screens,
make them current, and then ask
for the renderer ID associated
with each virtual screen.
So I've iterated over
all the virtual screens,
gotten their render IDs
and then matched those
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
gotten their render IDs
and then matched those
with the renderer ID
that I'm looking for,
and that tells me the virtual
screen number that corresponds
to that particular GPU.
Okay, so it's important
to always check virtual
screen numbers.
So in the example that we
just looked at, when I walked
through that and actually
executed that code,
it turned out that the primary
GPU was actually virtual
screen one.
And so if I had just
assumed that you know,
virtual screen zero
would be the primary
because primary comes
before secondary,
I would have been wrong
and I might have ended
up rendering all of my work
on say the secondary GPU
instead of the primary GPU.
And so our - and so the
Mac is very flexible.
It actually can handle
this case.
It's just not as efficient as,
say, rendering all that 3D work
to the primary GPU and then
displaying it immediately.
Okay, so that's how
you do it in OpenGL.
Let's take a look at how
to do the same operation
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Let's take a look at how
to do the same operation
of the same set of
operations in OpenCL.
So, in OpenCL, I would have
gone through the process.
I would have started
with the CGL API and gone
through the process of
figuring out which render ID
in the system is the secondary
GPU which - and if I had sort
of flipped that process
around, I could have determined
which one was the primary GPU.
Now I have to go from a
renderer ID that I obtained
from CGL to a CL device ID.
And in Yosemite, there's
an API that we can use
that will convert a CL device
ID directly to a renderer ID
and that function is
CGLGetDeviceFromGLRenderer.
I pass in the renderer ID and
it gives me back a CL device ID.
Then I can use that CL device
ID to create a command queue
and dispatch work
directly to that GPU.
And so instead of having to
do a query for virtual screens
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And so instead of having to
do a query for virtual screens
in the OpenCL API, I can
just create a command queue
and then use that command queue
to directly dispatch work,
in this case, to
that secondary GPU.
Okay, so the next stop
is dispatching work.
So in OpenGL, the
context - the GL context -
refers to the renderer
or interacts
with the renderer via this
virtual screen number.
And to do - to set the virtual
screen, we saw an example
of this earlier, if I'm going
to set up some draw calls,
I'm going to issue some draw
calls to one of the devices,
the first thing I have to do
is make sure that the context
that I created is
the current context.
So I'll call
CGLSetCurrentContext.
And pass in the context that
I'm entrusted in working with.
Then once that context is set,
I can set the virtual
screen number and,
like I said a couple slides
ago, it's really important
and I can't emphasize
this enough, to make sure
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and I can't emphasize
this enough, to make sure
that you know which
virtual screen corresponds
to the primary GPU and the
secondary GPU as opposed
to just assuming
that the first one
or the second one is always
the primary or the secondary.
Anyway, I can call
CGLSetVirtualScreen and pass
in the number that I want.
And then issue my bind calls
and my draw calls in OpenGL.
In OpenCL, instead of having
to set a virtual screen,
I just use a command queue.
And so here, I'm
not setting state.
Instead what I'm doing is
I'm creating an object,
this queue object, and then
using that queue object
to enqueue work to
a particular device.
And so there are no
bind calls in OpenCL.
Here I'm just creating
a number of objects.
There was already a kernel here.
I set some arguments on it.
I have a command queue that I've
created based on the device ID
that I looked up using the
process that we just described.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that I looked up using the
process that we just described.
And I can queue work
to that GPU.
Okay, now that I've created a
context that has two GPUs in it,
then identified the primary
GPU and the secondary GPU,
dispatched work using a virtual
screen or a CL command queue,
the last step is to get results
or to get the data off the GPU
that I've selected and to
use it in my application.
And of course, in an OpenGL
application, you might -
you know, the results might
be displayed on a primary GPU.
In an OpenCL application, if
you were doing CL-GL sharing,
you might end up sending the
results from the secondary GPU
to the primary GPU in
order to render them,
or you might download the
results from the secondary GPU
to host memory if you
were working on something
that wasn't related to
rendering or to display.
And either of those
things are possible.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And either of those
things are possible.
Either of those techniques
are possible.
So if you're in a CL-GL sharing
case, there are a couple things
that you have to do that
we'll look at in a second,
but the runtime is going to
do most of the work for you.
If you follow a specific
procedure,
after you've dispatched the
work to the - to one GPU,
when you start using that work
on the second GPU, the runtime
and the driver will take
care of moving the data
between the two devices.
And well actually, this is a
great advantage because it means
that we can very
easily take advantage
of using the secondary
GPU in our application,
but we have to follow
certain rules to make sure
that the system will behave
in an efficient manner
when we move data
between the two devices.
So let's take a look at what
to do when we're switching work
between - from one
GPU to another.
So, let's say that we were going
to do some OpenGL work
on the secondary GPU.
So we'd call SetVirtualScreen
and we'd pass
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
So we'd call SetVirtualScreen
and we'd pass
in the secondary virtual screen.
And then we would bind some
objects, maybe some textures
that we're going to work
with, and do some drawing.
And then we would call
glFlushRenderAPPLE()
and that's going to -
that's going to cause all
of that GL work to be submitted
to the device, and it's going
to push all of that
work off to the GPU
and the GPU will
start working on it.
At some point in the
future, we're going to want
to use the results
that we had computed.
Maybe we are rendering
into an FBL or something.
We want to use those
results on the primary GPU.
And so, we're going to
call CGLSetVirtualScreen,
pass in the primary GPU's
virtual screen number,
and then start working with
the data on the primary GPU.
Now, in this case, there
is a single OpenGL context.
And because there was a
single OpenGL context,
it wasn't necessary for
me to change the state
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
it wasn't necessary for
me to change the state
or to re-bind the objects
that we are working with.
That state was already set.
I simply changed
the virtual screen
and continued using the
objects - the GL objects -
that I was working
with previously.
And that allows me to -
that allows the runtime,
that SetVirtualScreen call,
allows the runtime to realize
that I'm going to start
sending work to the other GPU.
And it will take care of
synchronizing the data
that I wrote to you on the
other device over to the device
that I'm going to start using
when I issue the next draw call.
Okay, so let's take
a look at what
that looks like in a schematic.
So I've taken some graphics work
and I've issued a
bunch of Draw calls.
On the secondary GPU, I
call glFlushRenderAPPLE()
and the runtime pushes all
of that work onto the device.
Now, if the runtime was going
to issue any more commands,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
like for example, if the
runtime decided that it had
to issue a page off command,
that page off command would
be sitting behind all the work
that I'd previously
flushed to the device.
And that's exactly
what happened.
So when the primary GPU or
when the runtime detects
that CGLSetVirtualScreen call
going to the primary GPU,
it in turn will page the data
from - or page off the data
from the secondary GPU,
after those previously flushed
commands have been executed
and then page it on
to the primary GPU
so that I can then execute my
draw calls and continue working
with the data on
the primary GPU.
So, the movement of the data,
takes place automatically
and as a programmer, I've made
sure that that data movement is
in the right order or it
takes place after the commands
that I used to create the data
by calling glFlushRenderAPPLE().
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that I used to create the data
by calling glFlushRenderAPPLE().
Okay, so in OpenCL, it's
a little bit different.
In OpenCL, we have
command queues instead
of virtual screens,
and we're going
to do something that's
very similar.
We're going to enqueue
work using a command queue
that we created in
the primary GPU
and then flush it
using just clFlush.
That will cause that queue to
start working or that device
to start working on the data, on
the commands that we enqueued.
And then when I enqueue
work to the secondary queue,
once that work gets to the
head of the command queue,
the system will execute a
similar page-off operation
that in this case is going to be
guaranteed to be behind the work
that was sent to
the primary queue.
And so we'll see a similar
type of behavior as we saw
in the GL case where I made sure
that that page-off would arrive
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
in the GL case where I made sure
that that page-off would arrive
at the GPU, after it had already
started working on the producer
or the operations that
were producing the data.
So, on Mac, you have heard of a
pattern called Flush and Bind.
And this is a pattern
of APIs that is used
in multiple GPU situations and
in instances where there is more
than one OpenGL context.
It's also used in a situation
where you have an OpenGL context
that's, say, producing the data
and an OpenCL context
that's consuming the data.
So, any instance where on Mac
you have two different contexts,
you have to use Flush and Bind.
And what that means
is that when you -
after you queue the work
that's doing the production,
that's producing the texture
or maybe it's producing some
geometry, after you enqueue
that work, you always have to
make sure that you flush it.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that work, you always have to
make sure that you flush it.
You flush that command queue
or you flush before
switching virtual screens.
And then after that, when
you switch to the other API
or to the other context,
you have to make sure
that you rebind any
objects that were modified,
in this case, by OpenCL.
So in the single instance
that we looked at before,
when we were - just had one
OpenGL context, we could flush
and then immediately use the
objects on the other device.
In an instance where there
are either two OpenGL contexts
or there's an OpenGL context
and an OpenCL context,
we have to use Flush and Bind.
We have to flush like we
did before, but then we have
to rebind those objects once
we switch to the other device.
So if you follow these steps,
the runtime will take care
of moving this data between the
two GPUs for you as you work
on the data in those
two different places.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And the reason that the
runtime's able to do this,
is that you've created a context
that contains all the devices
in the system, and so that the
runtime and the driver are able
to track enough state to perform
these operations for you.
And so it's very
important to emphasize
that when you create a
context, always create a context
that contains all the
devices in the system,
all of the GPU devices
in the system.
There are some other design
patterns that you might follow.
For example, you might create
or be tempted to create a, say,
a set of objects, a
context, a command queue,
a whole stack of objects.
One stack per device
in the system.
But on Mac, really the
best thing to do is
to always create a - the context
to contain all of the devices
in the system, even
if you're, say,
on a different configuration,
only going to use
one of the devices.
If you do this, it
will be very easy
when you move your
application onto the Mac Pro
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
when you move your
application onto the Mac Pro
to start using two GPUs
because the application
and the structure of the
program has already been written
to handle a context that
contains both devices.
It makes it a lot easier
to migrate to the system
and allows the runtime to take
advantage or allows the runtime
to move objects between
the two GPUs for you.
Okay, so now that I've
showed you how to program
or the API that's involved in
programming for the Mac Pro,
I'd like to show you some
programming patterns.
And what I'm going to focus on
here is what the system does,
or what the system's
doing on your behalf,
when you perform
different tasks on the GPU.
So, what I'm going to start
with is an example of an offline
or an offload task where you
have some kind of operation
that isn't related to
display, and you're going
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that isn't related to
display, and you're going
to perform this operation
on the secondary GPU.
And so, I've called this -
I call this an offline task.
And you might have an offline
task in your application
if it's something that
- say it's something
that you apply once or one
- sort of one set of time.
So for example, you have an
image processing application
and the user goes to the Edit
menu and they select a filter
and they change some filter
parameters and click Apply,
that might be a great offline
task because you're not going
to perform the bulk of the
work, of the bulk of that say,
OpenCL compute work until
the user clicks Apply.
Then you're going to perform
a large amount of work
on some input data,
on some giant image.
And then once you're done,
you're going to say save
that image off to main memory
or maybe you're going to save
that image off to disk.
And that operation is
a discrete operation.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And that operation is
a discrete operation.
It takes a long time.
If you were to run it
on say the main thread,
it might cause the GUI
to respond more slowly.
And it's something that is -
it's something that's separate
from the main sort of, GUI
loop of the application.
So let's take a look
at what that -
what I'm talking about here.
So I have some OpenCL
work in green,
and this OpenCL work is
going to apply my operation.
And it might take a long time.
And then I'm going to first have
some OpenGL work that's going
on that's related to my GUI.
And my application actually
may be using OpenGL in the GPU
if I'm using a - if I'm using
certain parts of the UI,
even if my application
itself doesn't use OpenGL.
Now, what this looks like
is, it has a lot of -
we might have a lot
of sort of short
or inexpensive OpenGL
operations being performed.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or inexpensive OpenGL
operations being performed.
And then we have this
giant compute operation.
And of course, if I then have
some more GUI-related OpenGL
work coming through,
what's going to happen is,
I'm going to - you know,
my system's going to lag.
I might end up with some
sort of progress problem
or maybe I'll even
get a beachball,
if this OpenCL program
or this part
of my OpenCL application
takes too long.
And so what we're going to do is
we're really just going to take
that green box, the Open -
the expensive OpenCL compute
operation, and we're just going
to move it over to
the secondary GPU.
And if we've set
up our application
so that we have both
devices in our context,
and we followed the API that we
just described, it's very easy
to perform this offload task
and move an offline operation
off to the secondary GPU.
So here's what this looks like.
It's very straightforward.
I have an application here.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
I have an application here.
The user went into my -
or the user is going to go
into my Edit menu and
select Apply Effect.
I end up in an action here.
And I have a kernel that I'm
executing iteratively a large
number of times.
This makes the application
a little bit slower.
And I'm doing this right
now in the primary queue,
and all I'm going to do
is make sure that I've set
up the secondary queue and
just send that operation off
to the secondary queue.
And now, at some
point in the future,
after these operations are
finished, the existing code
in my application to move
the data off the GPU and back
to disk will just move that
data, those memory objects,
off the other GPU
instead of the primary GPU
that I was previously using.
Another pattern that you might
end up following is an instance
where you're going to perform
graphics work on both GPUs.
And once you've divided the
work, the rendering work,
between the two devices,
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
between the two devices,
the window server actually
will take care of copying data
from the secondary GPU to
the primary GPU for display.
So I'm going to show
you what this looks like
and what happens here.
And actually in a second,
I'll show an example
where we perform these
kinds of operations
in an actual application.
So, here I have an application's
app thread, and it's going
to perform a - it's
going to select a -
make the context current.
It will set the virtual
screen to the primary.
It's going to call the drawScene
method and that's going
to do a lot of OpenGL work.
And then I'm going to
call glFlushRenderAPPLE
and maybe I call flushBuffer
to put the work on the screen.
And now I'm going to do the
same thing on the secondary GPU
and also call flushBuffer.
In this example, I might
have two separate parts
of my application and I'm going
to render one part on one GPU
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
of my application and I'm going
to render one part on one GPU
and the other part
on the other GPU.
Now, what happens here, if
I look at the operations
that are being performed,
at some point and time,
both of these GPUs are
going to get flush commands
and then flushBuffer commands.
And the window server is going
to wake up and it's going
to realize as it's getting
ready to composite the image
for the next frame, that some
of the data it needs is
on the secondary GPU.
And so it's going to actually
perform a very similar operation
to what we saw a little
while ago with the page-off.
It's going to realize
that that -
the data's on the other device,
send a page off request for it
to the device, move the
data back, then page it
on to the primary GPU
performance composite,
and then display the image.
And so, there's a
period of time here
where the window server has
gotten involved for display
and it's going to end up
executing that page-off
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and it's going to end up
executing that page-off
and then the secondary GPU
is free to continue working
on more graphics work.
But the primary GPU is
going to have to end
up copying the data back on, and
then rendering the composite.
And so there's a certain
amount of overhead
that your application
has to be aware
of when you're performing
work on both devices.
But the system, if
you follow this API,
the system will handle
this for you
and your application will
be able to use both devices.
Okay, so the challenge here,
when we're taking an application
and modifying it to
use this configuration,
is really that we have to
divide our work somehow.
We might divide it by taking
a task that isn't related
to display and moving
it to the secondary GPU
or maybe we can parallelize the
work between the two devices
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
or maybe we can parallelize the
work between the two devices
in such a way that the
overhead is not a problem.
And we have to always be aware
that there is one device that's
connected to the display,
and the other device that isn't.
And so that can be a
source of overhead,
especially if the data is -
especially if the data has
to be moved back and
forth very frequently.
So, I'd like to take
a second here and talk
about some other situations
that involve multiple
GPUs on our platform.
So the - probably the most
common multi-GPU situation is a
laptop that has two
GPUs: a discrete GPU
and an integrated GPU.
And it's important to
remember that although the APIs
that you use when you
modify an application
to support automatic graphics
switching are similar,
the way that you use those
APIs is a little bit different.
And so, if you are interested
in supporting both GPUs
in a laptop, be sure
to take a look
at the automatic graphics
switching feature.
Also, the tower Mac Pro
configuration is still available
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Also, the tower Mac Pro
configuration is still available
or is still out there.
And is - your application
may run on it.
And it is something that can
support multiple displays
connected to multiple GPUs, and
there's a lot of infrastructure
in Mac graphics for
handling instances
where an application moves
from one GPU to another,
based on the display
that's connected to the GPU.
And that uses a different set
of APIs and callback mechanisms.
And so, if you're working
or if you're concerned
about that situation,
you should take a look
at that documentation.
So let's look at a complete
example of everything
that you have to do
in an application.
The first step of course
was to create a context
that contains all the GPUs.
The second is to check and see
if you have an offline device.
So if you're on a Mac Pro,
you should have a single -
basically you should have
one device that's online
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
basically you should have
one device that's online
and one device that's offline.
Then check to see if you have
two identical - in this case,
I'm looking for device names.
And if I fail any of these
checks, it might mean that I'm
in one of those other
situations.
I might be in an instance
where I have a laptop
that has an integrated
and discrete GPU
that are different devices,
or maybe I'm in an instance
where I have a tower Mac Pro
and I have two different -
two displays connected
to two different GPUs.
Once I've determined that I'm
on a Mac Pro, a new Mac Pro,
then I have to be
concerned about dividing work
between the GPU and then
synchronizing the results.
So, if I fail that first check,
I'm going to go and take a look
at supporting multiple displays.
And if I - if the second
check doesn't work out,
then I might have
to be concerned
about automatic graphics
switching.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
Okay, so let me show you
a multi-GPU example now.
And this is a - this demo is a
system that's performing some
OpenCL work and some
OpenGL work.
And we're performing the
OpenGL work on the primary GPU,
and the OpenCL work in the
demo mode that we'll see
in a just a second, is
going to be performed on -
partially on the primary GPU and
then also on the secondary GPU.
So when I launch
this application,
this is performing
a physics simulation
where there's a particle system
that's being rendered in OpenGL
and it's being simulated
in OpenCL.
And right now, we have a
large number of particles.
We're performing the
physics simulation
and we're using one GPU.
And we're getting about
15 frames per second.
We have a - this is a relatively
fluid animation but it turns
out we can do much
better if we go over here
and reconfigure the
demo to use two GPUs.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
and reconfigure the
demo to use two GPUs.
And here you can
see that we're doing
about 30 frames per second,
maybe a little bit more
than 30 frames per second,
and we were able to accomplish
that by working through our
application and making sure
that we enable the application
to move data between
the two GPUs.
And then we found a way
of dividing the data
and the computation in
such a way that the speedup
that we obtained from
dividing our work and executing
with twice the amount
of GPU capability,
that speedup was a lot greater
than the overhead of having
to move a small amount
of data back
to the primary GPU for display.
So here actually we're -
I guess towards the end,
we're getting even faster
than the very beginning
of the simulation.
So this is a relatively
simple example.
Let's see if I can restart here.
This is a simple example.
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
And without a huge
amount of effort,
we're able to divide our
application into pieces
and then execute the
compute part on both devices
and obtain a significant
speedup.
[ Silence ]
Okay, so in the demo what
we saw was an application
that we had modified
to move between
or to take an application
and perform some of the work
on one GPU and a lot of the
work on the secondary GPU.
And we saw that that produced
a pretty significant speedup.
There are a lot of other
applications that can benefit
from working on this
configuration and I hope
that using this API
and understanding some
of the terminology and the way
that the system behaves will
help you port applications
X-TIMESTAMP-MAP=MPEGTS:181083,LOCAL:00:00:00.000
that the system behaves will
help you port applications
to this configuration.
For more information about
using OpenCL and OpenGL
in the Mac Pro, please talk
to the WWDR representatives.
Thank you very much for
attending this session
and please let us know
how we can help you.
[ Applause ]
[ Silence ]