WWDC2010 Session 416

Transcript

>> My name is Ian Ollmann and I'm the first speaker
in a series that will talk to you about OpenCL.
OpenCL is Snow Leopard technology which was
just released a couple of months ago, this fall.
And, I wanted to give you an overview of what OpenCL is
all about before we get started on the more in-depth talks.
So, first I want to discuss OpenCL design and philosophy,
and then explore some common developer questions
about, that have emerged over the last year.
Some of you may have seen them on the developer forums.
And then, offer up a few debugging tips about
how we go about to tune our OpenCL kernels.
Now, since OpenCL has been out for awhile, I'm going
to assume that many of you have already played with it.
Maybe you've downloaded some of the sample
code and looked at it, read through the spec,
maybe even tried to write a few applications of your own.
And so, what I wanted to do is provide an overview to kind
of tie it all back together, because there are a lot of APIs
and a lot of spec to read, and sometimes seeing
the forest for the trees is a little tricky.
So, when we set out to build OpenCL, what we wanted to do is
bring programmability to all of the devices on your system.
One API to bind them all; CPU, GPU.
If we had them, you could also program an accelerator.
And, this brings in new problems
into the programming paradigms.
Some of these devices may have their own memory attached
to them, which sit off in a separate address space
and do not fit together contiguously with
the RAM you're used to using on your system
in your ordinary C or C++ or Objective-C program.
Many of these devices run on a different instruction
set than the Intel instruction set that you're used to.
So, we have to overcome these problems when we build OpenCL.
So, OpenCL, in a nutshell, is sort of the
minimum set of objects that you would need
to encapsulate this information and make it all work.
We have an object which represents a device.
This might be a CPU or a GPU.
We have a context, which does not do very much;
it's just a sandbox to hold everything else.
This serves two purposes.
It allows you to keep the damage localized if something
goes horribly wrong, which we hope won't happen.
And also, if you're doing data sharing between OpenCL and
OpenGL, it acts as a counterpart to the CGL share group.
You'll want to put your data into something.
Because we have to copy the data up to a device
sometimes, we need to know how big your data is,
so a simple C pointer is not enough; we need an extent.
So, the object's designed to encapsulate your data.
There are two kinds.
There is a buffer, which is almost exactly
like what you'd get out of malloc .
When you call malloc, it's essentially a range
of bytes; what you put it in is up to you.
And then, there's image data type, which is
useful for sample data in a regularly spaced grid,
and images are designed to be used by the GPU texture unit,
which has hardware to make sampling
out of the image much faster.
Collectively, these things are called MemObjects.
You also want to write code, so OpenCL
provides a C like programming language.
We essentially started with C99, and then we tagged on a
few extra features, like vectors and vector intrinsics,
and we stripped out a little bit of the C standard
library that didn't make any sense for GPUs.
And then, so you'll build a program against
multiple devices, and then you'll need some sort
of function pointer-like thing to go
find the functions within that program
or compilation unit, and these things are called kernels.
And, you can make many kernels for the same function if
you want, and that actually turns out to be quite useful.
Finally, we have to have some verbs in this
sentence, and those are the command queues.
You can queue commands into the queue to
make all of these objects start doing things.
They might be to copy data from one place to another.
They might be to run a program on some of your data.
And, that's basically it in a nutshell.
The important thing with the command queues though,
is that these are fully asynchronous queues.
So, when the device is finished with command one,
it's going to want to go straight to command two,
and if there isn't a command two, then it's going to
sit and be idle, and you're not taking full advantage
of the computational horsepower of your
entire system when they're sitting idle.
So, what you really want to do is enqueue
a pile of stuff into the command queues
and let it take off and run in asynchronous fashion.
As an example of how these things
work, here I have, at top left,
some data that you might have allocated,
called My Buffer or My Image.
And then, in your program somewhere; I
put it in Main, but it can be anywhere,
you'd start enqueueing commands
to then operate on that data.
We might start with an enqueue to write
a buffer, and what this does is copy data
from your buffer into OpenCL's counterpart.
You might call EnqueueWriteImage, which
will do the same thing for images.
And then, you can call EnqueueNDRangeKernel,
which will copy all of that data automatically
up to whatever device you enqueue it for.
The device will run your program, and something will happen.
You could have some results.
And then finally, you might enqueue a read buffer to copy
the data back to your thing, and proceed in that way.
The OpenCL API itself is very consistent.
We have about eight object types and they all have
the same set of functions that are used with them.
They have a creation function;
it's used to create the object.
We turn the object out the left
side and enter out of the right.
These things follow Core Foundation
reference counting semantics,
so you'll find that familiar probably
with retain and release.
When the reference count goes to zero, the object
is then destroyed in the background by OpenCL.
We have getters and setters in the standard object
paradigm to get information in and out of the object.
Almost all of the objects have getters.
Only a very few of them have setters, because
that introduces mutable state into the object,
which makes it harder to make a thread safe back in.
So, and then finally, if you want to
queue a command, we have a clEnqueue
and then some command type that--a bunch of functions like that.
So, the next question is; OK, I've
got this object infrastructure,
but I don't really understand, like how I write code for it.
I mean, what does the code I actually write look like?
And, this is the code I mean that you run on the
device, not the thing running in your regular C program.
So, we have to find some way to split
up lots and lots of parallelism in a way
that you write your code where it still makes sense to you.
So, a traditional way to do that would
be to invoke task level parallelism.
This is something you might have done yourself,
you know, using Pthreads or NSThread or something,
where you divide the work up into different tasks.
So, if we take a mail reader app as an example, we
might have a thread to get the mail from the server,
another one to scan through the results
and identify junk mail as it comes in,
another one that might run mail filters, a thread
to draw the UI, a thread to get keyboard input,
a thread to play the audio if you
want to make a little beeping noise.
You know, we can divide up many such tasks like
that and, you know, usually you come up with five
or ten different things you can do, and that's great when
you've got five or ten different processors to work with,
because you can get a five- or a ten-way parallelism if
you can manage to stack those things all up concurrently.
But, in OpenCL we're really targeting
a much more parallel system than that.
We're going after many core systems, you know,
that might have hundreds or thousands of cores,
so how do you break up your workload in a way
that's more amenable to those kinds of systems?
And, what we can do is just learn from standard
shader languages, like OpenGL shader language,
and take advantage of data-level parallelism; that
is, rather than breaking things up by different kinds
of tasks, we break things along data boundaries.
So, if we continue our example with the email reader,
you might have a separate thread for each email.
So, if you're like me and you come in in the morning and
you have 200 emails waiting for you to waste a good part
of your morning, then you know, this would be a great
way to get 200 way parallelism out of your system.
And, hopefully these kinds of computations that you're doing
on the different data elements are largely independent.
You know, in my case, what happens to one email probably
doesn't influence much what happens to another one.
And, assuming you have enough data, then you can
presumably get your 1,000 way parallelism potential.
Or, better yet, you know, find a
million way parallelism in there.
And, this is exactly the sort of problem
that OpenCL is set up to go after,
is where you can just get massive
parallelism in your computation.
[ Silence ]
So, as an example of a way to break
things down, let's take this image.
This is an old OpenCL logo.
And, I've broken it down so that you
can see each pixel outlined in the grid,
and each pixel we would call a workitem, and we
would run a single function against that pixel.
So, for even a very small image of, you know, one kilopixels
by one kilopixels, which is much smaller than you can get
out of most cameras today, you still easily
see I could get a million workitems out of this
and you can get a great deal of parallelism.
And, what OpenCL will do is essentially
implement the outer loop for you and go through
and call the function in turn for
each one of these workitems.
And then, in your function, when you get called,
you call a little get_global_id function,
which tells you which one you are,
and then you go operate on that one.
And then, just to be complete, it's not really required
that your dimensions of your problem map directly
to the workitem dimensions that you've called with
OpenCL; it can be another dimension heading off
in a orthogonal space and you can map it back,
and I'll talk more about that in a minute.
However, there is a little bit of a complication here.
It turns out that modern silicon is generally not a vast array
of very simple scalar processors all running in parallel.
Over the last few decades, as you know, we've been
adding all sorts of ways to get instruction level
and data level parallelism into
the single instruction stream.
We have super scalar processors, symmetric multi-threading.
You can do a lot of work at the same time
in a SIMD fashion using vector units.
And, it turns out that if we actually look
at the possibility for doing concurrent work
on general modern cores, these things are capable of running
many, and sometimes a great many, workitems concurrently.
So, what we do is we take our giant grid of workitems,
and we break it out into little groups, called workgroups.
And, here I've outlined them in yellow,
in one way of breaking down the problem.
And, we run a workgroup on a particular core, and
all of the workitems in that workgroup will run
as roughly concurrently, more or less, on that core.
And this is actually a very convenient motif, because it
means that you have spatially adjacent data working together
at roughly the same time, which
means you get great cachetilization.
You can share resources, like local memory.
If one workitem accesses a piece of memory and another one
right next to it accesses the piece of memory right next
to it and so on down the line, the compiler might be smart
enough to spot; hey, you just loaded a continuous range
of data, and turn that into one big load
rather than doing it in smaller loads.
So, that's called a coalesced load,
or you can do it with storage, too.
It's also very cheap to synchronize.
It could be as cheap as a NoOP to get
all of those things to work together,
because you're essentially doing it all in software.
Or, it could be a little hardware
interlock within a single core.
What's really hard to do though, is to synchronize
between workgroups, because they're running
on different cores, which might be far apart on the silicon.
There's actually sort of speed-of-light
information travel problems to get from one core
to another, so communication can be slow.
It would add complexity to the chip to
have each core capable of being interrupted
by every other core, in sort of an N-squared fashion.
And then finally, you run up into sort of a memory limit.
If we imagine that we have a relatively small image,
say a million pixels, and we make a little stack
for each workitem, which is four
kilobytes, again very modest.
Multiply that and quickly you just realize you've
just exhausted your address space in a 32-bit process.
So, we can't actually have all of
those stacks living at the same time.
So, it's not possible to say, put a barrier in the middle
of your code, have all million workitems come up to
that barrier, then once they get there they all continue.
It just doesn't work.
So, how do we map workitems directly to hardware?
There are varieties of ways to do this.
One way is a simple, direct model, and this
is what we do on a CPU today in Snow Leopard,
where essentially one workitem is one hardware
thread running there, just like a Pthread.
And, in this model, you have to vectorize your code if
you want to get full use out of the hardware available,
and then there might be other sources of parallelism
on the core that you might want to take advantage of,
like superscalerism or try to use the reorder
buffers to get more to happen concurrently.
But, there are other ways to do
it in a more GPU-like fashion.
We can parallelize our work items through the SIMD
engine, and here we run a separate workitem down each lane
of the vector register in the vector unit.
We can go further than that.
We can write a software loop to do multiple vectors at
a time; meaning, in this case, 32 workitems at a time.
And of course, since it's a loop, we
can turn the loop over many, many times,
and pretty soon you can see how you can end up with a
workgroup that's hundreds or thousands of workitems in size,
all running on the same piece of parallel hardware.
And finally, we can parallelize this in a different
way, using SMT on each one of these vector things
down through a symmetric multi-threading engine,
and use hardware to do all of the scheduling.
So, these are just simple examples of how it might happen.
There are some ways to synchronize within a workgroup.
The simplest is mem_fence.
That actually synchronizes within a single workitem.
And, it's mostly there just for hardware
that has really weak memory ordering.
This is largely unnecessary on Mac OS X, because all
of our hardware is more strongly ordered than that.
So, you should not need to use mem_fence.
Barrier is very important, however.
If you are working with local memory, you'll
find that you will want to copy data over.
Make sure all of the copies are done,
issue a barrier, and then proceed forward.
Now you can use the data that's in local memory.
But, barriers only work across a single
workgroup for the reasons I mentioned before.
Then finally, there's a call, wait_group_events,
which is a barrier-like synchronization,
but it works with an asynchronous sort of switch that
copies data around, called async_workgroup_copy.
There is also a problem with figuring out
how many workitems to put in your workgroup.
You can pick just about any number, as
long as the hardware will swallow it.
So, what will the hardware swallow?
Well, it depends on the hardware.
OpenCL provides quite a diversity of
different interfaces to try to figure that out.
You can see I have eight of them here.
Six of them are APIs and there's also some constraints
in the standard; the dimensions of your workgroup have
to divide evenly into the total problem
size and it might turn out that you need
so much local memory per workitem in order to do your work.
There's only so much local memory on the system, so
that kind of can limit the size of your workgroup, too.
So, that's the bottom entry.
So, how do you wade through all of this and try to
figure out how big your workgroup can or should be?
Well, the first solution is to give up.
You can pass in NULL with the workgroup size.
We'll swallow that.
You can hopefully, you know, do some magic, which we'll try.
We'll do the best we can, and magic is wonderful stuff,
but unfortunately it's not real grounded in reality,
so we may not get it right for
reasons I will describe in a minute.
So, in cases where you have more information then you
guess we do, it's often best to handle this yourself.
So, another approach is divide and conquer,
which is the standard technique, of course.
And, you just go through and sort the information you're
getting back from OpenCL according to dimensionality.
So, you won't need to call all of these interfaces.
You'll quickly realize that only some of
them really apply to your particular problem.
But, you'll get a number of one-dimensional
limits that you might have to take the minimum of;
we can't have any more workitems than this.
And then, there's a couple of APIs that will give you a
3D shape, which will constrain the size of your workgroup,
and this is required because there are
devices out there that are only able
to vectorize in one dimension and not all three.
So, these APIs can be used to do that, and then, of course,
the overall size of your global problem set will constrain
your data shape because it's got to divide evenly
into your workgroup size, or by your workgroup size.
You can run into problems when the
global work size is a prime number.
What divides into a prime number?
Well, the only thing that divides
into it is one and the prime number.
Prime number's not going to work; it's too big.
One means we're only running one workitem per
core; you're not going to get any parallelism
that way, so it's going to perform terribly.
So, what do you do?
Well, one thing you can do is just make
your problem demand a little bigger.
And then, in your kernel, when you go write your
code and you go get, you know, what problem am I,
if you have to be outside the original
problem size, then you just have an early out.
So, that will let you solve the local size
must divide into the global size problem
and run on a prime number workgroup of global size.
Another thing you can do, as I mentioned earlier, is
enumerate your workitems out into some abstract dimension.
And then, you write a little function such as this
where you take in my global ID in the abstract dimension
and map it back to something more
real, like your X or Y position.
So, these are all things that you can just do yourself in
code and, you know, it's limited only by your imagination.
So, I wanted to work through a few
developer questions we've had over the years.
They come from a variety of topics.
Many of you have noticed that we've added half precision
to the spec. This is a 16-bit floating point number.
And, it's easy to get quite excited
about that; ooh, what is this thing?
You know; ooh, I have a new float to work with.
It's not quite that.
It's a storage-only format, which means that it only
exists as a 16-bit floating point number in memory.
As soon as you pick it up and start trying to work on it,
the first thing OpenCL does is
convert it to single precision number.
You do all of your arithmetic in single
precision, and then when you go to store it back
out somewhere, then it gets converted back.
Which interface you use to load and store your data
depends on whether you're working with buffers or images.
Buffers will use vload_halfn/vstore_halfn; images will use
read_imagef or write_imagef, as for other pixel formats.
There is an extension you'll see in the
end of the spec, which is cl_khr_fp16,
which actually specifies half precision direct arithmetic.
That's not supported on our platform.
There's a couple of good reasons for that.
The native hardware doesn't do it;
we'd have to emulate it in software,
which would be a lot slower than
doing it in single precision.
And, the other problem is, of course, half precision
only has about 11 bits of precision if you do enough
in multiplies and adds and whatever else.
You're going to start losing bits and you'll
be down to eight, seven, six bits of precision
and for most algorithms, that's just not enough.
Many people see that; oh, OpenCL has four address
spaces, which are sort of disjoined places to put memory.
What's that all about?
We have global, which is akin to
the main system memory used to use.
We have local, which is a little, user
managed cache tied to your compute unit.
We also have a constant memory space, and private, which
is just local storage for your particular workitem.
And, the confusing one is, what is local memory?
It's just a user managed cache, and the way you use it
is you either explicity or using a convenience function,
like async_workgroup_copy, just pick up the
data from global memory and write it over there.
So, it's as simple as, you know, just doing an
assignment from A to B in your OpenCL C code.
So, what you want to do is have all of your workitems
work together to copy in the data from global to local,
then issue a barrier to make sure everyone's done so that
we don't try to read any of it before any of them are done.
And now, the data you know is resident in local
memory and you can read that out much quicker
than it would have taken to access it from global.
Now, a key point about local memory is that it only
really works if you touch the data more than once.
If you just touched it once, then you would
essentially be reading it once, copying it over here
and then reading it back from over there.
You didn't save yourself any time.
You want to be in a situation where
you read it once here, put it over here
and then all sorts of different people use this over time.
So, if it turns out that you only plan to use your data
once, let's say in My Image, I'm just converting RGB to YUV,
so each pixel is largely independent
and I only touch it once.
Then we wanted to use a variety of different
approaches, depending on what kind of hurdle you're on.
On a GPU, there's a texture cache designed
to accelerate that kind of read-once access,
and that is backed up by the image
data type, so you'd want to use that.
On the CPU, there is a global cache backing up buffers,
so as long as your data has good spatial locality,
then you should get some acceleration out of the caches.
I should note that local memory, while it
seems like a predominantly GPU technology,
we have found that it's actually quite helpful on the CPU.
It can make vectorization easier.
It also can avoid polluting your caches,
and what I mean in that sense is,
let's say your global data structures
are an Array of Structures, data type.
Here I have an AoS strapped with
x, y and z in it and then some
of these telephone numbers, some unrelated piece of data.
And, I know many of you just love doing code like that.
I've seen it everywhere.
And, this might be an array in your buffer, but
when you go actually work on it in your kernel,
it can be quite useful to then
transpose that around into an array
of x's followed by an array of y's and an array of z's.
If you only intended to work on x, y and z, and
you didn't care about the telephone numbers,
then you end up compacting the data
down into a much smaller space.
Also, because it's plainer in orientation,
it's much easier to vectorize.
The GPUs have a watchdog timer.
And, what is a watchdog timer?
It's somebody looking over your shoulder to make
sure you don't monopolize the GPU for too long.
And, the reason for that is the UI will not
interact as long as you're busy on the GPU.
If you use the GPU for more than a few seconds at a
time in a single kernel, you'll probably get a message
in the console, such as this one shown here.
You may see a flash on screen and you definitely
will not get the right answer out of OpenCL.
Your contacts might be invalidated.
So, if you start running into this, the simple solution
is just divide your task up into smaller chunks
so that it runs faster and doesn't
use up quite as much time.
You can enqueue them one after another in the
queue; you know, the second one will start as soon
as the first one's done, so you won't waste too much time.
You want to be careful though, because the breadth
of capability between sort of a low-end GTU
and a high-end can be quite large
in order of magnitude, maybe.
So, if you are doing this, be sure you're testing out
a low-end system to make sure that it works everywhere.
Some of you have noticed that OpenCL provides a way
to get out of your kernels compiled as a binary.
There's this interface, clGetProgramInfo
(CL_PROGRAM_BINARIES).
And, the intention of this thing is to give
you a cache, or a way to create your own cache,
to avoid having to compile your
kernels every time your app runs.
Some people want to use it for code obfuscation,
but it's not suitable for that use right now,
and the reason for that is that Apple has
not committed to an API for the kernels.
And so, that means that on some future OS we might
change the API; your binary will not work anymore.
When you go try to load it, you have to call clBuildProgram
before you can actually use it, and that call will fail.
If the only thing you shipped with your app was the binary,
you're now in deep trouble because you have no code to run.
So, what you need to do is ship your source.
If this happens to you, then you rebuild
your source fresh and override the cache
that you had set up for yourself, and continue on.
Some developers are curious; when should
I use buffers, when should I use images?
I think if you're familiar with
OpenGL, it should be obvious.
But, if you're from CPU land, like
me, then it's not so clear.
Buffers are sort of native territory for a CPU.
We have caches to back them up; they're very fast.
On a GPU, on current generation there's no
cache to back up global memory accesses,
so you're taking essentially a big, long trip;
several hundred cycles out to get uncached memory.
GPU, you'd want to use, copy that data in the local
memory, which is very close and much faster to use.
Or, use coalesced reads wherein multiple workitems are
reading data from largely contiguous regions of memory.
Images are great on the GPU.
There's a texture unit to accelerate those
accesses, if they have great spatial locality.
But, on the CPU there's no such hardware, so
we have to emulate the whole thing in software.
So, these things look a lot more complicated than
you would think just looking at the spec. But,
at least the CPU is extremely accurate, so it's good
for debugging to make sure you're doing the right thing.
But, as you can see, this is the implementation of a single
pixel read using linear sampling; it's 168 instructions.
So, I would only want to use the
read image feature on the CPU
if you've budgeted ample CPU time to
go through and do all of that work.
Some developers are wondering; how do
I use OpenCL in my multi-threaded app?
It says it's not completely thread safe.
The intended design is to use a separate queue from
each thread that you intend to enqueue work into OpenCL.
The other thing you have to do is make sure you're
not getting reentrant access into individual objects.
And, you can end up in some patterns
also where you can step on yourself.
For example, on this one, I might set the kernel
argument to be a value and then get interrupted,
have another thread which is using the same kernel come
along and set it to a different value and queue its kernel,
then finally I wake up and queue my
kernel, but it has the wrong argument.
So, you want to make sure that you don't get
these sorts of races happening in your app.
Now, you could implement some very fancy locking schemes
to try to guarantee this, but it's going to be heavy,
it's going to damage your concurrency,
and you're not going to like it.
What we are actually thinking that you would do is, if
you intend to call the same kernel for multiple threads,
make multiple kernel objects, all
pointed back to the same kernel function.
That's cheap to do.
You don't have to do any locking, because each thread will
have its own copy of the kernel object, and it's safe to do.
Probably the number one thing that I've seen developers do
is block too much when they're enqueueing stuff into OpenCL.
A number of the enqueue APIs have the capacity
to block until the work is completely done
in OpenCL before returning control to you.
But, that's extremely insensitive because you don't
get to do anything while OpenCL's doing stuff,
and then once OpenCL is done, then it has to wait for you.
And, so you end up losing a lot of your concurrency,
and I'll show you an example of that a little bit later.
So, there's an API entirely intended
to block your queue, clFinish.
You should almost never need to call that.
The only time I've seen where it was a good use case of that
is where somebody wanted to shut down OpenCL completely,
wanting to make sure all of the work was done and
all of the reference counting had resolved itself
and all of the objects were freed
and all of the memory was released.
So, that's great use for clFinish, but you
should otherwise almost never need to do it.
Some people just seem to instinctively
put it in there proactively
after every single call, and it's
killing them, I'll tell you.
There are calls to read and write data in and out of OpenCL.
These can be made blocking if you want, but you'll
only need to be blocking on these some of the time.
And, you can probably figure out for yourself when
this is, but often you'll see, like for example;
I need to enqueue multiple reads to read back
results from multiple buffers after my computation.
It turns out because the queue is in order, which means
each job is finished before the next one can start;
you only actually need to block on the last one, because
you know that the other ones have already completed.
Likewise, when you're doing writes, the typical pattern
is write data into OpenCL and queue a bunch of kernels,
and then read back the results; the last one is blocking.
Well, OK; I know my write finished a long time ago,
way up here, so no need to block on that either.
So, there really, in any, like giant sequence of
calls, you probably only need one block at the end.
People also run into some performance
pitfalls using half-full vectors.
This can come in two forms.
On the CPU, the vectors are fixed width, they're all
16-bytes on SSE, and so if you write, like a float2,
which is only an 8-byte type, you're essentially
issuing a 16-byte instruction work on 16 bytes of data,
but you've only populated it half-full,
and so we end up doing extra work on some,
who knows what's in the rest of the register.
So, this is bad for two reasons.
Obviously, you've wasted half of your potential to do work.
But, in floating point, if those lanes in the vector
happen to get any NaNs or infinities or denormals,
you might set yourself up to take a hundred cycle stall
for each one of those, and that can hit you operation
after operation after operation after
operation, and make your code run,
like orders of magnitude slower than it should.
So, you want to be sure, when programming for
the CPU, on the direct model like we have,
that you try to make sure you use 16-byte vectors or larger.
Larger will actually do a little free unrolling for you,
in the compiler is at times a little
faster than just using the 16-byte vectors.
The GPU has sort of the reverse problem.
When it's revectorizing your problem along a different
dimension in the way the GPU would like to do it,
you might have a float4, but you've only put data in the
first two elements and have a bunch of garbage after that
because you couldn't figure out what to
do and you declared it a float4 somewhere.
Well, when the GPU vectorizes that, it will make a
big vector full of x's and a big vector full of y's,
and then two big vectors full of junk, which it will then
go do arithmetic on needlessly, so that just wastes time.
Finally, we've noticed that people
often will make objects, use them once,
then delete them; make them, use them once, delete them.
And, that can be kind of wasteful.
Many OpenCL objects are heavy;
they're intended to be reused a lot.
Like a program, you'd have to compile it each time,
which can take a big chunk of a second, sometimes.
Images and buffers have a big, giant backing store,
megabytes in size; a bunch of driver calls to set it up,
and then there's some state associated
with who used it last,
so we can track which device the
actual data lives on right now.
And then, finally, on any buffers that you make that are new
on the system are subject to the usual zero full activity
that the kernel will do the first time you use it.
So, if you reuse them, you save yourself
this cost the second and later time.
However, it's only really useful to reuse things if
they're about the same size, or in the case of images,
exactly the same size as the previous use.
Otherwise, OpenCL has no concept of
only copy part of this buffer up there;
it'll copy the whole thing up to the device.
So, you only really want to reuse
them if they're about the same size.
So finally, I'd like to talk about a few debugging tips.
These are standard techniques we use back in Infinite Loop.
Pretty much all of us run with the
environment variable CL_LOG_ERRORS set all of the time.
I just put that in my bashrc.
And, you can set this to either standard
out or standard error or console,
depending on where you want the error messages to go.
And, what it does is, whenever you call an OpenCL API
and you manage to miss some little gotcha in the spec,
and OpenCL returns an error out, you also get
spewed to the console or standard error or whatever;
some hopefully human understandable English
message about what exactly you did wrong.
So, if you're encountering any problems with
the API, getting that to do what you want,
then CL_LOG_ERRORS is very much your friend.
There's also a way to programmably hook into it.
You can roll your own function and
pass it in when the context is created.
Finally, when you're working on the CPU, we make heavy
use out of Shark and Instruments to see what's going
on in there, and what I'd like to do is give you
a quick look at what that process looks like.
So, this is an iPhoto.
It's unmodified, and as it turns out, iPhoto on Snow
Leopard for certain things will use OpenCL on the CPU.
So, we can take a look at what it's doing.
So, we call up our little Adjust panel and we can
start, you know, adding little filters onto the image,
and these are all processed in real time, as you can see.
And then, we can go and start jiggling stuff.
So, if we run Shark; Shark has a bunch
of different ways to run samples,
but the two most interesting are
Time Profile and System Trace.
I'll run a Time Profile first; I'm
sure many of you have done this.
So, while I'm getting OpenCL to do something
through iPhoto, here I'm jiggling this thing around.
I can hit option escape to get it to record, and what it's
doing is it's taking a sample server so many milliseconds
or microseconds, and records where the CPU
or CPUs, what instruction they were on.
And then later, it puts it all back together,
backtraces it against which functions they were there,
and you can get a breakdown as to where your
time was going using this stochastic technique.
So, you can see it was spending about
12 percent of its time in this function.
This is an OpenCL library.
OpenCL provides a number of libraries
which are strangely named;
they look a little bit more like
pixel formats than libraries.
So, if you see this, this is a cost accrued to you
due to the cost of going to use readImage on the CPU.
You will also see lines that say unknown library in them.
Well, why doesn't it say OpenCL?
Well, this is your code.
It doesn't actually exist on disk anywhere.
We just built this code, stuck it in the memory and run it.
So, Shark doesn't really know what
it is, so it says unknown library.
But, you can still drill down and see all of
the code that we prepared, and if you're good,
you can figure out what parts map onto your
kernel and see if you got the code you wanted,
and if there are large stalls in here, you can
generally figure out what went wrong in your kernel.
I can also show you a system trace, which is very useful
for understanding how your interactions
with the OpenCL queues are progressing.
So, if we go back and play with the image a bit more,
I'll now record a system trace for a second or two.
This is a 16-core system, or 8-cores with 2ASMT.
So, we have here a bunch of iPhoto threads.
I'll just limit it down to iPhoto.
And, I'm looking at the timeline, and what you
can see here; these are threads, horizontally.
Regions that are amber is time during the timeline,
which progresses to the right, when the CPU was active.
And, what we can see is that it's single-threaded a lot,
but then there are these little
windows where we're multi-thread.
And, these are when OpenCL is running,
and we can zoom in on these things.
These little telephones are system calls, and here
we can see; here is the main thread on the top
and it's running through, making various system calls,
and we can go look at this and track this back to OpenCL.
This is a release_mem_object call
to release some memory object.
This one is enqueue a kernel, and you see a little while
after a kernel we get a little blip of something happening.
And here we enqueue another kernel,
and then this one's a big one.
And, it seems like there is some serial
process here to kick off each CPU
as it goes along, so you can see them all firing up.
But, sometime before we manage to get them all
fired up, we're already done with the work.
So, maybe the kernel you enqueued is too small, because we
didn't actually get all 16 of the threads up and running,
the hardware threads, before we ran out of work to do.
Well, why did that happen?
Well, we can go and look at this one.
What's this thing?
The main thread wasn't doing anything during
this time; that's kind of a little strange.
We can go track that back.
Oh, look! It's a clFinish.
Apparently the enqueued a kernel and then
issued a clFinish to wait for it to be done,
and then did some more work and
then repeated the process again.
You can see another clFinish over here.
Well, this is quite costly for a number of reasons.
For example, let's see if I can zoom in here.
While we weren't doing anything here, we probably could
have been doing this much work in the main thread.
It only would have cost OpenCL one thread out of 16.
So, it probably wouldn't have slowed down too much.
So, you could have gotten this much work done, which meant
all of this dead time, all down here, would have gone away
and we would have compressed the launch-to-launch
time from here to over here by about that much time.
So, you can see, just that clFinish,
that's what it's costing you.
Another thing is, all of the CPUs go back to sleep after
we're done with this, and I only have to wake them up again.
If you didn't have that Finish in there, it's possible
we would have just picked right up where we left off,
and then started running these things, but,
except all of the CPUs would be awake now.
So, rather than getting a little, tiny bit of work out
of these threads, you might have gotten a full width.
So, the whole thing, you know, might
have been about twice as fast if we go
by one-half base times height on this little triangle here.
So, you can use Shark to dig right in to how
much residency you're getting on the CPU.
And, the same techniques apply in the GPU.
You won't see these little triangles,
because the work is happening on the GPU,
but you will certainly see time that's dead time in your
main thread, when you could have been doing something else.
So, that's Shark with OpenCL in a nutshell.
And, what I'd like to do now is invite Abe Stephens up
to tell you all about how to integrate your workflow
between OpenCL and OpenGL, and that
allows you to quickly and seamlessly,
and with a minimum of cost, share data between the two.
[ Applause ]
>> Abe Stephens: Hi.
My name is Abe Stephens and I work
with the OpenCL Group at Apple.
Today I'm going to talk about OpenCL and OpenGL sharing,
which is a mechanism that allows us to create objects
in OpenGL and then import them into
OpenCL in such a way that the actual data
that is operated on by both APIs is the same.
So, we can avoid copying data or making duplicate
copies of the data, and accelerate our programs.
The motivation here is that we'd
like to combine these two APIs.
Now, OpenCL and OpenGL are similar in many ways.
A lot of programs that are written
for both APIs run on the GPU.
It's possible to write programs for the CPU in OpenCL, and
that's a little bit less common for OpenGL, but really,
we're looking for a mechanism that allows us to
move data efficiently between these two interfaces.
Let's take a look at a simple example; a case where we
have an application that's going to perform some kind
of physical simulation and then visualize the results.
The physical simulation part of this example;
for example, a bunch of objects bouncing
around the screen, is a very OpenCL oriented kind of task.
It might involve collision detection, computing
maybe Newtonian mechanics or something like that.
And then, it also might involve rendering.
We could compute our position and velocity in our
compute side of the application, and then take that data,
move it to graphics, and actually render the scene.
Now, this type of task could also
be performed without OpenCL.
We could produce data on the CPU and then transfer it to
the GPU, and then in the next frame, repeat that process.
Alternatively, with a sharing API, with a sharing
mechanism, we could produce the data in CL on the GPU,
and then move it from the CL side to the
GL side, and use the same data over again.
So, let's take a look.
In OpenCL, we'll produce a list of vertex data and then
move that data from CL into OpenGL, where we can provide,
or we can implement, some kind of
sophisticated shading operation.
In this example, we've rendered these spheres with
a refraction shader and some other special effects.
That shader operation is obviously best
suited to graphics, and the physics operation
in this case is very well suited to OpenCL.
We can also perform the opposite kind of operation.
Instead of producing data in OpenCL, we can consume
the data in OpenCL and produce something in OpenGL.
For example, in that previous slide, OpenGL could
produce data about surface normals and surface positions
of fragments, and then that data could be passed
into OpenGL for some type of post-process.
So, for example, in our physics application, we could
render the spheres using OpenGL to a vertex buffer
or to a fragment buffer object, and then take a surface
of that fragment buffer object, transfer it to OpenCL,
where the CL program might use that data as
input to a ray tracer, trace a caustic effect,
and then move that caustic effect back into
OpenGL for final display and compositing.
Now, under the hood, under the hood both of
these applications, or both of these APIs,
operate using similar data structures in the
driver, and are implemented using shared structures.
As you might be familiar, OpenGL selects the devices
that are used for computing using a pixel format.
So, a developer or a programmer sets up
a pixel format and that format is matched
with the rendering devices in the system, GPUs and CPUs.
Matching the pixel format to system
devices and then passing that format
to a create function generates an OpenGL context.
At this point in setting up a OpenGL
system, the programmer is done.
We can use this OpenGL context to start
sending draw commands to the system.
But, if we wanted to take our program a step further
and add an OpenCL application or an OpenCL task,
we have to take the context and convert it into
OpenCL, and this is done by extracting the ShareGroup
from the GLContext, and then taking that ShareGroup
and moving the ShareGroup into an OpenCL context.
And then, if we looked inside that OpenCL context, we'd
see that the CLContext contained all of the devices
that were originally in that pixel format.
Now, the CGL structure, the CGLShareGroup, and the OpenCL
context both contain very analogous types of structures.
The ShareGroup has a list of vertex buffer objects, textures
and render buffers, and the CLContext has data objects
that are wrapped by the cl_mem object type.
When we convert the ShareGroup into a CLContext,
we're making all of those GL data structures available
to the CLContext, and then we have to obtain references to
those structures from the CLContext to use in our program.
There's some relationship between the CGLContext that's
used by OpenGL, and the list of CL devices, although in CL,
we have a very explicit representation for the device.
In OpenGL, we select a virtual
screen, or we use another mechanism
to select a rendering device that the system will use.
So, although there's a relationship,
the devices aren't exactly analogous.
There are five steps for setting up a
sharing process between OpenGL and OpenCL.
We've already looked at the first step,
which was to obtain that CGLContext.
From the Context, we obtain a ShareGroup and
then use that ShareGroup to create the CLContext
with which we'll send commands on the CL side.
After the Context has been created, we can
import data objects from OpenGL to OpenCL.
In the first example that we looked at, those
data objects consisted of a vertex buffer
and then in that second post-process example,
the shared data object was a GL render buffer.
After we've imported the data objects, the set-up phase
of our application is complete, and now we can concentrate
on executing commands, and there's a specific flush
and acquire semantic that we have to use to coordinate
between the OpenGL side of the program
and the OpenCL side of the program.
And then, finally, when we're done, we have to tear
the whole system apart and clean up by making sure
to release the objects safely in
CL before destroying them in GL.
So, let's take a look at some source code.
The first step is to obtain the CGLShareGroup
for an application from the CGLContext.
And, the example that we'll look at
will focus on a Cocoa application.
In Cocoa, the NSOpenGLView is commonly
extended to add OpenGL to an application.
Within the initialization function of our Cocoa
program, we can obtain first the CGLContext associated
with that OpenGLView, using an accessor function.
And then, we use the CGLContext to obtain a CGLShareGroup,
which is essentially what we'll use to create the CLContext.
Now, the second step in our five-step process is to
take that ShareGroup and use it to create a CLContext.
We do this using OpenCL; an OpenCL
function with a special enumerant,
the CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE
enumerant, which is really hard to forget.
And, this is passed in a property list to the
CLCreateContext function and in this case,
without any other arguments except for the error argument.
Now that we've created a CLContext, we have to obtain a
list of CL devices, and if you remember from the slide
at the very beginning of the talk, those
devices will reflect the devices that were used,
that were passed in the pixel format when
that CGLContext was originally created.
In a standard Cocoa application, the runtime has
already taken care of creating a pixel format,
and we simply obtained the CGLContext
that was provided by the runtime.
If we wanted to obtain all of the devices that were
associated with our CGLShareGroup in our CGLContext,
we could simply use the existing or the
standard CL accessor method, clGetContextInfo,
passing it the correct enumerant,
and get a list of all of the devices.
However, in that case, we might obtain
a GPU device that doesn't correspond
to the GPU device our Cocoa application
is currently using for display.
If we want to obtain the device that's currently, the
virtual screen that's currently being used for display,
we have to use another special enumerant and another special
function, clGetGLContextInfoAPPLE, to obtain the CL device
that matches the current virtual screen
that our application's running on.
So, let's take a look now at what we've done.
We've made our way around this entire figure, and we're
back to the point where we have a list of CL devices inside
of our CLContext; however, if we look at those devices,
we'll notice that the CL API has removed the
CPU device from that initial list of GL devices.
If we want to add the CPU device back in; for example,
if we wanted to run some CL programs or some CL kernels
on the CPU, and others on the GPU, when we create our
CLContext we have to explicitly add the CPU device,
as if we were creating a normal CLContext.
So, this involves getting a list of device IDs with the
CPU device type, and then passing that to CLCreateContext.
OK, now here's the fun part.
Now that we have our Context, our
CLContext, we have to move data objects,
or tell the runtime which data objects
we'd like to move from OpenGL into OpenCL.
We have to import the shared objects.
OK, so here are the two ShareGroups
we'd like to end up with.
If we'd like to move that vertex buffer object into
OpenCL, we use the function clCreateFromGLBuffer,
passing it both the CLContext and the
GLBuffer, and then memory access flags
that tell OpenCL how we plan on using that data structure.
Now, if we look under the hood, what we've really done
by sharing the vertex buffer object between OpenGL
and OpenCL is we've created a structure within the OpenGL
runtime and the OpenCL runtime, but those two structures,
the VBO and the cl_mem object, actually
refer to the same piece of driver storage.
This piece of driver storage, called a backing store,
is the actual memory that contains that vertex data,
or texture data, or render buffer data,
and is the piece of memory that is moved
between devices as we execute our program.
Now, if we execute a command either in GL, like a draw
command, or if we execute a kernel in CL, the runtime,
the driver, will take care of moving a cached copy of
that data to the device that the command is executed on.
So in this case, when I created that mem_object in OpenCL,
it's not as if I was allocating memory on a device.
I was really allocating memory inside of this driver storage
pool, and then if I execute a command using a device,
that memory is cached, in this
case in Device VRAM, on the GPU.
Now, if I execute a command in either API on a different
GPU, the runtime will take care of moving that data back
to the host and then to the other device.
Now, OpenCL is a little bit different than OpenGL
in this respect, in that unlike the GPU, which has;
each GPU has its own piece of device VRAM.
In OpenCL, the CPU device shares VRAM with driver storage.
So, when I execute a command on the CPU in
OpenCL, this copy operation doesn't occur.
The CL kernel is able to use the
same backing store as the runtime.
Now, the operations we've looked at
so far, and the example in the demo,
involve vertex buffer objects or pixel buffer objects.
But, there are three other functions we can use
to create shared data between OpenGL and OpenCL.
The first two functions involve textures; 2 and 3D textures.
And, the third object allows us to manipulate
render buffers, which was the structure we used
for that post-process example that we saw earlier.
After these structures are created in OpenGL and then
imported into OpenCL, we can do a lot of things with them.
In either API we can modify their contents,
execute commands that use them; but,
one thing we can't do is we can't modify their properties.
So, after an image is imported from OpenGL into OpenCL,
we can't change the width and height of the image.
We can't change other properties of the image.
We'd have to create a new copy in GL and then move it
back into CL in order to make those sorts of changes.
OK, now that we've created the objects and imported it into
OpenCL, as we're launching commands and executing commands
in either API, we have to use a specific
type of semantic, called Flush and Acquire,
in order to coordinate access between the two APIs.
Let's take a look at the standard queueing
system that's used by OpenGL by itself.
As I execute commands, they're enqueued in a
command queue on the host, and at a certain point,
those commands are flushed to the device and executed.
Now, the order of commands as I call functions
is sort of maintained by that command queue,
and those commands will be executed in the
same order when they get to the device.
And, since I have a single command queue,
there's no chance of commands being executed
in an order other than the one I've specified.
However, if I add the OpenCL command queue alongside
that OpenGL command queue, and then execute a bunch
of OpenGL commands in my thread, and then a bunch of OpenCL
commands in my thread, without any explicit synchronization,
I don't have any control over the order that
those commands get submitted to the device.
So, it's very possible that even though I sent the
two OpenGL functions first, they might be submitted
to the device by the runtime in an interleaved order.
Now, if there are data dependencies between OpenGL and
OpenCL such that I had to execute all of those GL operations
because I was producing that render buffer
in OpenGL before I consumed it in OpenCL,
this type of unsynchronized execution would cause a problem.
Therefore, before we move data, or before we move commands
between those two APIs and execute work on one side
or another, we have to make sure to flush the GL side
and then that flush operation sends those commands
out to the device, and then acquire those shared
objects on the CL side, then execute our CL commands.
Once we then call clEnqueueRelease, our CL commands
will be flushed to the device after the GL commands.
This explicit type of Flush and Acquire semantic
ensures that those GL commands are in progress
on the device before the OpenCL commands
have a chance to be submitted to the device,
and the order between the two APIs will be maintained.
OK, now that we've gone over how to create the ShareGroup,
move the GLShareGroup into OpenCL, then create data objects
and import those data objects,
and then safely execute commands,
the very last step is how to safely clean up this system.
The key here is that we always release objects
in the opposite order that we created them.
We always release the objects in
OpenCL, and then destroy them in OpenGL.
This ensures that in OpenGL, the OpenGL driver won't take
objects out from underneath the OpenCL implementation.
Now, as you might be aware, OpenGL automatically takes
care of retaining objects for you, so if you pass a kernel
into the runtime, or a data object into the runtime,
the runtime will make sure that the reference count
of that object reflects that the runtime
is holding onto a pointer for it.
So, it's necessary to make sure that after
you've enqueued commands that use memory objects,
those commands have been executed and have
been completed before OpenGL has an opportunity
or might accidentally delete or destroy an object.
So, the objects have to be completely released
by OpenCL before they can be destroyed by OpenGL.
And, releasing a mem_object is simple.
Essentially, there's one function, regardless
of whether the mem_object was created
on the GL side, whether it's a buffer or an image.
We simply call clReleaseMemObject.
OK, now I'd like to show you an example; a
live demo of the example I showed earlier.
This is the case where a vertex buffer object
is created in OpenCL and shared with OpenGL,
and then a GLFrame is created, or a GLFBO is created and
then shared with OpenCL to perform some post-processing.
In this example, In this example OpenCL is rendering, OpenCL
is computing the physics interaction between the spheres
that are bouncing around the screen, and
then OpenGL is rendering the refraction
and the reflection on each individual sphere.
As OpenGL renders that effect, it also produces a
buffer which contains the surface normal and position
on the surface of various fragments for the spheres.
That buffer is passed back to OpenCL
using a shared render buffer.
OpenCL then provides or performs some
photon tracing to compute the caustic effect
that you see towards the bottom of the screen.
This is a simple example.
The physics that's computed in OpenCL isn't particularly
sophisticated, but this allows us to perform all
of the computation on the GPU, instead of having to
coordinate between the CPU and the GPU for each frame.
So, both the physical simulation and the rendering of the
spheres can happen on the GPU, and then the photon tracing
and the rendering of the caustic
highlight can also occur on the GPU.
OK, so the three steps for that demo
were simply to update vertex positions,
perform the photon trace, and then render the scene.
And, by using the sharing API, it was very easy to
perform all of those operations on a single device.
If we had some type of application where we wanted
to perform, say updating the vertex positions
or rendering the photons on the CPU or on another
device, the OpenCL runtime would have allowed us
to automatically move the data back
and forth between those devices.
And so, even if we're not running applications that do all
of their work on the GPU, it's still possible to use OpenCL
and the sharing mechanism to handle
moving data between the two devices.
OK, so five easy steps to shared
data between OpenGL and OpenCL.
The first step is to make sure that we select our pixel
format and the devices that we're going to use for OpenCL,
using that CGLPixelFormat function, and using that
pixel format to create our CGLContext and ShareGroup.
We passed that ShareGroup to CLCreateContext
in our second step,
to produce and initialize a CLContext
containing those devices.
Then we create objects in GL, import them into CL.
We use a GLFlush and CLAcquire pattern to handle
coordination of commands between the two APIs,
and then lastly, when we're done, we
release in CL before destroying in GL.
Now, this concludes the CL session.
For more information, you should contact
Allan Schaffer, who's our Evangelist,
and of course, look at the Apple developer forums.
There's a CL Dev forum and also a GL Dev
forum that are great ways of getting in touch.
There are a number of other sessions this
week that will address OpenCL and OpenGL.
Immediately following this session in this room, we'll
hear from a number of vendors that will describe how
to maximize OpenCL performance for different devices.
Tomorrow there's a session on OpenGL for the Mac,
and then later in the day tomorrow is a session
that will describe multi-GPU programming,
both with OpenGL and OpenCL.