WWDC2010 Session 418

Transcript

>> Hello, everyone.
Welcome to the session on maximizing OpenCL performance.
In the previous session, we talked about an overview of
OpenCL, how to use it, and sharing between CL and GL.
In this session, we're going to talk about tuning
of your kernels to get the best possible performance
on AMD NVIDIA GPUs and Intel CPUs.
First up is a presentation by Benedict Gaster of AMD.
Ben is going to talk about doing game
physics with OpenCL, in particular,
he's going to talk about doing cloud simulation and
Bullet Physics using OpenCL and tuning for AMD GPUs.
So let me please -- Ben.
>> Thank you.
[ Applause ]
>> Benedict Gaster: Okay, so as Aaftab said I'm
going to talk about OpenCL and accelerating physics.
And when I talk about physics here,
we're talking about games physics.
I'm not trying to solve the world's problems looking
where the next black hole is or something like that.
I'm going to look at games physics.
So what does that mean?
So that means there are a number of APIs out there
and the one in particular that we've been focusing
on is an API called Bullet, which is primarily developed
and architected by a guy called Erwin Coumans at Sony.
And it's used in many third-party games.
You know, and also films and so forth.
It has a Zlib license, so it's completely -- you can do
whatever you like with it, it's completely open source.
And they're collaborating on accelerating it.
Today the implementation is purely CPU and there's
been some experiments with SSE and some other things.
But it's not really an accelerated API.
So what we're looking to do is
how can we give it a real boost.
So we looked at cloth, and I'm going
to talk about that today, or soft body.
So you can think of if you've got your little yellow
duck, my kids are always playing with that in the bath,
and you're going to squeeze it, how do you represent that,
it's kind of a soft body, rather than something like a rock.
A cloth is another form.
And the key thing is that it's fully open source.
So all the contributions, all the work that
we do in OpenCL is going to go back into that,
and you'll be able to pick it up
and run it on your Mac platforms.
So what is Bullet in OpenCL, what does it do?
So on the big picture here standing on the
left is something I'm not going to talk about today,
but it's rigid bodies, they're very important in physics.
They're basically Little rocks or whatever.
Here we have a pachinko machine, and 5,000 little
things bouncing around hitting off each other,
hitting off the pegs, bouncing off the back.
And all being accelerated in OpenCL.
This all runs in real time.
Everything I'm going to show you today is in real time.
You know, that's the key thing about games.
It may not be exact physics, but it
better give a good response to the user
because otherwise they're not going
to want to play the game and so forth.
Another thing, fluids is becoming more and more popular,
you know, maybe you want to do a river or we have some water
that splashes down on top of the character in the game.
And at the bottom here the stuff that I'm going to talk
about today is the cloth simulation, and so everything --
all the flags, you can see all the posts, every single
one of those are cloth -- real cloth being simulated --
not real cloth, but you know, game
physics cloth being simulated in OpenCL.
And the little guy if he was dancing around in the middle
there, you'd see that he'd have a cape out the back there,
and that would all be simulated as a cloth
simulation, all done or running in real time.
So what is cloth simulation, what does it mean.
Really, it's fairly straight forward.
Basically you have some masses, so
basically think of them as particles
and they have some form of mass,
and you have some connections.
You could have all those springs and
things, and they could all be separated.
But that's not how fabric works, you know?
You grab it and it doesn't all fall apart and fall to bits.
So you need some sort of things connecting them together.
So basically you have what's called a mass spring system.
Turns out there's lots of these masses, really.
You want to simulate a piece of cloth and make
it look real you want lots of them, particles.
That's really good for parallelism, because lots
of things that you can operate on gives you an idea
that maybe you can do something -- if there's
only two of them, probably not very good to do.
So that's the first thing you want to do when
you're thinking about parallelizing them,
you want to make sure that there's lots of work to be done.
So we need to connect them together with some
constraints, and this is what the springs does.
And I show you here you can see that there are three
types of springs that we are considering, at least.
Maybe there are others.
But the main ones -- so if you see the straight
lines around the outside, they're not in the inside,
these are what we call structural
springs, and they kind of provide --
give a body, give a kind of shape to the
thing so it all doesn't collapse down.
And then you want to be able to at some point,
you're going to want to be able to rip that cloth.
You're going to need to be able to rip that fabric.
So you're going to want to have shearing.
These are the diagonal ones that go across.
And then of course a soft body wouldn't be a soft
body if it didn't bend and deform and things.
So you have these bending constraints
that are all connected together.
So obviously we want to start, you know, want to be able
to parallelize this, we want to be able to do something.
And so what we need -- we have lots of these particles.
And what we do, it's very straight forward, we
have -- so it's good for parallel programming.
And what we need to do is basically at each step
through of the simulation we generate some forces,
apply them to the different connections into thing --
different springs, so they'll have
different -- slightly -- characteristics.
But basically it's a set of rules.
You do calculate some impulses, plot it to
the masses and calculate the new positions.
You know, and the key thing here, if you imagine
that there's lots of these things connected together
and we could do these somewhat in parallel.
It's not quite as simple as that, because
the fact is if I have one little bit of cloth
and there's another bit connected, if I pull it
this way, this one's going to go that way as well.
So we need to have some core of
locality and connecting things together.
So parallelism is not complete.
You know, it's not -- everything is not done individually.
So one of the things we wanted to look at is that you
know, cloth could interact with rigid bodies, for example,
maybe a throw on a chair or a flag, or
you'll drop a piece of cloth on to a bottle,
so for example things like that, pour some water on to it.
But another thing very much independent of all those things
is you might want to have some flags flying in the wind.
That's an obvious force.
It turns out this can be paralleled really easily and really
nicely because this is computed per vertex on the cloth.
You know, that's how we're representing
each particle is in some sense a vertex.
Because all you really want to too is describe a
mesh that you're then going render at some point.
That's the end goal is to render
something that looks like a flag.
Doesn't really need to be a flag.
And so here we can compute it per vertex, excellent
parallel computation, force and density of the air medium,
and you against the mass and the normal of the vertex.
You know, so it's a very simple physics calculation.
There's nothing rocket science here.
What does it look like if we were
going to do this on the CPU.
So this would be the Bullet implementation today,
other physics implementations would take a very
similar approach if you're doing it on the CPU.
Well you do a iterative, you know, integration,
verlet is a common one to use over the vertex position,
and for each spring you compute a force,
update the vertices with a new position
and repeat N times, where N is basically configurable.
The idea here is you know, if you integrate
just once, it's not accurate enough.
So you want to run the simulation -- it turns out
in physics, game physics at least, 10 randomly --
I couldn't tell you exactly why it does, helps it converge
to a reasonable point that gives you a nice simulation.
And that turns out not just to be true for cloth, it
turns out to be true for fluid and also for rigid bodies.
Just I think ten -- maybe just enough convergence
there that gives you the simulation without running
so many CPU cycles that you never get an answer.
So, the critical thing here is that I'm -- you know, this
kind of integration is great, but you're updating everything
in place, you're going from -- say you got the results
from one iteration being fed back into the next iteration.
This is a very sequential algorithm.
Everything can be immediate, everything
moves through this over immediately.
But it's not a trivial thing to parallelize
if you try to just take this as is.
So here you can see what this would
look like for each iteration.
So this iteration is the N that
we talked about, the integrate.
You simply walk through of the links and do the integration.
Really very much straight out of a physics book.
It's nothing to be thought about.
So, move into parallel execution,
and this is a critical thing,
I think the CPU implementation has no atomicity issues.
You know, there -- it doesn't matter
that you're updating everything in place.
You know, I might want to update a single spring,
a single mass may be affected by two links,
and I could just do one and then I could do the next.
And I could update the same memory
location, none of that doesn't matter,
because I'm implementing it sequentially,
there's no race conditions.
You know, that's the key thing.
The trouble is, I actually want to do these
updates in parallel when using the GPU.
So this introduces races.
So the critical thing is I can't let you do those
ones in parallel, otherwise it wouldn't be a race.
They're going to be things -- but of course if it's a node
over here that really doesn't effect the one over here,
I could probably update those in parallel.
So what does that mean?
That means we're going to introduce a nation of batching.
And again, this is not something that I've invented or
anything, it's a kind of well-known technique that's used
in physics simulations and generally
in parallel programming as a whole.
And the idea is we're going to
go through using graph coloring.
We're going to color the nodes.
And so here we're going to come -- color the links, sorry.
And the key thing is that for each -- each batch
which is the colored links of the same color,
you can -- I can execute them in parallel.
There's nothing there.
But between ones where they're different colors,
that means I have to synchronize between them.
So what that means is I'll execute one batch, do as much
as I can in parallel in that batch, execute the next batch,
do as much as that in parallel, and
so I am serializing the algorithm.
But I'm only doing it at certain points.
And you can think of these as kernel
launches in your CL program, for example,
because that's inherently a sequential serialization point
because you're coming back from the parallel to device
to a single -- you know, a thread in your CPU or whatever.
You know, your host program, depending
on how you've designed it.
And so this technique is known as batching.
So basically, deriving batches, we need to be able to
build up the batches, we go -- for each iteration --
again, we're going to have to do this for each iteration.
We're going to step through number -- the size of the
batches, and then we find the first and the second,
this is a kind of structure that we've got.
We pick up the start and the number.
So the start is an offset into the batch that we're going
to do so we can parallelize it on multiple GPUs if we want to.
This implementation that I'll show you in the
demo today is not parallelized multiple devices,
but it's designed such that it could be.
Okay, so the code I'm going to show you now
is this is it, this is all the OpenCL code
for the solver. There's other kernels and
everything, and this is the host site code.
I'm using the C++ API rather than the standard C API.
So that's why it might look a little different.
But all the names should be fairly straightforward.
But the key thing is here is I get to the kernel
itself, get the command key that I want to run it on.
We've abstracted that out so it can be an arbitrary device.
I set some kernel arguments and then I
enqueue the kernel and execute it.
And if we go to the solver on the next side you can see
if you check back on the previous slide you'll see
that my arguments correspond exactly to the
set that I set there, and of course because of the --
in OpenCL, the global work size must
be a multiple of the local work size.
We check -- we sometimes have to do some rounding.
So we check that we can execute
inside that with that little thing.
So the number links is the actual ones
that we're actually parallelizing over.
This over itself is almost identical
to the CPU code except we're looking
to the global GPU buffers and everything like that.
But it's basically just the integration step.
You know, we read the vertexes, the masses,
and then we compute the new positions.
There really isn't that much more to it.
Okay, so what does that mean?
That's pretty good, that turned out to be pretty good.
We can already start to see a pretty
good performance speed up over the GPU.
However, I talked about the batching.
And originally I've shown you here just nine batches.
That's what we did from our original coloring.
That's great.
Except it turns out that the batches are quite small.
You know, and what I mean by that,
there's not that much work in the batches.
So there's not high density per thread of work.
So we execute all this and notice not much work.
So we're not really efficiently using the GPU.
So the idea is can we, can we, can
we, can we keep increasing the size
of the batches so that we can get more done.
and basically we reduce -- not only do we create
larger batches but of course we reduce the number
of serialization points in the program as well.
Because that's what different batches introduce.
So you can see here, it actually turns out that if
we're very careful we can get that down to four batches.
And we've doubled the amount of work that's going on.
We're not now completely -- you know,
again, it's not real physics here.
We've made these observations, we've done some analysis,
and realized that we can use these techniques to do that.
And in particular, the way that we get
around this is that we remove determinism.
So what we were saying before, only wanted to
update a single link at one time, you know a node.
And of course that causes serialization bottleneck.
It turns out that if you're careful
and execute per vertex data,
so that means recompute some data at the point of execution.
So you may compute the same value twice, then
actually you can update links in parallel.
Now of course we've gotten millions of
our users here to access,
so recomputing worked, what's wrong with that.
If you're memory-bound then using extra ALUs
to recompute the value is not a big issue.
So it turns out that this is how
we can start to remove determinism.
Something that we're looking at moving forward, and this
is not a proven point yet we've implemented it and it seems
to work but we haven't mathematically
verified it's correct yet,
is that we can remove determinism
all together and have a single batch.
And the issue seems to be that it's slower to converge.
And so N that I talked about earlier
is now not the right value for that.
The problem is if we increased N to 20 rather than
10, that would double the amount of work we're doing.
So it's kind of a balance of is that going
to give us back still better performance,
because we're more effectively using the GPU.
And that's an interesting thing, and
we don't have the answer to that today.
But you can see these are the kind of things that
you really want to think about when building a GPU.
Okay, so that's kind of it, I've kind
of give a number of points and a number
of techniques for parallelizing for the GPU.
They work really work on our GPU.
Actually, these work well on any kind
of parallel problem we kind of solve,
and these are really techniques
that you could use for OpenCL.
One thing that's really important,
I think, with parallel programming
and particularly with GPUs, is these are SIMD machines.
So what that means is they're very wide vector machines.
You know, you are executing a single
instruction many times for the same, you know,
for the same vector but different elements.
That means if your threads diverge, so if
you say AMD, it's a 64 wide vector machine,
if only one thread is executing out
that 64, you're using 1/64 of a GPU.
You can easily do that by saying if get global ID equals
1, say for example, then you're restricted that to do that,
or group ID say equals 1, then you're just
restricting it to that word group and that SIMD.
But it's very easy to do that.
There are some techniques that we'd used in
the cloth simulations itself to reduce this.
It turns out that if we're willing to
not allow an arbitrary complex mess --
so what I might mean about that is I have a piece of cloth
that's kind of wrapped up and rolled over and kind of coming
through itself, if we say actually that's
not used that common in games, you know,
we can reduce the mesh to say a nice kind of lace cloth
we'll have flapping in the sky like the flags I'll show you
in a minute, then actually we can make them regular and
we know the shape, we can preprocess them statically,
and we know that we don't have any
diversions when we're generating that.
So techniques like that, you know, you're not solving
the whole world's problems, you're not doing everything,
you're saying what is the problem that
I want to address in this application.
Then you can make trade-offs that
allow you to get, you know,
reduced divergency, better throughput, better efficiency.
I think that's very important,
I think in parallel programming.
You know, it's very difficult to solve the general problem.
If you can reduce and restrict your problem
a little bit, you're much more likely
to get the performance speed-ups that you hoped to get.
Okay, hopefully I can switch to -- okay, that window.
So this is that you just saw those two flags.
One of them is very far out, flying.
And so this is running purely on the GPU.
And just two flags on here at the
moment, it's completely being simulated.
Just a very simple GL renderer.
And I can simply keep pressing
here and just keep adding flags.
At this point I think I've got 10
flags, you can't see all of them.
But you can spin around and see some are very close up.
And you can see that we haven't seen
any performance degradation at all.
And I've just added these ten flags and
each one's adding it, just loading it to OpenCL.
And actually if you were to look at this application
you'd see that the CPU is basically doing hardly anything.
Turns out there's no bottlenecks on here.
The only main thing is, is that because I knocked this
out very quickly and didn't have a lot of time to spend
on the rendering I'm actually copying the data --
I'm not using the GL sharing that was
talked about in the previous talk.
And if I did that I could probably just keep
adding flags, maybe I have 50 or 60 flags.
And again, not only would the GPU be
running, these would be running in real time.
You actually wouldn't see any of the
CPU performance going up hardly at all.
So you can off load that and do this work.
And yeah, okay, so that's it.
Now I'm going to hand over to Intel for their presentation.
[ Applause ]
>> Vinay Awasthi: Hello, today I am going to talk
about how to develop complex kernels for Intel CPUs.
So jumping right into it -- and the purpose of this
talk is that we want to develop kernels which are used
for solving complex problems such as solvers,
or dynamic programming related issues and problems.
And how do we utilize multiple
cores, SIMD instructions, and caches.
So first let's go where simple dynamic
programming problem called longest common subsequence.
And this problem is essentially pretty simple --
you have two strands of DNA and you
want to find out how similar they are.
And the way you solve it is you create a table and you align
these strands, top and left side, and then you fill in zeros
so that you can process from the
corners, going from left to right.
And the code that is used to analyze this is on
your right-hand side, where any of the data matches,
you pick up an element that is diagonal
to the element that is being processed.
And if it does not match, then you take the maximum of the
cells that has talked to it and that are on the left of it.
So it's something like this.
So you're processing let's say first row.
And you went through and A and
A matched for last two columns.
So you picked up 0 and added 1 to it.
And if it doesn't match, then you
take the maximum for the top or left,
whichever is the maximum you can
fill into that particular cell.
And you kind of just go in and fill in this
particular table all the way to the bottom.
And at the bottom right corner, the number
that you end up with is your edit distance.
That is essentially telling you how
many edits, that is insertion, deletion,
that you need in order to transform one strain to another.
And the higher the distance, that shows
that these are not very similar DNAs.
And usually scientists or NIH tends to process
hundreds of thousands of these strand-long DNAs.
And essentially, you want to process
them as fast as possible,
and sometimes you just want to know the edit distance.
At other times you want to know what
is the longest common subsequence.
So in this case, what you need to do
is you have to traverse backwards.
And the way you do it, is essentially using the same
comparison algorithm and then wherever diagonals are,
that's where you have your common element.
So in this case, it's JCTA.
So why is it interesting that dynamic programming is
used to solve hard problems, NP complete problems.
A lot of times they're used in set solvers or areas wherein
you want to solve common optimization relative problems.
And they're very different than divide and conquer.
Because in divide and conquer you get almost
half the size of the problem when you divide it.
In case of dynamic programming, your subproblem is not too
small from the original problem that you started out with.
And in this case, as I said earlier we're
going to cover vector data type optimizations,
cache optimizations, and multicore optimizations.
So jumping right into vector data type optimizations.
So say this is your table that you want to work with.
So as I showed you earlier, in case of longest common
subsequence, you kind of set up your requirements or --
you need diagonal element, you need top element,
and left element in order to start processing.
So we just fill in the number first.
So now we are ready to process at least the top left corner.
And usually, if you look at any
programming task such as CL RS or any one
of those standard algorithm test,
this is how they describe it.
You can go top left corner and process
this row by row, one row at a time.
Kind of like this.
But if you look a little bit deeply, you would notice that
after the first element is filled in, you can go diagonally.
You can cross next two in this way.
Then you go next three in this form.
And four, and then you can have any number of elements
being processed in parallel all the way up to the end.
So it is up to you how you pick up your set.
And then after that you kind of
process all the elements in parallel.
But there is more to it.
You can also shift your rows to your right.
And once you do that, those two elements will line up.
And if you organize your table in such
a way that they line up horizontally,
then in that case you have cache locality working for you.
You do not have to have this if statement, the
Max part would essentially become mass load.
So you set up your mass based on the
matches and then you do a straight store.
And that will write whatever element needs to be written.
You will be doing two stores, one for more mass that
matches, and another one more for mass that it not match.
And you can fill in the Max of the elements that you have.
Okay, so how do we go about running it for re-efficiently.
So as you notice that essentially you --
your threads were coming out to be first,
second, third, til you hit the full length.
And then the same way you end up with the
problem at the end wherein it's three, two, one.
So you do not want to have if conditions in your code.
So what you need to do is pad it more.
So on both sides you pad it, now you can process four
elements starting from the beginning all the way to the end.
And this will run through all the way.
And you will not have just four rows.
You can have 8 rows, 16 rows, as many as you like.
Essentially you are using a lot of vector data types to
process as many elements as you can process in parallel.
Okay, so this also allows you to organize your
problem in such a way so that you can take advantage
of cache locality and mass loads and stores.
And you can also do streaming stores.
Because once you have your data organized, you
can -- you don't need to have regular store,
you can do right combining
of stores using streaming stores.
Anyway, so how would you go about
multicore or multithread execution?
So for as you saw earlier, that more
any set to be processed you need top row
and the left column available so that you can process it.
And number of transition execute, they change.
So first you can see, you can go with 7.
Then you kind of traverse down
diagonally as I showed earlier.
So next time you can just run one thread, then
two, then three, and the same way you kind of exit.
Now let's look at it in context.
This is our group of elements that we want to process.
Each one is running SIMD instruction, they
are processing multiple elements in each set.
And these groups, we can process them in
this fashion in parallel in the first shot.
After that, we can go to the next one.
Then you have to wait till that next one is over, then
you process next 2, next 3, next 2, and then last one.
And that's how you will exit.
And there are other schemes to do the same thing.
And that allows you to take advantage
of the cache hierarchy.
So you know how much of a work set can fit in your caches.
And you divide your work in such a way so that the
problem that you're trying to work on fits in the cache.
And these approaches are called cache-oblivious algorithm.
There's a lot of research going on in academic areas
where they're trying to find ways to fine-tune algorithms
in such a way that they can be cache oblivious.
In this case, you kind of divide your
problem and do execution in this order.
It's important to go from one set to the other to the
right so that you are again taking advantage of the caches.
If you go columnwise, then you lose
prefetching and cache-related advantages
that are there, if you go 0, 1, 2, 3, in that fashion.
Okay, so what does the performance gain
that you get if you go and do all this work?
So once I started out, I was getting about --
let's say 1 performance, and then as I added all
this OpenCL execution went up to almost 7 times.
And then I tried to do it using threading and hand-tune SSE.
So that also changed quite a bit based
on the threaded model that you picked
and the best numbers that I got was using GCD and SSE.
And that was about 7 to 10% better than OpenCL.
So as you can see OpenCL code approach pretty close
to the code that you will get using GCD and SSE.
Okay, so what are the lessons learned.
One is you have to issue large work items,
around 10,000 to 200,000 instructions.
Utilize map and map buffers.
And the advantage of using that is they just
lock the memory, there's no copying happening.
You just map and then you get your data, and then you unmap.
Use CL_ALLOC_HOST_PTR so
that framework is allocating memory for you.
And it is aligning memory for you.
You don't have to worry about alignments
and things like that.
But if you do have a need for it, then you use
CLU's pointer and you align it the way you want it.
And try to avoid -- if you can, enqueue buffers or wait for
barriers or have atomics, because they would tend to slow
down your performance because each kernel
is executing hundreds of thousands of times.
Now other areas that are there to consider is there is no
added advantage of using image-related functions on CPU
because there is no specialized hardware
that is optimizing image processing in CPUs.
And you also do not want to do a lot of
edge condition processing in your kernels.
You want to process it either earlier or
you want to page your dataset in such a way
so that the whole thing could be processed.
Use 1D range because then you can process
everything in one shot, go linearly.
Okay, so here there are a few kernels that we are
going to see how we should go about fixing them.
So one is -- a lot of patterns you will
see in sample codes and all,
they tend to compute your work group size using
this kind of pattern wherein they find the global_id
and know what is the data size, divide it,
and this kind of pattern is doing per item work repeatedly.
And you want to focus in your kernel in such a way so
that you can eliminate as many instructions as possible
when you're trying to write it,
so it is as small as possible,
because that particular code is
going to get executed a lot of times.
So in this case, what you could do is you can either set
up a constant or pass that as a parameter to your kernel.
Now here, you will see a lot more of morphing
algorithms and lot of other image-processing algorithms.
They tend to find -- they tend to work towards --
okay, here's my pixel and here's the
line that I want to morph towards.
And I need to find my distance so that I know the
weight, let's say [Inaudible] algorithm tend to do it,
and there are a lot of other image-related
processing algorithms that do it.
And general code is something like this, you
just kind of do A squared plus B squared and do square root.
So try to use built in functions because
they're already optimized for you.
So in this case, you will use [Inaudible].
In this particular code, we are
using int to get the global thread ID.
And this particular code will have problem
when you're running it on 64-bit machine,
because there will be downcasting
happening from 64-bit to 32-bit.
You want to use size_t.
Here what is happening is the exponentor
that is being passed is stopping Compiler
to do the loop and rolling for you.
So what you want to do is either pass it as
a build-time configuration, wherein you can --
when you're running your configuration during the build
time you can specify it, or specify it as a constant.
In this particular code, you will
see that float2s are being used.
And what you would like to do is use the
full vector length because Intel CPUs do best
when the full lengths are being utilized.
In this case you want to do twice the work using
float4, and again, don't use uint, use size_t.
And this is the way you should do more
work, use full-length [Inaudible] vectors
or vector data types to get the best performance.
Okay, so let's go with the demo.
See in here, my interest was towards
using dynamic programming.
And it's not so much so to morphing here.
My goal is to see how dynamic programming
could be used to morph one image
to the other dynamically, or without any user intervention.
And the way we do it is something like this.
So each picture is transferred into
a gray scale to find out its edges.
And after the edges are detected we do black run analysis.
So you find out how many black runs are
there and you try to do match, split,
or merge for each one of them, for all the pictures.
So think of it all the rows are being analyzed, and then
we are separating the separate black runs for each picture.
And then we are analyzing those black runs for the next
picture across the rows to find out where is the best match.
When we detect the best match that is over 1
point that we want to transform towards in terms
of image to image, then we go and execute.
So there's a lot of work that is happening.
Usually you would not do this much work for morphing.
You would just go and have either users set
up the points like [Inaudible]'s, and other algorithms do.
There are some kind of work that is involved
from the Animator's side to allow you to do it.
But here we are just using dynamic programming.
And then what we also do, we are
creating about 60 frames in between.
And so image -- that juice box to the football,
there are about 60 images that are created.
And the same way that we are going from
soccer ball to bowling pin, again, 60 images.
And all of that is just dense, so
let's see how do they look.
So here is the transformation.
And then if we look at the images, you
can see how transition is happening.
All the way.
Okay.
[ Applause ]
>> So here are the conclusions.
It's very easy to use and write.
I wrote so many kernels in the past few
weeks, and it was just straightforward.
You develop your code on the C
side and pretty much transfer it.
And if you're using intrinsic, they have
functions on OpenCL side which directly map.
And performance is almost on par
with GCD and SSE performance.
If you're trying to use POSIX thread,
then you will see that there is a problem.
And GCD tends to outperform POSIX.
So those are the lessons that we saw.
It scales very well across course, and effectively utilizes
SSE, and you can write portable code that will work
across devices, so you don't have
to worry about whether the same code
that you have will have problem
with next generation of CPUs.
They would automatically be handled
behind the scenes for you.
And it works across device generations.
All right, with that I will ask
James to come over from NVIDIA.
>> James Fung: Thanks.
[ Applause ]
>> James Fung: So my name's --
sorry -- my name's James Fung,
I'll be talking about optimizing for GPUs using OpenCL.
So today we're going to talk about how to get the maximum
performance out of the GPU in your OpenCL program.
This is the outline of the talk.
We're going to start talking with
work group size heuristics.
How big should you make your work
groups, how much should you launch?
Then we're going to move on to
some instruction optimizations,
in particular we're going to talk about divergence.
So you're on a parallel machine.
Branching's fine.
But there's some things you want
to think about when you do branch.
Then we're going to move on to memory optimizations.
On a parallel processor keeping the data moving through
that processor and keeping it fed is really important.
And we're going to talk about two key memory
optimizations that let you get the maximum throughput.
In particular, coalescing and how to handle local memory.
And finally, we're going to talk about using the GP texture
hardware, of course GPU is a graphics processor in addition
to a compute processor now, but that graphics
functionality is still available to you in OpenCL
and there's some good features you can use.
And we're going to show a demo of optical
flow using some of the texturing hardware.
Now just one note.
In my talk in OpenCL, you'll be familiar with, it's
called work items and work groups.
In CUDA, both on the architecture and sort of in the
programming language, we call them threads and blocks.
I might be using the two interchangeably.
But you should know they're interchangeable.
So let's take a look at what's actually on the GPU.
So this is a picture of a GTX 285.
On the GPU, on this -- you have up to
240 thread processors available to you.
These are grouped together into 30 streaming
multiprocessors, with up to 4 gigabytes of RAM.
So on the entire card you have up to 1
teraflop of peak theoretical single precision.
Now that is actually IEEE 754
precision, but it's not fully conformant
with the spec in some degree -- in some ways.
In addition to single precision,
you also have double precision.
And the card provides 87 giga flops
in double precision to you.
Now in each of these SMs, these are these logical groupings
of thread processors, each SM holds 8 thread processors,
1 double-precision unit, and in addition has 16
kilobytes of local memory and 16K worth of registers.
So with so many processors available to you, it's
really important to be able to keep them all busy.
Now I think the note here is that your work items or your
threads are executed concurrently in warps of 32 threads.
So a warp is 32 threads.
Basically execute in lock step.
So however big your work group is, 32 at a time will
execute in a lock step in actual physical hardware.
Now obviously thread instructions are executed sequentially.
So executing other warps is the only way
to hide latency and keep the hardware busy.
What we mean by that is if one warp needs to do
a fetch from memory and is waiting on memory,
then if you have other warps available to
run on that SM it will schedule another warp
in that maybe can do some arithmetic
computation inside the ALUs.
And you like to provide the scheduler with as many warps
for it to choose from in order to keep the machine busy.
So we think of something here called occupancy.
And that's what we define as a number of resident warps
divided by the maximum possible number of resident warps.
Now obviously you like an occupancy of 1, which means that
you've put as many warps on that SM as you possibly can.
In practice of course, that's limited by resource
usage, in particular, registers and local memory.
Let's look at an example then.
So we have a hardware SM or shader unit.
And say the specs say we can support
up to eight work groups.
Each SM has 43 kilobytes total local memory
and can support up to 1024 work item maximum.
So that's the hardware.
Now the second point there, let's say we write our
kernel and we define work group size of 128 work items.
But we find inside our kernel we need -- for each work item
we decide we'd like to use 24 kilobytes of local memory.
Well, you've only got 32 kilobytes
of local memory on that SM.
That means that the -- it can only fit one of these work
groups on that SM because it's limited by the local memory.
It would need 48 at least to fit two of those.
So what happens now only one work group
is going to be available on the SM.
And that's going to give you 128 threads on the SM.
And that's out of a maximum possible 1024 work items.
So you have a little occupancy. Check your GPU
documentation for specific details on your hardware.
So what are some heuristics as to how
big you should make these work groups
and how many work groups you should launch.
Well firstly, the number of work groups should be
at least greater than your number of multiprocessors
so that every multiprocessor has at
least one work group to operate on.
Secondly, the number of work groups per
multiprocessor should be greater than say two,
so you have multiple work groups on each multiprocessor.
So multiple work groups can run concurrently in a
multiple processor, and then if you have more than one,
work groups that aren't waiting at a barrier, say,
can be swapped in and keeps the hardware busy.
And again, that happens all transparently to you.
But as long as you load the machine
up it will give you good efficiency.
And again, subject to resource availability such as
registers or local memory, which is the example we just saw.
Your total number of work groups you'd like to launch say
would be -- you want to launch in the order of hundreds.
And that's going to let you scale from
small devices across to larger devices.
These work groups are in pipeline
fashion, kind of batched out to the SMs.
If you can even launch thousands of work groups in a kernel,
that's going to let you scale across multiple generations.
The number of work items in a work group
generally should be a multiple of the warp size.
So again, a warp is executed physically
in a lock step, if -- and is 32.
If only -- if you have some morph
that's only like 15 or 2 or 3 threads,
then the other threads in the warp aren't used.
So basically want to keep all the
threads in the warp active.
For control flow, the main thing you
want to be thinking about is divergence.
This happens when you branch.
So branching it fine.
But you get divergence when work items
in a single warp take different paths.
And what happens is different execution
paths then need to get serialized.
And what that means is the GPU, if you
branch within a warp, has to, say, evaluate the
if statement or the if part of the branch.
And then go back, evaluate the else part
of the branch before it can move on.
And so in our first example down
below is an example where that's going
to happen, is when you're branching within a warp.
So if you thread -- say you have some control flow
and you're examining your particular thread ID.
And in this case, thread 0, 1, and 2 would take the if
part, and maybe other ones would take the else part.
In that case, your branch granularity
is less than your warp size.
So the GPU has to evaluate the if part,
then the else part before it moves on.
But you can still have -- sorry -- you can
still have branching but avoid divergence.
If your branch clearing granularity
is a whole multiple of your warp size.
So in the case below, we have the branch,
but it's occurring basically in whole warps.
So if warp 0, 1, or 2, say, take one part
of the if branch and then the other warps
in that work group take the other end of the
branch, the hardware will know that it doesn't need
to reality the other end of the
branch for that work group because --
sorry, for that warp because they've
all taken one -- one leg instead.
Let's look at some examples.
So here we have a parallel reduction.
In this case all we're trying to
do is find the sum of the buffers.
And we'd like to use a parallel logarithm
in order to use all the processors.
So we could launch a kernel and
let's look at say one work item.
And here we have the values, say stored in local memory.
And we have the threads of the work item.
And each thread is going it to add its neighbor.
But we don't need to have all the
threads active all the time.
But we iterate.
So basically each thread adds its neighbors, then goes
to the next iteration, each thread adds a partial sum,
and we proceed in that fashion
till finally in the last thread,
it's going to add two numbers and
comes up with the sum of the buffer.
As you can see in this case we're
going to have a lot of divergence.
We have thread 0 doing one thing, thread 1 in this case
is not doing anything, thread 2 is doing something else.
So we have this interleaved type of workflow
and that'd going to give us divergence.
So let's look at the same problem but solve
it in a way that doesn't need divergence.
So again we have our array in local memory.
Now we know each thread needs to
add one neighbor or one partial sum.
But there's no reason that it has to be the
immediate neighbor, we're simply looking for a sum.
So here we've simply restructured the problem.
But we've now created a continuous set of
threads that are all doing the same warp.
Now this is a smaller problem shown
here than you would have on a GPU,
warp size on a GPU is 32, but imagine
you had a warp size of 8.
In this case, in the first step now we have
an entire warp doing the same set of work.
And so we can avoid divergence.
When you're accessing the main memory on the GPU
there's a couple of tricks that are available to you
to get of the maximum performance out of it.
So you can say the main memory or the global memory,
that's kind of the memory that's advertised when you go
to the store and buy the card, you
know, 512 megabytes or 1 gigabyte card.
So referring to the global memory there.
And there's some rules that if you follow
lets you get the peak data throughputs.
And these are rules that kind of vary by generation.
So with -- up to Compute 1.1,
we have different compute levels
that describe the capabilities of the hardware.
But for a GeForce 8600, 8800 GT, 9400 M, 9600
GT, basically if you can have a half-warp,
16 threads in this case, a half-warp can access a
contiguous and aligned and in order set of data.
Instead of falling back, instead
of having 16 warps fall back
to 17 different transactions it can simply do one
larger transaction as long as you follow some rules.
So the rules are basically -- and they need to be --
the starting address for a region
must be a multiple of region size.
So there's an alignment issue there.
The kth thread and half-warp needs to access
the kth element of the work group.
So there is an in order access there.
And there's also a contiguous access requirement.
There's one exception.
Not all threads need to be participating.
But let's look at some real examples here.
So in these two examples we achieve coalescing.
So in the first upper case, we
have a half-warp of threads.
And are accessing a contiguous region of memory, it's
properly aligned, and each one's accessing it in order.
So instead of the 16 threads having to do 16
different memory accesses the GPU coalesces
that into one access of larger width.
And so you get that back in one memory transaction.
And then in the lower case we see our one exception here.
Not all threads need to participate, but as
long as the other rules are all followed,
and again it can do this in one single transaction.
Let's look at some areas, some
situations, where we break coalescing.
So again we have our properly assigned contiguous region,
but now we've permuted the access here in threads 1 and 2.
Now it's going to break our coalescing
on the older generation,
and it's going to fall back to
16 different memory transactions.
It's going to impactor performance quite a bit.
Similarly in the lower case, we're contiguous,
we're in order, but we're miss assigned.
And so again it's going to fall back to 16
transactions instead of a single coalesced transaction.
On the newer GPUs the coalescing rules are much improved.
This applies to GeForce GTX 285 and the GeForce 330 M.
Basically what the hardware is going
to be doing is it combines addressing
within a half-warp into one or more blind segments.
And it figures out what's the best way
to transfer the number of segments.
So all threads with addresses within a segment
are serviced by a single memory transaction,
regardless now of ordering or alignment within the segment.
So let's look at some examples.
This is what happens on the GTX 285, for instance.
When you go to access your global memory.
In the left-most case, we have a contiguous segment
again, it's completely out of order but that's fine,
it's all -- it's contiguous and it's aligned.
And so the GPU knows that it can service those
requests with sun single 64-byte segment.
So goes ahead and gets basically get your coalesced
access, despite being completely out of order.
In the second case there's a misalignment problem, so
the GPU can't do a 64-byte aligned memory transaction.
But it looks at what you're doing, and it says
well, we can do this with a larger 128-byte segment.
And so it'll go ahead.
It will do a larger transfer, and so there's still some
data transfer that maybe you're not going to be using.
But again, it won't suddenly fall
back to 16 memory transactions.
So you don't fall off a cliff.
And similarly, the last case has
another misaligned transaction.
But in this case, just because of the wait in line the GPU
says well, I can do a 32-byte segment and a 64-byte segment.
And again you haven't fallen back
to 16 different transactions.
What happens when you actually do coalescing.
So here's a simple experiment.
This is on the 8800 GTX.
So it's a previous generation GPU
with the more stringent rules.
But basically in our kernel all we do is read
a float, increment this value, write it back.
We're processing 3 million floats, 12 megabytes of data.
We average our time over 10,000 runs.
So 12K blocks, 256 threads per block meaning the floats.
If we achieved coalesced access, we could do
this type of operation in 350 microseconds.
In the second case there, coalesce but
with some threads not participating.
That one exception.
Again, the timing is still basically of
the same as a fully coalesced access.
If we break coalescing, we can see the penalty for that.
We've gone from 350 microseconds
up to almost 3500 microseconds.
And that happens if we permuted or
had some misaligned thread access.
So it's almost a 10 times speed difference
if you're able to achieve coalescing.
In the second case, we can see the same kind of effect.
And in this case what if we were
reading say, float3's data.
That's kind of a -- it's a 3-element data structure.
Doesn't actually follow coalescing rules.
But we can do some tricks where we can
actually coalesce that through local memory.
So you can actually load that in, say -- imagine
that it was simply a set of single element floats,
put into local memory, treat it as a 3-element
float, and then write it back out to global.
The point here though is if you
can use your this local memory
to achieve coalescing, again you get really good results.
So uncoalesced cases give you 3 -- 3300 microseconds.
But again, if you achieve coalescing 360 microseconds.
So coalescing is probably one of the major
performance improvements if you're not achieving it.
We talked about local memories, mentioned several times.
What is local memory.
Well local memory is basically on our GPU,
it's a high-speed, low-latency on chip memory.
So this is opposed to global memory which is on
the card, but it's not on actually GPU chip itself.
Local memory is actually on the GPU and in fact
you can see there, it's actually inside the SM.
And the way you can think about
it is as a user-managed cache.
If you're processing an image, say, a national way to use
it would be to load a tile of that image into shared memory.
And then all the threads in your SM can actually read
and write into that tile data as much as it wants.
When you get your final result in, you
can push that back out to main memory.
So in that case you can get a lot of quick accesses
without paying the penalty of going to global memory.
When you're using local memory, it's not coalescing you're
worried about, you can actually access it very fast.
But there are still some rules that
will give you better performance.
In particular, something called a bank conflict.
So on the GPU local memory -- local
memory is divided into banks.
And that's essential to achieve high
bandwidth in a parallel application.
And so each bank can service one address per cycle.
And in general, memory can service as many
simultaneous accesses as it has banks.
If you have multiple simultaneous accesses to a single
bank, the same bank, then you get a bank conflict.
And in that case, those accesses need to be serialized.
So let's look at some examples here.
So here we have our 16 banks of our
local memory on the H address, 32-bit.
So imagine you had an array of integers saying local memory.
Then the first 16 integers in your array would be a
service by banks 0-15 and it loops around again.
Now if your threads go ahead and access it and then
the -- in the leftmost case, kind of nice, orderly,
it's all in order and are accessing only
one bank, there's no bank conflicts.
Each bank can service all the threads.
Similarly, permeation is fine.
The key thing here is as long as those banks don't have
multiple requests coming to them, there's no bank conflicts.
So examples where you do get a bank conflict.
So on the left we have a two-way bank conflict.
So we have multiple threads, in this case, 0 and 8.
Both accessing bank 0.
We got a 2-way bank conflict, hardware has
to service thread 0, then service thread 8.
So it's serialized.
Then you can have high order bank conflicts.
On the right side we have 8-way bank conflict,
where 8 threads were all accessing the same bank
and so it's got to serialize all those 8 transactions.
Finally, the texture functionality on the
GPU is still available to you through OpenCL.
And we're going to show an example here.
The example is optical flow.
Basically, optical flow is calculating
the motion of points in an image pair.
So here we have an image.
It's actually have standard test dataset.
It's a picture of a guy closing his trunk.
As you can see him, he's kind of pushing down the trunk.
And what optical flow is going to do is calculate
the motion between image one and image two.
And then down below we have the kind
of the results of the optical flow.
And those blue areas, if you kind of
zoom in, they're basically small arrows.
That tells you the direction and magnitude
of the motion that we see in the image pair.
This side of the computation is embarrassingly parallel,
and compute-intensive, it's a great candidate for the GPU.
And it's also used in applications such as inner
stabilization, future tracking and computer vision.
It's also used in things like video encoding.
Optical flow makes use of texture cache
and hardware bilinear triplication.
Let's talk about those.
Let's look at how do we actually implement optical flow?
This is very high-level description.
There's lots of many, many different
implementations of optical flow.
Here we examine the pyramidal Lucas-Kanade
optical flow, which is a pretty well known one.
Basically, we create an image pyramid first.
So we downsample the image, so we have the
same image downsampled to different sizes.
We do some edge filtering on it to
get a vertical or horizontal edges,
or derivatives, at each level of the pyramid.
Then we basically put it into some
kernels that solve for the optical flow.
Solving for the optical flow is basically as easy as
you know, computing this G inverse B matrix calculation.
And G is 2 by 2.
So we can simply precompute G at every point
in our image so we have those available
to us when we need to do the flow calculation.
And finally, we can go ahead and solve for the flow.
And the optical flow processing is a -- is sequentially
goes from the smallest level of the pyramid,
that kind of gives you course estimates, and you refine
it down to larger levels of the pyramid in putting
in the flow results to each level
until you get the final base of level.
So solving for flow is an iterative logarithm.
You estimate the motion, calculate
the vector, V, but it's iterative.
So you've got to guess for V, then
you update your current position,
and then you iterate again to get a new guess for V.
And so what you end up doing is you have a small window of
computation that's kind of slowly sliding across your image.
The nice thing is on the GPU you have a texture cache.
So if you keep these images in cache, even
though you're sliding variably along the image,
and some areas of the image would
have large motions and some areas
of the image would have only a small amount of motion.
Basically you slide around the image,
the GPU will manage the cache for you.
So you get cached coherent -- I'm sorry
-- cache data access, bodies and textures.
Another nice thing about the texture pipeline
on the GPU is hardware interpolation.
So in our algorithms, subpixel accuracy
and subpixel sampling is crucial.
Between iterations as you're sliding around, your center
position generally is not going to land exactly on a pixel.
So rather what you're going to need to do is interpolate.
So if you take a closer look at the computations here,
where I and J in the upper equation
are both the images.
So we're looking at the I at X Y, but
we're looking at the second image, J,
with some offset, which is our current guess for our flow.
And so when we resample that, we need to get
interpolated samples, and we use that information in order
to compute our B vector from the previous slide.
The nice thing now though is that because we're using the
texture hardware, we can apply linear interpolation to it.
So we can look up on floating-point coordinates, slightly
offset, and we retrieve a bilinear interpolate sample.
And that bilinear interpolation
happens in hardware, in GP hardware.
So that's really useful hardware for you to use.
So we've looked at the pyramidal
Lucas-Kanade flow for this demo,
and the additional thing here is the visualization
that you're going to see is actually done by the GPU
by sharing data between OpenGL and OpenCL.
So maybe we can switch over.
Okay, great.
So in the upper left there we see our optical flow.
And if Eric kind of moves around here, you
can see he's calculating the motion vectors.
So the green is the motion being calculated for each frame.
And we're calculating dense optical flow in this case.
So it's 640 by 480 input.
With older methods they would usually only
try to compute to flow at particular points.
But with the processing power in the GPU
we can compute optical flow at every point.
So we're actually calculating at every point.
We're only displaying kind of a subset, a
grid, in order to make the arrows visible.
This is running on a GTX 285, and these arrows
are actually being drawn via buffer sharing.
So I can simply compute the two end points
of the arrows that I'd like to draw.
Send that right into OpenGL.
And it can go ahead, we do a GL again, you know, GL lines,
basically, strong lines, and that all happens on the GPU.
Okay, thanks a lot.
Actually, just going to --
[ Applause ]
>> Switch it back to this -- okay, and so if we
look at our performance across different GPUs,
different GPUs have different amount of
SMs, and certainly pays to have more SMs.
So comparing 8800 GT versus GTX 285, running
the same code across two different GPUs
with more SMs you can get better performance.
Okay, thanks a lot, and I'm going
to hand it over to Aaftab.
[ Applause ]
>> So hope these two sessions gave you a deeper
insight into how OpenCL works, how to use GPUs,
and some guidelines on writing performance
OpenCL programs for CPUs and GPUs.
Thank you.