WWDC2003 Session 625
Transcript
Kind: captions
Language: en
so welcome to the last Java session of
the week maximizing gasps knocks amazing
Java performance for mac OS 10 my name
is Victor Hernandez this talk will be
given by three of us Jim Lasky and
myself were from the Java Runtime
technology scheme and girard vm ski who
is from the Java platform classes team
and we're going to be splitting the talk
of the three-part but the goal overall
talk give you a better understanding of
why your job application performs as it
does on Mac OS 10 Jim is going to be
talking about performance improvements
that we've made specifically targeting
the g5 processor then i'll be talking
about performance opportunities that are
arrived with java 141 and then Gerard
will be talking about Java graphics
performance on Mac OS pen and stay tuned
specifically for I park because there's
a lot of great demos to be seen there so
we got a lot of material if get right to
it here's Jim thanks so my part of the
talk is going to talk specifically but
what changes for me to the hotspot vm to
targets of the g5 which is uh was pretty
exciting because we only got to see some
of these new machines a few weeks ago
and play around with some of the
prototypes first of all I'll just give
an overview my section of the talk I
want to talk specifically about some of
the details of the g5 to give you a
sense of what sorts of things that we
could actually exploit on the g5 then
I'll do a little performance comparison
between the g4 and g5 to give you some
kind of sense as well as a benchmark can
of what kinds of improvements you might
see in your application and then I'll go
into some detail and some of the changes
that we made specifically to the Java VM
interpreter and the Gypsy and then
quickly at the end I'll go through a
couple of changes that we made to the
hotspot runtime
okay so what does the g5 being to Java
developers well the main thing you
should know is I guess we should
understand or should be obvious to you
that the g5 is going to make your
application generally run faster and you
should expect that from a faster
processor which with forty percent
faster than the highest in g4 currently
shipping faster bus structure there's
also been some architectural changes to
the way the g5 processor works over the
g4 which actually improves the
performance of various types of
operations and very specifically
floating point you'll find that floating
point is faster than that forty percent
are typically faster than that forty
percent projected by just a change in
gigahertz on the machine now we could
have left the vm alone and not done
anything to it and you would have gotten
a gain in performance running on the g5
but we like to tinker and there's all
these really cool instructions on on the
new processor that we wanted to take
advantage of specifically there's the
introduction of 64-bit operations and if
you have any long or long int signore in
your java application we now use a much
simpler and quicker set of instructions
to do those operations and i'll go into
some detail on what was actually done
now one thing to note and i know there
hasn't been a lot of talks about the g5
and what g5 means all you're hearing is
that it's a 64-bit processor and that
can mean a lot of things a lot of people
i'd like to think of it in terms of
actually there's there's two phases of
calling a processor 64 bit one is 64 bit
as a processor processing 64-bit
instruction or starting a 64-bit
instructions but but doing operations on
64-bit operand and the second part is
also there's an implication that you
also have 64-bit address space now we've
chosen to run the Java VM in what's
called 32-bit mode which allows us to
maintain the 32 bit 32-bit address space
so that we don't need to represent
object pointers in 64-bit and hence we
don't need twice as much memory to
represent send things but we still can
use the 64-bit operations to do state
long integer arithmetic and the finally
the main thing that you can walk away
from the session feeling is that you
don't have to do anything to your
application to gain these benefits
improvements in performance we've
modified the vm as soon as you run as
soon as you run your application on a g5
you're going to gain all the benefits of
having 64-bit arithmetic the faster
processing yourself so you have to make
no changes to your to your application
now just to start it off I want to show
you some comparisons of running the an
application or some applications on g4
versus g5 this is a i'll be using spec
sorry sightmark we use several different
benchmarks internally to test various
things we would normally abuse spect JVM
but this is a fair use policy which are
which requires that you post the scores
on a public forum before you can
actually use them and we're still
working with prototypes and we don't
have our final values one night so we
chose to use side mark which is a is a
fairly good benchmark and it will give
you a good sense of where we're going
the other thing about sightmark is that
it's a scientific engineering benchmark
which people have often said well you
know like the the client vm is very slow
when it comes to computation well this
sy mark score the thigh mark score
should give you a sense of where we're
headed with the computation side mark
can be found at the National Institute
for Standards and Technology website
there's the Earl and if you go there
there's a whole list of current
standings they're fairly up to date I
think the most recent one is May June
timeframe if you look in the list you'll
see us way down there somewheres
actually in the 61st position this was
run by somebody back in the fall running
on 131 11 31 version of the vm on a 1.2
sorry yeah a 1.25 x gigahertz dual
processor g4 okay so note the score
there the score is 78 253 that is that
what's called a composite score signmark
is actually 5 separate tests such as
fast fourier transform sparse matrix
multiplication Monte Carlo and there's a
composite score prepared from those five
results and it's the composite score
which is used to actually rate how
you're doing in in timer so so this
graph will show a high ng for the
current hi ng for which is a 1.4 2
gigahertz dual g4 against a 2 gigahertz
dual g5 focus saw in the first column
because that's the one that the score is
actually bit the main scores based on so
currently our score would be around 111
okay composite score and you can see
each of the each of the subtest subtests
there I'm not sure whether there's a
normalization of these I haven't seen
anybody actually hitting a hundred on
any of those but that's basically what
you would find currently now if you took
a straight forty percent increase in
performance on each of those each of
those tests this would be the projection
that you would get okay so this is we
did this to sort of get a sense of where
we should be headed once we ran it on it
on one of these g5 ok so a score one
again focus on the confidence score
because the other one's going to vary a
little bit so as I said this would have
been the best we would have expected
well we were kind of surprised when we
actually ran the tests and that we got a
232 which is pretty significant so it's
more than just gigahertz it's also the
system itself and the changes that we've
made to the vm and here's an overlay of
the projections just to show you show
you where we're at so the score is
basically more than doubles on the g5 so
where does this put us well this if we
were to do this today this will put us
about 12 position and what's interesting
is that this is up in the high end there
with all the high IBM servers running
three gigahertz and we're running a
client vm right so let's give you a
sense of the power of what the g5 is and
also the potential that we could have as
we make further improvements on the vm
ok ok so let's go over some of the
changes that took place in the
interpreter and the Gypsy we we enhance
them with g5 instructions we can do this
because the interpreter the Gypsy
compile the code generation and also the
runtime are all constructed on the fly
when you launch the vm so this gives us
an opportunity to choose which
instructions that we want to apply so we
ran on a g3 versus g4 we would choose
different structions we're running on a
single processor or a double processor
we would choose different instructions
and now that we're running on the g5 we
can actually choose 64-bit instructions
so you have a long it's in support that
is now using 64-bit operations and also
some of the details of that there's also
been improvement in float and double
support using some of the new floating
point instructions and they're not
actually new instructions resist
instructions that are only made
available because of 64-bit support so
let's talk a little bit about the
details of what 64-bit means
in a g4 orgy three for that matter
everything that goes through the
processor has to go through in 32 bits
that's because the the data bus and the
width of the the registers is only 32
bit wide so if you wanted to do an
operation on a 64-bit long integer value
you would require two registers to deal
with each of the operands and with the
results in this case we had few equals x
plus y who would require two registers
to hold the result ok in this case in
this example r3 and r4 we'd also need
two registers for x + 4 y so we need six
registers just to perform a long add
operation or a long subtract operation
and it's in the g5 world our registers
are 64 bit wide we can still use treats
them as though there's 32 bit wide and
some operations still deal with them as
32 bit why but the long integer
operations we can deal with them as full
64-bit so in the previous example where
we had few equals x plus y who only
needs the result only needs one register
and x and y only need one register each
so we cut down the number of registers
that those are needed for each operation
and make and that makes more available
for other operations ok so we get a win
from the reduction or the the lack of
general win by having more registers
available for operations so let's look
at the specific operations that we can
improve on so in your Java code you have
an expression long x equals y on a g4
this would actually require four steps
in order to perform this performance ok
we need two steps to load each half of
the 64-bit value the high 32 bits then
the low 32 bits and then we don't need
two steps to store
those out back into memory okay so in a
30 in a 32-bit world we almost always
have to use at least two instructions
where one would would do and the g5 we
only have one instruction for each of
those so one instruction per load one
instruction per store this is also used
for moving data we have a 64-bit data
bus we can get better throughput through
the system so when we're doing memory
copies we're also getting performance
boost there let's look at some of the
simple simple operations like add
subtract and the gate again because we
only have 32 bit wide registers we have
to do everything in two steps on the g4
so in this case if we want to add a too
long int we have to add the low halves
of the two upper and bring the carry
forward and then add the two halves and
add the carry and so that would be the
two steps that are highlighted here I've
thrown in the load operations as well to
give you a sense of that well it's not
just the operation itself it's also the
things that go on around it okay so it
takes eight instructions to perform that
on the g5 it only takes one instruction
to do they add and again the load only
takes one each of the loads only takes
one instruction the store when he takes
one instruction so we've cut the number
of instructions required in half and you
can think in terms of fewer instructions
faster code now the more interesting
things and this is the most trouble for
us in implementing the Java VM is has
been dealing with long and some of these
more complex operations like multiply
and divide and remainder and shifts and
even comparisons they can take many
instructions and into a long aunt divide
can literally take hundreds of
instructions or hundreds of steps in
order to complete the operation ok
remainder just a few more shifts can
take eight comparisons can take 12 up to
12 it's they're fairly expensive
operations each of these have been
reduced to one
single operation and I'll take multiply
as the simplest example on the g4 to do
a long inch multiply it takes six steps
to do the cross multiply of the low on
high parts of the upper end on the g5 it
only takes one instruction okay so you
can see where this is going is that if
you have a lot of a lot of long and comp
law int computation in your code where
it took many steps before it's only
going to take a few now let's take a
look at float when I say float I mean
float and double in the the g5
implementation of the Java VM we have
taken advantage some of the newer
instructions that can convert long to
double and doubles back to long and
stained with two is float in the g4
implementation has to make a library
call and which takes several hundreds of
steps so that it speeds up the
performance of casting a conversion of
longs to doubles there's also been some
improvements in the float and double bit
extraction routines such as a double to
long' bits these are used primarily when
you're converting doubles two strings
and back again the most interesting of
the changes is square root and the g5
there's a built-in square root function
now on the GP on the G floor the square
root is implemented as a trig library
routine and it can take several steps
it's in the order of about 40 40 steps
to complete so what I did was I took a
little micro benchmarks where I'm
iterating through a hundred thousand
data points and applying a square root
to each of the data points and producing
a results and just to make it
interesting I took a little bit more
complex operation where I had 100
million XY points on a coordinate plane
and I want to compute the distance so
it's a little bit more complicated
equation and it
see how long it would take to do on eat
on each of the processors the first
processor is a g4 running at 1.42
gigahertz okay so it takes about 12
seconds to compute all those oh do all
those computations and 13.5 for the
distance formula now if I was just to
take a straight port over and use the
library routine on the g5 it would be
reduced to 7.7 seconds in 8.1 seconds
and this is actually Burt better than
thirty percent of the projection that
the projected time that it should take
so the floating point processing is
better on the g5 and you're going to get
a better result on the g5 running with
the square root instructions built into
the code or in lined in the code it only
takes two seconds so you've got say a
six times improvement in performance but
this is a micro benchmark you know it's
it's just going to give you a sense of
you know the increase in performance so
the square root itself so your actual
example may take a little bit longer but
it gives you a sense of the magnitude of
this improvement improvement there
finally I just want to write quickly run
through some of the changes to the sort
of run time and in a 32-bit world we
have a little problem where two threads
may want to share a long int value to
say static or a field in an object and
while they're writing to that long int
that the upper and lower halves of those
values might get slammed by one or the
other process depending on how the
thread switching is going on to avoid
that problem you can annotate your
you're filled with the volatile keyword
and what that does is forces the vm to
coordinate how those how that field is
accessed and make sure that we don't run
into that problem and the g4 we did a
little fudge excusing 64-bit double
register and and using that as an atomic
access and copying it through some
memory
so and so forth so it took several steps
in order to to to make that work and the
g5 64bit loads and stores are atomic so
you don't have that problem okay so
there's no no overhead when you're
dealing with volatile fields on the g5
one of the problems that the g5
introduces is the fact that the hardware
itself a little bit more complex and has
more stages when it's doing its
computation this is where it gets its
speed so when you have running on a dual
processor there needs to be some
coordination and how memory is being
accessed in the g4 world we use
something called the sink instruction
and this allowed the two processors on
the dual processor environment to sync
up the data that shared between the two
processors okay but the problem with the
thing construction is that it it's
somewhat freezes the state of the
processors until they're both
coordinated before it continues on so
there's a bit of an impact there and
sometimes it can be actually fairly
serious with the introduction of the g
five days they brought in a new
instruction called lightweight sync
which doesn't require as much
handshaking between the processors
determine whether there whether the data
is in sync and we use these when we're
doing memory allocation when two threads
are trying to allocate memory at the
same time or worse when you're using a
synchronization between of an object and
finally the last major change that we
made in the runtime to deal with the g5
is the atomic long access there's a
class in Sun misc called atomic long CS
sorry see a simple which allows you to
do atomic access of long values and this
is primarily used in the the net net
operations like when you're setting up
sockets and so on so forth and the g4 we
had to actually use full Java
synchronization and we just use the Java
implementation to provide the
synchronization so lock out the access
to the object or to that particular
object field
and then we make the assignment and
release it through normal
synchronization on the g5 we use a light
weight load and reserve which is
instruction which allows us to reserve
access to that work and can be done
fairly quickly so in summary Java that
at 1.14 or 3141 that ships with the g5s
once we start shipping will
automatically adapt to the g5 processor
and we're only going to be shipping one
version of the vm from that point on
it's not one that runs on the g4 when
the runs on g5 it's one that runs on all
platforms but adapt to the g5 and this
is you know this is one of the great
things about the hotspot cm you're going
to get significant performance changes
somewhat across the board in a
specifically in floating point that's
that's the main thing if you're doing
scientific computing you're going to see
bigger wins there and then also with the
long int arithmetic if you're using it
and the main thing that as they say want
to point out is that you don't have to
make any code changes to your own code
the vm does the adoption for you and
this is where you're you're one up on
all the sea and Objective C programmers
because if they want to run on g5 and
take advantage of the g5 processor
they're going to have to recompile your
application and they're going to have to
ship a separate version of their
application for the g5 and and one for
the g4 so Java is automatically going to
take advantage of that okay and that's
how has the size okay Victor there you
go so for those of you that don't think
in terms of bits and instructions we'll
take it at a higher level now my name is
Victor Hernandez in case you don't
remember and here we go
so I'm basically what I'm going to be
talking about is updates to hotspot that
has been made with Java 1401
specifically one of the features that
we've added in and being able to
optimize your code and that specifically
aggressive in lining and also one
performance opportunity that you can
take advantage of your of yourself in
Java 141 which is the new I owe a pis
and finally I'm going to kind of wrap it
up with a bunch of conclusions on tips
that you can take advantage of to
improve your hire hot methods okay so
one of the performance bottlenecks
though it has plagued Java well I don't
know if legs but up that I Java has
encountered since the early days is the
fact that there's a large cost in the
overhead of actually invoking a java
method so our opportunity to minimize
that cost is actually done by
dynamically inlining the method calls
done by your message when we compile
your method what is in lining that
should be pretty straightforward well
give a quick example here you got
average and some average call some of
course this could avoid the call to some
if it just simply did the a plus B
itself but of course you don't want to
do that in your code because that limits
the reusability of that method good
thing is that we're able to do that for
you you don't need to change your code
we just do it for you on the fly in 131
there was limited ability to do in
lining we're able to that in line your
accessor methods to your fields we are
able to in line your calls your call to
create new instances of your objects and
we were able to inline certain intrinsic
intrinsic being methods that um now we
actually don't need to look at the byte
codes to know what it's supposed to do
we actually know what's supposed to do
and actually have a finely-tuned
implementation of it for example sine
cosine also the identity function and
then also but one of the main issues
within lining in Java 131
was the fact that we were actually not
able to inline virtual methods why are
virtual methods difficult in line well
the reason they're difficult in line is
because there could be possible multiple
implementations of that method when you
actually go to do an invocation so we
don't know which implementation to
actual in mind so how do we go about in
learning those virtual methods we do
that with a technique called class
hierarchy analysis the goal of clarifier
analysis is to determine which is to
determine if a method is monomorphic and
a method is run and morphic if there is
only one implementation of that
particular virtual method i've actually
been loaded if we know there's only one
that has been loaded and you go to call
it that's got to be the one and hot spot
one for one attempts to aggressively in
line all mon amour monomorphic methods
that's the main that's the main feature
we've added beyond 131 so what are the
benefits of this well clearly the fact
that now we can actually in line virtual
methods there are certain situations
where those methods actually don't get
in lines and even in that case then we
can avoid the virtual method virtual
table lookup when invoking that method
because we know that that there's only
one entry in them in the virtual table
there's also provides the ability to do
a faster implementation of certain byte
codes because the class Erick analysis
has a data structure which actually
tells us the full hierarchy information
of all the classes that have been loaded
so when you're doing things like instant
cyber check cashed for which are byte
codes used one up casting your your
objects between various classes we can
actually use that data structure and it
actually performs a lot faster ok so
what is another performance issue that
has affected Java in the past well this
one actually has two parts to it one is
the fact that if you ever wanted to
operate on native data structures from
your Java methods you actually had to
have them rewriting in a Java heat why
would you actually need to admit a data
structures in the drive people if you
ever want to interact with in any system
api's you actually need to have those
data structures to pass down once you
drop down to
native methods that add the other cat
the other heavy cost which is the fact
that those J&I transitions to do those
method calls the native method calls are
quite expensive yeah I mean in the
previous section where I was talking at
the inlining where we're trying to
minimize the amount of method calls and
those method calls are even pretty quick
compared to these Jane I transitions not
only that but these jni transitions
definitely cannot be inlined at all
because we're crossing abis and we don't
totally control a lot of the issues
between calling from Java to see but
those but those are still those jni
transitions are still necessary as of
java 131 the other thing to keep in mind
is that once you have all those native
data structures in the Java heap they
actually need to be hopping around
during garbage collection and and yet
they don't contain any actual Java
pointers which the garbage Christian
algorithm need to take track of so what
is our approach at actually improving
this bottleneck we want to remove this
Jane I dependency all together by giving
you the ability to actually access that
native memory from your java method you
might be familiar with this this is
basically the new io a P is that were
provided in 141 they're available in the
Java niño package and there's basically
a buffer class for every single one of
the Java scalar types and also for the
actual fight scalar type actually all
operations happen at the bite level even
though that's not a Java scalar but but
you can actually cast to in buffer too
long buffer and actually operate at the
Java level and one of the one of the
cool one of the things you need to keep
in mind here is that even though the
goal is to have a direct access to
native buffers that are not located
inside of the Java heap you can actually
trick yourself into still basically
having a copy residing in the Java keep
accessing that and have it being copied
over outside of sight of the heap which
even though it might be improved
performances before since you don't have
to chop down into a j9 native men
to do that it still is an added overhead
and you need to be careful about that
there's a few other issues that I want
to bring up this is a pretty
straightforward code example that shows
an allocation of a byte buffer of size
four hundred and you're basically zero
filling it in it with a for loop one of
the things that actually you need to be
aware of right here is that that for
loop is not as optimal as it can be
because the call to the put method does
not get in line because it's not
determinative monomorphic this is a
caveat of actual of the actual cost
hierarchy of the Java niño package and
and it affects all your calls to get me
put in the case of byte buffer if you're
doing something like this there is
actually one way you can get around it
and that's actually just simply by using
a map byte buffer the not by buffer
getting put methods are actually
determined to be monomorphic and they
get in line but you need to keep that in
mind and this is something that we're
gonna be tracking for in the future to
see if this can be improved so how do
you actually do I owe with high-level
i/o with the new I oh that's using Java
and I old channel found in a Java nao
channels package and the main thing that
it provides beyond what was available in
traditional Java iOS of Java 131 is the
ability to do non blocking and
interrupter law operations no longer are
is there any need to actually have one
thread per socket that's the thing of
the past another thing that it provides
is actually has improvised improved file
system support gives you a lot more of
the system level primitives that you've
come to expect from from a robust
operating system things like file
locking and also a memory mouth file
just like just like in the case where
you need to make sure that you had a
direct buffer sitting behind your native
buffer this is another example where you
actually not only have access direct
memory but you actually are accessing
memory map file itself so let me go into
a little more detail about the socket
channel I don't want this to be a
I'm not going to go into enough festive
for this to be a tutorial on the sort of
thing but i do want to bring up a few
issues that tutorials might miss an
occasion this is an example of how to
create a server socket channel and bind
it to a particular address for it to be
listening on one of the things is that
by default it is not set to be
non-blocking so you actually have to do
that by calling configure blocking and
passing it value or false you can it's
pretty straightforward thing but it can
be missed and it definitely makes a huge
difference and then actually how do you
actually communicate with your clients
using this model you use the selector
model which you might be familiar with
from the programming pattern and you can
see in the code right here basically
what you're doing is you're if you're
registering for a particular key and
then once you once you've done that you
can basically iterate over all all all
of your clients who communicate to you
via keys and who pass the new channel
you will communicating them with them
with in a interesting a big eater in a
big while loop if you want by iterating
over all all of the keys and that way
you're abstracting away all the
different sockets that you're actually
talking to you instead of doing the
traditional having the block until your
client talks back to you along that hard
together socket and and one of the
things to keep in mind is that the
socket channel that has actually
returned each of the times you you you
ask us one of these keys it's different
than the one you had originally so if
you want to continue the non-blocking
i/o you actually need to state that
you're doing a non-blocking i/o once
again with the configure blocking set to
fall okay so what do you need to keep in
mind about with when using new I oh well
it's definitely not free the costs of
allocating those native buffers is
definitely much larger than allocating
the Java race it's pretty hard to reach
our performance of allocating dollar
arrays because we actually do a very
good job at doing that as quickly as
possible
the other thing to keep in mind is that
the jet put methods of another of the
native buffers are not in line you can
use a trick to get at least that fixed
for a few cases but there's nothing you
can do if you're in buffer and for some
of the other scalar buffer types but the
cost definitely out sorry the gains
definitely outweigh the cost in the
cases where I've been talking about
where you have heavy use of system api's
related data structures one is a good
example of actually taking advantage of
that wind is actually in the riorca
texture of the AWT done by our team for
java 141 we actually took advantage of
java of the new io a p.i to talk to core
graphics and and and not has to and
minimize the number of Jane I
transistors basically we told the
classes team try to minimize jni
transition as much as possible and and
they did better as much as they could
and it definitely are you definitely see
the performance improvement there and
we're hoping that uh that the shared
classes on the whole we'll be seeing
more use of new eyewear they not can be
done in the future the other thing also
is clearly if you if you have server I
oh with multiple clients you definitely
want to be using this because because
the overhead is definitely costly okay
so I'm walking your takeaway from all
this well the main thing you need to
keep in mind what the server aisle is
just simply use it in those cases and
with with me what I told you about in
lining what you need to do actually is
maximizing you maximize the
opportunities where we can inline your
methods this is mainly important in your
hot methods when you do a profile you
want to figure out what the methods are
your call you're mainly calling our and
at make sure that that whenever you're
the hottest lesson all of the things
that are calling our hopefully being in
mind it only can be done at a high level
there are actually no flags to have just
notified to you if your methods aren't
being in line and that sort of thing but
the general rules of thumb are
definitely if you if all those methods
are small that helps because we have
certain limit at which point we fail on
any future in line in lines in the
method we're trying to compile feel free
to use access or methods those have been
in mind definitely since Java 131 also
there's no need to use the final
qualifier on your methods that's
superficial for performance beginning
it's that's enough superficial for
object-oriented programming but we we
don't particularly get any benefit out
of that and also keep in mind that a lot
of the JDK methods do get in mind so you
can keep that in mind one if if that's a
lot of what your hot methods are doing
there are a few things that we're still
unable to inline and you got to keep
that in mind mainly a synchronized
methods obviously large methods and and
if you have an exception handler and
your method ducking cause it not to be
inlined keep that in mind the last tips
I want to leave you with our ones I
always like to reiterate which are
things that still live on from the days
of Java one avoid object pool there's
absolutely no need for them in modern
Java our new is completely fast it's
also in lined we also have thread a
local allocation now there's not even
there's minimized contention between
multiple threads allocating in the Java
heap at the same time and we also we
have precise garbage collection so let
us do the work for you in terms of
figuring I want an object needs to go
away you don't need to take care of it
in terms of the objects flow and all
that and also avoid programming by
exception there definitely are
situations where you want to program by
exceptions there's like the case where
you have like you want to go down a tree
and then go all the way back up turn
branches in the tree sure but hotspot is
definitely not optimized to compile
those cases as well for example it does
it can cause in line to be prevented and
also the actual creation of the
exceptions are expensive but that
creation only happens if the exception
is actually thrown so hope you came away
some tips for your application and now
bring up sherrard
hello welcome my name is George in ski
I'm engineer on Java classes team and
i'll be talking to you about graphics
performance first I'll give you a short
introduction of the state of Java
graphics on my quest and then I'll give
you few actual tips and techniques on
what you can do to your application to
make your Java app run faster and
finally we'll have some cool demos to
show you so Java 1311 really interesting
thing that we've done there was a java
2d harvard accelerated implementation
that sit on top of OpenGL that was
really terrific terrific implementation
it was passed with incredibly fast
however the problem with it was that
when it worked it worked and it worked
only ninety percent of the time and to
get that the rest n % was really
difficult to us we were making strides
we're continuing and we're making
progress however we really could not
nail down the correctness so when we
moved to Java 141 we completely real
hurry textured our code we moved from
carbon to cocoa and the lessons that we
learn in 131 was first of all if we
cannot do hardware acceleration we need
to have something we can fall back on
and that is sort of renderer so when we
move to 141 we decided let's start let's
nail down let's have terrific server
implementation as far as the correctness
is concerned then in the future once we
have that down we'll be looking for new
technologies emerging right here with an
apple and then we'll evaluate them and
then we'll see which one works for ants
the best and then we adopt that
technology so we are right now at this
point we are still at the transition
point where we're in 141
we have brand new code there is not even
one line of code that we sure would want
we want its brand new everything is
written from scratch but we want to nail
the correctness first and of course
we're keeping our eyes open up on what's
going on around us and what technologies
we can we can use later so in 141 the
java update that you guys have access to
first of all our priorities were
correctness second we also didn't really
want to neglect the java graphics
organization such as you all know 141 is
not a speed demon as far as graphics is
concerned so we worked on really on very
basic architectural optimization
techniques that we could put in there
and right now we came up with three of
them that's lazy drawing lazy pixel
conversion and lindsey state management
well lazy dronians about is we simply
collect all your primitives that you
want to draw we put them aside in the
queue in a cache and when the time comes
to draw them to the screener into your
image it's only then when we transition
from Java we go to the native then we
process that cue the good thing about
this lazy drawing implementation that we
have right now is that is future
compatible whatever technology we choose
to use next this lazy during
implementation will work with it and we
work with core graphics guides and we
made sure that whatever we do with our
lazy drawing optimization will not break
them in any way second lazy pixel
conversion there are certain image types
that Java provides the access to that
are not supported natively so what that
means is if we want to do something with
that image the pixels are in format they
are not understood natively we have to
convert them if we didn't do this
optimization drawing of images or
drawing into images would be terribly
curvy straw so what lazy pixel
conversion is it simply a technique of
converting the pixels only when it's
necessary and then thirdly lazy state
management a graphics context as
multiple different states that you can
set transformations color and what this
is about this optimization techniques
will simply let house set only those
states that have actually changed we are
now done we're not quite done with this
optimization we are only part way so
unfortunately at this point whenever you
change most of the graphic states we
have to slam all the other ones as well
at that time that is turban efficient
but we're working on that so here is a
serious one benchmark micro benchmark to
show you on this one this course show
simply basically performance of lazy
drawing optimization so what you what
you have in your hands with our national
141 release was the base of hundreds and
what you have right now you will have
175 score which is seventy-five percent
increase that's not too bad we're not
done with it by any money means and
second robocco that's real world
application it was interesting there's
interesting story behind this a rubber
coat at the time when we work in 131
rubber code was running pretty darn slow
and we went to the developer and I think
we made mistake because we told him look
the image format that you using is not
fast with our current implementation
once we want why don't you use the image
format that we support natively and will
speed up your application while they
listen to us they changed it and yes the
the solid performance improvement in
once we want however when we move to 141
underneath we use different
implementation different techniques
different technology
again the rubber coat score plummeted
down the problem was that the image
format was hard-coded so right now what
we tried to do was there are two things
that went wrong in rubber coat first of
all our lazy pixel conversion had not
very efficient filter and what we did
was and this is maybe small tip for you
guys we make floats and ins and that
made it trebek Turkish flow when you
switch to integers only the filter was
really really much faster this combined
with the fact that we also improved
unlazy spiritual conversion made rubber
coat perform much better now keep in
mind that this is still with the image
that is not natively supported we still
have to do work there but we use our
lazy pixel optimization Scott do you
want to show us this you might have seen
demo on please you might have seen in
the state of the union I showed this and
it wasn't getting full frame rates and
we're getting much closer to full frame
rates that I've seen before and if you
remember on the 141 release we were
getting about four frames a second and
that's because we're doing all that
conversion right on the fly and now if I
just start this battle up we should be
getting something closer to around 30 or
or or so yeah we're getting about 32 and
I may even get above there right now
it's hard lock to 30 so I go up to
maximum we might even break 30 and gets
more or right around 30 and you can see
here that we're not even using both of
the processors and there's something
else going out here that you don't
actually know is that we're running one
of the really cool demos in the
background that you'll see later in the
session on the same machine so so it's
actually got a lot of extra time I could
start on my word processor and we'd
still keep our 30 frames a second and I
also like to do this let's restart it
again and turn on my favorite option
which is I'm showing where the robots
are scanning and you get to see this and
we're still holding at 30 frames a
second so it's a pretty good improvement
on just looking at that native image
format and seeing that okay we had to do
a fix
thanks a lot thanks guys flights peace
and future first of all we have lazy
state management to finish this should
give us free nice improvement then
there's more optimization that we can do
we have already tried implementing some
of our lazy pixel conversion filters
using the multiprocessors that are now
available in many of our computers and
we're not with the we're not done with
it we're just testing playing around
seeing how much improvement that can
give us but that's one of the things
we're looking at also altivec
optimizations so there is still quite a
bit quite a few technologies that we can
use to make java graphics go in posture
and then we are talking two sons
engineers when we're working on our lazy
drawing optimization we actually went to
to them we went to Java to the graphics
engineers and we told them listen guys
this is what we're thinking of doing
what do you think is it dumb is it going
to work is not going to work do you have
any other of co ideas that we may do and
they love this none of their ideas and
they said yeah go for it we don't even
have it so we've done something that
they wish they they had it so we're
definitely doing some interesting things
and then lastly it's very important we
are working really very closely with
core graphics guys for example you if
you went to core graphics sessions you
might have learned that they're starting
to provide people first in OpenGL what
that is is off screen images also what
they provide is they can wrap an OpenGL
context in congrat expects that means
with that means two answers we basically
don't have to do any extra work to get
hardware acceleration out of certain
operations
p buffers together with core graphics
giving us encapsulation of OpenGL
context within core graphics gives us
ability to very nicely very easily
implement volatile images as you know
volatile images right now are not higher
accelerated on Mac OS 10 but there you
go coraopolis just added something that
we can use to hard work sir wait to have
a real hard work serrated implementation
for volatile images so that's one
example and second of course they're
looking into OpenGL the looking into API
to the API that sits on top of OpenGL
that's very similar to the hardware
acceleration that we had in 131 but
better because we are not the only
client of that we don't have to support
it it's the it would be the system-wide
and if they come through if they have
that working then that's definitely
something that we would like our code
work with them and take advantage of
that so we'll be looking very closely at
what core graphics guys will be doing
and certainly taking advantage of any
called technologies that they have to
offer if there is one thing that I would
like you guys to take out of this
session is what to do about the images
if you have to draw into bufferedimage
how do you determine what is the correct
what is the fastest image type that you
guys should use and
and this is the most important thing do
not please do not hard code proper image
height that's an example of a code that
would be hard coding it what you can do
is you can ask the system for compatible
image now if you do this then no matter
what technology will use in the future
you're guaranteed that you will be given
a buffered image type that will be the
fastest on our platform that is very
very important and one more point here
if you have a choice you have the option
after balto image it will be will have
them part of accelerated soon hopefully
so we can have the choice get bottle
image now there's one misconceptions
among some of you that with respect to
index core formats on other platforms
they're very fast and they're also
conserved memory so using index format
is is a way of compressing the pixel
data using less memory unfortunately a
Marco extender not supported natively so
what we have to do internally to support
that image format is to create brand new
buffer just to have those pixels
converted in the format that we can
understand natively and then that's the
way we can use them so index core format
images on michelson do not use less
memory on the contrary they use more and
they aren't flower so if you have to use
them do you have no choice but if you do
need to use them I'll use all the image
formats and it's very easy very often
for you just to see now you don't need
to do a lot just just change the buffer
image format type and second and that's
very important the most optimal image
from that it's not hard coded it can
change it can vary from a machine to
machine if we were to move again correct
technology that uses OpenGL
that is very dependent on a video
graphics card you have in your system
also it is very important than to us to
see what is the resolution of the
monitor you running what is the depth
screen you running on so there will not
be one and only one image from my dad is
the best the fastest it will change and
you need to keep that in mind if you're
writing for for Mac OS 10 now if you
really need to know what are the
natively supported image formats at this
particular time and this may not hold
even in the next few months if this may
change but at this point only for image
types are supported natively those are
the fastest and those are the image
drives into which you can draw which
means they're the destination you can
create a context right bufferedimage
create context or get context good
graphics so those only four are natively
supported those are the fastest at this
point if you need to draw an image
somewhere else meaning the image you
have the pixels are source then the
native we supported image formats are a
superset of the destination we have one
or image format that we can support
natively as a source and that is a RGB
alphen aan premultiplied that is by the
way the image from addis robocode users
and we have added special optimization
in our lazy pixel conversion that
actually allows the pixels to know
whether they're a native format or there
are in java format and then based on
that we have two different cge image
rest and then we can just switch very
fast between two of them we can choose
the pixels that are there them there are
up to date
and here is some techniques for the
rendering this is important in Iran
platform as well where we have missing
one or technologies that we had in 131
allows you to draw very reasonably fast
if you were drawing on a known awt event
thread we don't have that quite yet in
this current reads we working on this we
don't so the point here is if you can
you can trust us and we may do this but
if you really want to do something about
if you're in this situation you want to
make sure that this runs fast in your
application if you're an unknown awt
event threat please first render
off-screen to an image and then take all
of that content and at one time blooded
to your final destination it will be
much faster in my question now this is
for those of you who really need the
fastest access to the image pixels for
whatever reason if you're writing an
image manipulation program like
Photoshop something like that then if an
image is supported natively
unfortunately for you it is there's no
way to determine whether certain image
type is supported natively or not you
may change in the future it's constant
only at this point you may change
however if for some reason you need to
do that and you know the image format is
supported natively what you can do is
grab pixels directly from data buffer on
no need to be supported image you do not
want to access pixels directly if you do
we have to turn lazy pixel conversion
optimization off because the time the
second you touch pixels those pictures
are stolen you can have access to them
we do not know when you look at them or
when you use them we have to do the
conversion from native to Java every
single operation so you do not want to
touch pixels directly on a non natively
supported image so go through
graphics objects and draw to it that way
other optimizations tips and techniques
this comes from my all our work I've
done an application with DNA sequencing
application and when I was trying to
optimize it here on the few things that
I found that helped that application
first of all avoid creating the objects
in your plane method obvious disappoints
to all platforms don't create new phones
don't create rectangles if you need to
use them and manipulate them later on
for determining the clip don't create
any objects in paint mess up your simple
primitive in sets of shapes instead of
shapes we have done our lazy drawing
optimization actually attempted to do
that automatically for you however it
would be if you have the choice on Mac
OS then it is faster to drop the
primitives directly is in for X 0 0 you
know XY with hide as opposed to creating
unrec to the object use follow all your
line instead of drawing lines one at a
time it's only avoids the crossing of
the natives from Java to native and Tom
and it's simply faster with current co
graphic spoon but implementation because
we can build a complex pad if you have a
power line otherwise we have to draw
lines one at the time and co graphics
it's not terribly good at that and this
will not apply to probably to most of
you but if you have a limited alphabet
and you know you will not be drawing
complex characters so if you're writing
an editor text editor kind of
application this will not apply however
you can have a limited alphabet of say
for four letters then maybe you can do
that and the optimization is you can use
bytes no charge charge for 16-bit and we
do not know where it could be a unicode
character or not if it is then we have
to go through this more complex pad to
draw unicode characters if it's a by
then we know it is it falls within an
ascii range and then we can just bypass
some of the complex texturing routines
and we can go straight to co graphics to
blood those
those characters use double buffering
for your static portions of your
applications that is that will apply to
all platforms and we have added with
this release we have added tons of
runtime options for you guys to play
around to turn them on and off you can
turn up the optimizations that we
provided for you guys you can turn on
and off rendering of lines or rectangles
of shapes you can use all those runtime
options to narrow down and to find out
what is the problem with your
application if you have one so now for
the demo I like to welcome Ken Ralph
Lauren some micro system I a couple of
weeks ago at javaone sun announced the
new java gaming initiative and one of
the products of this initiative is a new
OpenGL binding for the java platform
called jo GL and jo GL is open source
and you can download the source code
right now on java net so just go to
java.net search for the project name and
you can get it and thanks to Gerard and
a couple of all nighters Jo GL is now
running on OS 10 it's running on the
Developer Preview that you've got with
your 10 3 cds and it's not going to run
on any earlier versions of Java for 40 s
10 so keep that in mind but going
forward it will it will work and it will
be fast and robust so we've got a couple
of very cool demos to show you this one
is very special this is um this is go be
the dog and still be was developed by
the synthetic characters group at the
MIT Media Lab and Dobby is completely
autonomous he perceives his environment
he has his own internal motivations and
desires and you can actually train
though be in the same way that you would
train a real dog you can sort of lure
him around and show him new motions to
do you can reward him by giving him a
little click with a clicker you can in
some sense scold him by ignoring him
when he does a behavior that you don't
like and and basically Dobby represents
or at least it's a safe thing to say
that they'll be is pretty much a state
of the art in
active animated characters that can
learn and you can read more about though
be in the paper on him in siggraph 2002
now they'll be it turns out is written
almost entirely in the java programming
language with a little bit of native
native code around the outside to get
the custom device inputs and he uses
some of the more advanced opengl
techniques like vertex shaders to do the
shadow that you see here and the cartoon
like shading around the edges of the dog
this demo runs that over 50 frames a
second on a dual processor g4 and I
should mention that the synthetic
characters group is a big OS 10
development house and so they do all of
their development at this point with
Java on OS 10 and this is the first demo
that they've actually had to slow down
because it was running too fast so it
actually slowed down to 30 frames a
second and because the g-forces so fast
and g5 will be even better so we're not
actually going to train dobby right now
he's just going through his paces but
you can sort of see see what's going on
there skinning going on there's a action
selection and this is running on top of
the Jo GL binding for OS 10 and also
notice the CPU usage unless the bottom
corner there's almost nothing going on
in there everything goes through through
video graphics card yep so cool stuff ok
ok so now here's another demonstration
this is the demo by nvidia corporation
that we've poured it from c to java
yeah okay now this is not real time ray
tracing this is using a couple of trick
to get hardware acceleration for this
technique of rendering glasses prismatic
effect so there is actually in some
sense you many of you I'm sure are
familiar with the technique of ray
tracing resend a light of array of ray
of light out the camera into the scene
and that is in fact being done at every
vertex on this wireframe model but the
trick is that it's being done on the
graphics card by what's called a vertex
shader or vertex program this is a tiny
little assembly language program that is
actually uploaded to the graphics card
when its demo starts up that tells it ok
we're going to take the cameras position
and the vertexes position in the surface
normal and figure out where should the
reflected ray go and where should be
refracted ray goes through the object
and basically it looks up in the
surrounding environment this street
scene where's the right texture
coordinate for where the Ray intersects
the the world and basically what it's
doing is distorting the background
texture in such a way on a per vertex
basis that it looks like the things made
out of laughs so it's not doing it at
every pixel it's doing it at every
vertex but it's close enough that it's
really indistinguishable another cool
trick here is that you notice that we
just turned off the fringe effect what's
going on is that we're rendering the
scene three different times each with a
slightly different refractive index for
the glass and that makes the refracted
ray go into a slightly different
position in the surrounding environment
each time then those three things are
added together again on the graphics
card and you get the what basically
looks like a prison so I'd like to point
out that the same binary for this
demonstration runs on OS 10 it runs on
linux and it runs on Windows and it runs
at one hundred percent of the speed of
the analogous C++ code remember this was
a port not a new demo so basically we
are here with respect to OpenGL
performance in Java and and it's running
on all the time it looks great so go out
and develop cool stuff
we don't have java 3d for you guys yet
but if you really need to use home 3d
graphics and then please use jungle and
judge some of you might be familiar with
Java Java which is very similar product
technology also Queen geo binding java
java doesn't support the lightest opengl
standard doesn't give you access to
pixel shaders juggled us so if you want
to have 3d graphics on my question using
java you can
you