WWDC2001 Session 504

Transcript

Kind: captions
Language: en
I will not waste any time so the goal of
this session is to get the best
performance out of your java application
on Mac OS 10 whether you're writing a
new one or whether you already have one
that you're porting for a Mac OS 9 or
some other platform so what we'll cover
in this session is try and give you an
understanding of the characteristics of
the Mac OS 10 Java performance some
techniques and patents you can use in
your code to try and get optimal
performance on 10 point out why
measurement is critically important
important whenever you're trying to do
optimization and then lastly we'll have
a demonstration of some performance
analysis to give you an idea of what
sort of tools will be available on Mac
OS 10 so without further rammus i'll
introduce even possible
from the java vm team he's our tech lead
thank you
[Applause]
well what I wanted to question first
said what are the Java performance
factors well first of all it's your
application design and implementation if
you have algorithms that don't scale to
your problem set there is no performance
tuning we can do in the VM at any rate
that will improve your app so you have
to make sure that you don't have n
square algorithms or anything like that
in there
the second factor is the amount of
memory your application is consuming the
more memory you can the more memory you
use the more likely you are to be
swapping out to be paging the more
stress you're putting on the VM memory
subsystem and the less likely are
benefiting from caches data cache and
instruction cache then the next factor
down is bytecode execution performance
it is at what speed does the job of VM
execute your Java code Jim we'll talk
about tips and tricks do's and don'ts
in the in the second part of his talk on
what you can do in this area I will
concentrate more in Java VM efficiency
in the first part of this talk there are
two other factors that influence Java
performance one is the speed of the
underlying OS we cannot entry we are not
touching this in this talk and then
there is all obviously the speed of the
harder you're running on
so let's look what let's look at what
Java VM efficiency means there are two
most cited most cited issues that
influence job performance at runtime one
of them is memory management there is
the footprint of your application the
footprint of your java process running
that includes your Java heap once that
includes all the supporting java vm
memory structures we are actually using
the Java heap for that as well it
includes parts for the OS parts for the
code for the VM and so on then second is
the speed of allocation Java is very
object heavy you allocate a lot of
temporary objects so you need to be able
to allocate those objects very quickly
well to keep your footprint down if you
just allocate very quickly and never
reclaim you would grow your foot
footprint to infinity so that's why you
have to reclaim very efficiently as well
I will touch briefly on what what we do
there or how you can actually help us as
Java programmers in that area the second
part is synchronization Java has
built-in support for multi-threading in
the language what to make that viable
use we actually have to make sure that
the synchronization primitives are
implemented in a very efficient manner
inside the VM another part is startup
performance there must be hundreds of
Java benchmarks out there on the web
small and big and some useful some less
but I've never seen one that actually
measures Java startup performance you
don't want your application to
start up for a minute or two minutes
seeing the bouncing icon in the dock so
I will touch some of the issues we have
done to address that issue most notably
reducing class loading class loading is
about 40% of your start Java VM startup
and I will touch on what you can do to
prevent or actually to help us out with
startup performance so let's go to
memory management first the hotspot VM
that we ship on Mac OS 10 has an
accurate compacting generally
generational copying garbage collector
what do those buzzwords mean accurate
means we know at all times where in the
VM you have references to your object we
can distinguish between real references
to objects and just memory location that
look like references to objects
compacting is also means that we compact
the heap we don't leave any holes in the
heap meaning all the memory use is
basically move together compacted well
as it says and that improves both the
footprint and the locality of reference
generational copying is means that we
allocate objects for a first time in a
in a new generation and only when they
survive for a certain amount of time we
actually copy them into an older
generation which we handle very less
we're much more less frequently than the
Nuge newly allocated objects so we spend
a lot of cycles on newly allocated
objects collecting newly allocated
objects for all their objects we don't
spend that time so there is one thing
lifetime of your object is important if
you have objects that you're not going
to use you have to know like the
references if you have object
hierarchies that you're not using
anymore not allow the reference to that
object hierarchy to actually give the
garbage collector an opportunity to
remove all the space that is Alec
for those object hierarchies then the
second part is if you use if you use
some caches or have objects that can be
recreated at a later time you can use
the weak reference class or soft
reference class to give give the
collector a hint or an opportunity to
remove objects out of your working set
if the memory is getting tight then
avoid finalizes where you can finalize
objects that can be finalized need to be
handled specially in the VM they we need
to actually call the runtime from
interpreted or compiled code to allocate
objects with finalized errs because we
have to keep track of them and that
makes it very hard to allocate quickly
all other allocation is done in line in
interpreted and compiled code for fine
fine Eliza Ballabh jex we first have to
keep track of them by allocating through
through the runtime and then we actually
have when we're throwing them away we
have to make sure to call the finalizes
when you're done so if you can avoid
finalizes please do then to reduce your
footprint it is helpful to do lazy hours
initialization and allocation so that
way you reduce your footprint as well as
reduce your startup performance and
startup time to increase the startup
performance so what did we do to memory
management to improve some footprint
issues what we introduced a shared
generation that stores the most often
used classes and methods including their
byte codes and so on and this shared
generation is mapped in from a file so
we used mark underlying mark virtual
memory system to just bring that memory
into the VM we reduce the GC time
because we never try to collect
reclaim that memory it is basically free
because we read it from a file
we are never touching it so we share it
across multiple running Java processes
the thing you can help us there is
well don't break it don't change your
boot classpath do not modify the systems
are files that are installed on the
system and that way we can always use
the share generation when we start the
VM it has additional benefit in start-up
time which I'm going to touch now avoid
eager class loading and class
initialization but it has both effect on
startup time as well as your memory
footprint if you don't load those
classes when you don't need them you're
not you're not using the memory as well
as you're spending cycles to actually
load them from the class file decode
them bring them into memory initialize
and so on so if you want to see which
classes are loaded at what time you can
use the - verbose callin class flag on
the command line to see what is loaded
at what time and in which order so as I
mentioned before the shared generation
reduces class load time because most of
the classes are going to use out of Java
and Java net swing and so on have been
preloaded for you in that shared
generation and our mapped from the file
when you actually when you start the VM
so when you run the above command that I
that i mentioned java - verbose class
with version you don't see any class
loading going on at all so the other
part was synchronization the VM the
hotspot VM we are using has very fast
synchronization in the on contended case
what does a none contended case mean is
when you synchronize on an object one
thread at a time so that is most that is
most often the case that you don't you
don't have any locking really going on
but you want to protect if some other
thread happened to be in that in that
code so for this we have a constant time
overhead done the in line implementation
of the compiler or in the interpreter
takes about eight to ten eight to ten
instructions it has very low memory
overhead
we don't allocate any space for these
object locks on the heap they're all
stack allocated in the stack frame that
you're actually running this that you're
locking this object that you're
synchronizing on and we don't use any of
the underlying OS resources because that
is expensive that ties down memory
sometimes even in the kernel which you
want to avoid the contended case is rare
for that but since it happens
for that we use mark printed primitives
directly to speed up to get as much
performance as we can
so before I want to hand over this talk
to Jim I wanted to see to tell you
what's new in 2001 so we shipped Mike
Weston it has a shared generation that
is one of the big improvements we did to
the java virtual machine we get from sun
in the 131 Developer Preview we're
working on or that should be ready we
did in line interpreter allocation this
was not in Mac OS 10 we're working on
thread local allocation we were we have
a faster instance off and faster array
copy that is tuned to g4 with the
velocity engine we use the same code
actually for the code we used for our
array cup we are also using inside the
GC to copy the objects between the
generations so we're making use of that
code in multiple areas so now I would
walk wanted to give this over to Jen
thanks
so memory management and synchronization
is a crucial part of what affects the
performance of your application but I
think fundamentally when we're working
with with Java code we think of in terms
of the speed of speed of the code
execution as being the key factor in
determining what you know what is
actually slowing down our particular
application so what I'm going to try to
do in this talk is just describe how
things get interpreted how they get
compiled when they get interpreted when
they get compiled and give you some
ideas and some of the coding hints that
you can use to speed up the performance
once you find out what kinds of
weaknesses you have in your code and I'm
also going to run through a few things
that have changed in the last year since
our talk last year so first of all I
want to point out that you know people
often question well why don't we compile
everything obviously if you could turn
everything into into native code it's
going to run a lot faster than if we if
we interpret it but because the Java
environment compiles code on the fly
using just-in-time compilers it's
there's a certain amount of cost at both
in CPU and memory usage in order to get
things compiled and when you're doing
some analysis of the actual VM you find
out that it's actually cheaper to
interpret the code because the
interpreter is fairly fast cheaper to
interpret the code then actually go off
and compile it and try to run it so we
have to get a balance there on what
actually gets interpreted and what gets
compiled okay so when does a method get
compiled well first of all everything
that starts off in the VM starts out in
the interpreter we have to make the
assumption that the method will run best
in the interpreter the first few times
at least and try to get a feel for what
the method is is about if we within the
interpreter we actually have monitor
monitoring tools which count the number
of times a particular method
and vote and once the method is invoked
a few times and in the case of the
hotspot VM client vm once it's invoked a
thousand times
then we siphon it off to the to the
compiler and the compiler generates
native code so that so the one
thousandth and first time that the
method is invoked we we fire off it
fired off to the native code so hands it
starts running faster at that point now
the invocation count is not sufficient
in to determine whether a method is is
hot or is readily used or used fairly
often so we also keep a track of the
number of time as a method loops so
within within the loop code within the
turf the interpreter every time a method
loops we also increment the invocation
counter so so that a method that loops a
lot is going to get compiled a lot
quicker so if you have a method that has
a loop that iterates a hundred times
then it's only going to take ten
invocations of that method to actually
trigger the fact that it gets compiled
now there are certain methods that that
tend to loop for a long period of time
maybe ten thousand times or a hundred
thousand times depending on or maybe
forever depending on the type of
application you have and one of the
things that we've added in in the last
year is something called on stack
replacement which allows us to take a
method that is hot and is actually
looping around create compile code for
it and replace the interpreted version
of that invocation into a compiled
version we can continue on in a compiled
code so this is pretty neat a notion and
hence almost all the things that are
hotspots in your application get
compiled to get compiled when they need
to be yeah last year I had a list of
things that don't get compiled at this
point almost anything that's a good
candidate or as a hot spot in your code
will get compiled and there's very few
things like a couple of Java assembler
concoctions that that don't get get
compiled in the jck so using these the
criteria of a number of
occasions the number times it loops we
can actually sort of find out what is a
what is actually hot in your in your
code and we find out it's only about 5%
of the byte codes at 5% of your
applications or methods that that need
to be compiled and hints are hot in your
application and Andy will be going over
some of the tools that will to allow you
to determine which of those methods are
actually getting compiled the native
code native code and we can once we've
got that information we can start
tweaking those particular methods
because those are the methods that are
that are going to be problematic so I'm
just going to go through and discuss a
few things that you can do to to get the
best performance out of your application
and types of things that you can
concentrate on once you find out what's
hot in your application the first the
most important thing that's constantly
repeating this but the most important
thing is that because compilation takes
up a lot of resources CPU time memory
it's it's best to try to keep down the
size of your methods down because then
the compiler can get through them fairly
quickly and if only a small portion of
your if your method is actually being
used all the time maybe you have some
exception code in there or some special
case code it's doing a disservice to the
compilation process by having it
embedded in your in your method so you
should try to break that code out into
separate separate methods and try to
keep your method focused small and
focused so that it can compile quickly
and then go off and execute and then you
get a good locality also two of the
execution of the code so as I say rep
separate rarely use code out in two
separate methods as far as saying to
yourself well you know like if I put it
in a lot of into a separate method then
it's gonna we're going to incur the cost
of calling that method you know
basically pushing parameters and so on
so forth
but you find that in in the VM will
actually in lines things that make sense
to in line if we can make more optimal
use of the code by in
as opposed to having it separately so
don't worry about that and in particular
accessor methods are always in line so
you don't have to worry about the fact
that well I've got a very tiny method
and all it does is extracts the field so
and with the 1-3-1 code we've actually
have a much tighter implementation of
accessors so they are very fast one of
the things that I like to do is actually
try to find which methods that are used
fairly frequently and in the class
libraries and tweak the codes
specifically to handle those because
those are the the routines that are used
a lot by everybody and we want to try to
get to try to get good performance for
those particular methods so trust the
supplied classes you may have the urge
to go out and rewrite the vector class
because it synchronizes every time you
access it and and there's an object you
know check cast every time you do a
extraction from it you know these are
the sorts of things that we notice that
are used a lot so we we hand tool or not
hand tool but we provide special service
this service for those particular types
of methods that are used fairly
frequently so instead of going off and
writing your own try trust the supply
classes so classes such as string string
buffer vector and the collection class
is you know use what's there because
we're gonna you know get the performance
up for you and we've added some more
optimization or more special cases for
those in the no.13 one if you have a
copy that your copy from one array to
another user rate copy because it's a
Yvonne mentioned that we're using G for
acceleration on the array copy and hence
it's going to be the fastest way of
doing it so instead of having a loop
that iterates through user rate copy and
then of course there's certain functions
like sine cosine and tangent ation so
it's best to use you know what's applied
and not go off and run your own gianna
write your own J&I routines to deal with
it make the best use of the native
datatypes again
the g4 is not a 64-bit processor so
whenever you have a long arithmetic it
has a certain amount of costs associated
with it some of the basic operations
look at and add and subtract or and and
or whatever they're reasonably cheap but
when you get into shifting or divide or
that sort of thing that's it can be
fairly costly so if you don't need all
that precision stick with ants for the
time being and then also consider using
floats instead of doubles not not
necessarily in your computation because
sometimes you just need that you need
the precision but when you're dealing
with arrays of values it's best to keep
the size of your arrays down by using
floats and there are quite a you know
quite well-known techniques for actually
keeping a precision even though you're
using a 32-bit value and then new to the
131 we've added better register
allocation for law and floats and
doubles so you'll find that some of the
say especially if you're doing a looping
type calculation that performance will
be improved on that I try try to avoid
using the generic data types because
there is a cost and assigning say a
generic data type to a specific data
type we have to go through a check check
cast Yvonne mentioned that we've done
some performance improvements in 131 -
to deal with instance of and Chuck gasps
Chuck cast but it's still a cost case
instead of at a simple assignment we
have to go off and do this to make sure
that it's the right class for doing that
so try to avoid using generic types and
use subtyping or sub classing you know
in these circumstances because then that
way you can avoid making assignments
that require these these context I try
to work with local copies now one of the
things about people have been asking
about was why is the code generated by
hotspot client slower let's say then
maybe the server version and so forth
well some of the optimizations you get
in the server version of the compiler
are very sophisticated and they're not
there in the client because again we
want to try to compile things fairly
quickly and get them up and running so
if you have a array access
and you're working with that array value
it's best to make a copy of that value
and work with that copy and put it back
in again in this particular example you
have three accesses to that array that
means we have to do three bounds checks
and three null checks to see on the
table itself whereas if we make a copy
of that we only have to do two in this
case and then plus you get the locality
issues where if you're working with the
value in registers in this cases what
happiness value will be assigned to
register as opposed to going back to the
array that you get performance boost
there as well this is sort of a Dee
Optima D of the code one of the things
that people run into especially on MP
machines is with a lot of threading if
you have access to your two global
values they're kind of wondering why the
values are sort of changing or not
changing up from underneath them make
sure that if you have a global value
that a global static that's being
accessed from several different threads
or written to both other threads that
use a keyword volatile one of the
optimizations a compiler will do is say
well this is a value that I've already
got a copy of why should I go back and
get the original if you put the keyword
volatile this will be aren t that things
will get reloaded every time you access
the variable use final constant of
Static finals this basically specifies
that that the the variable let's say in
this case buffer size is is a constant
and the compiler can treat that as a
constant all the way through the code
and optimize do constant folding in this
example we know that the the character
array that we're allocating is a fixed
size we know that the initialization of
the of the buffer in the loop is going
to iterate a fixed the number of times
so take advantage of that by making sure
that you just clear your statics is
final if they're going to be constant
throughout your execution there's a
certain cost involved in invoking
anything which is a virtual call or a an
interface call in the case of virtual
calls we have to do
a index index into a table to find the
address of the method that we want to
dispatch to in interfaces it's a little
bit more complicated because we actually
have to do a match and make sure that we
we match the class that we're of the
class of the method that we want to
invoke so virtuals a little bit cheaper
than then interfaces so if you have a
choice try to stick with subclassing as
opposed to creating interfaces and you
get better performance that way in the
hotspot VM we actually cache the call so
that from any particular call site we we
know which method worked for us last
time we try to reuse that so we don't
actually do a lookup each time but there
is still a cost and that initial lookup
and whatnot so try to use virtuals
versus interfaces one of the
optimizations that we've done in the
hotspot compiler is is to dealing with
switches you can create the switch
statements with fairly sparse values in
your cases in traditional compilation
what they would that would happen in
those situations is that we they would
create a big if-then-else we were using
a technique of double indexing which
will allow us to actually just dispatch
basically fairly quickly on any switch
it's not the nested-if combination so if
you're comparing a single variable
against a an integer datatype utilize
switches over if statements more and
more you know as more and more people
are learning the program some of the new
people new to programming are have a
tendency to use exceptions for control
of their program flow you really should
try to use exceptions for the
exceptional cases and not for for actual
flow of your program so because there is
a quite a bit of cost in the VM to
actually handle that exception so if
you're going to if there's a good
likelihood that
that the routine you're calling is going
to produce an error then you should
probably use error codes and test the
result when you come back as opposed to
throwing the exception because that
would be faster than and then actually
throwing the exception and in it and
having the VM deal with it and finally I
think you should think Pierre Java in
the 1-3-1 code we've implemented
something called compiled natives which
allow you to call J&I code fairly
efficiently we don't have to go through
any kind of marshalling code which
marshals up the parameters and then goes
off and calls the routine what happens
with these compiled natives is that we
can actually have we actually have a
thunking layer which just already knows
what the parameters are going to look
like and assembles the parameters for a
call to the J&I routine so that so
you're on one side your J&I calls are
going to be a little bit more more
efficient and faster and once we want
but on the other side there still is a
cost in using j'ni or or je direct which
is built on top of je and I there's this
translation of later involved and it
costs and then also if you're doing will
call dealing with callbacks that's going
to require some kind of lookup so you
should try to use Java wherever you
possibly can and you know try to avoid
going off and as time wear goes on we're
going to get the compiler faster and
faster and you can forget about C and
C++ okay with that
I'll pass it on to Andy hi
so I get to do my little bit now where I
talk about how important measuring is so
all of this all of the information that
we've been giving you is kind of useless
if you then go and apply it willy-nilly
to your entire 60 mega byte code base
it's really important all the textbook
advice that says don't optimize
prematurely it's really true what you
should be doing is measuring finding the
major bottlenecks optimizing those
bottlenecks making sure that it actually
worked because we've seen optimizations
that have actually slowed things down
I also go through what you should try
and measure how with the the 1-3-1 java
vm we've improved things and how you've
actually got tools to enable you to
measure those things and then I'll just
cover a few little myths that's still
still around in all those textbooks that
are not quite true on Mac OS 10 so the
first obvious thing that you always
think of is how fast is my program going
you look at the CPU meter on Mac OS 10
and it's tagged 100% so obviously you
should be looking at where the CPU time
is going your your your program whatever
it's doing is CPU limited the first
thing that we do for you in hotspot is
we compile all those hot methods we're
counting which one gets used most and
we're compiling the ones that get you
you get used to it called most
frequently so obviously look at the hot
method look at those ones that are being
compiled and I'll cover use it how using
X prof will actually tell you which ones
are being compiled now secondly
depending on where that CPU measure the
CPU meter is reading you might be using
system CPU and not use a CPU in which
case you might be paging and the poor
old OS is trying to just read things
read and write things from disk and
shuffle things around in the VM system
paging is really expensive so if you're
um if you're running on a 128 Meg system
and you set your heap to 256 Meg's well
we think you've got 256 Meg's so we'll
happily go and allocate and we won't do
full GCS until we think we've run out of
heap but in the meantime you'll be
paging madly
think think very much about controlling
your footprint and heap usage now other
times you get into situations where your
CPU isn't pegged and in fact at first
glance your program seems to be doing
nothing and that's probably what it is
doing it's probably waiting for the disk
or the network to reply so there are
some tools on Mac os10 some of which are
covered I will mention here some of
which are covered in the performance
tools talk later that allow you to look
at what your program is doing IO wise
and network wise and then lastly one of
the things that we talked about
synchronization monitor contention can
get very expensive and the reason for
that is that especially on 10 if you're
if you're used to Mac OS 9 switching
threads and processes was relatively
expensive because it didn't have the
memory protection and preemption behind
it
whereas on 10 you know when we switches
thread there's a there's all the state
in the processor that has to be saved
out to memory and when we switch
processes I threads between processes
you've got to save all of that context
as well
so it's a lot more expensive than
knowing so that's one thing to bear in
mind so how do you go about measuring
all of the things that I've talked about
the best thing is from your perspective
is to use a commercial performance tool
one example of which is optimize it
which Scott will be demoing just after
I've talked it provides CPU profiling
and/or sampling so profiling is a way of
tracking each and every time methods get
called sampling does a statistical
analysis there are pros and cons of each
profiling gives you a very precise
measure of exactly how often things that
called sampling is less invasive your
program doesn't slow down so much so
depending on what you're doing one or
the other is better you can also look at
object allocation which objects are
getting allocated where they get
allocated etc Scott will cover a lot of
them in the demonstration I think the
other thing you can do with with hotspot
that we provide in the 1-3-1 developer
preview
each probe is now functional it wasn't
in cheater and that's a trough is
implemented as a library loaded at
runtime that uses the JVM p.i interface
in hotspot secondly you can use X probe
which is a per thread kind of
measurement and there's minus X a proc
issue allocation information so as I
mentioned each probe comes with the
Developer Preview one of that's
available on the website it's a basic
CPU and monitor profiling tool so it
doesn't give you a lot of it gives you a
lot of nitty-gritty detail and not a lot
of analysis there's a relatively simple
UI available from Java software that
gives you a primitive GUI on top and
lets you drill down a little bit and
I've used that to a certain extent and
that's quite helpful it's relatively
simple to use you just passed a couple
of command line parameters and you tell
it whether you want to sample or look at
monitor contention etc and this is the
it turns out that the the perf final
tool only works with CPU sampling it
doesn't doesn't work if you use
profiling so should use the first
example the monarch the contention will
give you a little bit information about
how much time each thread spends waiting
on a particular monitor so if you're
seeing Oh an application that you you
can't really see why it's slow but there
seems to be a lot going on you
probably one of the first things you
should do as a look at monitor
contention you can see dramatic
performance improvements there because
when when we get contention as to the
cases where we don't have an explained
earlier it's a it's a case of going
through ten instructions in line in the
interpreter oral the compiler versus
several thousand cycles going into the
kernel and doing context which isn't
like so that's why it's expensive - like
say prof. will give you a simple
allocation profile so you run your
program and
at the end it'll just when it exits
it'll spit out this dump of all the
objects that got allocated how much
space they took up the average instance
size etc and you can from just from that
information you can say well maybe I
shouldn't be allocating so many vectors
or hash tables or etc but it doesn't
give you any information about where
they got allocated which is why it's you
know optimize it or something like that
is much more useful - X prof is is of
somewhat limited use because it gives
you per thread information so if you
have a program that Forks 400 threads
like
Vilano mark or something that I tend to
run err on or off at the end of the
program when it exits it spits out 400
copies of the information which is not
very useful but it is the only way that
I know of where we actually list out the
methods we've got compiled versus the
ones that get interpreted and how much
time we spend an interpreter code versus
native code versus compiled code and how
much time we spend in GC etc so that can
give you some very useful insight on
first of all which methods got compiled
and you might look at it and say hang on
a minute I expected method a to get
compiled because I I was under the
impression that this was my most
expensive method but it turns out that
in fact we didn't go anywhere near that
we didn't compile it or maybe we did and
we couldn't compile it because it's got
some funny assembler or some construct
or it's too big etc so that that will
that will pinpoint which methods are
getting compiled you can sanity check
that the ones your that are getting
compiled are the ones you expect and
once you know the ones that are getting
compiled you can then focus your
optimizations on those methods and it's
a little example use down at the bottom
it's very simple to use but like I said
don't try it with 400 threads just do it
on something with a minimal number now
measuring memories is a little harder
because the the Java VM has several
different perspectives on what memory is
as far as you are concerned the only
memory you can really have any control
of is the memory in the Java heap and
the tips that some if I unexplained
whereby you you know that references you
avoid using finalizes cetera that's
that's the kind of thing you can control
other than that so you can you can watch
the heap as it grows and shrinks using
the voice chief see flag as we mentioned
for most class will show you what
classes as they get loaded if you you
might see classes getting loaded before
you think you should be using them and
that's an example where you should sort
of go in and pinpoint why they're
getting pulled in and maybe you can load
them a bit later there's a command
called top which will give you an
overall memory view of the whole system
and that's good for splitting out memory
that's being shared for example when you
run multiple java processes some of the
memory that we pull in from the shared
generation is shared between several
processes and you can tell the
difference between memory that's
privately allocated I used in the heap
for you versus memory that's being
shared in the shared generation or is
being shared because of dynamic
libraries that are being pulled in by
native code either your code or our VM
map is another command-line utility
gives you a lot more specific insight on
the the intricacies of the virtual
memory being used and it's it's
relatively complicated if you if you
want to learn a bit more about it they
might cover it in the performance tools
it turns out that Java vm's are
improving faster than the books about
performance in Java can be written so
there are there are quite a few books
out there most of which is containing
extremely good advice but some of their
tips just become outdated with time as
the technology rolls on traditionally in
1.0 vm's 1.1 VMs allocation was very
slow
they had a malloc space via an
allocation scheme or something our
allocation is now extremely cheap now
the initialization of an object may not
be but allocating it is few instructions
as a result of that and as a result of a
scavenging short live objects are very
cheap to jeast GC because we essentially
don't do anything with them we just
throw them away at the end of the their
lifecycle so
karai's methods costs are small as if I
mentioned and the contending case is
still expensive now lastly as I hinted
at before system calls which involve
entry into the kernel just because of
the whole context switching and a little
bit more weight involved with Mac OS 9
they are expensive there are there are
certain things that we do on your behalf
as part of the Java API s that involve
system calls network operations i/o
operations things like that thread yield
all of those system calls so if you
don't need to do things like that avoid
them so there's a quick graph which
you've seen before this is the peak
allocation performance from various
different technologies
I think Blaine showed it and his talk
and you can see that and this includes
the garbage collection side of things so
it's not just allocation and you see
that compiled Java which is the tall one
is just way faster than any other
technology now so here's an example that
I pulled from a performance book
published a year or two ago they gave us
an example of one thing you can do to
improve performance pooling objects so
that you avoid that by recycling them
you avoid the cost of allocating and G
seeing them so I wrote a little
benchmark and I ran it on my G for power
book and I got this sort of description
so as I increase the number of threads
you can see for the for the single and
two threaded cases you know the pooling
is just slightly faster so I'm
allocating hundred thousand and then
filling them up and then throwing them
away etcetera but when you get too large
a number of threads you can see that the
time taken to actually recycle these
vectors is actually longer than it took
to create them in GC them now in the
jewel processor case it actually is the
moment you go to anything other than
single threaded the the simple allocate
and throw away mechanism is faster and
the point being is that you don't have
to incorporate any complicated pooling
code if you just do the brain-dead thing
in
just a lie came throw it away so this is
this is one example where the technology
has just moved on and that little truth
about pooling things is it's not quite
so true now on the other hand if you
have an object and in this particular
example I'm talking about a java thread
that is extremely expensive to create
and initialize that the expense of
creating a thread involves you know a
couple of kernel trips to create the
instant internal data structures they
can obviously be cheap Teresa they it
can obviously be beneficial to recycle
those the the corollary is that
sometimes especially with something like
a thread which involves a kernel data
structure it can actually be costly to
keep them around as well so you have
this trade-off between some things get
more performant but on the other hand
you you have to pay the penalty of
keeping the kernel wide memory around
and the extra stack etc so this this
little graph just there's an example web
server on some site which brain-dead
simple it just sits in a sits in a
socket listening for a request gets a
request it hands it off to her a thread
to respond to it and serves back a
response so I took that example and I
and I produce three variants one of
which Forks a new thread for every
request the second one which uses a
pulled collection of threads and then a
third one where all the work of threads
themselves actually sit and accept and
handle their the requests directly so
there is no listing thread and the the
the purple sorry the the line well this
is the response time as seen by the
client so with the version that Forks a
different thread for every request the
it just doesn't scale as the number of
requests go up you just see a response
time that degrades purport sort of with
N squared according to the number of
clients
the elves degrade as well but they
degrade much more gracefully now
interestingly I had kind of expected
when I did this exercise that the
version running multiple threads in
accept would scale even better than the
pooled version and lo and behold it
isn't actually that true
so that's actually a III what really
wanted to include this slide because
it's an indication of why you should be
measuring because I my expectations were
dashed now on a dual processor it's
really interesting as well because the
the per thread one seems to be doing
almost as well as the the pulled
versions so that was somewhat unexpected
but then I realized that what's
happening in the pulled versions is as
I'm getting a lot more contention
because I'm on an MP system so the the
version where I'm running multiple work
is in accept is actually the best
performing one and the reason for that
is because all of the contention is
handled right in the kernel writer the
accept call rather than everything
coming out and fighting over the socket
so our conclusion is very simple your
application design is paramount that's
the most important part of the
performance of your app there's a lot of
new stuff in recent VMs and in hotspot
and on Mac OS 10 that have improved some
of those bottlenecks if you follow our
advice you'll get better compiled code
and so your app will run faster and
where you are seeing bottlenecks just
keep measuring and improving those
things and you'll see results so what
we're going to do now Scott's going to
come up and give a demonstration of
optimize it so I'm going to show you
optimize it I showed the memory portion
of optimize it at the Java development
tools
session so I'm gonna concentrate on the
the profiling portion it's a tool
written by a company called VM beer they
used to be called intuitive it's a pure
Java tool uses just a tiny little bit of
native code so it's mostly pure and it's
the way it works it just sort of
instantiates its own hooks into between
the U and the VM and then you run your
application on top of it and then you're
able to just look at all the same kind
of profiling stuff that you see and like
X prof and and memory profiling and all
that stuff it's really cool we use it at
Apple we've been helping them get it up
and running and we've been using it to
actually work on all of our AWT work and
swing work to find all of our
bottlenecks and I mean it saved us tons
of time and we're hoping that we can
convince them to get a developer release
out for you guys they've committed to a
fourth-quarter release so that's that's
good that we haven't coming out
eventually so let me just go right to my
demo machine there's not that much I
have to say about how about it but this
is my mic and sorting table demo this is
right out of swing set I just put in
names of people on our team and I added
some sorting to it I didn't use any of
the collection classes I wrote my own
sort so I wrote the worst sort possible
I do a little bubble sort here and so
it's kind of slow so I click here and I
sort my first name and that's sorted and
that's only 58 items that I sorted by
first name and I sort by last name and
so that's that's not really good and I
actually pause all of the UI while I'm
sorting so I want to figure out what's
going on why is this taking so long so
what I'm gonna do is I'm gonna go over
to optimize it which I've already
launched and if I want to hook into this
I started this other app using optimize
the stubs and I'm gonna do this through
remote debugging you can do it you can
launch it all through here but I just I
kind of like doing the remote thing
because it shows you can do it on a
separate machine so I'm gonna go to
remote application it's already been set
up on this machine on this port I have
my source path set up so I'll just
attach to that and it'll just take a
second for it to connect up it gets I
have this all set up for my demo this
morning so this is the memory the memory
profile of every
all the objects have been instantiated
and there's a lot of cool stuff in here
but I'm gonna go into the CPU profiler
so the CPU profiler hasn't been
profiling until you tell it to so that's
one of the big differences between this
and like X prof is you get to profile
just a segment of the application that
you want and you can turn it on do your
work turn it back off so what I'm gonna
do is I'm gonna press on the button go
to my application and click sort and
click sort again let's do another one
then I'll go back here and I'll stop so
here's one of the cool things here are
all of our threads read is idle Green is
active so and there are even groups of
threads like main system so if I just
start by just staring generally at the
main thread and I'm looking around
trying to see what's going on and let me
flip this around into the normal
execution path so what we have right now
is 49% of this happen through event
dispatch which makes sense because we
clicked on buttons to do most of our
work and then 34% of it was in thread
run now
I know I wrote this application and I
have a separate thread that gets spawned
off every time I saw it I actually
created a new thread I'm I wrote this
really badly so I spawned a new thread
and I run my sort so if I look through
here and I can see that I have my sort
and it ends up calling greater than
because I'm doing an excellent single
directional bubble sort and I have a
greater than and I have some my time
into string but let's see what's going
on inside of greater than I've got a
compare inside a greater than and
there's something called to lowercase so
that immediately is there's something
going on here and I can look if I just
click here and I double click on this
it'll bring up my source code viewer and
I see there's some to lowercase that's
inside of the AWT code and you know I
don't care that much about it but what
to lowercase is taking up a lot of time
and that's inside of there's a whole
bunch of things inside of a we did a WT
but I want to see my stuff so mine sort
data is my class and let's see what's
going on in my compare I can see right
here that I get two strings and to
compare them I turn them into lowercase
first because I wanted to be
I want it to be case-insensitive and
then I compare them character by
character and okay so so you know that's
really bad
you know actually I've had an engineer
who's done this before at another
company I was at so so so there are a
lot of things that you could this
immediately drives you there now
I sort of dov'è round here cuz I wanted
you to see this allocation graph it's
it's pretty cool it shows you you know
each allocation entry point and how much
time is spent you can even get sub
percentages so if I select here I think
if I mouse over here it says that ninety
nine point two five percent of the time
is spent inside of this compare and
that's that's just my compare that's not
anything else going on so so what I can
do is I can actually if I want to see
right down here this would tell me
immediately what the problems are these
are my hot spots and this is just taking
the individual methods no matter who
called them whatever Direction is
showing you how what percentage of your
time is spent in those hot spots and if
you flip the graph back around you can
start from the hot spots back down and
you can see who's calling H hot spot and
you can look and see that - lower case
okay who's calling that and it really is
only from one place is from my sorter
but so that's kind of cool and I found
that pretty easily here just on the main
thread but you can also go into your
individual threads and I can say let's
just look exactly a disorder you know I
looked at the whole main thing and
included all my event loops and stuff so
if we look at my sorter my sorters even
worse it's got 60% spent in there so if
I actually had a big list I was trying
to sort or something I could I could do
a little better so without I'm gonna re
profile again let me go back here so I
have this thing called fix it now this
really shouldn't be called fix it this
should be like don't be so stupid which
is what this is doing is it's not doing
a whole to lower it's getting each
character lower casing the characters
and testing yes those that's a little
better so I'll turn that on and I'll
read profile again do a sort by first no
I already did that so I'll click these a
couple times back and forth
and we'll see that now we don't even
have to lower we don't have anything
about lowercase in our let me get up
here actually sorry we'll see some
lowercase in here but it's not going to
be as huge I can't even find it right
now so that's that's pretty cool is that
now our compare is no longer the huge
portion of this whole thing we can we
see that there's some something about
app context and graphics which if you're
using the hardware accelerated would be
a lot lower than this so so this is a
really cool way for you to find out like
what actually is your bottleneck
obviously you wouldn't be writing these
really bad short routines but who knows
where things might pop up like this if
you have you know if you're using a
library from someone else you'll
actually see what parts of their library
as long as it's not been obfuscated or
something like that what parts of their
library are slow you can even see into
our libraries and see what's going on
inside of graphics and things like that
but you don't necessarily want to do
that but sometimes it's fun to do and
we've actually used this a lot of our
graphics code is written in Java so we
use this all the time I mean I have
engineers other co-workers of mine are
coming in and saying you know I made
some changes of the past week and
everything just slowed down like
jbuilder doesn't run very fast what's
going on and we haven't changed anything
and we run it through here and we find
that yeah it was someone did a really
bad draw a circle or something like that
and so we optimized that and we get back
our 10 times improvement so it's it's
pretty cool let me just show you another
thing which is useful this is this is
the VM statistics I've had this running
the whole time and it shows you things
that lead into what the other people
before me all we're talking about which
is you don't want to load all your
classes right away so you can turn this
on at startup and you can see your
classes being loaded and it'll show you
as you do different things so if you
actually have dynamically loading
classes which is what you want you want
to load them slowly as users get to
different portions of graph you'll see
your classes going up and up and up plus
threads actives so if I were to go over
here and I actually say sort first name
oh it probably went way too fast to even
see the thread yeah so there are thread
came in sorted and went away so it shows
you in the in your heaps
so you can see what actual amount of
memory you have and you can kick off the
garbage collector and now all the stuff
that's just been sitting around because
the garbage culture hasn't needed to run
is collected and you can do a lot of
cool things let me just show you really
quickly for the people who haven't seen
the memory profiler just one or two
little things the memory profiler you
can just look at all your instances you
can mark a certain instance and then you
can go do something in your app like a
sort or something that's gonna be really
slow here like resizing these things and
you can see what's changed since you did
that and you can see that a whole bunch
of char arrays were allocated rectangles
obviously we use a lot of rectangles and
graphics and they should mostly go away
when you run a garbage collector so I
hit the garbage collector and we see
that rectangle went down to none so we
did a good job there's something going
on with this one string and one
character and we might go hunt those
down for references or something but
that's basically what you have there's a
lot of different things in this in this
sampler you actually can do let's see
there go there it is so you have
different types of profiling you can I
did this all this profiling using
sampling so every 5 milliseconds I tried
to get what routine we are using I could
crank this down or up I could also go to
this method called instrumentation which
is every single instruction every single
call is being calculated so that you
don't miss anything because it only
happened for half a millisecond and you
happen to always miss that half a
millisecond sampling usually works
pretty well instrumentation will slow
your app down even more a couple things
about running this this requires the
1:31 interpreter or hotspot version
that's in DP one I'm running this all on
the interpreter from a pre DP one
release that's why my app is even slower
but it worked pretty well for this sort
demo and the the other thing is that I
just wanted to mention again this UI was
done in IFC the guy who writes optimize
it wrote IFC so he loves it
but that's why it's not an aqua look and
feel because his own UI inside of there
and if you're interested on finding out
when it's going to be available and how
much it's going to cost
all that contact vm gear its vm gear
calm and I'm sure they'd love to hear
from all you guys because they they got
this up and working and they're excited
to have a whole bunch of sales to Java
programmers so that's about it there's a
little slide with the roadmap of the
relevant talks that are coming up
following this one there's a
demonstration and talk of jet about
jbuilder that you might want to go to in
the Civic Center just like just after
this talk and then some of the other
ones QuickTime for Java as I mentioned
in my in my bit Apple performance tools
that'll give you more information about
the performance tools if you're
specifically interested in that so what
we'll do now is we'll have a quick queen
Q&A session I'll invite the rest of the
you