WWDC2001 Session 504

Transcript

Kind: captions Language: en I will not waste any time so the goal of this session is to get the best performance out of your java application on Mac OS 10 whether you're writing a new one or whether you already have one that you're porting for a Mac OS 9 or some other platform so what we'll cover in this session is try and give you an understanding of the characteristics of the Mac OS 10 Java performance some techniques and patents you can use in your code to try and get optimal performance on 10 point out why measurement is critically important important whenever you're trying to do optimization and then lastly we'll have a demonstration of some performance analysis to give you an idea of what sort of tools will be available on Mac OS 10 so without further rammus i'll introduce even possible from the java vm team he's our tech lead thank you [Applause] well what I wanted to question first said what are the Java performance factors well first of all it's your application design and implementation if you have algorithms that don't scale to your problem set there is no performance tuning we can do in the VM at any rate that will improve your app so you have to make sure that you don't have n square algorithms or anything like that in there the second factor is the amount of memory your application is consuming the more memory you can the more memory you use the more likely you are to be swapping out to be paging the more stress you're putting on the VM memory subsystem and the less likely are benefiting from caches data cache and instruction cache then the next factor down is bytecode execution performance it is at what speed does the job of VM execute your Java code Jim we'll talk about tips and tricks do's and don'ts in the in the second part of his talk on what you can do in this area I will concentrate more in Java VM efficiency in the first part of this talk there are two other factors that influence Java performance one is the speed of the underlying OS we cannot entry we are not touching this in this talk and then there is all obviously the speed of the harder you're running on so let's look what let's look at what Java VM efficiency means there are two most cited most cited issues that influence job performance at runtime one of them is memory management there is the footprint of your application the footprint of your java process running that includes your Java heap once that includes all the supporting java vm memory structures we are actually using the Java heap for that as well it includes parts for the OS parts for the code for the VM and so on then second is the speed of allocation Java is very object heavy you allocate a lot of temporary objects so you need to be able to allocate those objects very quickly well to keep your footprint down if you just allocate very quickly and never reclaim you would grow your foot footprint to infinity so that's why you have to reclaim very efficiently as well I will touch briefly on what what we do there or how you can actually help us as Java programmers in that area the second part is synchronization Java has built-in support for multi-threading in the language what to make that viable use we actually have to make sure that the synchronization primitives are implemented in a very efficient manner inside the VM another part is startup performance there must be hundreds of Java benchmarks out there on the web small and big and some useful some less but I've never seen one that actually measures Java startup performance you don't want your application to start up for a minute or two minutes seeing the bouncing icon in the dock so I will touch some of the issues we have done to address that issue most notably reducing class loading class loading is about 40% of your start Java VM startup and I will touch on what you can do to prevent or actually to help us out with startup performance so let's go to memory management first the hotspot VM that we ship on Mac OS 10 has an accurate compacting generally generational copying garbage collector what do those buzzwords mean accurate means we know at all times where in the VM you have references to your object we can distinguish between real references to objects and just memory location that look like references to objects compacting is also means that we compact the heap we don't leave any holes in the heap meaning all the memory use is basically move together compacted well as it says and that improves both the footprint and the locality of reference generational copying is means that we allocate objects for a first time in a in a new generation and only when they survive for a certain amount of time we actually copy them into an older generation which we handle very less we're much more less frequently than the Nuge newly allocated objects so we spend a lot of cycles on newly allocated objects collecting newly allocated objects for all their objects we don't spend that time so there is one thing lifetime of your object is important if you have objects that you're not going to use you have to know like the references if you have object hierarchies that you're not using anymore not allow the reference to that object hierarchy to actually give the garbage collector an opportunity to remove all the space that is Alec for those object hierarchies then the second part is if you use if you use some caches or have objects that can be recreated at a later time you can use the weak reference class or soft reference class to give give the collector a hint or an opportunity to remove objects out of your working set if the memory is getting tight then avoid finalizes where you can finalize objects that can be finalized need to be handled specially in the VM they we need to actually call the runtime from interpreted or compiled code to allocate objects with finalized errs because we have to keep track of them and that makes it very hard to allocate quickly all other allocation is done in line in interpreted and compiled code for fine fine Eliza Ballabh jex we first have to keep track of them by allocating through through the runtime and then we actually have when we're throwing them away we have to make sure to call the finalizes when you're done so if you can avoid finalizes please do then to reduce your footprint it is helpful to do lazy hours initialization and allocation so that way you reduce your footprint as well as reduce your startup performance and startup time to increase the startup performance so what did we do to memory management to improve some footprint issues what we introduced a shared generation that stores the most often used classes and methods including their byte codes and so on and this shared generation is mapped in from a file so we used mark underlying mark virtual memory system to just bring that memory into the VM we reduce the GC time because we never try to collect reclaim that memory it is basically free because we read it from a file we are never touching it so we share it across multiple running Java processes the thing you can help us there is well don't break it don't change your boot classpath do not modify the systems are files that are installed on the system and that way we can always use the share generation when we start the VM it has additional benefit in start-up time which I'm going to touch now avoid eager class loading and class initialization but it has both effect on startup time as well as your memory footprint if you don't load those classes when you don't need them you're not you're not using the memory as well as you're spending cycles to actually load them from the class file decode them bring them into memory initialize and so on so if you want to see which classes are loaded at what time you can use the - verbose callin class flag on the command line to see what is loaded at what time and in which order so as I mentioned before the shared generation reduces class load time because most of the classes are going to use out of Java and Java net swing and so on have been preloaded for you in that shared generation and our mapped from the file when you actually when you start the VM so when you run the above command that I that i mentioned java - verbose class with version you don't see any class loading going on at all so the other part was synchronization the VM the hotspot VM we are using has very fast synchronization in the on contended case what does a none contended case mean is when you synchronize on an object one thread at a time so that is most that is most often the case that you don't you don't have any locking really going on but you want to protect if some other thread happened to be in that in that code so for this we have a constant time overhead done the in line implementation of the compiler or in the interpreter takes about eight to ten eight to ten instructions it has very low memory overhead we don't allocate any space for these object locks on the heap they're all stack allocated in the stack frame that you're actually running this that you're locking this object that you're synchronizing on and we don't use any of the underlying OS resources because that is expensive that ties down memory sometimes even in the kernel which you want to avoid the contended case is rare for that but since it happens for that we use mark printed primitives directly to speed up to get as much performance as we can so before I want to hand over this talk to Jim I wanted to see to tell you what's new in 2001 so we shipped Mike Weston it has a shared generation that is one of the big improvements we did to the java virtual machine we get from sun in the 131 Developer Preview we're working on or that should be ready we did in line interpreter allocation this was not in Mac OS 10 we're working on thread local allocation we were we have a faster instance off and faster array copy that is tuned to g4 with the velocity engine we use the same code actually for the code we used for our array cup we are also using inside the GC to copy the objects between the generations so we're making use of that code in multiple areas so now I would walk wanted to give this over to Jen thanks so memory management and synchronization is a crucial part of what affects the performance of your application but I think fundamentally when we're working with with Java code we think of in terms of the speed of speed of the code execution as being the key factor in determining what you know what is actually slowing down our particular application so what I'm going to try to do in this talk is just describe how things get interpreted how they get compiled when they get interpreted when they get compiled and give you some ideas and some of the coding hints that you can use to speed up the performance once you find out what kinds of weaknesses you have in your code and I'm also going to run through a few things that have changed in the last year since our talk last year so first of all I want to point out that you know people often question well why don't we compile everything obviously if you could turn everything into into native code it's going to run a lot faster than if we if we interpret it but because the Java environment compiles code on the fly using just-in-time compilers it's there's a certain amount of cost at both in CPU and memory usage in order to get things compiled and when you're doing some analysis of the actual VM you find out that it's actually cheaper to interpret the code because the interpreter is fairly fast cheaper to interpret the code then actually go off and compile it and try to run it so we have to get a balance there on what actually gets interpreted and what gets compiled okay so when does a method get compiled well first of all everything that starts off in the VM starts out in the interpreter we have to make the assumption that the method will run best in the interpreter the first few times at least and try to get a feel for what the method is is about if we within the interpreter we actually have monitor monitoring tools which count the number of times a particular method and vote and once the method is invoked a few times and in the case of the hotspot VM client vm once it's invoked a thousand times then we siphon it off to the to the compiler and the compiler generates native code so that so the one thousandth and first time that the method is invoked we we fire off it fired off to the native code so hands it starts running faster at that point now the invocation count is not sufficient in to determine whether a method is is hot or is readily used or used fairly often so we also keep a track of the number of time as a method loops so within within the loop code within the turf the interpreter every time a method loops we also increment the invocation counter so so that a method that loops a lot is going to get compiled a lot quicker so if you have a method that has a loop that iterates a hundred times then it's only going to take ten invocations of that method to actually trigger the fact that it gets compiled now there are certain methods that that tend to loop for a long period of time maybe ten thousand times or a hundred thousand times depending on or maybe forever depending on the type of application you have and one of the things that we've added in in the last year is something called on stack replacement which allows us to take a method that is hot and is actually looping around create compile code for it and replace the interpreted version of that invocation into a compiled version we can continue on in a compiled code so this is pretty neat a notion and hence almost all the things that are hotspots in your application get compiled to get compiled when they need to be yeah last year I had a list of things that don't get compiled at this point almost anything that's a good candidate or as a hot spot in your code will get compiled and there's very few things like a couple of Java assembler concoctions that that don't get get compiled in the jck so using these the criteria of a number of occasions the number times it loops we can actually sort of find out what is a what is actually hot in your in your code and we find out it's only about 5% of the byte codes at 5% of your applications or methods that that need to be compiled and hints are hot in your application and Andy will be going over some of the tools that will to allow you to determine which of those methods are actually getting compiled the native code native code and we can once we've got that information we can start tweaking those particular methods because those are the methods that are that are going to be problematic so I'm just going to go through and discuss a few things that you can do to to get the best performance out of your application and types of things that you can concentrate on once you find out what's hot in your application the first the most important thing that's constantly repeating this but the most important thing is that because compilation takes up a lot of resources CPU time memory it's it's best to try to keep down the size of your methods down because then the compiler can get through them fairly quickly and if only a small portion of your if your method is actually being used all the time maybe you have some exception code in there or some special case code it's doing a disservice to the compilation process by having it embedded in your in your method so you should try to break that code out into separate separate methods and try to keep your method focused small and focused so that it can compile quickly and then go off and execute and then you get a good locality also two of the execution of the code so as I say rep separate rarely use code out in two separate methods as far as saying to yourself well you know like if I put it in a lot of into a separate method then it's gonna we're going to incur the cost of calling that method you know basically pushing parameters and so on so forth but you find that in in the VM will actually in lines things that make sense to in line if we can make more optimal use of the code by in as opposed to having it separately so don't worry about that and in particular accessor methods are always in line so you don't have to worry about the fact that well I've got a very tiny method and all it does is extracts the field so and with the 1-3-1 code we've actually have a much tighter implementation of accessors so they are very fast one of the things that I like to do is actually try to find which methods that are used fairly frequently and in the class libraries and tweak the codes specifically to handle those because those are the the routines that are used a lot by everybody and we want to try to get to try to get good performance for those particular methods so trust the supplied classes you may have the urge to go out and rewrite the vector class because it synchronizes every time you access it and and there's an object you know check cast every time you do a extraction from it you know these are the sorts of things that we notice that are used a lot so we we hand tool or not hand tool but we provide special service this service for those particular types of methods that are used fairly frequently so instead of going off and writing your own try trust the supply classes so classes such as string string buffer vector and the collection class is you know use what's there because we're gonna you know get the performance up for you and we've added some more optimization or more special cases for those in the no.13 one if you have a copy that your copy from one array to another user rate copy because it's a Yvonne mentioned that we're using G for acceleration on the array copy and hence it's going to be the fastest way of doing it so instead of having a loop that iterates through user rate copy and then of course there's certain functions like sine cosine and tangent ation so it's best to use you know what's applied and not go off and run your own gianna write your own J&I routines to deal with it make the best use of the native datatypes again the g4 is not a 64-bit processor so whenever you have a long arithmetic it has a certain amount of costs associated with it some of the basic operations look at and add and subtract or and and or whatever they're reasonably cheap but when you get into shifting or divide or that sort of thing that's it can be fairly costly so if you don't need all that precision stick with ants for the time being and then also consider using floats instead of doubles not not necessarily in your computation because sometimes you just need that you need the precision but when you're dealing with arrays of values it's best to keep the size of your arrays down by using floats and there are quite a you know quite well-known techniques for actually keeping a precision even though you're using a 32-bit value and then new to the 131 we've added better register allocation for law and floats and doubles so you'll find that some of the say especially if you're doing a looping type calculation that performance will be improved on that I try try to avoid using the generic data types because there is a cost and assigning say a generic data type to a specific data type we have to go through a check check cast Yvonne mentioned that we've done some performance improvements in 131 - to deal with instance of and Chuck gasps Chuck cast but it's still a cost case instead of at a simple assignment we have to go off and do this to make sure that it's the right class for doing that so try to avoid using generic types and use subtyping or sub classing you know in these circumstances because then that way you can avoid making assignments that require these these context I try to work with local copies now one of the things about people have been asking about was why is the code generated by hotspot client slower let's say then maybe the server version and so forth well some of the optimizations you get in the server version of the compiler are very sophisticated and they're not there in the client because again we want to try to compile things fairly quickly and get them up and running so if you have a array access and you're working with that array value it's best to make a copy of that value and work with that copy and put it back in again in this particular example you have three accesses to that array that means we have to do three bounds checks and three null checks to see on the table itself whereas if we make a copy of that we only have to do two in this case and then plus you get the locality issues where if you're working with the value in registers in this cases what happiness value will be assigned to register as opposed to going back to the array that you get performance boost there as well this is sort of a Dee Optima D of the code one of the things that people run into especially on MP machines is with a lot of threading if you have access to your two global values they're kind of wondering why the values are sort of changing or not changing up from underneath them make sure that if you have a global value that a global static that's being accessed from several different threads or written to both other threads that use a keyword volatile one of the optimizations a compiler will do is say well this is a value that I've already got a copy of why should I go back and get the original if you put the keyword volatile this will be aren t that things will get reloaded every time you access the variable use final constant of Static finals this basically specifies that that the the variable let's say in this case buffer size is is a constant and the compiler can treat that as a constant all the way through the code and optimize do constant folding in this example we know that the the character array that we're allocating is a fixed size we know that the initialization of the of the buffer in the loop is going to iterate a fixed the number of times so take advantage of that by making sure that you just clear your statics is final if they're going to be constant throughout your execution there's a certain cost involved in invoking anything which is a virtual call or a an interface call in the case of virtual calls we have to do a index index into a table to find the address of the method that we want to dispatch to in interfaces it's a little bit more complicated because we actually have to do a match and make sure that we we match the class that we're of the class of the method that we want to invoke so virtuals a little bit cheaper than then interfaces so if you have a choice try to stick with subclassing as opposed to creating interfaces and you get better performance that way in the hotspot VM we actually cache the call so that from any particular call site we we know which method worked for us last time we try to reuse that so we don't actually do a lookup each time but there is still a cost and that initial lookup and whatnot so try to use virtuals versus interfaces one of the optimizations that we've done in the hotspot compiler is is to dealing with switches you can create the switch statements with fairly sparse values in your cases in traditional compilation what they would that would happen in those situations is that we they would create a big if-then-else we were using a technique of double indexing which will allow us to actually just dispatch basically fairly quickly on any switch it's not the nested-if combination so if you're comparing a single variable against a an integer datatype utilize switches over if statements more and more you know as more and more people are learning the program some of the new people new to programming are have a tendency to use exceptions for control of their program flow you really should try to use exceptions for the exceptional cases and not for for actual flow of your program so because there is a quite a bit of cost in the VM to actually handle that exception so if you're going to if there's a good likelihood that that the routine you're calling is going to produce an error then you should probably use error codes and test the result when you come back as opposed to throwing the exception because that would be faster than and then actually throwing the exception and in it and having the VM deal with it and finally I think you should think Pierre Java in the 1-3-1 code we've implemented something called compiled natives which allow you to call J&I code fairly efficiently we don't have to go through any kind of marshalling code which marshals up the parameters and then goes off and calls the routine what happens with these compiled natives is that we can actually have we actually have a thunking layer which just already knows what the parameters are going to look like and assembles the parameters for a call to the J&I routine so that so you're on one side your J&I calls are going to be a little bit more more efficient and faster and once we want but on the other side there still is a cost in using j'ni or or je direct which is built on top of je and I there's this translation of later involved and it costs and then also if you're doing will call dealing with callbacks that's going to require some kind of lookup so you should try to use Java wherever you possibly can and you know try to avoid going off and as time wear goes on we're going to get the compiler faster and faster and you can forget about C and C++ okay with that I'll pass it on to Andy hi so I get to do my little bit now where I talk about how important measuring is so all of this all of the information that we've been giving you is kind of useless if you then go and apply it willy-nilly to your entire 60 mega byte code base it's really important all the textbook advice that says don't optimize prematurely it's really true what you should be doing is measuring finding the major bottlenecks optimizing those bottlenecks making sure that it actually worked because we've seen optimizations that have actually slowed things down I also go through what you should try and measure how with the the 1-3-1 java vm we've improved things and how you've actually got tools to enable you to measure those things and then I'll just cover a few little myths that's still still around in all those textbooks that are not quite true on Mac OS 10 so the first obvious thing that you always think of is how fast is my program going you look at the CPU meter on Mac OS 10 and it's tagged 100% so obviously you should be looking at where the CPU time is going your your your program whatever it's doing is CPU limited the first thing that we do for you in hotspot is we compile all those hot methods we're counting which one gets used most and we're compiling the ones that get you you get used to it called most frequently so obviously look at the hot method look at those ones that are being compiled and I'll cover use it how using X prof will actually tell you which ones are being compiled now secondly depending on where that CPU measure the CPU meter is reading you might be using system CPU and not use a CPU in which case you might be paging and the poor old OS is trying to just read things read and write things from disk and shuffle things around in the VM system paging is really expensive so if you're um if you're running on a 128 Meg system and you set your heap to 256 Meg's well we think you've got 256 Meg's so we'll happily go and allocate and we won't do full GCS until we think we've run out of heap but in the meantime you'll be paging madly think think very much about controlling your footprint and heap usage now other times you get into situations where your CPU isn't pegged and in fact at first glance your program seems to be doing nothing and that's probably what it is doing it's probably waiting for the disk or the network to reply so there are some tools on Mac os10 some of which are covered I will mention here some of which are covered in the performance tools talk later that allow you to look at what your program is doing IO wise and network wise and then lastly one of the things that we talked about synchronization monitor contention can get very expensive and the reason for that is that especially on 10 if you're if you're used to Mac OS 9 switching threads and processes was relatively expensive because it didn't have the memory protection and preemption behind it whereas on 10 you know when we switches thread there's a there's all the state in the processor that has to be saved out to memory and when we switch processes I threads between processes you've got to save all of that context as well so it's a lot more expensive than knowing so that's one thing to bear in mind so how do you go about measuring all of the things that I've talked about the best thing is from your perspective is to use a commercial performance tool one example of which is optimize it which Scott will be demoing just after I've talked it provides CPU profiling and/or sampling so profiling is a way of tracking each and every time methods get called sampling does a statistical analysis there are pros and cons of each profiling gives you a very precise measure of exactly how often things that called sampling is less invasive your program doesn't slow down so much so depending on what you're doing one or the other is better you can also look at object allocation which objects are getting allocated where they get allocated etc Scott will cover a lot of them in the demonstration I think the other thing you can do with with hotspot that we provide in the 1-3-1 developer preview each probe is now functional it wasn't in cheater and that's a trough is implemented as a library loaded at runtime that uses the JVM p.i interface in hotspot secondly you can use X probe which is a per thread kind of measurement and there's minus X a proc issue allocation information so as I mentioned each probe comes with the Developer Preview one of that's available on the website it's a basic CPU and monitor profiling tool so it doesn't give you a lot of it gives you a lot of nitty-gritty detail and not a lot of analysis there's a relatively simple UI available from Java software that gives you a primitive GUI on top and lets you drill down a little bit and I've used that to a certain extent and that's quite helpful it's relatively simple to use you just passed a couple of command line parameters and you tell it whether you want to sample or look at monitor contention etc and this is the it turns out that the the perf final tool only works with CPU sampling it doesn't doesn't work if you use profiling so should use the first example the monarch the contention will give you a little bit information about how much time each thread spends waiting on a particular monitor so if you're seeing Oh an application that you you can't really see why it's slow but there seems to be a lot going on you probably one of the first things you should do as a look at monitor contention you can see dramatic performance improvements there because when when we get contention as to the cases where we don't have an explained earlier it's a it's a case of going through ten instructions in line in the interpreter oral the compiler versus several thousand cycles going into the kernel and doing context which isn't like so that's why it's expensive - like say prof. will give you a simple allocation profile so you run your program and at the end it'll just when it exits it'll spit out this dump of all the objects that got allocated how much space they took up the average instance size etc and you can from just from that information you can say well maybe I shouldn't be allocating so many vectors or hash tables or etc but it doesn't give you any information about where they got allocated which is why it's you know optimize it or something like that is much more useful - X prof is is of somewhat limited use because it gives you per thread information so if you have a program that Forks 400 threads like Vilano mark or something that I tend to run err on or off at the end of the program when it exits it spits out 400 copies of the information which is not very useful but it is the only way that I know of where we actually list out the methods we've got compiled versus the ones that get interpreted and how much time we spend an interpreter code versus native code versus compiled code and how much time we spend in GC etc so that can give you some very useful insight on first of all which methods got compiled and you might look at it and say hang on a minute I expected method a to get compiled because I I was under the impression that this was my most expensive method but it turns out that in fact we didn't go anywhere near that we didn't compile it or maybe we did and we couldn't compile it because it's got some funny assembler or some construct or it's too big etc so that that will that will pinpoint which methods are getting compiled you can sanity check that the ones your that are getting compiled are the ones you expect and once you know the ones that are getting compiled you can then focus your optimizations on those methods and it's a little example use down at the bottom it's very simple to use but like I said don't try it with 400 threads just do it on something with a minimal number now measuring memories is a little harder because the the Java VM has several different perspectives on what memory is as far as you are concerned the only memory you can really have any control of is the memory in the Java heap and the tips that some if I unexplained whereby you you know that references you avoid using finalizes cetera that's that's the kind of thing you can control other than that so you can you can watch the heap as it grows and shrinks using the voice chief see flag as we mentioned for most class will show you what classes as they get loaded if you you might see classes getting loaded before you think you should be using them and that's an example where you should sort of go in and pinpoint why they're getting pulled in and maybe you can load them a bit later there's a command called top which will give you an overall memory view of the whole system and that's good for splitting out memory that's being shared for example when you run multiple java processes some of the memory that we pull in from the shared generation is shared between several processes and you can tell the difference between memory that's privately allocated I used in the heap for you versus memory that's being shared in the shared generation or is being shared because of dynamic libraries that are being pulled in by native code either your code or our VM map is another command-line utility gives you a lot more specific insight on the the intricacies of the virtual memory being used and it's it's relatively complicated if you if you want to learn a bit more about it they might cover it in the performance tools it turns out that Java vm's are improving faster than the books about performance in Java can be written so there are there are quite a few books out there most of which is containing extremely good advice but some of their tips just become outdated with time as the technology rolls on traditionally in 1.0 vm's 1.1 VMs allocation was very slow they had a malloc space via an allocation scheme or something our allocation is now extremely cheap now the initialization of an object may not be but allocating it is few instructions as a result of that and as a result of a scavenging short live objects are very cheap to jeast GC because we essentially don't do anything with them we just throw them away at the end of the their lifecycle so karai's methods costs are small as if I mentioned and the contending case is still expensive now lastly as I hinted at before system calls which involve entry into the kernel just because of the whole context switching and a little bit more weight involved with Mac OS 9 they are expensive there are there are certain things that we do on your behalf as part of the Java API s that involve system calls network operations i/o operations things like that thread yield all of those system calls so if you don't need to do things like that avoid them so there's a quick graph which you've seen before this is the peak allocation performance from various different technologies I think Blaine showed it and his talk and you can see that and this includes the garbage collection side of things so it's not just allocation and you see that compiled Java which is the tall one is just way faster than any other technology now so here's an example that I pulled from a performance book published a year or two ago they gave us an example of one thing you can do to improve performance pooling objects so that you avoid that by recycling them you avoid the cost of allocating and G seeing them so I wrote a little benchmark and I ran it on my G for power book and I got this sort of description so as I increase the number of threads you can see for the for the single and two threaded cases you know the pooling is just slightly faster so I'm allocating hundred thousand and then filling them up and then throwing them away etcetera but when you get too large a number of threads you can see that the time taken to actually recycle these vectors is actually longer than it took to create them in GC them now in the jewel processor case it actually is the moment you go to anything other than single threaded the the simple allocate and throw away mechanism is faster and the point being is that you don't have to incorporate any complicated pooling code if you just do the brain-dead thing in just a lie came throw it away so this is this is one example where the technology has just moved on and that little truth about pooling things is it's not quite so true now on the other hand if you have an object and in this particular example I'm talking about a java thread that is extremely expensive to create and initialize that the expense of creating a thread involves you know a couple of kernel trips to create the instant internal data structures they can obviously be cheap Teresa they it can obviously be beneficial to recycle those the the corollary is that sometimes especially with something like a thread which involves a kernel data structure it can actually be costly to keep them around as well so you have this trade-off between some things get more performant but on the other hand you you have to pay the penalty of keeping the kernel wide memory around and the extra stack etc so this this little graph just there's an example web server on some site which brain-dead simple it just sits in a sits in a socket listening for a request gets a request it hands it off to her a thread to respond to it and serves back a response so I took that example and I and I produce three variants one of which Forks a new thread for every request the second one which uses a pulled collection of threads and then a third one where all the work of threads themselves actually sit and accept and handle their the requests directly so there is no listing thread and the the the purple sorry the the line well this is the response time as seen by the client so with the version that Forks a different thread for every request the it just doesn't scale as the number of requests go up you just see a response time that degrades purport sort of with N squared according to the number of clients the elves degrade as well but they degrade much more gracefully now interestingly I had kind of expected when I did this exercise that the version running multiple threads in accept would scale even better than the pooled version and lo and behold it isn't actually that true so that's actually a III what really wanted to include this slide because it's an indication of why you should be measuring because I my expectations were dashed now on a dual processor it's really interesting as well because the the per thread one seems to be doing almost as well as the the pulled versions so that was somewhat unexpected but then I realized that what's happening in the pulled versions is as I'm getting a lot more contention because I'm on an MP system so the the version where I'm running multiple work is in accept is actually the best performing one and the reason for that is because all of the contention is handled right in the kernel writer the accept call rather than everything coming out and fighting over the socket so our conclusion is very simple your application design is paramount that's the most important part of the performance of your app there's a lot of new stuff in recent VMs and in hotspot and on Mac OS 10 that have improved some of those bottlenecks if you follow our advice you'll get better compiled code and so your app will run faster and where you are seeing bottlenecks just keep measuring and improving those things and you'll see results so what we're going to do now Scott's going to come up and give a demonstration of optimize it so I'm going to show you optimize it I showed the memory portion of optimize it at the Java development tools session so I'm gonna concentrate on the the profiling portion it's a tool written by a company called VM beer they used to be called intuitive it's a pure Java tool uses just a tiny little bit of native code so it's mostly pure and it's the way it works it just sort of instantiates its own hooks into between the U and the VM and then you run your application on top of it and then you're able to just look at all the same kind of profiling stuff that you see and like X prof and and memory profiling and all that stuff it's really cool we use it at Apple we've been helping them get it up and running and we've been using it to actually work on all of our AWT work and swing work to find all of our bottlenecks and I mean it saved us tons of time and we're hoping that we can convince them to get a developer release out for you guys they've committed to a fourth-quarter release so that's that's good that we haven't coming out eventually so let me just go right to my demo machine there's not that much I have to say about how about it but this is my mic and sorting table demo this is right out of swing set I just put in names of people on our team and I added some sorting to it I didn't use any of the collection classes I wrote my own sort so I wrote the worst sort possible I do a little bubble sort here and so it's kind of slow so I click here and I sort my first name and that's sorted and that's only 58 items that I sorted by first name and I sort by last name and so that's that's not really good and I actually pause all of the UI while I'm sorting so I want to figure out what's going on why is this taking so long so what I'm gonna do is I'm gonna go over to optimize it which I've already launched and if I want to hook into this I started this other app using optimize the stubs and I'm gonna do this through remote debugging you can do it you can launch it all through here but I just I kind of like doing the remote thing because it shows you can do it on a separate machine so I'm gonna go to remote application it's already been set up on this machine on this port I have my source path set up so I'll just attach to that and it'll just take a second for it to connect up it gets I have this all set up for my demo this morning so this is the memory the memory profile of every all the objects have been instantiated and there's a lot of cool stuff in here but I'm gonna go into the CPU profiler so the CPU profiler hasn't been profiling until you tell it to so that's one of the big differences between this and like X prof is you get to profile just a segment of the application that you want and you can turn it on do your work turn it back off so what I'm gonna do is I'm gonna press on the button go to my application and click sort and click sort again let's do another one then I'll go back here and I'll stop so here's one of the cool things here are all of our threads read is idle Green is active so and there are even groups of threads like main system so if I just start by just staring generally at the main thread and I'm looking around trying to see what's going on and let me flip this around into the normal execution path so what we have right now is 49% of this happen through event dispatch which makes sense because we clicked on buttons to do most of our work and then 34% of it was in thread run now I know I wrote this application and I have a separate thread that gets spawned off every time I saw it I actually created a new thread I'm I wrote this really badly so I spawned a new thread and I run my sort so if I look through here and I can see that I have my sort and it ends up calling greater than because I'm doing an excellent single directional bubble sort and I have a greater than and I have some my time into string but let's see what's going on inside of greater than I've got a compare inside a greater than and there's something called to lowercase so that immediately is there's something going on here and I can look if I just click here and I double click on this it'll bring up my source code viewer and I see there's some to lowercase that's inside of the AWT code and you know I don't care that much about it but what to lowercase is taking up a lot of time and that's inside of there's a whole bunch of things inside of a we did a WT but I want to see my stuff so mine sort data is my class and let's see what's going on in my compare I can see right here that I get two strings and to compare them I turn them into lowercase first because I wanted to be I want it to be case-insensitive and then I compare them character by character and okay so so you know that's really bad you know actually I've had an engineer who's done this before at another company I was at so so so there are a lot of things that you could this immediately drives you there now I sort of dov'è round here cuz I wanted you to see this allocation graph it's it's pretty cool it shows you you know each allocation entry point and how much time is spent you can even get sub percentages so if I select here I think if I mouse over here it says that ninety nine point two five percent of the time is spent inside of this compare and that's that's just my compare that's not anything else going on so so what I can do is I can actually if I want to see right down here this would tell me immediately what the problems are these are my hot spots and this is just taking the individual methods no matter who called them whatever Direction is showing you how what percentage of your time is spent in those hot spots and if you flip the graph back around you can start from the hot spots back down and you can see who's calling H hot spot and you can look and see that - lower case okay who's calling that and it really is only from one place is from my sorter but so that's kind of cool and I found that pretty easily here just on the main thread but you can also go into your individual threads and I can say let's just look exactly a disorder you know I looked at the whole main thing and included all my event loops and stuff so if we look at my sorter my sorters even worse it's got 60% spent in there so if I actually had a big list I was trying to sort or something I could I could do a little better so without I'm gonna re profile again let me go back here so I have this thing called fix it now this really shouldn't be called fix it this should be like don't be so stupid which is what this is doing is it's not doing a whole to lower it's getting each character lower casing the characters and testing yes those that's a little better so I'll turn that on and I'll read profile again do a sort by first no I already did that so I'll click these a couple times back and forth and we'll see that now we don't even have to lower we don't have anything about lowercase in our let me get up here actually sorry we'll see some lowercase in here but it's not going to be as huge I can't even find it right now so that's that's pretty cool is that now our compare is no longer the huge portion of this whole thing we can we see that there's some something about app context and graphics which if you're using the hardware accelerated would be a lot lower than this so so this is a really cool way for you to find out like what actually is your bottleneck obviously you wouldn't be writing these really bad short routines but who knows where things might pop up like this if you have you know if you're using a library from someone else you'll actually see what parts of their library as long as it's not been obfuscated or something like that what parts of their library are slow you can even see into our libraries and see what's going on inside of graphics and things like that but you don't necessarily want to do that but sometimes it's fun to do and we've actually used this a lot of our graphics code is written in Java so we use this all the time I mean I have engineers other co-workers of mine are coming in and saying you know I made some changes of the past week and everything just slowed down like jbuilder doesn't run very fast what's going on and we haven't changed anything and we run it through here and we find that yeah it was someone did a really bad draw a circle or something like that and so we optimized that and we get back our 10 times improvement so it's it's pretty cool let me just show you another thing which is useful this is this is the VM statistics I've had this running the whole time and it shows you things that lead into what the other people before me all we're talking about which is you don't want to load all your classes right away so you can turn this on at startup and you can see your classes being loaded and it'll show you as you do different things so if you actually have dynamically loading classes which is what you want you want to load them slowly as users get to different portions of graph you'll see your classes going up and up and up plus threads actives so if I were to go over here and I actually say sort first name oh it probably went way too fast to even see the thread yeah so there are thread came in sorted and went away so it shows you in the in your heaps so you can see what actual amount of memory you have and you can kick off the garbage collector and now all the stuff that's just been sitting around because the garbage culture hasn't needed to run is collected and you can do a lot of cool things let me just show you really quickly for the people who haven't seen the memory profiler just one or two little things the memory profiler you can just look at all your instances you can mark a certain instance and then you can go do something in your app like a sort or something that's gonna be really slow here like resizing these things and you can see what's changed since you did that and you can see that a whole bunch of char arrays were allocated rectangles obviously we use a lot of rectangles and graphics and they should mostly go away when you run a garbage collector so I hit the garbage collector and we see that rectangle went down to none so we did a good job there's something going on with this one string and one character and we might go hunt those down for references or something but that's basically what you have there's a lot of different things in this in this sampler you actually can do let's see there go there it is so you have different types of profiling you can I did this all this profiling using sampling so every 5 milliseconds I tried to get what routine we are using I could crank this down or up I could also go to this method called instrumentation which is every single instruction every single call is being calculated so that you don't miss anything because it only happened for half a millisecond and you happen to always miss that half a millisecond sampling usually works pretty well instrumentation will slow your app down even more a couple things about running this this requires the 1:31 interpreter or hotspot version that's in DP one I'm running this all on the interpreter from a pre DP one release that's why my app is even slower but it worked pretty well for this sort demo and the the other thing is that I just wanted to mention again this UI was done in IFC the guy who writes optimize it wrote IFC so he loves it but that's why it's not an aqua look and feel because his own UI inside of there and if you're interested on finding out when it's going to be available and how much it's going to cost all that contact vm gear its vm gear calm and I'm sure they'd love to hear from all you guys because they they got this up and working and they're excited to have a whole bunch of sales to Java programmers so that's about it there's a little slide with the roadmap of the relevant talks that are coming up following this one there's a demonstration and talk of jet about jbuilder that you might want to go to in the Civic Center just like just after this talk and then some of the other ones QuickTime for Java as I mentioned in my in my bit Apple performance tools that'll give you more information about the performance tools if you're specifically interested in that so what we'll do now is we'll have a quick queen Q&A session I'll invite the rest of the you