WWDC2003 Session 625

Transcript

Kind: captions Language: en so welcome to the last Java session of the week maximizing gasps knocks amazing Java performance for mac OS 10 my name is Victor Hernandez this talk will be given by three of us Jim Lasky and myself were from the Java Runtime technology scheme and girard vm ski who is from the Java platform classes team and we're going to be splitting the talk of the three-part but the goal overall talk give you a better understanding of why your job application performs as it does on Mac OS 10 Jim is going to be talking about performance improvements that we've made specifically targeting the g5 processor then i'll be talking about performance opportunities that are arrived with java 141 and then Gerard will be talking about Java graphics performance on Mac OS pen and stay tuned specifically for I park because there's a lot of great demos to be seen there so we got a lot of material if get right to it here's Jim thanks so my part of the talk is going to talk specifically but what changes for me to the hotspot vm to targets of the g5 which is uh was pretty exciting because we only got to see some of these new machines a few weeks ago and play around with some of the prototypes first of all I'll just give an overview my section of the talk I want to talk specifically about some of the details of the g5 to give you a sense of what sorts of things that we could actually exploit on the g5 then I'll do a little performance comparison between the g4 and g5 to give you some kind of sense as well as a benchmark can of what kinds of improvements you might see in your application and then I'll go into some detail and some of the changes that we made specifically to the Java VM interpreter and the Gypsy and then quickly at the end I'll go through a couple of changes that we made to the hotspot runtime okay so what does the g5 being to Java developers well the main thing you should know is I guess we should understand or should be obvious to you that the g5 is going to make your application generally run faster and you should expect that from a faster processor which with forty percent faster than the highest in g4 currently shipping faster bus structure there's also been some architectural changes to the way the g5 processor works over the g4 which actually improves the performance of various types of operations and very specifically floating point you'll find that floating point is faster than that forty percent are typically faster than that forty percent projected by just a change in gigahertz on the machine now we could have left the vm alone and not done anything to it and you would have gotten a gain in performance running on the g5 but we like to tinker and there's all these really cool instructions on on the new processor that we wanted to take advantage of specifically there's the introduction of 64-bit operations and if you have any long or long int signore in your java application we now use a much simpler and quicker set of instructions to do those operations and i'll go into some detail on what was actually done now one thing to note and i know there hasn't been a lot of talks about the g5 and what g5 means all you're hearing is that it's a 64-bit processor and that can mean a lot of things a lot of people i'd like to think of it in terms of actually there's there's two phases of calling a processor 64 bit one is 64 bit as a processor processing 64-bit instruction or starting a 64-bit instructions but but doing operations on 64-bit operand and the second part is also there's an implication that you also have 64-bit address space now we've chosen to run the Java VM in what's called 32-bit mode which allows us to maintain the 32 bit 32-bit address space so that we don't need to represent object pointers in 64-bit and hence we don't need twice as much memory to represent send things but we still can use the 64-bit operations to do state long integer arithmetic and the finally the main thing that you can walk away from the session feeling is that you don't have to do anything to your application to gain these benefits improvements in performance we've modified the vm as soon as you run as soon as you run your application on a g5 you're going to gain all the benefits of having 64-bit arithmetic the faster processing yourself so you have to make no changes to your to your application now just to start it off I want to show you some comparisons of running the an application or some applications on g4 versus g5 this is a i'll be using spec sorry sightmark we use several different benchmarks internally to test various things we would normally abuse spect JVM but this is a fair use policy which are which requires that you post the scores on a public forum before you can actually use them and we're still working with prototypes and we don't have our final values one night so we chose to use side mark which is a is a fairly good benchmark and it will give you a good sense of where we're going the other thing about sightmark is that it's a scientific engineering benchmark which people have often said well you know like the the client vm is very slow when it comes to computation well this sy mark score the thigh mark score should give you a sense of where we're headed with the computation side mark can be found at the National Institute for Standards and Technology website there's the Earl and if you go there there's a whole list of current standings they're fairly up to date I think the most recent one is May June timeframe if you look in the list you'll see us way down there somewheres actually in the 61st position this was run by somebody back in the fall running on 131 11 31 version of the vm on a 1.2 sorry yeah a 1.25 x gigahertz dual processor g4 okay so note the score there the score is 78 253 that is that what's called a composite score signmark is actually 5 separate tests such as fast fourier transform sparse matrix multiplication Monte Carlo and there's a composite score prepared from those five results and it's the composite score which is used to actually rate how you're doing in in timer so so this graph will show a high ng for the current hi ng for which is a 1.4 2 gigahertz dual g4 against a 2 gigahertz dual g5 focus saw in the first column because that's the one that the score is actually bit the main scores based on so currently our score would be around 111 okay composite score and you can see each of the each of the subtest subtests there I'm not sure whether there's a normalization of these I haven't seen anybody actually hitting a hundred on any of those but that's basically what you would find currently now if you took a straight forty percent increase in performance on each of those each of those tests this would be the projection that you would get okay so this is we did this to sort of get a sense of where we should be headed once we ran it on it on one of these g5 ok so a score one again focus on the confidence score because the other one's going to vary a little bit so as I said this would have been the best we would have expected well we were kind of surprised when we actually ran the tests and that we got a 232 which is pretty significant so it's more than just gigahertz it's also the system itself and the changes that we've made to the vm and here's an overlay of the projections just to show you show you where we're at so the score is basically more than doubles on the g5 so where does this put us well this if we were to do this today this will put us about 12 position and what's interesting is that this is up in the high end there with all the high IBM servers running three gigahertz and we're running a client vm right so let's give you a sense of the power of what the g5 is and also the potential that we could have as we make further improvements on the vm ok ok so let's go over some of the changes that took place in the interpreter and the Gypsy we we enhance them with g5 instructions we can do this because the interpreter the Gypsy compile the code generation and also the runtime are all constructed on the fly when you launch the vm so this gives us an opportunity to choose which instructions that we want to apply so we ran on a g3 versus g4 we would choose different structions we're running on a single processor or a double processor we would choose different instructions and now that we're running on the g5 we can actually choose 64-bit instructions so you have a long it's in support that is now using 64-bit operations and also some of the details of that there's also been improvement in float and double support using some of the new floating point instructions and they're not actually new instructions resist instructions that are only made available because of 64-bit support so let's talk a little bit about the details of what 64-bit means in a g4 orgy three for that matter everything that goes through the processor has to go through in 32 bits that's because the the data bus and the width of the the registers is only 32 bit wide so if you wanted to do an operation on a 64-bit long integer value you would require two registers to deal with each of the operands and with the results in this case we had few equals x plus y who would require two registers to hold the result ok in this case in this example r3 and r4 we'd also need two registers for x + 4 y so we need six registers just to perform a long add operation or a long subtract operation and it's in the g5 world our registers are 64 bit wide we can still use treats them as though there's 32 bit wide and some operations still deal with them as 32 bit why but the long integer operations we can deal with them as full 64-bit so in the previous example where we had few equals x plus y who only needs the result only needs one register and x and y only need one register each so we cut down the number of registers that those are needed for each operation and make and that makes more available for other operations ok so we get a win from the reduction or the the lack of general win by having more registers available for operations so let's look at the specific operations that we can improve on so in your Java code you have an expression long x equals y on a g4 this would actually require four steps in order to perform this performance ok we need two steps to load each half of the 64-bit value the high 32 bits then the low 32 bits and then we don't need two steps to store those out back into memory okay so in a 30 in a 32-bit world we almost always have to use at least two instructions where one would would do and the g5 we only have one instruction for each of those so one instruction per load one instruction per store this is also used for moving data we have a 64-bit data bus we can get better throughput through the system so when we're doing memory copies we're also getting performance boost there let's look at some of the simple simple operations like add subtract and the gate again because we only have 32 bit wide registers we have to do everything in two steps on the g4 so in this case if we want to add a too long int we have to add the low halves of the two upper and bring the carry forward and then add the two halves and add the carry and so that would be the two steps that are highlighted here I've thrown in the load operations as well to give you a sense of that well it's not just the operation itself it's also the things that go on around it okay so it takes eight instructions to perform that on the g5 it only takes one instruction to do they add and again the load only takes one each of the loads only takes one instruction the store when he takes one instruction so we've cut the number of instructions required in half and you can think in terms of fewer instructions faster code now the more interesting things and this is the most trouble for us in implementing the Java VM is has been dealing with long and some of these more complex operations like multiply and divide and remainder and shifts and even comparisons they can take many instructions and into a long aunt divide can literally take hundreds of instructions or hundreds of steps in order to complete the operation ok remainder just a few more shifts can take eight comparisons can take 12 up to 12 it's they're fairly expensive operations each of these have been reduced to one single operation and I'll take multiply as the simplest example on the g4 to do a long inch multiply it takes six steps to do the cross multiply of the low on high parts of the upper end on the g5 it only takes one instruction okay so you can see where this is going is that if you have a lot of a lot of long and comp law int computation in your code where it took many steps before it's only going to take a few now let's take a look at float when I say float I mean float and double in the the g5 implementation of the Java VM we have taken advantage some of the newer instructions that can convert long to double and doubles back to long and stained with two is float in the g4 implementation has to make a library call and which takes several hundreds of steps so that it speeds up the performance of casting a conversion of longs to doubles there's also been some improvements in the float and double bit extraction routines such as a double to long' bits these are used primarily when you're converting doubles two strings and back again the most interesting of the changes is square root and the g5 there's a built-in square root function now on the GP on the G floor the square root is implemented as a trig library routine and it can take several steps it's in the order of about 40 40 steps to complete so what I did was I took a little micro benchmarks where I'm iterating through a hundred thousand data points and applying a square root to each of the data points and producing a results and just to make it interesting I took a little bit more complex operation where I had 100 million XY points on a coordinate plane and I want to compute the distance so it's a little bit more complicated equation and it see how long it would take to do on eat on each of the processors the first processor is a g4 running at 1.42 gigahertz okay so it takes about 12 seconds to compute all those oh do all those computations and 13.5 for the distance formula now if I was just to take a straight port over and use the library routine on the g5 it would be reduced to 7.7 seconds in 8.1 seconds and this is actually Burt better than thirty percent of the projection that the projected time that it should take so the floating point processing is better on the g5 and you're going to get a better result on the g5 running with the square root instructions built into the code or in lined in the code it only takes two seconds so you've got say a six times improvement in performance but this is a micro benchmark you know it's it's just going to give you a sense of you know the increase in performance so the square root itself so your actual example may take a little bit longer but it gives you a sense of the magnitude of this improvement improvement there finally I just want to write quickly run through some of the changes to the sort of run time and in a 32-bit world we have a little problem where two threads may want to share a long int value to say static or a field in an object and while they're writing to that long int that the upper and lower halves of those values might get slammed by one or the other process depending on how the thread switching is going on to avoid that problem you can annotate your you're filled with the volatile keyword and what that does is forces the vm to coordinate how those how that field is accessed and make sure that we don't run into that problem and the g4 we did a little fudge excusing 64-bit double register and and using that as an atomic access and copying it through some memory so and so forth so it took several steps in order to to to make that work and the g5 64bit loads and stores are atomic so you don't have that problem okay so there's no no overhead when you're dealing with volatile fields on the g5 one of the problems that the g5 introduces is the fact that the hardware itself a little bit more complex and has more stages when it's doing its computation this is where it gets its speed so when you have running on a dual processor there needs to be some coordination and how memory is being accessed in the g4 world we use something called the sink instruction and this allowed the two processors on the dual processor environment to sync up the data that shared between the two processors okay but the problem with the thing construction is that it it's somewhat freezes the state of the processors until they're both coordinated before it continues on so there's a bit of an impact there and sometimes it can be actually fairly serious with the introduction of the g five days they brought in a new instruction called lightweight sync which doesn't require as much handshaking between the processors determine whether there whether the data is in sync and we use these when we're doing memory allocation when two threads are trying to allocate memory at the same time or worse when you're using a synchronization between of an object and finally the last major change that we made in the runtime to deal with the g5 is the atomic long access there's a class in Sun misc called atomic long CS sorry see a simple which allows you to do atomic access of long values and this is primarily used in the the net net operations like when you're setting up sockets and so on so forth and the g4 we had to actually use full Java synchronization and we just use the Java implementation to provide the synchronization so lock out the access to the object or to that particular object field and then we make the assignment and release it through normal synchronization on the g5 we use a light weight load and reserve which is instruction which allows us to reserve access to that work and can be done fairly quickly so in summary Java that at 1.14 or 3141 that ships with the g5s once we start shipping will automatically adapt to the g5 processor and we're only going to be shipping one version of the vm from that point on it's not one that runs on the g4 when the runs on g5 it's one that runs on all platforms but adapt to the g5 and this is you know this is one of the great things about the hotspot cm you're going to get significant performance changes somewhat across the board in a specifically in floating point that's that's the main thing if you're doing scientific computing you're going to see bigger wins there and then also with the long int arithmetic if you're using it and the main thing that as they say want to point out is that you don't have to make any code changes to your own code the vm does the adoption for you and this is where you're you're one up on all the sea and Objective C programmers because if they want to run on g5 and take advantage of the g5 processor they're going to have to recompile your application and they're going to have to ship a separate version of their application for the g5 and and one for the g4 so Java is automatically going to take advantage of that okay and that's how has the size okay Victor there you go so for those of you that don't think in terms of bits and instructions we'll take it at a higher level now my name is Victor Hernandez in case you don't remember and here we go so I'm basically what I'm going to be talking about is updates to hotspot that has been made with Java 1401 specifically one of the features that we've added in and being able to optimize your code and that specifically aggressive in lining and also one performance opportunity that you can take advantage of your of yourself in Java 141 which is the new I owe a pis and finally I'm going to kind of wrap it up with a bunch of conclusions on tips that you can take advantage of to improve your hire hot methods okay so one of the performance bottlenecks though it has plagued Java well I don't know if legs but up that I Java has encountered since the early days is the fact that there's a large cost in the overhead of actually invoking a java method so our opportunity to minimize that cost is actually done by dynamically inlining the method calls done by your message when we compile your method what is in lining that should be pretty straightforward well give a quick example here you got average and some average call some of course this could avoid the call to some if it just simply did the a plus B itself but of course you don't want to do that in your code because that limits the reusability of that method good thing is that we're able to do that for you you don't need to change your code we just do it for you on the fly in 131 there was limited ability to do in lining we're able to that in line your accessor methods to your fields we are able to in line your calls your call to create new instances of your objects and we were able to inline certain intrinsic intrinsic being methods that um now we actually don't need to look at the byte codes to know what it's supposed to do we actually know what's supposed to do and actually have a finely-tuned implementation of it for example sine cosine also the identity function and then also but one of the main issues within lining in Java 131 was the fact that we were actually not able to inline virtual methods why are virtual methods difficult in line well the reason they're difficult in line is because there could be possible multiple implementations of that method when you actually go to do an invocation so we don't know which implementation to actual in mind so how do we go about in learning those virtual methods we do that with a technique called class hierarchy analysis the goal of clarifier analysis is to determine which is to determine if a method is monomorphic and a method is run and morphic if there is only one implementation of that particular virtual method i've actually been loaded if we know there's only one that has been loaded and you go to call it that's got to be the one and hot spot one for one attempts to aggressively in line all mon amour monomorphic methods that's the main that's the main feature we've added beyond 131 so what are the benefits of this well clearly the fact that now we can actually in line virtual methods there are certain situations where those methods actually don't get in lines and even in that case then we can avoid the virtual method virtual table lookup when invoking that method because we know that that there's only one entry in them in the virtual table there's also provides the ability to do a faster implementation of certain byte codes because the class Erick analysis has a data structure which actually tells us the full hierarchy information of all the classes that have been loaded so when you're doing things like instant cyber check cashed for which are byte codes used one up casting your your objects between various classes we can actually use that data structure and it actually performs a lot faster ok so what is another performance issue that has affected Java in the past well this one actually has two parts to it one is the fact that if you ever wanted to operate on native data structures from your Java methods you actually had to have them rewriting in a Java heat why would you actually need to admit a data structures in the drive people if you ever want to interact with in any system api's you actually need to have those data structures to pass down once you drop down to native methods that add the other cat the other heavy cost which is the fact that those J&I transitions to do those method calls the native method calls are quite expensive yeah I mean in the previous section where I was talking at the inlining where we're trying to minimize the amount of method calls and those method calls are even pretty quick compared to these Jane I transitions not only that but these jni transitions definitely cannot be inlined at all because we're crossing abis and we don't totally control a lot of the issues between calling from Java to see but those but those are still those jni transitions are still necessary as of java 131 the other thing to keep in mind is that once you have all those native data structures in the Java heap they actually need to be hopping around during garbage collection and and yet they don't contain any actual Java pointers which the garbage Christian algorithm need to take track of so what is our approach at actually improving this bottleneck we want to remove this Jane I dependency all together by giving you the ability to actually access that native memory from your java method you might be familiar with this this is basically the new io a P is that were provided in 141 they're available in the Java niño package and there's basically a buffer class for every single one of the Java scalar types and also for the actual fight scalar type actually all operations happen at the bite level even though that's not a Java scalar but but you can actually cast to in buffer too long buffer and actually operate at the Java level and one of the one of the cool one of the things you need to keep in mind here is that even though the goal is to have a direct access to native buffers that are not located inside of the Java heap you can actually trick yourself into still basically having a copy residing in the Java keep accessing that and have it being copied over outside of sight of the heap which even though it might be improved performances before since you don't have to chop down into a j9 native men to do that it still is an added overhead and you need to be careful about that there's a few other issues that I want to bring up this is a pretty straightforward code example that shows an allocation of a byte buffer of size four hundred and you're basically zero filling it in it with a for loop one of the things that actually you need to be aware of right here is that that for loop is not as optimal as it can be because the call to the put method does not get in line because it's not determinative monomorphic this is a caveat of actual of the actual cost hierarchy of the Java niño package and and it affects all your calls to get me put in the case of byte buffer if you're doing something like this there is actually one way you can get around it and that's actually just simply by using a map byte buffer the not by buffer getting put methods are actually determined to be monomorphic and they get in line but you need to keep that in mind and this is something that we're gonna be tracking for in the future to see if this can be improved so how do you actually do I owe with high-level i/o with the new I oh that's using Java and I old channel found in a Java nao channels package and the main thing that it provides beyond what was available in traditional Java iOS of Java 131 is the ability to do non blocking and interrupter law operations no longer are is there any need to actually have one thread per socket that's the thing of the past another thing that it provides is actually has improvised improved file system support gives you a lot more of the system level primitives that you've come to expect from from a robust operating system things like file locking and also a memory mouth file just like just like in the case where you need to make sure that you had a direct buffer sitting behind your native buffer this is another example where you actually not only have access direct memory but you actually are accessing memory map file itself so let me go into a little more detail about the socket channel I don't want this to be a I'm not going to go into enough festive for this to be a tutorial on the sort of thing but i do want to bring up a few issues that tutorials might miss an occasion this is an example of how to create a server socket channel and bind it to a particular address for it to be listening on one of the things is that by default it is not set to be non-blocking so you actually have to do that by calling configure blocking and passing it value or false you can it's pretty straightforward thing but it can be missed and it definitely makes a huge difference and then actually how do you actually communicate with your clients using this model you use the selector model which you might be familiar with from the programming pattern and you can see in the code right here basically what you're doing is you're if you're registering for a particular key and then once you once you've done that you can basically iterate over all all all of your clients who communicate to you via keys and who pass the new channel you will communicating them with them with in a interesting a big eater in a big while loop if you want by iterating over all all of the keys and that way you're abstracting away all the different sockets that you're actually talking to you instead of doing the traditional having the block until your client talks back to you along that hard together socket and and one of the things to keep in mind is that the socket channel that has actually returned each of the times you you you ask us one of these keys it's different than the one you had originally so if you want to continue the non-blocking i/o you actually need to state that you're doing a non-blocking i/o once again with the configure blocking set to fall okay so what do you need to keep in mind about with when using new I oh well it's definitely not free the costs of allocating those native buffers is definitely much larger than allocating the Java race it's pretty hard to reach our performance of allocating dollar arrays because we actually do a very good job at doing that as quickly as possible the other thing to keep in mind is that the jet put methods of another of the native buffers are not in line you can use a trick to get at least that fixed for a few cases but there's nothing you can do if you're in buffer and for some of the other scalar buffer types but the cost definitely out sorry the gains definitely outweigh the cost in the cases where I've been talking about where you have heavy use of system api's related data structures one is a good example of actually taking advantage of that wind is actually in the riorca texture of the AWT done by our team for java 141 we actually took advantage of java of the new io a p.i to talk to core graphics and and and not has to and minimize the number of Jane I transistors basically we told the classes team try to minimize jni transition as much as possible and and they did better as much as they could and it definitely are you definitely see the performance improvement there and we're hoping that uh that the shared classes on the whole we'll be seeing more use of new eyewear they not can be done in the future the other thing also is clearly if you if you have server I oh with multiple clients you definitely want to be using this because because the overhead is definitely costly okay so I'm walking your takeaway from all this well the main thing you need to keep in mind what the server aisle is just simply use it in those cases and with with me what I told you about in lining what you need to do actually is maximizing you maximize the opportunities where we can inline your methods this is mainly important in your hot methods when you do a profile you want to figure out what the methods are your call you're mainly calling our and at make sure that that whenever you're the hottest lesson all of the things that are calling our hopefully being in mind it only can be done at a high level there are actually no flags to have just notified to you if your methods aren't being in line and that sort of thing but the general rules of thumb are definitely if you if all those methods are small that helps because we have certain limit at which point we fail on any future in line in lines in the method we're trying to compile feel free to use access or methods those have been in mind definitely since Java 131 also there's no need to use the final qualifier on your methods that's superficial for performance beginning it's that's enough superficial for object-oriented programming but we we don't particularly get any benefit out of that and also keep in mind that a lot of the JDK methods do get in mind so you can keep that in mind one if if that's a lot of what your hot methods are doing there are a few things that we're still unable to inline and you got to keep that in mind mainly a synchronized methods obviously large methods and and if you have an exception handler and your method ducking cause it not to be inlined keep that in mind the last tips I want to leave you with our ones I always like to reiterate which are things that still live on from the days of Java one avoid object pool there's absolutely no need for them in modern Java our new is completely fast it's also in lined we also have thread a local allocation now there's not even there's minimized contention between multiple threads allocating in the Java heap at the same time and we also we have precise garbage collection so let us do the work for you in terms of figuring I want an object needs to go away you don't need to take care of it in terms of the objects flow and all that and also avoid programming by exception there definitely are situations where you want to program by exceptions there's like the case where you have like you want to go down a tree and then go all the way back up turn branches in the tree sure but hotspot is definitely not optimized to compile those cases as well for example it does it can cause in line to be prevented and also the actual creation of the exceptions are expensive but that creation only happens if the exception is actually thrown so hope you came away some tips for your application and now bring up sherrard hello welcome my name is George in ski I'm engineer on Java classes team and i'll be talking to you about graphics performance first I'll give you a short introduction of the state of Java graphics on my quest and then I'll give you few actual tips and techniques on what you can do to your application to make your Java app run faster and finally we'll have some cool demos to show you so Java 1311 really interesting thing that we've done there was a java 2d harvard accelerated implementation that sit on top of OpenGL that was really terrific terrific implementation it was passed with incredibly fast however the problem with it was that when it worked it worked and it worked only ninety percent of the time and to get that the rest n % was really difficult to us we were making strides we're continuing and we're making progress however we really could not nail down the correctness so when we moved to Java 141 we completely real hurry textured our code we moved from carbon to cocoa and the lessons that we learn in 131 was first of all if we cannot do hardware acceleration we need to have something we can fall back on and that is sort of renderer so when we move to 141 we decided let's start let's nail down let's have terrific server implementation as far as the correctness is concerned then in the future once we have that down we'll be looking for new technologies emerging right here with an apple and then we'll evaluate them and then we'll see which one works for ants the best and then we adopt that technology so we are right now at this point we are still at the transition point where we're in 141 we have brand new code there is not even one line of code that we sure would want we want its brand new everything is written from scratch but we want to nail the correctness first and of course we're keeping our eyes open up on what's going on around us and what technologies we can we can use later so in 141 the java update that you guys have access to first of all our priorities were correctness second we also didn't really want to neglect the java graphics organization such as you all know 141 is not a speed demon as far as graphics is concerned so we worked on really on very basic architectural optimization techniques that we could put in there and right now we came up with three of them that's lazy drawing lazy pixel conversion and lindsey state management well lazy dronians about is we simply collect all your primitives that you want to draw we put them aside in the queue in a cache and when the time comes to draw them to the screener into your image it's only then when we transition from Java we go to the native then we process that cue the good thing about this lazy drawing implementation that we have right now is that is future compatible whatever technology we choose to use next this lazy during implementation will work with it and we work with core graphics guides and we made sure that whatever we do with our lazy drawing optimization will not break them in any way second lazy pixel conversion there are certain image types that Java provides the access to that are not supported natively so what that means is if we want to do something with that image the pixels are in format they are not understood natively we have to convert them if we didn't do this optimization drawing of images or drawing into images would be terribly curvy straw so what lazy pixel conversion is it simply a technique of converting the pixels only when it's necessary and then thirdly lazy state management a graphics context as multiple different states that you can set transformations color and what this is about this optimization techniques will simply let house set only those states that have actually changed we are now done we're not quite done with this optimization we are only part way so unfortunately at this point whenever you change most of the graphic states we have to slam all the other ones as well at that time that is turban efficient but we're working on that so here is a serious one benchmark micro benchmark to show you on this one this course show simply basically performance of lazy drawing optimization so what you what you have in your hands with our national 141 release was the base of hundreds and what you have right now you will have 175 score which is seventy-five percent increase that's not too bad we're not done with it by any money means and second robocco that's real world application it was interesting there's interesting story behind this a rubber coat at the time when we work in 131 rubber code was running pretty darn slow and we went to the developer and I think we made mistake because we told him look the image format that you using is not fast with our current implementation once we want why don't you use the image format that we support natively and will speed up your application while they listen to us they changed it and yes the the solid performance improvement in once we want however when we move to 141 underneath we use different implementation different techniques different technology again the rubber coat score plummeted down the problem was that the image format was hard-coded so right now what we tried to do was there are two things that went wrong in rubber coat first of all our lazy pixel conversion had not very efficient filter and what we did was and this is maybe small tip for you guys we make floats and ins and that made it trebek Turkish flow when you switch to integers only the filter was really really much faster this combined with the fact that we also improved unlazy spiritual conversion made rubber coat perform much better now keep in mind that this is still with the image that is not natively supported we still have to do work there but we use our lazy pixel optimization Scott do you want to show us this you might have seen demo on please you might have seen in the state of the union I showed this and it wasn't getting full frame rates and we're getting much closer to full frame rates that I've seen before and if you remember on the 141 release we were getting about four frames a second and that's because we're doing all that conversion right on the fly and now if I just start this battle up we should be getting something closer to around 30 or or or so yeah we're getting about 32 and I may even get above there right now it's hard lock to 30 so I go up to maximum we might even break 30 and gets more or right around 30 and you can see here that we're not even using both of the processors and there's something else going out here that you don't actually know is that we're running one of the really cool demos in the background that you'll see later in the session on the same machine so so it's actually got a lot of extra time I could start on my word processor and we'd still keep our 30 frames a second and I also like to do this let's restart it again and turn on my favorite option which is I'm showing where the robots are scanning and you get to see this and we're still holding at 30 frames a second so it's a pretty good improvement on just looking at that native image format and seeing that okay we had to do a fix thanks a lot thanks guys flights peace and future first of all we have lazy state management to finish this should give us free nice improvement then there's more optimization that we can do we have already tried implementing some of our lazy pixel conversion filters using the multiprocessors that are now available in many of our computers and we're not with the we're not done with it we're just testing playing around seeing how much improvement that can give us but that's one of the things we're looking at also altivec optimizations so there is still quite a bit quite a few technologies that we can use to make java graphics go in posture and then we are talking two sons engineers when we're working on our lazy drawing optimization we actually went to to them we went to Java to the graphics engineers and we told them listen guys this is what we're thinking of doing what do you think is it dumb is it going to work is not going to work do you have any other of co ideas that we may do and they love this none of their ideas and they said yeah go for it we don't even have it so we've done something that they wish they they had it so we're definitely doing some interesting things and then lastly it's very important we are working really very closely with core graphics guys for example you if you went to core graphics sessions you might have learned that they're starting to provide people first in OpenGL what that is is off screen images also what they provide is they can wrap an OpenGL context in congrat expects that means with that means two answers we basically don't have to do any extra work to get hardware acceleration out of certain operations p buffers together with core graphics giving us encapsulation of OpenGL context within core graphics gives us ability to very nicely very easily implement volatile images as you know volatile images right now are not higher accelerated on Mac OS 10 but there you go coraopolis just added something that we can use to hard work sir wait to have a real hard work serrated implementation for volatile images so that's one example and second of course they're looking into OpenGL the looking into API to the API that sits on top of OpenGL that's very similar to the hardware acceleration that we had in 131 but better because we are not the only client of that we don't have to support it it's the it would be the system-wide and if they come through if they have that working then that's definitely something that we would like our code work with them and take advantage of that so we'll be looking very closely at what core graphics guys will be doing and certainly taking advantage of any called technologies that they have to offer if there is one thing that I would like you guys to take out of this session is what to do about the images if you have to draw into bufferedimage how do you determine what is the correct what is the fastest image type that you guys should use and and this is the most important thing do not please do not hard code proper image height that's an example of a code that would be hard coding it what you can do is you can ask the system for compatible image now if you do this then no matter what technology will use in the future you're guaranteed that you will be given a buffered image type that will be the fastest on our platform that is very very important and one more point here if you have a choice you have the option after balto image it will be will have them part of accelerated soon hopefully so we can have the choice get bottle image now there's one misconceptions among some of you that with respect to index core formats on other platforms they're very fast and they're also conserved memory so using index format is is a way of compressing the pixel data using less memory unfortunately a Marco extender not supported natively so what we have to do internally to support that image format is to create brand new buffer just to have those pixels converted in the format that we can understand natively and then that's the way we can use them so index core format images on michelson do not use less memory on the contrary they use more and they aren't flower so if you have to use them do you have no choice but if you do need to use them I'll use all the image formats and it's very easy very often for you just to see now you don't need to do a lot just just change the buffer image format type and second and that's very important the most optimal image from that it's not hard coded it can change it can vary from a machine to machine if we were to move again correct technology that uses OpenGL that is very dependent on a video graphics card you have in your system also it is very important than to us to see what is the resolution of the monitor you running what is the depth screen you running on so there will not be one and only one image from my dad is the best the fastest it will change and you need to keep that in mind if you're writing for for Mac OS 10 now if you really need to know what are the natively supported image formats at this particular time and this may not hold even in the next few months if this may change but at this point only for image types are supported natively those are the fastest and those are the image drives into which you can draw which means they're the destination you can create a context right bufferedimage create context or get context good graphics so those only four are natively supported those are the fastest at this point if you need to draw an image somewhere else meaning the image you have the pixels are source then the native we supported image formats are a superset of the destination we have one or image format that we can support natively as a source and that is a RGB alphen aan premultiplied that is by the way the image from addis robocode users and we have added special optimization in our lazy pixel conversion that actually allows the pixels to know whether they're a native format or there are in java format and then based on that we have two different cge image rest and then we can just switch very fast between two of them we can choose the pixels that are there them there are up to date and here is some techniques for the rendering this is important in Iran platform as well where we have missing one or technologies that we had in 131 allows you to draw very reasonably fast if you were drawing on a known awt event thread we don't have that quite yet in this current reads we working on this we don't so the point here is if you can you can trust us and we may do this but if you really want to do something about if you're in this situation you want to make sure that this runs fast in your application if you're an unknown awt event threat please first render off-screen to an image and then take all of that content and at one time blooded to your final destination it will be much faster in my question now this is for those of you who really need the fastest access to the image pixels for whatever reason if you're writing an image manipulation program like Photoshop something like that then if an image is supported natively unfortunately for you it is there's no way to determine whether certain image type is supported natively or not you may change in the future it's constant only at this point you may change however if for some reason you need to do that and you know the image format is supported natively what you can do is grab pixels directly from data buffer on no need to be supported image you do not want to access pixels directly if you do we have to turn lazy pixel conversion optimization off because the time the second you touch pixels those pictures are stolen you can have access to them we do not know when you look at them or when you use them we have to do the conversion from native to Java every single operation so you do not want to touch pixels directly on a non natively supported image so go through graphics objects and draw to it that way other optimizations tips and techniques this comes from my all our work I've done an application with DNA sequencing application and when I was trying to optimize it here on the few things that I found that helped that application first of all avoid creating the objects in your plane method obvious disappoints to all platforms don't create new phones don't create rectangles if you need to use them and manipulate them later on for determining the clip don't create any objects in paint mess up your simple primitive in sets of shapes instead of shapes we have done our lazy drawing optimization actually attempted to do that automatically for you however it would be if you have the choice on Mac OS then it is faster to drop the primitives directly is in for X 0 0 you know XY with hide as opposed to creating unrec to the object use follow all your line instead of drawing lines one at a time it's only avoids the crossing of the natives from Java to native and Tom and it's simply faster with current co graphic spoon but implementation because we can build a complex pad if you have a power line otherwise we have to draw lines one at the time and co graphics it's not terribly good at that and this will not apply to probably to most of you but if you have a limited alphabet and you know you will not be drawing complex characters so if you're writing an editor text editor kind of application this will not apply however you can have a limited alphabet of say for four letters then maybe you can do that and the optimization is you can use bytes no charge charge for 16-bit and we do not know where it could be a unicode character or not if it is then we have to go through this more complex pad to draw unicode characters if it's a by then we know it is it falls within an ascii range and then we can just bypass some of the complex texturing routines and we can go straight to co graphics to blood those those characters use double buffering for your static portions of your applications that is that will apply to all platforms and we have added with this release we have added tons of runtime options for you guys to play around to turn them on and off you can turn up the optimizations that we provided for you guys you can turn on and off rendering of lines or rectangles of shapes you can use all those runtime options to narrow down and to find out what is the problem with your application if you have one so now for the demo I like to welcome Ken Ralph Lauren some micro system I a couple of weeks ago at javaone sun announced the new java gaming initiative and one of the products of this initiative is a new OpenGL binding for the java platform called jo GL and jo GL is open source and you can download the source code right now on java net so just go to java.net search for the project name and you can get it and thanks to Gerard and a couple of all nighters Jo GL is now running on OS 10 it's running on the Developer Preview that you've got with your 10 3 cds and it's not going to run on any earlier versions of Java for 40 s 10 so keep that in mind but going forward it will it will work and it will be fast and robust so we've got a couple of very cool demos to show you this one is very special this is um this is go be the dog and still be was developed by the synthetic characters group at the MIT Media Lab and Dobby is completely autonomous he perceives his environment he has his own internal motivations and desires and you can actually train though be in the same way that you would train a real dog you can sort of lure him around and show him new motions to do you can reward him by giving him a little click with a clicker you can in some sense scold him by ignoring him when he does a behavior that you don't like and and basically Dobby represents or at least it's a safe thing to say that they'll be is pretty much a state of the art in active animated characters that can learn and you can read more about though be in the paper on him in siggraph 2002 now they'll be it turns out is written almost entirely in the java programming language with a little bit of native native code around the outside to get the custom device inputs and he uses some of the more advanced opengl techniques like vertex shaders to do the shadow that you see here and the cartoon like shading around the edges of the dog this demo runs that over 50 frames a second on a dual processor g4 and I should mention that the synthetic characters group is a big OS 10 development house and so they do all of their development at this point with Java on OS 10 and this is the first demo that they've actually had to slow down because it was running too fast so it actually slowed down to 30 frames a second and because the g-forces so fast and g5 will be even better so we're not actually going to train dobby right now he's just going through his paces but you can sort of see see what's going on there skinning going on there's a action selection and this is running on top of the Jo GL binding for OS 10 and also notice the CPU usage unless the bottom corner there's almost nothing going on in there everything goes through through video graphics card yep so cool stuff ok ok so now here's another demonstration this is the demo by nvidia corporation that we've poured it from c to java yeah okay now this is not real time ray tracing this is using a couple of trick to get hardware acceleration for this technique of rendering glasses prismatic effect so there is actually in some sense you many of you I'm sure are familiar with the technique of ray tracing resend a light of array of ray of light out the camera into the scene and that is in fact being done at every vertex on this wireframe model but the trick is that it's being done on the graphics card by what's called a vertex shader or vertex program this is a tiny little assembly language program that is actually uploaded to the graphics card when its demo starts up that tells it ok we're going to take the cameras position and the vertexes position in the surface normal and figure out where should the reflected ray go and where should be refracted ray goes through the object and basically it looks up in the surrounding environment this street scene where's the right texture coordinate for where the Ray intersects the the world and basically what it's doing is distorting the background texture in such a way on a per vertex basis that it looks like the things made out of laughs so it's not doing it at every pixel it's doing it at every vertex but it's close enough that it's really indistinguishable another cool trick here is that you notice that we just turned off the fringe effect what's going on is that we're rendering the scene three different times each with a slightly different refractive index for the glass and that makes the refracted ray go into a slightly different position in the surrounding environment each time then those three things are added together again on the graphics card and you get the what basically looks like a prison so I'd like to point out that the same binary for this demonstration runs on OS 10 it runs on linux and it runs on Windows and it runs at one hundred percent of the speed of the analogous C++ code remember this was a port not a new demo so basically we are here with respect to OpenGL performance in Java and and it's running on all the time it looks great so go out and develop cool stuff we don't have java 3d for you guys yet but if you really need to use home 3d graphics and then please use jungle and judge some of you might be familiar with Java Java which is very similar product technology also Queen geo binding java java doesn't support the lightest opengl standard doesn't give you access to pixel shaders juggled us so if you want to have 3d graphics on my question using java you can you