WWDC2004 Session 503

Transcript

Kind: captions Language: en good morning welcome to session 503 optimizing for power mac g5 last it's been about a year since we introduced actually exactly a year since we introduced the power mac g5 chip and last year if you're here at WWDC most of the content that we presented dealt a lot with the architecture of the g5 chip itself we had optimization labs running for several days into the night late night at times trying to get as much information to you as a developer to understand the differences between the g4 chip architecture and the g5 there are very stark differences there this year we want to again re-emphasize architectural differences and why it matters to writing optimized code but we also want to now make sure that you guys understand how to utilize the tools that we supply in our tool set as well as the compiler options that you have in terms of helping you optimize your code itself so we have several speakers of this morning from the compiler team to our performance group as well as the gut speaker from IBM this morning so with that I'd like to start off this morning introduce Sanjay Patel from our performance group good morning is a tough time flood for those of you here last year we're going to do a bit of review how many of you are here last year for this talk okay so we're going to go through some g5 architecture and then we want to talk about things you can do to help improve your code on g5 and in fact all platforms really and then we'll turn it over to some of our compiler guys to help guide you through your process of optimizing your code so to start off with when you're talking about the g5 you have to of course start with the powerpc 970 chip and this is of course a super pipelined super scalar processor we got from we teamed up with IBM to make it's based on the power for server architecture and the big addition that we had to have to make it an apple chip was dad what we call the altivec engine also known as the velocity engine so this is a hundred twenty eight bit vector unit which does floating-point and integer math and the other big difference is we have this high bandwidth point-to-point interface that connects the chips to the memory controller and help take advantage of all the bandwidth that we have available in the system in theory are these automatic hardware prefetcher engines that help put all that theoretical bandwidth to use so this is a die shot of the 970 chip 24 hardware engineers is kind of like pornography you know you look at this and you'll hear some safe things like look at the FP use on that one for software engineers what you want to take away from this is that there are lots of executed execution units available for your program that all operate in parallel so two independent load store units two independent I Triple E compliant floating-point units we have the full implementation of altivec to fix point units as well so there's just a lot of space here to get things done in parallel another way to look at this from the software perspective again if you start at the top you have the l1 cache which holds your instructions from the l1 cache instructions flow into a fetch queue and from there they get dispatched up to five instructions on each clock so at two and a half gigahertz your dispatching over ten in theory 10 billion instructions per second now that's actually the narrow point of the 970 architecture so you can see this stuff in the kind of greenish gray is out of order execution so you can feed 12 independent execution units with 10 issue tues once you've dispatched and then again your instructions complete at the bottom of this picture in order up to 5 per clock now a good way to put this in perspective is to look at how the g5 compares to the g4 and so we keep talking about the the parallel ISM of this ship one way to measure that is how many instructions you keep in flight simultaneously and for the g5 it's over 200 instructions in flight compared to a little over 30 for a g4 architecture and we've also increased the pipeline stages so you can see that it's more than doubled for a simple integer instruction and usually you'd say well that's not so good why'd you do that the reason we increase the pipeline stages is to help increase frequency so we just announced that we hit two and a half gigahertz in order to hit higher and higher frequency numbers you increase the pipeline depth okay and we talked about some of the execution units again we've doubled up on the load/store units doubled up on floating-point units you now have to fix point to general purpose fix point units whereas on a g4 architecture you have up to three simple units that do things like adds and subtracts and one complex unit to handle things like multiplies and divides the vector units are pretty similar you have an ALU that handles floating-point an integer and you have a permute unit for any of you have done any vector programming with altivec you know what the power of the permute unit is in terms of switzerland atta out of memory and into the registers as we work our way out from the core the biggest programmer visible difference that you'll find is that the cache line size is different it's under twenty eight bytes whereas it was thirty-two for a g4 now that can either be a really good thing or a bad thing and I'll show an example of how that happens as we work our way again to memory you have the l1 cache is the same size different associativity and right policy the l2 cache the l1 instruction cache is also doubled up in size there's the l2 cache the l2 cache is also doubled in size compared to 7450 you'll notice there's no l3 cache on a g5 system whereas you had up to two megabytes of l3 on a chief or system now we've made up for that by increasing the processor bandwidth substantially 1 by doubling the width of the DDR interface and increasing its frequency but also the front side bus and this slide is actually a little out of date at two and a half gigahertz we've increased the front side bus frequency to 1.2 5 gigahertz so you can actually in theory get 5 gigabytes per second in each direction so I want to talk about some programmer problems I've seen over the last year since we've intro the machine and the one that comes up most frequently turns out to be a rather simple thing it's conversions from floating points to integers and the reason this shows up a lot is because in when you write this in see it looks really cheap right you just cast your variable into ensor cast it to float but it turns out this is not cheap at all particularly on a g5 because it has so much going on in parallel when it hits this condition because the PowerPC architecture doesn't have integer register transfers you actually go to the l1 cache and come back so you have a load in the store operation going on there are a lot of things you can do to avoid the problem in the biggest one that I found it actually turns out to be one of the easier solutions is simply don't do that it turns up again because it looks so cheap and easy that a lot of people just cast from one to the other without thinking about it and when you examine that code you realize you could have stayed in one domain or the other without hampering or affecting it the algorithm in any way the other cool way you can get around the problem and of course if you use this altivec you're going to get a much greater speed up because of all the parallel ISM in the ultimate unit but the ultimate unit handles floating-point and integer identically in the same register set so there's no memory operations when you transfer between types another potential optimization is to use the GCC compiler flags dash fast this tries to schedule your loads and stores and inserts no ops leavin to separate them out to keep things flowing through the g5 IBM's XL compilers also do this kind of optimization if you specify the g5 architecture so want to show you a quick and really simple example of bad code so this is a real lame loop all it does is inflict version because the loop counter in this case I has been declared an int but we're adding it into a floating-point some so that looks really cheap but it's not every time you do that ad you're going to have to convert the I from integer two floating point main to store it into the some so how would you get around this problem Oh what you can do is creep what I call a shadow of the I variable in this floating point unit so I've just named it i underscore FP to denote that it's floating point value and when i initialize I I also initialize its shadow and when I increment I incremented shed now inside the loop we're going to use the floating point value for the some rather than the integer value so on a g5 I measured this it turns out this code is three times faster than the previous code where you're doing the conversions because this code won't have to do all the load and store operations and the next biggest thing I've seen over the last year is just improvements to try to schedule your code and like I'd say the g5 has a lot of execution units that you can operate in parallel but if you write dependent code so one operation depends on the next depends on the next it's old cereal you're not going to take advantage of all those units and furthermore you're not going to take advantage of these long pipelines you have to schedule your code so that you're filling in all these pipelines slots instead of causing bubbles and execution so compiler help is here you have GCC 33 which has g5 architecture tuning so that tries to schedule for the available units and slots xlc also has the same kind of flag where you specified that you have a g5 architecture and the other thing you can do is use shark which you've probably heard about and we'll talk about in more detail tomorrow at three-thirty we have a full session on how to use shark and what it can do for you now again I mentioned that we've increased the pipeline stages compared to a g4 on the g5 so what does that mean well it means it takes longer in terms of clocks for a simple instruction to complete so for example an addition instruction may take one cycle on a g4 but may take two cycles of latency on a g5 so what you want to do is account for that in your program I kind of grouping a bunch of the similar operations together so that means you can unroll your important loops or you can use the compiler flag and as well you want to schedule your code for the g5 so you're going to fill in all those pipelines slots now people often ask well shouldn't the compiler do that for me and in all these examples you can always ask that question shouldn't the compiler do that for me in some cases the compiler can do it for you if you specify the right flags but there's always a downside if you if you just lean on the automatic way to work over this problem because the compiler hardly usually doesn't know when the loop is important or not that's something you have to tell it so if you choose to unroll all loops or unroll most loops you're going to have a big increase in code size which could be detrimental to your performance that's why is the first pass you should profile your program and try to do manually some of these optimizations it's just in the important spots so here's another example of some code that's again just a silly example right we're just going to some a bunch of one in this case and the 970 architecture the g5 has two floating point units and they're each six stages long so this code is only going to get approximately 1 12 efficiency because every instruction is dependent on the previous some so there's a simple example out the code is exploded right because we're trying to fill all the pipeline stages and here we actually only unrolled 28 ways of partial sums so we wouldn't fill all of 12 pipeline flopsy you would actually want to do 12 in order to maximize your gains on the g5 so you can think of the floating point unit says either 112 stage pipeline or 12 single units but they're all going to operate in parallel so this code turns out to be 10 times faster just using partial sums instead of one variable the other big thing you have to worry about when you're optimizing code and this goes for all architectures because particularly that on a g5 because the g5 core is so good it makes memory look really really slow what you have to do is try to reduce operations where you're waiting on memory so effectively reduce your latency there are a couple of ways to do that one is to rely on Hardware prefetch engine and I'll show you another example of that the other thing you can do is use software prefetch instructions to get the data before you actually need to use it for computations for example if you're in a loop you can batch all your loads together at the top of the loop do a bunch of math and then do stores at the bottom that's going to perform better than doing serial operations of load maps store I mentioned that the data cache is different on the g5 than the g4 and the biggest difference is the cache line size so it's four times as big what does that mean well it means you might get one fourth the cache misses if your data is organized nicely it may also mean that you're getting really terrible performance if you're accessing one bite skipping 127 and accessing another one bite at that point you're getting less than 1 percent efficiency from your cash so what you want to do and this is sort of basic CS right pack your data together to maximize its locality so as you walk through your array you want to be stepping sequentially rather than jumping around this as an additional benefit of triggering the hardware memory the hardware prefetcher so the CPU is automatically going to detect that you're walking a straight line either up or down through memory and start prefetching cache lines from memory into the cache so again here's another simple example as a classic two-dimensional array where we're walking the wrong way through it we're iterating on the columns or rather the rows rather than a columns so we're skipping large chunks of memory in this case what you'd want to do is switch the four loops so that you could sequentially access every element in this array so any guesses on how much faster this is going to be big difference right and so this is simple stuff but 30 times faster if you do the right thing rather than the wrong thing so it highlights the problem of how important accessing memory is so I just want to summarize some of the things you should be doing and looking at while you're trying to optimize code the first thing to do is try to unroll and schedule important loops because you have all these execution units the independent floating point units the independent load store units you have hundreds of instructions in flight it's like this number is actually out of date as well you can now use altivec to calculate more than 36 gigaflops per second at two and a half gigahertz this is of course the best solution if you have code that's just massively parallel you can operate on all the elements simultaneously for those of you writing floating-point code the g5 has a hardware square root instruction which can be enabled with GCC with this Power PC GP opt flag xlc will recognize that this instruction is available if you specify the g5 architecture this has made a very large difference in some rate racers and renders and other programs that have looked at that have heavy dependence on square root if you're using 64-bit integers long long cincy you can turn on flags to specify that you have a 64-bit machine because you g5 truly does have 64-bit integer registers this can be a huge difference for code compared to actually breaking up into 32 byte chunks or 32-bit chunks sorry so again the system as and the chip were designed for high bandwidth they were designed to do lots of things in parallel it's part of the server heritage coming from the power for you have 40 gigabytes per second to the l1 cache up to 80 gigabytes per second between the caches and up to five gigabytes per second to and from main memory and the way you want to take advantage of all this theoretical throughput and put it into practice is to take advantage of hardware prefetch engines so these will start scooping data out of main memory and bringing to cash before you actually need it and that's all I have so with that I'd like to introduce Steve vaquita from the IBM compiler team learning all good morning I'm actually very excited to be here as we have now introduced our Excel C C++ and Fortran compilers for the macro effects or of what so it's 10 my apologies so I be in compiler technology we've been in the business of compilation technologies for over 15 years and exploiting primarily PowerPC technology but we also been on about enough nine other platforms mainly IBM platforms in among all this technology we've got numerous types of optimization patents that truly exploit the PowerPC technology our goal in arc and the idea compiler team is actually threefold the first one is to exploit the hardware are our actual key here is to drive out the you know the maximum performance we can possibly get out of the g5 processor among these things then we have an extensive portfolio of optimizations these include things like interprocedural analysis which does whole loop from program analysis it has the profile directed feedback loop optimizations for parallels and instructions and all for locality scheduling and one of the things that we do regular is we actually work very closely with the the chip architecture team where we've been working with with the core team that actually developed the chips and provide them with information as to what ways that they may want to change the actual chip versus also the type of information that we can exploit within our own compilers the second thing that our compiler group really focused on is specifications and standards for our CMT plus plus we are si si 1999 and C++ 1998 compliant for fortran word fortune 70 90 95 and partial 2003 we also have openmp support primary that what that was introduced on the ax platforms and we're bringing it over to the mac OS x also our developers within our C C++ and Fortran teams are also representative on the standards committee there are also not only on the iso standards for me but they're also on the openmp consortium so being really focused on compatibility and also on standard specification our compiler develop the source code for art that can be pumped through our compilers are easily portable between numerous platforms for example Mac OS X Linux ax and our mainframe 0s the third thing that we're really focused on is is the customer care we work very closely with various ISVs and also customers on tuning the code matter of fact we're down in the optimization lab all this week and some of our engineers have been working very closely with people that have brought in their code and we've actually seen markups from anywhere between twenty percent and two hundred percent even in a short period of time by using our compilers to exploit their code so for the C C++ and Fortran compilers on Mac OS 10 they are based on our act ra X and Linux compilers on ax and Linux we call it right now visual HD plus plus so the visual a C++ those platforms are essentially the same compilers that we have for Mac OS 10 platform so this this actually leverages all the proven optimizations and language specifications that we've already introduced on those platforms some of the common things between our XL C and C++ compilers and Fortran is as I mentioned already exploitation of the g5 architecture we are integrated with Xcode symbolic language GTV and also we support a number of the apples profiling tools the shark one in particular is just an outstanding expanding tool and helping tune your code you know for sure even ourselves we wish that was available in summer IBM's platform among the two other things that we have is part of what we call it the technology preview so these are features that we are actually looking at trying to bring in to our product although they're there right now they're not formally supported and in particular is the openmp so our direction is to have full support for open NP 2 point 0 and the other one is automatic parallelization specifically then for C C++ as I mentioned there's the standards compliant for 399 and C++ 98 exploitation of altivec although that the this compiler II now can actually generate codes to the ultimate instructions one of the things that we are ongoing looking at and in our in our research and development is the automatic symbolization or otherwise known as automatic generation of Alta tech instructions so these are things that we are definitely focused on and looking at in future releases compatibility with GCC 3.3 so this is two folds one is we are fully binary compatible so you can intermix GCC objects with our compiler and the other one is that we have a number of language extensions that are GCC specific so you can source code compatibility for seed but bucks we also have an objective-c technology preview and then for excel fortran as i mentioned we're already the fortran 70 90 95 and 2003 partial we're also introduced many IBM and calm energy standards language extensions and these are include some from vm from 0s and other other well-known platforms for fortran so that's who is actually just a quick overview of the CRX L C C++ and Fortran compilers if you have any questions on how to exploit your code how to gain even more optimization capabilities and perform and cellular g5 come on down to the optimization lab we're and there's a number of us IBM are there to help answer any of your questions thanks all in the next person up is Ron tight thank you Steve well this is going to terrific here for the g5 and I think we all now understand what kind of power lurks in that box but those of us who have really worked a lot with it understand what it takes to extract that power and Sanjay covered some of that and of course the apple offers a set of tools that really facilitates understanding your program and being able to a grizzly allow it to extract that power I want to talk today about what the compiler can do to get you started on that path because not everyone is is ready to step up and start tuning their program and changing the algorithms and so forth and so Sanjay mentioned a number of compiler options that can help you in certain situations and what we have done within the compiler group is actually put together a mode we call dash fast and I want to talk about that today i also want to talk about feedback directed optimization which is another component of the dash fast mode that can help you significantly and then of course you've all heard the announcement this week that that we're on the past to deliver with Tiger our initial cut at out of vectorization and I want to talk about exactly what that is so GCC in the dash fast mode could I just askn in here if anyone's using this mood today ah gee amazing we've had so little feedback on how it has been working for people that we've wondered if aunt went using it and that's why we wanted to talk about it today the dash bass mode is really a collection of a lot of the compiler options but in many ways it's more than just a collection of options it's we put them together in a homogeneous type of fashion to our best ability to target tell you what i would call typical applications and of course we all know there's nothing like a typical application but in this case I mean applications that are computationally in 10 inch intensive so to do do a lot of mathematical computation we've tried to target a mode that will give you a first step into getting some of the performance however the details of when you use that modes important so you can't totally say I don't understand my program and what's going on in my program and so I'll talk a little bit about the details that are important and give you a good feel at least four dash baths and what it's trying to do and then finally there is a variant of dash fast called dash fast and and that's really what you should be using if you're working with C++ there are some things that we do slightly different to try and address performance in the c++ world so what are some of the specifics about dash fast mode what what are we trying to actually attack well Sanjay talked about and others have talked about the deeply pipeline nature the architecture and the wide functional units and so one of the things that you have to really be concerned about to get performance is keeping the pipeline field as we call it and so there are a number of optimizations some that Sanjay mentioned that we have brought together to try and keep this pipeline field so we're feeding this monster at the speed it would like to be fed I want to talk a little bit about standard conformance and some of the things that we do to relax the rules so that the compiler can actually do a better job for you in terms of optimization and then finally of course the g5 instruction set this is a presentation on the g5 so to start off with I don't know how many of you have ventured into the dash 0 3 level optimization mode but I want you to know that's just the starting point for dash fast and so you'll get that with dash fast and along with that come a couple important options one is in lighting functions and basically and you may understand this but basically that says within a computational unit the compiler can do some heuristics to determine how to inline functions within that computational unit and the real purpose behind the compiler doing that is that the more code that the compiler can also have an inline view of the better all of the optimizations can can be performed and so the bigger view that the optimizer has the better the optimization the second is rename registers option and what this simply does is it gives the compiler more freedom in terms of its register allocation and it does at the expense of you being able to debug your code but if you're on this ragged edge of trying to get optimum performance that is one of the pitfalls you have to deal with the second capability that I want to talk about is enter module and lining are function inlining across the whole program and what that does for the inlining the previous inlining option look at one computational unit the inter module function inlining looks at the whole program and so it gives you that many more opportunities to consider in lining throughout your program and once again there are heuristics that we have determined are the best when you're making guesses about inlining and you really don't know what there are functions call it a lot or not I'll be mentioning another feature a little later on though that deals with that this is a command line then that would represent you implementing animodule function inlining and basically that's triggered by putting all of your your compilation units on the same compile line so the compiler can look at them all at once the next and Sanjay talked about this has to do with loop unrolling it and the compiler can actually do that loop unrolling for you once again a very simple-minded loop but it will serve as a representation here the compiler can actually deal with more complex loops but inlining simply means that it reduces the number of iterations of the loop and it puts various iterations actually in line and so once again what you're doing is cutting down on the on the branching operations and trying to give the scheduler more opportunities for scheduling the other operations in the functional unit there is another form of loop unrolling that the compiler does called Luke peeling and in that situation you can see here we have an even smaller loop and these two occurring code and the compiler will simply unroll the entire loop and laminate the loop all together the next option is a loop transposition and we have several loop transpositions this is similar to put Sanjay was talking about and I think he indicated you could turn on this option but basically we have a double nested loop here and it is stepping through memory and fairly large increments in this case 1335 and that has a terrible effect on on the paging within the machine and so for data locality reasons we include this loop transpose function and what it will do is the compiler is able to recognize that situation and actually do the transposition of the loop so that now we're incrementing in increments of 1 throughout the memory we have a specialized optimization called loop two men set and what that is is if you have initialization types of arrays where you're initializing things to 20 the compiler actually will transform that into a call to memset and memset on each of our architectures including the G fly has been highly tuned in such a way that that you you can't beat it with with your own code and last we talked about tuning for g5 and tuning for g5 is really important because it tells the compiler this is a g5 architecture the compiler understands then of how to schedule instructions for the maximum grouping so that we can keep all of the functional units going as much as possible in parallel so we're really extracting the power of the having a wide functional unit set okay i mentioned standard conformance and so i have a couple relaxation rules here one of them is having to do with an option called strict aliasing and aliasing is the situation where you have point2 pointers than those pointers are actually pointing to the same object so those objects are then alias and the compiler often can't tell well even if there are different data types if in fact the objects are aliased and so it has to make the assumption that they are well you can help the compiler out by if you know in your program and this is when your program knowledge items if you know that your pointers are never alias within your program you can tell the compiler use strict aliasing assumptions here and so in the example basically what this means is and this is a very simple example strict aliasing tells the compiler by the way that that if pointers are of a different data type they will not be a less you can assume they're not alias so in this particular case here without strict aliasing we would actually have to reload the ti with one before we return p I by saying strict aliasing we're able to understand that we don't have to worry about reloading that and in fact can be put into a registering and return that way so this can have a interestingly up in many programs a pretty big impact the second that falls within the area of conformance is our fast NASA option and fast math you should understand is not I Tripoli conform it but by the way almost all code doesn't require actually conformance and you know that if it does and so by relaxing that rule that we relax that back to the compiler can assume that the associative distributive and community principles hold and so it can actually rearrange code and and the fashion that I show here on the on the screen to best utilize the scheduling of math and the computation of these operations within the system this is another one that can really win for you and if you really don't need to understand whether on the boundary conditions that it's not a number it is a number or its infinity then you should try using fast math and try that in your program hardware specifically and of course we we say mcp you it goes g5 and in the GCC compiler that says you can you're perfectly free to use any g5 instructions that are available and then in line floor happens to make use of a couple of specialized instructions in the g5 to actually in line the floor intrinsic right in right in line the next three have to do with alignment and one of the things that we've learned about the g5 is that it's very sensitive to alignment and you can you can make dramatic improvements in your code performance if you try to to deliver well aligned types of data and will aligned code in this case we're aligning loops jumps and functions all on 16 byte boundaries and yes this does cause some bloat in your program but our experience is that the performance far outweighs the bloke that you get and the size increase in the program the last item there Emmaline natural says all right align all data types on their natural boundary so you have to be cognizant of that if you're concerned about data being packed together and things of that nature because data types will then be aligned with perhaps gaps in them I would just encourage you in moving out so the dash bass part of this that you should give that a try free to give us feedback in terms of problems you have one of the things that it can help you trigger is the needs we're going the next step of analysis if in fact you use dash bass and you don't see speed up with in your code then you should be thinking that well maybe I've got some algorithmic problems or some memory accessing problems that are the real performance killers and so there's no way that compiler is going to optimize you out of this you will have to go to the shark ER and try and understand the program and what's causing that to happen next thing I want to talk about is feedback directed optimization and and feedback directed optimization is really an optimization that allows you to tell the compiler in more detail exactly how you expect your code to to execute and the compiler will take that knowledge into account and will do a better job of optimizing it's used number 14 in lining the concern about in lining and it was mentioned by Sanjay is that boy if you over in line you can kill performance as well well using feedback directed optimization we actually tell the compiler from results from a training line exactly how many times that a call site was a function called how many loops are in an iteration that has a function inside of that loop and so you can make very good decisions in terms of performance versus size trade-off as opposed to using guesses which are the normal parameters that we look at the second thing is used for is what we call hot and cold partitioning and hot and cold partitioning the best example I would have for that is an if statement okay you have two branches and one of those gets executed predominantly and the other one is only occasionally maybe only in and Eric addition so we tag the the hot one and we start grouping the hot code together and we take the cold code and we move that off together at the end of the program and so we help to contact the program down and keep the footprint for that program so that we reduce paging once again in operation there are a couple of flags that you used to do this so first you would use the create profile flag and you would actually create an executable that is instrumented such that it can gather the profiling information you run that with a training set of data then you you rebuild your program optimizing it using the profile that you just created not all applications i realize win themselves to to this type of profiling maybe it's an interactive type of application but certainly if you have computationally intensive applications that work on large data sets taking advantage of trying to train the the application the compiler to optimize the application for that is really a great thing to do well worth the effort then finally i want to talk about out of vectorization and just out of curiosity how many are using the altivec processor today okay we have quite a few hardy souls but you also understand that it doesn't come for free it takes it takes so work to program it today what we are doing is we're trying to open up the Vista of using the altivec to a broader scope of folks and so areas where you may not in fact want to spend the effort tuning it yourself so what is out of vectorization it's simply the compiler being able to transform serial oops into vector risible loops and what are vectors for those who don't know that well vectors 128 bits it can be operated on and a number of different sizes for integers and floating-point in and bit operations and so forth all of those operations within that hundred and twenty eight bits actually occur in parallel and so therein lies the speed up just a quick overview the types of operations are arithmetic logical compare rotating shift but they're all done within the vector unit and of course the data types we just talked about so in your DVD that you've received there is a preview compiler a preview of the three dot five compiler and that is a first introduction to you of the auto vectorization it has limitations and and our goal is to really work on those limitations between now and the time it's released with with Tiger but today what can it handle I can handle loops with both known and unknown bounds and there's different codes that we have to generate to discover the loop iteration counter at runtime if they're not known loops with even an odd vector links loops with conditionals and those particularly simple conditionals and misaligned vectors on loads so we're able to to take unaligned vectors and what I mean by that is once again altivec operates on a 16-byte boundary type of veteran so if your vectors online on on 16 byte boundaries and and you can get that from malik to raise and of course your own arrays that you that you allocate but but we go through vector operations to align them and i'll show you a little bit about the performance penalty that can occur when you do that auto vectorization has difficulties with pointers and aliasing well I talked a little bit about that before and the scalar part of the present age that's true here as well in this particular example there is no way that the compiler unless a and B are Global's our local they're not certainly not local within this function and so unless they're got Global's there's no way the compiler can discover that these are not alias and so it will have to make the assumption they are in today's world and not vectorize this loop however you can help the compiler out in a simple way you can actually use the restrict keyword and the restrict tells the compiler okay this pointer does not alias with any other 22 object and so those that's a simple help it turns a loop into one that can be vectorized that can't today and the next thing that that it has difficulty with that you need to watch out for is that scalar loops may have data dependencies that work perfectly fine when you're doing in the scalar mode but to try and transform that into a vector type of operation where you have a number of elements being being computed at the same time you can't have those dependencies and so the first illustration of a loop here is one in which we simply couldn't vectorize it because you will have this data dependency and the second one the second one looks similar but in fact there's no data dependency here because this is all set by N and then misaligned vector stores we simply can't handle that in the preview we'll have that available and the in the IGM release but if you're going to play with a 35 compiler be aware that the vector that you're storing into needs to be correctly aligned in memory so what is using the auto vectorization all about well it's about performance and so I have some initial numbers here and these are already out of date as we continue to tune the code but for simple types of operations in loops you can see speedups here that go all the way to 14 times and we're now seeing even around 20 times and some of our work if you have misaligned data the types of impacts that you can see is you really reduce the performance significantly now this as I said I expect to improve we're in a very early stage with Auto vectorization we have a limited set of loops that we're able to recognize and deck to rise and so I would encourage you to take a look at this we are really open to you sending us kernels of code of things that we don't seem to be able to vectorize because we want to build up and mature that ability and this will be something that we're working at as you can see though the the reason we're excited is because this can really offer some speed ups and particularly if you haven't already been using the altivec processor on your system is sitting there just wasting away and you can get some real performance out of it so the operation here includes the enabling of a couple of options I believe that in an Xcode today there's actually a option for auto vectorization that will do that for you and enable the process so if you're looking for more information you can contact mark toes are Matthew Formica and Mark you want to come up and so to add to that the reference library some documentation that's on apples developer website some tech notes that are written that were posted since last developers conference with a lot of information will have a George Warner up here in a few seconds with the QA who participated and writing some of that technical documentation a couple of takeaways I want to make sure you go away with this morning you know it's been a year since we introduced the PowerPC chip as I said earlier so you should be looking at transitioning your code to the g5 you should be looking at optimizing the code and making sure that it performs that its best optimization is a skill it is not something that comes for free all those tools that you heard about today and the compilers in both Apple compilers as well as IBM will provide a lot of assistance but there is some times when you need to get in there and roll up your sleeves and do the hard work for that we have optimization workshops at Apple for the past year we've had over eight workshops one a month essentially helping developers like yourselves to work through the problems of optimizing your code so i encourage you to participate in those workshops they're posted through the Apple Developer connection emails and they'll be continuing on throughout the rest of the year I'll have the next one starting in August I believe the first or second week of august i can remember correctly the other thing is that you know it does take a lot of work to do optimization work but there are a lot of rewards to it as Longet pointed out some of the sample code we're here to help you through those problems so as Steve mentioned from IBM there is the optimization lap here all week please take advantage of those resources we're here and committed to helping you guys write the best code for this platform we feel that the g5 has a lot to offer it has a lot of headroom to grow the best applications on the platform or those that take advantage of all the abilities that the hardware has to offer so please again make sure that that's something in mind when you're looking at revenue application writing a new application or just you know taking the time to take a look at what you've done in the past and may be improved upon that