---
title: WWDC2003 Session 304
framework: wwdc
role: article
path: wwdc/wwdc2003-304
---

# WWDC2003 Session 304

## Transcript

Kind: captions Language: en as you probably all know we jammed a new GCC 33 compiler this week at the show and today what I would like to do is introduce you to that compiler answer any questions you may have that are preventing you from moving forward and i encourage you to move to 3 dot 3 also i want to introduce you to one of our partners from IDM we work very closely in terms of delivering the three dots three compiler particularly focused on the g5 and the new g5 system and also leave you with some idea of future directions for us so we had a challenge in putting together three dot 3 compiler we wanted to deliver a very robust compiler because we know that's what you expect we wanted to address compile speed we've heard very much from you in terms of compile speed and comparing us against the code warrior compiler and also in terms of code quality because we were introducing a brand new system this week ajith I system and and we needed to have terrific optimizations for that system coming out of the shoot well that's a very hard problem I've worked in the compiler area for quite a number of years and usually you get to pick two of those three and you can deliver either a robot's compiler with a lot of great code quality or you can deliver a robust compiler that's really fast but to to try and deliver a compiler that meets all three of those goals is a real challenge so I want to talk a little bit about why we were able to accomplish that challenge and I think you'll see that when you install and use GCC three dot 3 first of all as you know it's based on the GCC the Free Software Foundation technology and and today that technology is is very mature and stable and continuing to evolve primarily from the the advent of Linux systems and lenox system vendors putting together compiler production release compilers since that time the GCC compiler has really matured significantly their billions of lines of code that have been put through it we build our own mac OS 10 operating system with GCC compiler we have very sensitive test Suites something on the order of 30,000 tests a day and growing all the time so so just a very robust experience and I want to assure you that when you install the three dots recompiling or see the same thing one example is one of our early seeds for three dot three was one of you folks with an application about 26 million lines of C++ code that entire application was my grades at 33 with essentially no problems so what does Apple contribute because we also have been putting a big effort in and and I want to tell you that we're a very active member of the of the free software community we have experienced engineers on staff that have worked in that community for quite a while we're an active maintainer the Free Software Foundation is actually recognized some of our efforts and they have made Darwin a reference platform for GCC releases something that is really fantastic new optimizations as i mentioned g5 was really important to us for this release this system as you've probably seen and some other presentations is an order of magnitude more complex to deal with than the g4 and our mission part of our mission is to make that happen for you so that you don't have to do all the work of extracting the power from g5 so we've done quite a bit of optimization you'll hear some about that today you'll hear some from our partner IDM who worked with us on that as I mentioned compilation speeds important and so we made that a high priority within our organization and i will show you some of the performance improvements in terms of consolation speed and finally just general maintenance our customers report problems we fix them we put them into the community and so we're we're very active and and correct members if you will of the community so let me get into talking about the three things that we focused on for this release besides robustness in the language feature area as you know GCC is already very compliant with the iso and ansi standards and so it's it's mature in that way but it's not a hundred percent compliant and so it continues to grow and in this release you'll see some of that particularly with respect to now being able to identify incorrect code that may have slipped by an earlier releases of the compiler we've also added a feature in Panther wide character support a missing feature now as you know within our own frameworks and within Apple we encourage you to use Unicode which is a much more powerful means of dealing with wide characters but we have heard your desire to have applications easier more easily ported into our environment and in some cases that requires white characters so that will be there with the Panther release you will not have it with Jaguar objective-c and objective c plus plus of course we maintain and we we do enhancement and dissing and one of the problems that you've told us about is that we did not have an in section model particularly for objective-c and so we've added that that that's also available in conjunction with the Panther OS release and finally and I know you've heard some of these things in other other forums but finally we've added another migration tool for you which is the ability to use inline assembly or port inline assembly from the code warrior so completely compatible with codewarrior other than bugs you may find since it's a brand new feature but I think it's pretty robust we poured it quite a bit of code with it so compatibilities want from released to release is always a concern in an issue the main thing I'd like to point out to you is what you see here is a very short list of issues to deal with so so we're compatible in many many ways one of the things that you will want to do though is that you will want to rebuild your C++ apps a necessity and in terms of making some enhancements the C++ we've had an IV and ABI change and so you simply just need to recompile all of your your C++ apps in 33 as you know we've been struggling with precompiled headers over the years in terms of providing a complete solution compliant providing a very robust solution and we've had CPP pre comp as an early version of that number of years served as well but had limitations and we introduced this time last year PFE and PFE brought to you and tie our language coverage which we didn't have in cpp precum PCH is the next generation of of essentially PFE more robust faster implementation and across all languages and this is a feature that is going back into the three sulphur foundation in and will be in other vendors releases of compilers in the future the other thing that we have done within GCC 3 dot 3 is that and looking at the migration of some code people were getting tripped up a little bit with the flags we had certain flags that you use for cpp pre comp we have some others for PFE and now we have PCH flags and so if you were in your development environment getting those flags corrected was recognizing that you needed correct and even was a problem is so we've made the use of invalid flags for the particular compiler you're using a warning so so you'll get a heads up if you see warnings from compiler flags now that you hadn't seen before you should deal with those they're really there to try and help you out one of the things that that we get as a result of defecating CPP pre-comp is that CPP pre-comp was built on a preprocessor model which was K in our base and so you had to flip some switches within the compiler if you wanted to actually use the ansi standard c preprocessor now that we no longer have that encumbrance we have made the default CPP pre comp CPP preprocessor the ansi standard preprocessor it aligns us with all other GCC compilers that are being delivered today and finally one other flag that I would just warn you about because we've seen some some resulting difficulty with this as well is the no standard include flag was originally intended to not look in any system directories we found that we actually had a problem with the implementation in the 3-1 compiler and it in fact was looking and finding header files in in a system directory well that's correct it now and so if in fact you were taking advantage of the fact that certain system directories were being searched then you'll need to deal with that as well compile speed really big deal we begin focusing on this about a year ago and we had an offering and I'll show you some performance numbers we had an offering in December timeframe where we made improvements we continue to improve this we have set our goals very high in terms of being able to provide speedy compilation and so let me talk a little bit about that what we've done to do that most of you if not all are familiar with precompiled headers if you are just bear with me a little bit because we find that not all of our customers are using them and and they really are beneficial and so to make use of and so the picture that you see here is simply a representation of a pre-compiled header and the way that gets created is that you can create that with the compiler end and what happens is the head of Polish precompiled and it's saved in an intermediate form for the compiler then when you begin to compile your own program where you make reference to a header that is pre-compiled the compiler is able to restore that state and continue the compilation process this makes the assumption that header files are a significant portion of your actual compilation process and that's certainly true if you're using large frameworks like carbon or cocoa or app kit or others like that so so your code is typically a very small portion of what is actually compiled when you're building files like that to make the best use of precompiled headers it works best if you can actually identify a common set of headers that are used by all of the files that you that you're building in a prime in a project what you can do with that is that you can create a prefix header as we call it which is a pre-compiled header that contains those common header files then you simply include that prefix letter in your file that you're building and the compiler breeze will restore a significant amount of state bypass all of that compilation process and move straight into compiling the rest of the code the next feature is predictive compilation and and I know that that's been talked about a little bit in the Xcode discussions I have a picture maybe that will help you understand in predictive compilation where once again are making the assumption that the header files are a big portion of your compilation process and furthermore if you're going to modify a header you do that in the context of modifying that header file not the source file you're you're editing and so what happens with predictive compilation is that when you begin editing a file the compiler in the background begins the compilation process on the header files and it will proceed up into the point where your code the code that you're working on then once you save that code the compiler proceeds on and predictably compiles that file for you you can see in the timeline that I provided here a representation of that the final feature and all of these by the way I'm mentioning even though they are driven by X code required modification to the compiler to be able to work synergistically with Xcode to be able to provide these compilation features and distributed compilation the Xcode is able to actually distribute your compiles out to other machines that have been so designated as to be available for that processing for you and so it'll make maximum use it sends over the source file it sends over header files that you may need so there's no setting up the environment especially on the different machines the only requirement is that you're running the same generation of OS and so if you're doing this on Panther you you'll be sharing all Panther machines but this is a way that can really speed up your performance as well so let me let me just show you the result of this if you look at this chart as I mentioned over on the left and Jaguar time frame we introduced PFE and that was our first language wide performance enhancement the application by the way here that I'm showing the data from is what I would call a moderately sized C++ app it's about 250,000 lines of C++ code its carbon based and it has over a thousand files the middle point which we offered in December toolset primarily was the addition of being able to use the dual processor on your tower so that using both processors we were able to significantly reduce the amount of compile time still had PFP though with the introduction this week GCC three dot zero and Xcode PCH is the new capability and you'll see on the chart there the performance over and above TFE were able to obtain with PCH I wanted to draw your attention though to one other point on there and that's the the green one down at the bottom that says distributed build this is the result of building this application with 6x serves dual processor exurbs 6 because we had them available actually don't know that this app required the use of all six but but essentially you have a means now of adding horsepower to to get faster performance even above dual processor so code quality and I mentioned that we had a big focus on code quality particularly with respect to the g5 processor when you're in the process of trying to focus on code quality within compilers the probably the most important tool you can have is is a real benchmark and that's a benchmark that that is then a harness so it's easy to run it provides reproducible results it represents applications similar to what you deal with and prior to g5 we'd use an internal benchmark that was called skid marks and skid marks was composed of kernels of codes from various applications that we felt were important in Apple when we reach 25 though we had a problem and a problem was that not only were the files small enough to fit into cash so that they didn't really represent true applications on the system but also they weren't self-checking such that if we broken optimization that the code might still run in producing correct results we didn't have an idea of that we just knew how fast it ran so we looked around in the industry and and we chose the spec benchmark and there are pluses and minuses with the spectrum inch mark you can argue as to whether their representative of Macintosh applications or not but but we determined that they were they were close enough that we can make some good progress animal by the way we also needed to produce some spec numbers for this new processor so spec is represents 12 integer benchmarks 14 floating-point benchmarks they're real applications that have been encapsulated into the the test bench and they run in a harness and they're they're predictable and validate their results so let me just show you the the results of GTI performance with respect benchmarks and what I'm comparing here is our 31 compiler versus our 33 compiler today so the compiler you're using today versus the one that hopefully you'll be using later on this afternoon you can see from spec int and this is an aggregate across the 12 benchmarks in second that we actually have a 17-percent performance increase over what we're able to extract from the three dot one compiler correspondingly for floating point and this is a floating point machine as has been noted in almost every presentation and floating point we're able to get thirty percent better performance than we can with the three dot one compiler on a g5 and this is a single processor g5 two gigahertz so what does this mean for you in terms of how can you take advantage in and build your code for the g5 and get performance well there are several situations that you may be dealing with first of all you you may have an app that you want to run on more systems than just the g5 you may want to have that app which has some computationally intensive code but by and large the rest of the app is is not dependent upon the performance of a g5 system and so you might break out a pup such that you put the computational intensity portion into a runtime library you tune that very heavily for g5 and leave the rest of your code as g4 g3 and you can produce runtime libraries for each category of system you can do that by turning on the mCP you and and getting g5 instructions and M tune and getting g5 scheduling let's suppose though that your app also uses long long or double instruction are double data types then you may want to take advantage of the 64-bit power arithmetic power of the g5 and what we have provided you use the Empire of PC 64 switch is that you actually enable 64-bit arithmetic and and it does amazing things than with your long long arithmetic or your doubles and loading and storing 64 bits and doing arithmetic on that believe me you applaud once you've had a chance to taste it if you're using that we have always encouraged you to use the optimization for space dash OS because that in the system can have a big impact and so we're not discouraging that now if you have code that is not computationally intensive by all means build it with Dashie OS what we are encouraging you though is be a little more adventuresome if in fact you want to get performance out of this machine you need to deal with the higher levels of optimization 0203 if at all possible and it lends itself to your code you want to deal with things like the alignment of loops and jumps and functions and make sure that they're aligned on proper boundaries so that you can if you have fall along some of the architectural types of discussions and David will be talking about this a little bit in a moment alignment can be very important in terms of extracting speed so let me call David Edelstein up to the stage David and others within IBM we've worked very heavily with us on optimizing for the g5 and in the three dot 3 release Davis all right thanks very much so my name is David Edelson and i work at IBM research my main focus there is on open-source issues and technology particularly GCC I'm glad to see so many people here at the conference interested in GCC so first let me give you a little idea of what I'm going to talk about in this presentation it was talking about the powerpc 970 processor and issues in the processor that important for compiler to target to get the highest performance we're also going to talk about how we have optimized GCC to extract that best performance on this processor that Apple is now using and we're going to mention a little bit about the future optimizations that are gonna be coming down the pike and later releases of GCC and about how IBM and Apple are both working together with the open source community to engage them and produce the best compiler for the PowerPC processors so let's start with a little bit of information about this big step in processor performance we now have a 64-bit PowerPC processor on the desktop the PowerPC architecture was designed from day one as a 64-bit architecture the previous implementations have been the 32-bit subset and now we are having implementation in Apple's computers that uses the full 64-bit architecture this and these this architecture has instructions that operate on both 32-bit size data and 64-bit size data and in 64-bit mode when is operating on twice the amount of data and it's the same performance whether you're in the 32-bit mode or 64-bit mode same latency the same speed of the instructions and they're getting operating on twice the amount of data this off this processor has a lot of resources available at its disposal to attack whatever application you have to throw at it has to complete symmetric floating-point units it has two symmetric load store units it has to almost symmetric instruction integer units in this processor and these are a lot of capabilities now to bring to bear on your problems and these this increase and resources is an increase in performance for your application so this is a pictorial depiction of the processor core in the 970 I just want you to have a visual representation for this processor in the upcoming slides so let's go through and talk about how an instruction throat flows through this processor first instructions are fetched from the l1 cache into the instruction queue and in this instruction queue the process the instructions are then decoded and placed into the dispatch group and up to five instructions can be dispatched at a time to the various instruction function units issue cues from the issue queues you can dispatch to any of these 12 function units in the processor so this is a lot of resources a very powerful ship a lot of complexity that the compiler needs to harness to give you the best performance so what are some of these challenges the compiler needs to deal with first of all the processor is dealing with internal ox that the actual function units are operating on operations that are mostly 141 equivalent to PowerPC instructions but some instructions that are more complicated than the PowerPC architecture are broken apart and that's what the function units actually operate on examples of these instructions are load algebraic which is a load with a sign extension a load of up and update form which is a load and then updating the address register various forms of stores which although it's not not all of them are cracked is actually issued as a register access and as an address generation and implicit compares which are the mnemonics that have a dot at the end where one performs a logical or arithmetic operation and then in comparison with 0 the stores result in a condition register and finally the multi field condition register operations where one is performing a condition register logical operation on multiple condition register fields at the same time those are examples of the user application instructions that are cracked another area that the complexity is the dispatch groups so for these dispatch groups The Dispatch group isn't completely symmetric and when it's which instruction can be placed into which slot and the dispatch group for instance move to and from special purpose registers is placed in slot one divisions are placements lot too and the slots also are directed to various versions of the dual function unit so this is another area where the compiler needs to be careful to achieve the best performance and finally the instructions are brought from the instruction queue into the dispatch group in order but then once the dispatch troops are dispatched to the various function units the functioning to the issue queues for the function is the function units are able to pull the instructions out of order as resources are available for them to produce the most performance and get the most parallelism out of the resources available and then the instructions are recombined at the end and completed as an entire dispatch group so there's a lot of parallelism available if the compiler is able to take advantage of it so what have we done in the compiler to achieve this performance first of all we're preferring the non cracked forms of instructions to again allow the processor the greatest amount of flexibility in which instructions it can dispatch as I mentioned about the dispatch groups the compiler is also trying to order the instructions in the sequences they will be placed into dispatch slot to allow the maximum number of instructions to be dispatched at one time to again allow the massive maximum amount of functions to be in use to work on the various problems there's also the issue of balancing the use of the functions we have these two dual function units and we want to make sure that one unit is not starved while the other one is overloaded so we need to make sure that we're issuing instructions to keep both units occupied and get the maximum performance from this chip additionally there are issues with dependent instructions where it's best for the chip if those dependent instructions are placed into separate dispatch groups to avoid solved within the processor and we've also done various branch optimizations to deal with other restrictions in the processor that could potentially produce stalls so there's a lot of issues there that the compiler needs to pay attention to and finally as Ron mentioned a very important issue is the alignment of instructions because the processor instruction fetch isn't is fetching 32 bytes at the time that's eight instructions into the instruction queue but it's etching those instructions on a 32 byte boundary so if one is jumping into the middle of a group of instructions that isn't on a 32 bytes boundary such as a loop such as a jump to the beginning of a function or label then you're wasting the instruction sets bandwidth if that's not on the first on the appropriate boundary so again we have additional options and the compiler and the compilers when tuned to generate instructions on that boundary if it isn't going to decrease the performance by adding too many other instructions to achieve that alignment so I want to make sure you understand that while this processor has a lot of capabilities that the compiler can leverage that this processor works very well with any sort of code that you're going to throw up that if you're working on g3 or g4 code there's a lot of dynamic capability in the processor to extract the most performance out of that you can work with the compiler to achieve the maximum performance and get that little extra percent but if you're working for on an application that needs to be targeted at g3 and g4 and the new g5 that application will scream on all of these processors and it'll scream on the g5 as well so this processor will not just fall down if presented with g4 code so how can you help the compiler to produce better code now that we discussed what the compiler what we've done in the compiler group to try to help to knit for this processor well first of all one thing that you need to do is to make sure it's just that the compiler can be as aggressive as possible in its transformations of the code and one of these areas is to try to make sure that the compiler doesn't need to be more conservative than need be an area where this is important is in aliasing that's where the compiler cannot tell or where two different memory two different accesses to memory could potentially two variables could point to the same place in memory and so if the compiler cannot distinguish this the compiler needs to be more conservative to ensure that it's going to perform the calculations in the appropriate order so what you can do to allow the compiler to see and understand more of the procedure that you're working on is to use local variables and to avoid taking the addresses of variables and this also includes areas where the calling convention may implicitly pass something by an address because then the compiler needs to sort of throw up its hands it can't fully understand the consequences of what the something may be happening behind its back so that's one area where you can help the compiler produce more effective code another area is in simplifying loops or if you can make this loop as simple as possible and as direct as possible the compiler is much more opportunity to make transformations that will improve performance one example of this is to not have changes in slow control inside the loop so for instance if you have a loop that has a branch inside the loop if it's possible it'd be better to move that branch that conditional outside of the loop and actually duplicate the loop in both branches of the conditional and that allows the compiler to then optimize each of those loops much more effectively when it can better understand exactly what's going to happen another area is indexes into arrays again try to have a simplified and index as possible so again the compiler can understand what's going on and make transformations with prefetching and other sorts of optimizations that would be much more effective in the throughput of this processor and finally is obeying the type rules the type rules is something that is essentially an agreement between the programmer and the language about how you're going to use the various types and if you use the types appropriately and don't start changing accesses behind the compilers back in ways that it doesn't expect you can use the most aggressive optimizations as Ron mentioned and get the best performance from your program so let me talk a little bit about where this technology came from that a lot of this work that we've been doing comes from IBM vast experience in developing compilers over the past decades including some of the earliest compilers for computers and so we have a worldwide research and development organization that has been brought to bear in partnership with apple and improving this compiler for the g5 processor some of this involves taking the technology that IBM has developed for its power for processor from which the 970 was derived and used in developing IBM own compiler taking that knowledge taking that experience and driving that technology and that understanding into GCC to leveraging all of that history of development and we're trying to using all of that to exploit this processor and give Apple and its customers the best performance possible now let me mention a little bit about where GCC is going in the future there's a lot of very exciting optimizations that are going to be included in the compiler in the future which will get further enhance the performance of this processor we have a new register allocator is coming there is improvements in the instruction scheduler which will allow even better ability to model these complicated dispatch groups that I mentioned there's work on software pipelining which will help improve floating-point performance there's work on an SS a optimization infrastructure which allows one to describe the program in a way that allows much more aggressive optimizations this will allow it easier implementation of loop optimizations and auto vectorization in the future further work on interprocedural optimizations to be able to again have the compiler understand more of the program understand what's going to happen with aliasing and work past limitations the programmer may not have intended profile directed feedback so that the compiler can reorder the instructions to take advantage of how the application is actually run with the data sets that you have and of course continued improvement in the compiler speed to more rapidly generate the code so all of this stems from this partnership that IBM and Apple have in developing and further improving GCC we were both very committed to GCC and to the open source community there are a large number of developers for GCC inside Apple and inside IBM there I am a member of the GCC steering committee and a member of Ron's team is on the GCC steering committee I'm one of the maintainer of the PowerPC toward of the compiler and a member of Ron's team is micah maintainer on the compiler and we are fully engaged with this community to help design the future of this compiler and taking a leadership position to make sure that this is the best compiler for apple systems and for the powerpc and so just want to let you know that as not only are we working aggressively on the processor and the hardware but there's a lot of excitement to come and further improving the performance of the system from the software and the work on the compiler as well and so I hope that you're excited as well and now I also just want to quickly mention the thanks to John Hannibal Stokes for the use of one of his graphics in this talk and now lend it back over to rot [Applause] actually it's a pretty exciting time for us in the compiler world when you're presented with a challenge like this it just makes your day or make sure your future directions well as you can see we've just embarked on this this journey with the new processor and and extracting the performance from that and we really consider our role to be one of trying to make make this process at work for you without you having to do jump through hoops or go through a lot of gyrations right now we're telling you please do a few durations if you want to get the performance out but we're working as David showed you on any number of areas within the GCC community to take the optimization capabilities of the canoe compiler to the next level we also don't consider our work done in terms of compiled speed I think you agree we've made some really dramatic progress in the past year and and if I were you I would expect from us that will make as good a progress in the coming year because we're going to nail performance that that's really important and we want to we want to remove this compiler from being an obstruction to you getting your work done in terms of language features will move along at a slower pace because the language is moving at a slower pace today but but you can count on us continuing to move towards full compliance with the ante and ISO standards so this even though it's being jammed today is being offered to you and as a part of the Xcode tools package there's documentation in that package that you can relieve that you can reach those the release notes one piece of documentation that that you'll probably want to look for is in terms of performance tuning your code and I would encourage you to attend the session which occurs the next session I don't think it's in this hall I may actually have a slide on that okay anyways there's a formal session during the next period of time and I would encourage you to attend that I was at the judge session that preceded this and I was happy to see an overflow room there we've got some dramatic capabilities with our tools to help you improve the performance of your code if you have contacts that you want to make goffe do Georgie who's our technology manager and you'll meet him just a moment his email address there's a feedback address for the development tools and of course we're always looking for bug reports well I guess we're not happily looking for bug reports that we appreciate the feedback here's the 305 in presidio as the performance tools section and I was talking to some of the other sessions that you might be interested in as Apple script studio carbon development and so forth and we encourage you tomorrow there will be a feedback forum where you can comment and let us know what you would like to feedback to us and give us your impression of the tools you
