WWDC2004 Session 503

Transcript

Kind: captions
Language: en
good morning welcome to session 503
optimizing for power mac g5 last it's
been about a year since we introduced
actually exactly a year since we
introduced the power mac g5 chip and
last year if you're here at WWDC most of
the content that we presented dealt a
lot with the architecture of the g5 chip
itself we had optimization labs running
for several days into the night late
night at times trying to get as much
information to you as a developer to
understand the differences between the
g4 chip architecture and the g5 there
are very stark differences there this
year we want to again re-emphasize
architectural differences and why it
matters to writing optimized code but we
also want to now make sure that you guys
understand how to utilize the tools that
we supply in our tool set as well as the
compiler options that you have in terms
of helping you optimize your code itself
so we have several speakers of this
morning from the compiler team to our
performance group as well as the gut
speaker from IBM this morning so with
that I'd like to start off this morning
introduce Sanjay Patel from our
performance group good morning is a
tough time flood for those of you here
last year we're going to do a bit of
review how many of you are here last
year for this talk okay so we're going
to go through some g5 architecture and
then we want to talk about things you
can do to help improve your code on g5
and in fact all platforms really and
then we'll turn it over to some of our
compiler guys to help guide you through
your process of optimizing your code so
to start off with when you're talking
about the g5 you have to of course start
with the powerpc 970 chip and this is of
course a super pipelined super scalar
processor we got from we teamed up with
IBM to make it's based on the power for
server architecture and the big addition
that we had to have to make it an apple
chip was dad what we call the altivec
engine
also known as the velocity engine so
this is a hundred twenty eight bit
vector unit which does floating-point
and integer math and the other big
difference is we have this high
bandwidth point-to-point interface that
connects the chips to the memory
controller and help take advantage of
all the bandwidth that we have available
in the system in theory are these
automatic hardware prefetcher engines
that help put all that theoretical
bandwidth to use so this is a die shot
of the 970 chip 24 hardware engineers is
kind of like pornography you know you
look at this and you'll hear some safe
things like look at the FP use on that
one for software engineers what you want
to take away from this is that there are
lots of executed execution units
available for your program that all
operate in parallel so two independent
load store units two independent I
Triple E compliant floating-point units
we have the full implementation of
altivec to fix point units as well so
there's just a lot of space here to get
things done in parallel another way to
look at this from the software
perspective again if you start at the
top you have the l1 cache which holds
your instructions from the l1 cache
instructions flow into a fetch queue and
from there they get dispatched up to
five instructions on each clock so at
two and a half gigahertz your
dispatching over ten in theory 10
billion instructions per second now
that's actually the narrow point of the
970 architecture so you can see this
stuff in the kind of greenish gray is
out of order execution so you can feed
12 independent execution units with 10
issue tues once you've dispatched and
then again your instructions complete at
the bottom of this picture in order up
to 5 per clock now a good way to put
this in perspective is to look at how
the g5 compares to the g4 and so we keep
talking about the the parallel ISM of
this ship one way to measure that is how
many instructions you keep in flight
simultaneously and for the g5 it's over
200 instructions in flight compared to a
little over 30 for a g4 architecture and
we've also increased the pipeline stages
so
you can see that it's more than doubled
for a simple integer instruction and
usually you'd say well that's not so
good why'd you do that the reason we
increase the pipeline stages is to help
increase frequency so we just announced
that we hit two and a half gigahertz in
order to hit higher and higher frequency
numbers you increase the pipeline depth
okay and we talked about some of the
execution units again we've doubled up
on the load/store units doubled up on
floating-point units you now have to fix
point to general purpose fix point units
whereas on a g4 architecture you have up
to three simple units that do things
like adds and subtracts and one complex
unit to handle things like multiplies
and divides the vector units are pretty
similar you have an ALU that handles
floating-point an integer and you have a
permute unit for any of you have done
any vector programming with altivec you
know what the power of the permute unit
is in terms of switzerland atta out of
memory and into the registers as we work
our way out from the core the biggest
programmer visible difference that
you'll find is that the cache line size
is different it's under twenty eight
bytes whereas it was thirty-two for a g4
now that can either be a really good
thing or a bad thing and I'll show an
example of how that happens as we work
our way again to memory you have the l1
cache is the same size different
associativity and right policy the l2
cache the l1 instruction cache is also
doubled up in size there's the l2 cache
the l2 cache is also doubled in size
compared to 7450 you'll notice there's
no l3 cache on a g5 system whereas you
had up to two megabytes of l3 on a chief
or system now we've made up for that by
increasing the processor bandwidth
substantially 1 by doubling the width of
the DDR interface and increasing its
frequency but also the front side bus
and this slide is actually a little out
of date at two and a half gigahertz
we've increased the front side bus
frequency to 1.2 5 gigahertz so you can
actually in theory get 5 gigabytes per
second in each direction
so I want to talk about some programmer
problems I've seen over the last year
since we've intro the machine and the
one that comes up most frequently turns
out to be a rather simple thing it's
conversions from floating points to
integers and the reason this shows up a
lot is because in when you write this in
see it looks really cheap right you just
cast your variable into ensor cast it to
float but it turns out this is not cheap
at all particularly on a g5 because it
has so much going on in parallel when it
hits this condition because the PowerPC
architecture doesn't have integer
register transfers you actually go to
the l1 cache and come back so you have a
load in the store operation going on
there are a lot of things you can do to
avoid the problem in the biggest one
that I found it actually turns out to be
one of the easier solutions is simply
don't do that it turns up again because
it looks so cheap and easy that a lot of
people just cast from one to the other
without thinking about it and when you
examine that code you realize you could
have stayed in one domain or the other
without hampering or affecting it the
algorithm in any way the other cool way
you can get around the problem and of
course if you use this altivec you're
going to get a much greater speed up
because of all the parallel ISM in the
ultimate unit but the ultimate unit
handles floating-point and integer
identically in the same register set so
there's no memory operations when you
transfer between types another potential
optimization is to use the GCC compiler
flags dash fast this tries to schedule
your loads and stores and inserts no ops
leavin to separate them out to keep
things flowing through the g5 IBM's XL
compilers also do this kind of
optimization if you specify the g5
architecture so want to show you a quick
and really simple example of bad code so
this is a real lame loop all it does is
inflict version because the loop counter
in this case I has been declared an int
but we're adding it into a
floating-point some so that looks really
cheap but it's not every time you do
that ad you're going to have to convert
the I from integer two floating point
main to store it into the some so how
would you get around this problem Oh
what you can do is creep what I call a
shadow of the I variable in this
floating point unit so I've just named
it i underscore FP to denote that it's
floating point value and when i
initialize I I also initialize its
shadow and when I increment I
incremented shed now inside the loop
we're going to use the floating point
value for the some rather than the
integer value so on a g5 I measured this
it turns out this code is three times
faster than the previous code where
you're doing the conversions because
this code won't have to do all the load
and store operations and the next
biggest thing I've seen over the last
year is just improvements to try to
schedule your code and like I'd say the
g5 has a lot of execution units that you
can operate in parallel but if you write
dependent code so one operation depends
on the next depends on the next it's old
cereal you're not going to take
advantage of all those units and
furthermore you're not going to take
advantage of these long pipelines you
have to schedule your code so that
you're filling in all these pipelines
slots instead of causing bubbles and
execution so compiler help is here you
have GCC 33 which has g5 architecture
tuning so that tries to schedule for the
available units and slots xlc also has
the same kind of flag where you
specified that you have a g5
architecture and the other thing you can
do is use shark which you've probably
heard about and we'll talk about in more
detail tomorrow at three-thirty we have
a full session on how to use shark and
what it can do for you now again I
mentioned that we've increased the
pipeline stages compared to a g4 on the
g5 so what does that mean well it means
it takes longer in terms of clocks for a
simple instruction to complete so for
example an addition instruction may take
one cycle on a g4 but may take two
cycles of latency on a g5 so what you
want to do is account for that in your
program I kind of grouping a bunch of
the similar operations together
so that means you can unroll your
important loops or you can use the
compiler flag and as well you want to
schedule your code for the g5 so you're
going to fill in all those pipelines
slots now people often ask well
shouldn't the compiler do that for me
and in all these examples you can always
ask that question shouldn't the compiler
do that for me in some cases the
compiler can do it for you if you
specify the right flags but there's
always a downside if you if you just
lean on the automatic way to work over
this problem because the compiler hardly
usually doesn't know when the loop is
important or not that's something you
have to tell it so if you choose to
unroll all loops or unroll most loops
you're going to have a big increase in
code size which could be detrimental to
your performance that's why is the first
pass you should profile your program and
try to do manually some of these
optimizations it's just in the important
spots so here's another example of some
code that's again just a silly example
right we're just going to some a bunch
of one in this case and the 970
architecture the g5 has two floating
point units and they're each six stages
long so this code is only going to get
approximately 1 12 efficiency because
every instruction is dependent on the
previous some so there's a simple
example out the code is exploded right
because we're trying to fill all the
pipeline stages and here we actually
only unrolled 28 ways of partial sums so
we wouldn't fill all of 12 pipeline
flopsy you would actually want to do 12
in order to maximize your gains on the
g5 so you can think of the floating
point unit says either 112 stage
pipeline or 12 single units but they're
all going to operate in parallel so this
code turns out to be 10 times faster
just using partial sums instead of one
variable
the other big thing you have to worry
about when you're optimizing code and
this goes for all architectures because
particularly that on a g5 because the g5
core is so good it makes memory look
really really slow what you have to do
is try to reduce operations where you're
waiting on memory so effectively reduce
your latency there are a couple of ways
to do that one is to rely on Hardware
prefetch engine and I'll show you
another example of that the other thing
you can do is use software prefetch
instructions to get the data before you
actually need to use it for computations
for example if you're in a loop you can
batch all your loads together at the top
of the loop do a bunch of math and then
do stores at the bottom that's going to
perform better than doing serial
operations of load maps store I
mentioned that the data cache is
different on the g5 than the g4 and the
biggest difference is the cache line
size so it's four times as big what does
that mean well it means you might get
one fourth the cache misses if your data
is organized nicely it may also mean
that you're getting really terrible
performance if you're accessing one bite
skipping 127 and accessing another one
bite at that point you're getting less
than 1 percent efficiency from your cash
so what you want to do and this is sort
of basic CS right pack your data
together to maximize its locality so as
you walk through your array you want to
be stepping sequentially rather than
jumping around this as an additional
benefit of triggering the hardware
memory the hardware prefetcher so the
CPU is automatically going to detect
that you're walking a straight line
either up or down through memory and
start prefetching cache lines from
memory into the cache so again here's
another simple example as a classic
two-dimensional array where we're
walking the wrong way through it we're
iterating on the columns or rather the
rows rather than a columns so we're
skipping large chunks of memory in this
case what you'd want to do is
switch the four loops so that you could
sequentially access every element in
this array so any guesses on how much
faster this is going to be big
difference right and so this is simple
stuff but 30 times faster if you do the
right thing rather than the wrong thing
so it highlights the problem of how
important accessing memory is so I just
want to summarize some of the things you
should be doing and looking at while
you're trying to optimize code the first
thing to do is try to unroll and
schedule important loops because you
have all these execution units the
independent floating point units the
independent load store units you have
hundreds of instructions in flight it's
like this number is actually out of date
as well you can now use altivec to
calculate more than 36 gigaflops per
second at two and a half gigahertz this
is of course the best solution if you
have code that's just massively parallel
you can operate on all the elements
simultaneously for those of you writing
floating-point code the g5 has a
hardware square root instruction which
can be enabled with GCC with this Power
PC GP opt flag xlc will recognize that
this instruction is available if you
specify the g5 architecture this has
made a very large difference in some
rate racers and renders and other
programs that have looked at that have
heavy dependence on square root if
you're using 64-bit integers long long
cincy you can turn on flags to specify
that you have a 64-bit machine because
you g5 truly does have 64-bit integer
registers this can be a huge difference
for code compared to actually breaking
up into 32 byte chunks or 32-bit chunks
sorry
so again the system as and the chip were
designed for high bandwidth they were
designed to do lots of things in
parallel it's part of the server
heritage coming from the power for you
have 40 gigabytes per second to the l1
cache up to 80 gigabytes per second
between the caches and up to five
gigabytes per second to and from main
memory and the way you want to take
advantage of all this theoretical
throughput and put it into practice is
to take advantage of hardware prefetch
engines so these will start scooping
data out of main memory and bringing to
cash before you actually need it and
that's all I have so with that I'd like
to introduce Steve vaquita from the IBM
compiler team learning all good morning
I'm actually very excited to be here as
we have now introduced our Excel C C++
and Fortran compilers for the macro
effects or of what so it's 10 my
apologies so I be in compiler technology
we've been in the business of
compilation technologies for over 15
years and exploiting primarily PowerPC
technology but we also been on about
enough nine other platforms mainly IBM
platforms in among all this technology
we've got numerous types of optimization
patents that truly exploit the PowerPC
technology our goal in arc and the idea
compiler team is actually threefold the
first one is to exploit the hardware are
our actual key here is to drive out the
you know the maximum performance we can
possibly get out of the g5 processor
among these things then we have an
extensive portfolio of optimizations
these include things like
interprocedural analysis which does
whole loop from program analysis it has
the profile directed feedback loop
optimizations for parallels and
instructions and all for locality
scheduling
and one of the things that we do regular
is we actually work very closely with
the the chip architecture team where
we've been working with with the core
team that actually developed the chips
and provide them with information as to
what ways that they may want to change
the actual chip versus also the type of
information that we can exploit within
our own compilers the second thing that
our compiler group really focused on is
specifications and standards for our CMT
plus plus we are si si 1999 and C++ 1998
compliant for fortran word fortune 70 90
95 and partial 2003 we also have openmp
support primary that what that was
introduced on the ax platforms and we're
bringing it over to the mac OS x also
our developers within our C C++ and
Fortran teams are also representative on
the standards committee there are also
not only on the iso standards for me but
they're also on the openmp consortium so
being really focused on compatibility
and also on standard specification our
compiler develop the source code for art
that can be pumped through our compilers
are easily portable between numerous
platforms for example Mac OS X Linux ax
and our mainframe 0s the third thing
that we're really focused on is is the
customer care we work very closely with
various ISVs and also customers on
tuning the code matter of fact we're
down in the optimization lab all this
week and some of our engineers have been
working very closely with people that
have brought in their code and we've
actually seen markups from anywhere
between twenty percent and two hundred
percent even in a short period of time
by using our compilers to exploit their
code
so for the C C++ and Fortran compilers
on Mac OS 10 they are based on our act
ra X and Linux compilers on ax and Linux
we call it right now visual HD plus plus
so the visual a C++ those platforms are
essentially the same compilers that we
have for Mac OS 10 platform so this this
actually leverages all the proven
optimizations and language
specifications that we've already
introduced on those platforms some of
the common things between our XL C and
C++ compilers and Fortran is as I
mentioned already exploitation of the g5
architecture we are integrated with
Xcode symbolic language GTV and also we
support a number of the apples profiling
tools the shark one in particular is
just an outstanding expanding tool and
helping tune your code you know for sure
even ourselves we wish that was
available in summer IBM's platform among
the two other things that we have is
part of what we call it the technology
preview so these are features that we
are actually looking at trying to bring
in to our product although they're there
right now they're not formally supported
and in particular is the openmp so our
direction is to have full support for
open NP 2 point 0 and the other one is
automatic parallelization specifically
then for C C++ as I mentioned there's
the standards compliant for 399 and C++
98 exploitation of altivec although that
the this compiler II now can actually
generate codes to the ultimate
instructions one of the things that we
are ongoing looking at and in our in our
research and development is the
automatic symbolization or otherwise
known as automatic generation of Alta
tech instructions so these are things
that we are definitely focused on and
looking at in future releases
compatibility with GCC 3.3 so this is
two folds one is we are fully binary
compatible so you can intermix GCC
objects with our compiler
and the other one is that we have a
number of language extensions that are
GCC specific so you can source code
compatibility for seed but bucks we also
have an objective-c technology preview
and then for excel fortran as i
mentioned we're already the fortran 70
90 95 and 2003 partial we're also
introduced many IBM and calm energy
standards language extensions and these
are include some from vm from 0s and
other other well-known platforms for
fortran so that's who is actually just a
quick overview of the CRX L C C++ and
Fortran compilers if you have any
questions on how to exploit your code
how to gain even more optimization
capabilities and perform and cellular g5
come on down to the optimization lab
we're and there's a number of us IBM are
there to help answer any of your
questions thanks all in the next person
up is Ron tight thank you Steve well
this is going to terrific here for the
g5 and I think we all now understand
what kind of power lurks in that box but
those of us who have really worked a lot
with it understand what it takes to
extract that power and Sanjay covered
some of that and of course the apple
offers a set of tools that really
facilitates understanding your program
and being able to a grizzly allow it to
extract that power I want to talk today
about what the compiler can do to get
you started on that path because not
everyone is is ready to step up and
start tuning their program and changing
the algorithms and so forth and so
Sanjay mentioned a number of compiler
options that can help you in certain
situations and what we have done within
the compiler group is actually put
together a mode we call dash fast and I
want to talk about that today i also
want to talk about feedback directed
optimization
which is another component of the dash
fast mode that can help you
significantly and then of course you've
all heard the announcement this week
that that we're on the past to deliver
with Tiger our initial cut at out of
vectorization and I want to talk about
exactly what that is so GCC in the dash
fast mode could I just askn in here if
anyone's using this mood today ah gee
amazing we've had so little feedback on
how it has been working for people that
we've wondered if aunt went using it and
that's why we wanted to talk about it
today the dash bass mode is really a
collection of a lot of the compiler
options but in many ways it's more than
just a collection of options it's we put
them together in a homogeneous type of
fashion to our best ability to target
tell you what i would call typical
applications and of course we all know
there's nothing like a typical
application but in this case I mean
applications that are computationally in
10 inch intensive so to do do a lot of
mathematical computation we've tried to
target a mode that will give you a first
step into getting some of the
performance however the details of when
you use that modes important so you
can't totally say I don't understand my
program and what's going on in my
program and so I'll talk a little bit
about the details that are important and
give you a good feel at least four dash
baths and what it's trying to do and
then finally there is a variant of dash
fast called dash fast and and that's
really what you should be using if
you're working with C++ there are some
things that we do slightly different to
try and address performance in the c++
world so what are some of the specifics
about dash fast mode what what are we
trying to actually attack well Sanjay
talked about and others have talked
about the deeply pipeline nature
the architecture and the wide functional
units and so one of the things that you
have to really be concerned about to get
performance is keeping the pipeline
field as we call it and so there are a
number of optimizations some that Sanjay
mentioned that we have brought together
to try and keep this pipeline field so
we're feeding this monster at the speed
it would like to be fed I want to talk a
little bit about standard conformance
and some of the things that we do to
relax the rules so that the compiler can
actually do a better job for you in
terms of optimization and then finally
of course the g5 instruction set this is
a presentation on the g5 so to start off
with I don't know how many of you have
ventured into the dash 0 3 level
optimization mode but I want you to know
that's just the starting point for dash
fast and so you'll get that with dash
fast and along with that come a couple
important options one is in lighting
functions and basically and you may
understand this but basically that says
within a computational unit the compiler
can do some heuristics to determine how
to inline functions within that
computational unit and the real purpose
behind the compiler doing that is that
the more code that the compiler can also
have an inline view of the better all of
the optimizations can can be performed
and so the bigger view that the
optimizer has the better the
optimization the second is rename
registers option and what this simply
does is it gives the compiler more
freedom in terms of its register
allocation and it does at the expense of
you being able to debug your code but if
you're on this ragged edge of trying to
get optimum performance that is one of
the pitfalls you have to deal with the
second capability that I want to talk
about is enter module and lining are
function inlining across the whole
program and what that does for the
inlining the previous inlining option
look at one computational unit the inter
module function inlining looks at the
whole program and so it gives you that
many more opportunities to consider in
lining throughout your program and once
again there are heuristics that we have
determined are the best when you're
making guesses about inlining and you
really don't know what there are
functions call it a lot or not I'll be
mentioning another feature a little
later on though that deals with that
this is a command line then that would
represent you implementing animodule
function inlining and basically that's
triggered by putting all of your your
compilation units on the same compile
line so the compiler can look at them
all at once the next and Sanjay talked
about this has to do with loop unrolling
it and the compiler can actually do that
loop unrolling for you once again a very
simple-minded loop but it will serve as
a representation here the compiler can
actually deal with more complex loops
but inlining simply means that it
reduces the number of iterations of the
loop and it puts various iterations
actually in line and so once again what
you're doing is cutting down on the on
the branching operations and trying to
give the scheduler more opportunities
for scheduling the other operations in
the functional unit there is another
form of loop unrolling that the compiler
does called Luke peeling and in that
situation you can see here we have an
even smaller loop and these two
occurring code and the compiler will
simply unroll the entire loop and
laminate the loop all together
the next option is a loop transposition
and we have several loop transpositions
this is similar to put Sanjay was
talking about and I think he indicated
you could turn on this option but
basically we have a double nested loop
here and it is stepping through memory
and fairly large increments in this case
1335 and that has a terrible effect on
on the paging within the machine and so
for data locality reasons we include
this loop transpose function and what it
will do is the compiler is able to
recognize that situation and actually do
the transposition of the loop so that
now we're incrementing in increments of
1 throughout the memory we have a
specialized optimization called loop two
men set and what that is is if you have
initialization types of arrays where
you're initializing things to 20 the
compiler actually will transform that
into a call to memset and memset on each
of our architectures including the G fly
has been highly tuned in such a way that
that you you can't beat it with with
your own code and last we talked about
tuning for g5 and tuning for g5 is
really important because it tells the
compiler this is a g5 architecture the
compiler understands then of how to
schedule instructions for the maximum
grouping so that we can keep all of the
functional units going as much as
possible in parallel so we're really
extracting the power of the having a
wide functional unit set okay i
mentioned standard conformance and so i
have a couple relaxation rules here one
of them is having to do with an option
called strict aliasing and aliasing is
the situation where
you have point2 pointers than those
pointers are actually pointing to the
same object so those objects are then
alias and the compiler often can't tell
well even if there are different data
types if in fact the objects are aliased
and so it has to make the assumption
that they are well you can help the
compiler out by if you know in your
program and this is when your program
knowledge items if you know that your
pointers are never alias within your
program you can tell the compiler use
strict aliasing assumptions here and so
in the example basically what this means
is and this is a very simple example
strict aliasing tells the compiler by
the way that that if pointers are of a
different data type they will not be a
less you can assume they're not alias so
in this particular case here without
strict aliasing we would actually have
to reload the ti with one before we
return p I by saying strict aliasing
we're able to understand that we don't
have to worry about reloading that and
in fact can be put into a registering
and return that way so this can have a
interestingly up in many programs a
pretty big impact the second that falls
within the area of conformance is our
fast NASA option and fast math you
should understand is not I Tripoli
conform it but by the way almost all
code doesn't require actually
conformance and you know that if it does
and so by relaxing that rule that we
relax that back to the compiler can
assume that the associative distributive
and community principles hold and so it
can actually rearrange code and
and the fashion that I show here on the
on the screen to best utilize the
scheduling of math and the computation
of these operations within the system
this is another one that can really win
for you and if you really don't need to
understand whether on the boundary
conditions that it's not a number it is
a number or its infinity then you should
try using fast math and try that in your
program hardware specifically and of
course we we say mcp you it goes g5 and
in the GCC compiler that says you can
you're perfectly free to use any g5
instructions that are available and then
in line floor happens to make use of a
couple of specialized instructions in
the g5 to actually in line the floor
intrinsic right in right in line the
next three have to do with alignment and
one of the things that we've learned
about the g5 is that it's very sensitive
to alignment and you can you can make
dramatic improvements in your code
performance if you try to to deliver
well aligned types of data and will
aligned code in this case we're aligning
loops jumps and functions all on 16 byte
boundaries and yes this does cause some
bloat in your program but our experience
is that the performance far outweighs
the bloke that you get and the size
increase in the program the last item
there Emmaline natural says all right
align all data types on their natural
boundary so you have to be cognizant of
that if you're concerned about data
being packed together and things of that
nature because data types will then be
aligned with perhaps gaps in them I
would just encourage you in moving out
so the dash bass part of this that you
should give that a try
free to give us feedback in terms of
problems you have one of the things that
it can help you trigger is the needs
we're going the next step of analysis if
in fact you use dash bass and you don't
see speed up with in your code then you
should be thinking that well maybe I've
got some algorithmic problems or some
memory accessing problems that are the
real performance killers and so there's
no way that compiler is going to
optimize you out of this you will have
to go to the shark ER and try and
understand the program and what's
causing that to happen next thing I want
to talk about is feedback directed
optimization and and feedback directed
optimization is really an optimization
that allows you to tell the compiler in
more detail exactly how you expect your
code to to execute and the compiler will
take that knowledge into account and
will do a better job of optimizing it's
used number 14 in lining the concern
about in lining and it was mentioned by
Sanjay is that boy if you over in line
you can kill performance as well well
using feedback directed optimization we
actually tell the compiler from results
from a training line exactly how many
times that a call site was a function
called how many loops are in an
iteration that has a function inside of
that loop and so you can make very good
decisions in terms of performance versus
size trade-off as opposed to using
guesses which are the normal parameters
that we look at the second thing is used
for is what we call hot and cold
partitioning and hot and cold
partitioning the best example I would
have for that is an if statement okay
you have two branches and one of those
gets executed predominantly and the
other one is only occasionally maybe
only in and Eric
addition so we tag the the hot one and
we start grouping the hot code together
and we take the cold code and we move
that off together at the end of the
program and so we help to contact the
program down and keep the footprint for
that program so that we reduce paging
once again in operation there are a
couple of flags that you used to do this
so first you would use the create
profile flag and you would actually
create an executable that is
instrumented such that it can gather the
profiling information you run that with
a training set of data then you you
rebuild your program optimizing it using
the profile that you just created not
all applications i realize win
themselves to to this type of profiling
maybe it's an interactive type of
application but certainly if you have
computationally intensive applications
that work on large data sets taking
advantage of trying to train the the
application the compiler to optimize the
application for that is really a great
thing to do well worth the effort then
finally i want to talk about out of
vectorization and just out of curiosity
how many are using the altivec processor
today okay we have quite a few hardy
souls but you also understand that it
doesn't come for free it takes it takes
so work to program it today what we are
doing is we're trying to open up the
Vista of using the altivec to a broader
scope of folks and so areas where you
may not in fact want to spend the effort
tuning it yourself so what is out of
vectorization it's simply the compiler
being able to transform serial oops into
vector risible loops and what are
vectors for those who don't know that
well vectors 128 bits it can be operated
on
and a number of different sizes for
integers and floating-point in and bit
operations and so forth all of those
operations within that hundred and
twenty eight bits actually occur in
parallel and so therein lies the speed
up just a quick overview the types of
operations are arithmetic logical
compare rotating shift but they're all
done within the vector unit and of
course the data types we just talked
about so in your DVD that you've
received there is a preview compiler a
preview of the three dot five compiler
and that is a first introduction to you
of the auto vectorization it has
limitations and and our goal is to
really work on those limitations between
now and the time it's released with with
Tiger but today what can it handle I can
handle loops with both known and unknown
bounds and there's different codes that
we have to generate to discover the loop
iteration counter at runtime if they're
not known loops with even an odd vector
links loops with conditionals and those
particularly simple conditionals and
misaligned vectors on loads so we're
able to to take unaligned vectors and
what I mean by that is once again
altivec operates on a 16-byte boundary
type of veteran so if your vectors
online on on 16 byte boundaries and and
you can get that from malik to raise and
of course your own arrays that you that
you allocate but but we go through
vector operations to align them and i'll
show you a little bit about the
performance penalty that can occur when
you do that auto vectorization has
difficulties with pointers and aliasing
well I talked a little bit about that
before and the scalar part of the
present age
that's true here as well in this
particular example there is no way that
the compiler unless a and B are Global's
our local they're not certainly not
local within this function and so unless
they're got Global's there's no way the
compiler can discover that these are not
alias and so it will have to make the
assumption they are in today's world and
not vectorize this loop however you can
help the compiler out in a simple way
you can actually use the restrict
keyword and the restrict tells the
compiler okay this pointer does not
alias with any other 22 object and so
those that's a simple help it turns a
loop into one that can be vectorized
that can't today and the next thing that
that it has difficulty with that you
need to watch out for is that scalar
loops may have data dependencies that
work perfectly fine when you're doing in
the scalar mode but to try and transform
that into a vector type of operation
where you have a number of elements
being being computed at the same time
you can't have those dependencies and so
the first illustration of a loop here is
one in which we simply couldn't
vectorize it because you will have this
data dependency and the second one the
second one looks similar but in fact
there's no data dependency here because
this is all set by N and then misaligned
vector stores we simply can't handle
that in the preview we'll have that
available and the in the IGM release but
if you're going to play with a 35
compiler be aware that the vector that
you're storing into needs to be
correctly aligned in memory so what is
using the auto vectorization all about
well it's about performance and so I
have some initial numbers here
and these are already out of date as we
continue to tune the code but for simple
types of operations in loops you can see
speedups here that go all the way to 14
times and we're now seeing even around
20 times and some of our work if you
have misaligned data the types of
impacts that you can see is you really
reduce the performance significantly now
this as I said I expect to improve we're
in a very early stage with Auto
vectorization we have a limited set of
loops that we're able to recognize and
deck to rise and so I would encourage
you to take a look at this we are really
open to you sending us kernels of code
of things that we don't seem to be able
to vectorize because we want to build up
and mature that ability and this will be
something that we're working at as you
can see though the the reason we're
excited is because this can really offer
some speed ups and particularly if you
haven't already been using the altivec
processor on your system is sitting
there just wasting away and you can get
some real performance out of it so the
operation here includes the enabling of
a couple of options I believe that in an
Xcode today there's actually a option
for auto vectorization that will do that
for you and enable the process so if
you're looking for more information you
can contact mark toes are Matthew
Formica and Mark you want to come up and
so to add to that the reference library
some documentation that's on apples
developer website some tech notes that
are written that were posted since last
developers conference with a lot of
information will have a
George Warner up here in a few seconds
with the QA who participated and writing
some of that technical documentation a
couple of takeaways I want to make sure
you go away with this morning you know
it's been a year since we introduced the
PowerPC chip as I said earlier so you
should be looking at transitioning your
code to the g5 you should be looking at
optimizing the code and making sure that
it performs that its best optimization
is a skill it is not something that
comes for free all those tools that you
heard about today and the compilers in
both Apple compilers as well as IBM will
provide a lot of assistance but there is
some times when you need to get in there
and roll up your sleeves and do the hard
work for that we have optimization
workshops at Apple for the past year
we've had over eight workshops one a
month essentially helping developers
like yourselves to work through the
problems of optimizing your code so i
encourage you to participate in those
workshops they're posted through the
Apple Developer connection emails and
they'll be continuing on throughout the
rest of the year I'll have the next one
starting in August I believe the first
or second week of august i can remember
correctly the other thing is that you
know it does take a lot of work to do
optimization work but there are a lot of
rewards to it as Longet pointed out some
of the sample code we're here to help
you through those problems so as Steve
mentioned from IBM there is the
optimization lap here all week please
take advantage of those resources we're
here and committed to helping you guys
write the best code for this platform we
feel that the g5 has a lot to offer it
has a lot of headroom to grow the best
applications on the platform or those
that take advantage of all the abilities
that the hardware has to offer so please
again make sure that that's something in
mind when you're looking at revenue
application writing a new application or
just you know taking the time to take a
look at what you've done in the past and
may be improved upon that