WWDC2016 Session 405

Transcript

[ Music ]
[ Applause ]
>> Hi, I'm Alex Rosenberg.
I'm incredibly excited to share
with you some amazing new
features we have in store
for the Apple LLVM Compiler.
But first, I want to talk
a little bit about LLVM,
the project upon which we
build the Apple LLVM Compiler.
LLVM is a modular framework
for building compilers
and related tools.
However, it need not be used
solely in this traditional way.
We all know and love
Xcode, which uses many
of the LLVM frameworks
internally.
And now the amazing, new Swift
Playgrounds app, has joined it
And now the amazing, new Swift
Playgrounds app, has joined it
with similar internal
uses of LLVM.
LLVM is also an integral part
of the inner workings of Metal
and our other graphics APIs.
LVVM is open source.
Apple has a long history
with open source
compilers and languages.
The Swift language was born
out of the LLVM project.
We're very happy with how you,
the community of contributors,
have move Swift forward via
the open evolution process
on Swift.org.
That process, and the Swift
development processes were
informed by the existing
LLVM project processes.
LLVM has a thriving
open source community,
backed by the nonprofit
LLVM Foundation.
LLVM is highly customizable
and composable.
By leveraging the same robust
and mature infrastructure,
underpinning the
Clang front end,
we power the Swift
compiler as well.
For more info about Swift,
review the talk from earlier.
For more info about Swift,
review the talk from earlier.
Let's take a brief
look into some
of these frameworks
available from open source
and how they might be used.
Clang is the front end for
C, Objective-C, and C++.
Its libraries provide
for advanced features,
like static analysis, and
in place for your writing,
for code mitigation,
and modification.
It also powers development
environment integration
features, like source
code indexing
and intelligent code completion
that we all love in Xcode.
The open source tooling library
helps harness the power of Clang
for your own custom command-line
tools to process source code.
Imagine the amazing things
you could do with the power
of a full parser at your
command, running on your code.
The LLVM optimizer,
which we'll speak more
about in a little
bit, has a full suite
of modern compiler
optimizations.
And this is where features like
Link-Time Optimization happen.
And this is where features like
Link-Time Optimization happen.
Stay tuned for some
exciting developments in LTO.
And then we have the back end,
where final code
generation happens.
There are libraries for
object file manipulation,
features like assembly
and disassembly,
and even advanced features
like just-in-time compilation
can be simply plugged together.
LLVM has many other tools
composed from its frameworks.
Let's take a look.
As part of a complete
LLVM based toolchain,
there are the usual suite
of binary file utilities,
including a burgeoning open
source effort on a framework
for making linkers
and related tools.
But perhaps the most prominent
of these projects is LLDB.
LLDB combines many of the
libraries from Clang, Swift,
the Code Generator, and its own
suite of debugger frameworks.
There will be a great session of
LLDB tips and tricks on Friday.
All of the amazing features
in Xcode, Swift Playgrounds,
All of the amazing features
in Xcode, Swift Playgrounds,
and the Apple LLVM Compiler
wouldn't be possible
without the contributions from
the LLVM open source community.
The community is made up of
many developers like you,
inspired to bring their own
ideas into the LLVM frameworks.
The LLVM open source project
is growing at a stunning pace
that would be extremely
challenging for any one team,
at any one company, to match.
We'd like to invite you to check
out the project, at LLVM.org,
and see how you can
fit into the community
and contribute your
unique ideas.
And now I'd like to introduce
Duncan, who will go over some
of the exciting new
language features we have
in the Apple LLVM Compiler.
[ Applause ]
>> Let's talk about
language support.
First, we'll talk about new
language features, then updates
to the C++ library, and
finally new errors and warnings
to the C++ library, and
finally new errors and warnings
to help improve your code.
Let's start with new
language features.
Objective-C now supports
class properties.
This feature started in
Swift as type properties,
and we've brought
it to Objective-C.
Interoperation works great.
In this example, the class
property someString is declared
using property syntax,
by adding a class flag.
Later, this someString property
is accessed using dot syntax.
Class properties are
never synthesized.
You can provide storage,
a getter, and a setter,
in the implementation.
Or you can use @dynamic to
defer resolution to runtime.
Over to C++.
LLVM has had great support
for C++11 for years.
The only holdout
has been support
for the thread-local keyword.
This year, we added support.
Let me tell you a
little about it.
If a variable is declared
with a thread-local keyword,
LLVM creates a separate
variable per thread.
Initializers are called
before the first use
after thread entry,
and destructors are
called on thread exit.
C++ style thread-local
storage, supports any C++ type.
And its syntax is portable
with other C++ compilers.
The Apple LLVM Compiler,
already has support
for C-style thread-local
storage,
even when compiling C++ code.
There are two syntaxes
available:
One with a GCC style keyword,
and another from
the C11 standard.
C-style thread-local
storage has lower overhead
than C++ thread local,
but it has restrictions.
It requires constant
initializers
and plain old data types.
If your code meets
these restrictions,
you should continue to use
C-style thread-local storage
for maximum performance.
Otherwise, use the C++
thread-local keyword.
Both work great.
Thread-local variables
can help fix bugs
in code that uses threads.
For more on thread related
bugs, watch the Thread Sanitizer
and Static Analysis talk.
That's it for new
language features.
Let's move on to the
C++ standard library.
Libc++ has been the default
C++ standard library for years.
We've been encouraging you
to move off of Libstandardc++
and in Xcode 8, we've deprecated
it on all our platforms.
Please move on as
soon as you can.
If your Xcode project
still uses Libstandardc++,
you should upgrade to Libc++
by changing the C++ standard
library build setting.
Xcode's project modernization
will offer
to do this automatically.
The Libc++.dylib has a
great update this year
for all our platforms.
It has complete library
support for C++14,
as well as many other
improvements.
Standard library features
that require the dylib,
now have availability
attributes.
While Apple frameworks encourage
deploying to old targets
through the use of
runtime checks,
the C++ standard library
availability checks are done
at compile time.
To use C++ features that
require the newest dylib,
you need to target a
platform that supports them.
That's the C++ library.
Between Xcode 7 and
Xcode 8, we've added more
than 100 new errors and warnings
to help you find
bugs in your code.
Let's talk about
just a few of them.
In Xcode 7, we added
a great feature
to Objective-C called
Lightweight Generics.
The kind of type modifier
plays an important role,
allowing implicit downcast
from the kind of type
allowing implicit downcast
from the kind of type
to any of its subclasses.
In Xcode 8, we've
improved the diagnostics
for method lookup
on kind of types.
In this example,
getAwesomeNumber is declared
inside My Custom Type,
which inherits from NSObject.
Later, getAwesomeNumber is
called on a kindof UIView.
This code is broken since
My Custom Type is unrelated
to UIView.
Xcode 8 gives an error here.
Type checking of methods called
on kind of types, is restricted
to the same class hierarchy.
This improved type checking
also avoids misleading warnings
when an unrelated type declares
a method with the same name.
In Xcode 8, kind of types
are way easier to use.
Next, containers in
Objective-C, such NSArray
or NSMutableSet can
hold arbitrary objects.
However, the NSMutableSet
called s
in this example is
added to itself.
This creates a circular
dependency,
which triggers a
warning in Xcode 8.
Besides creating a
strong reference cycle,
a circular dependency
can prevent some methods
from being well-defined.
Infinite recursion.
This example implements
the factorial function.
If n is positive, it returns n
times factorial of n minus 1,
recursing to calculate
the answer.
If n is zero, it returns
factorial of 1, also recursing.
The compiler now has a warning.
When all pass through a
function, call itself.
This catches common cases
of infinite recursion.
Here, one possible fix is
to return 1 when n is zero.
Standard Move is a great
C++ language feature.
It allows you to specify that
ownership should be transferred
from one container to another.
Moving a resource is
faster than a deep copy.
Moving a resource is
faster than a deep copy.
However, using standard
move on a return value,
blocks named return
value optimizations.
Typically, when a local
variable is returned by value,
the compiler can avoid
making a copy at all.
In generateBars, calling
standard move forces the
compiler to move bars.
Although moving is fast,
doing nothing is even faster.
LLVM now warns about
this pessimizing move.
The fix is to avoid standard
move on return values,
getting back the
lost performance.
Similarly, when a function
takes an argument by value,
and returns it by value,
the compiler automatically uses
standard move on the return.
There's no need to
call it explicitly.
For example, the standard
move on the return value
in rewriteText is redundant.
Although this doesn't
actively hurt performance,
it makes the code less readable.
It's better to return
text directly.
This is easier to
maintain and consistent
with returning local variables.
Finally, there are new warnings
for references to temporaries
in C++ range-based for loops.
Here, the loop is iterating
through a vector of shorts,
but the iteration variable i,
is a const reference to an int.
Because of the implicit
conversion between short
and int, i is a temporary.
This can lead to subtle
bugs, since it looks
as if i is pointing inside
the range, but it isn't.
The compiler now warns about
this unexpected conversion.
One fix is to change i to
a const reference of short.
Another is to make it
clear that i is a temporary
by removing the reference.
There is a similar warning
when a range doesn't
return a reference at all.
An iterator to standard
vector of bool,
does not return a reference.
So the iteration variable b in
this example is a temporary.
It's surprising that b does
not point inside the vector.
The compiler now warns
about this unexpected copy.
The fix is to remove the
reference making it clear
that b is a temporary.
The new warnings for infinite
recursion, standard move,
and C++ range-based for
loops are enabled by default.
To try them out in
your expo project,
add them to the Other
Warning Flags Build setting.
That's it for new diagnostics.
Let's move on to advancements
in compiler optimization.
We've made improvements
throughout the LLVM compiler
to optimize the runtime
performance of your code.
We've picked just a
few to highlight today.
We'll talk about improvements
to Link-Time Optimization,
highlight new code
generation optimizations,
and describe techniques
for arm64 cache tuning.
and describe techniques
for arm64 cache tuning.
It has been a couple of
years since we've talked
about Link-Time Optimization
and we have major
improvements to share.
Link-Time Optimization or
LTO, optimizes the executable
as a single monolithic unit.
It inlines functions across
source files, removes dead code,
and performs other powerful,
whole program optimizations.
LTO blurs the line between
the compiler and the linker.
To understand how LTO
works, let's first look
at the traditional
compilation model.
Let's say we have
four source files.
The first step is to compile,
producing four object files.
The object files are linked with
frameworks, to produce an app.
An LTO build starts
the same way,
by compiling the sources
into object files.
In LTO, these object files
contain extra optimization
In LTO, these object files
contain extra optimization
information which
enables the linker
to perform Link-Time
Optimizations,
and produce a single,
monolithic object file.
The output of LTO is
linked with frameworks
such as Foundation,
to produce the app.
LTO maximizes performance.
Apple uses LTO extensively
on our own software,
typically speeding
up executables
by 10 percent compared to
regular release builds.
Its effects multiply
when combined
with profile guided
optimization.
It can also reduce
code size dramatically,
when optimizing for size.
However, it has cost
at compile time.
The monolithic optimization
step can have large memory
requirements, doesn't take
advantage of all your cores,
and has to be repeated
in incremental builds.
Large C++ programs
with debug info,
Large C++ programs
with debug info,
have been the most
expensive to compile.
Over the last two
years, we've worked hard
to reduce the overhead.
For example, let's look
at the memory usage
for linking the Apple
LLVM compiler itself
with LTO and debug info.
Smaller bars are better.
In Xcode 6, this took
over 40 gigabytes.
Since then, we've reduced
the memory usage by 4x.
We've also reduced
compile time by 33 percent.
The Line Tables Only debug
information level uses even less
memory in LTO.
Linking LLVM itself,
now takes only 7 gigs.
LTO has never been better.
But there is still a compile
time trade-off, particularly
for incremental builds.
We have an exciting
new technology
that designs away these costs.
Incremental LTO scales
with your system.
It performs global
analysis and inlining
It performs global
analysis and inlining
without combining object files.
This speeds up the build because
it can optimize each object file
in parallel.
Moreover, since the
object files stay separate,
and you build cache
in the linker keeps
incremental builds superfast.
Let's look at how an
incremental LTO build works.
The compile step is the
same as monolithic LTO,
producing one object file
for each source file.
Without combining object files,
the linker runs an LTO analysis
of the entire program.
This analysis feeds
optimizations
for each object file, enabling
the objects to inline functions
from each other, as well
as other powerful whole
program optimizations.
The LTO optimized object files
are stored in a linker cache,
before being linked with
frameworks to produce the app.
The resulting runtime
performance of programs built
The resulting runtime
performance of programs built
with incremental LTO, is
similar to monolithic LTO,
with a few benchmarks
running slower,
and a few running faster.
But it's fast at
compile time too.
Let's look at build times
with my favorite C++ project:
The Apple LLVM compiler itself.
In this graph, smaller
bars are better.
The time at the top is
for building without LTO.
Monolithic LTO adds
significant build time overhead,
taking almost 20
minutes, instead of 6.
Incremental LTO is much
faster at under 8 minutes,
adding only 25 percent overhead.
Let's zoom in on the link step
where LTO is doing its work.
Without LTO, linking the
Apple LLVM compiler itself,
takes less than two seconds.
You can't see the
bar in the graph,
because the linker
isn't performing any
compiler optimizations.
Monolithic LTO takes
almost 14 minutes,
when it can use all your cores.
Incremental LTO takes two
minutes and a quarter,
Incremental LTO takes two
minutes and a quarter,
more than 6x faster
than monolithic LTO.
Memory usage is great too.
Without LTO, linking the Apple
LLVM compiler uses a little more
than 200 megabytes.
As we saw earlier, monolithic
LTO takes 7 gigabytes.
Incremental LTO uses
less than 800 megabytes.
The scaling is incredible.
[ Applause ]
All these results are
for a fresh build.
It gets even better.
With incremental LTO,
incremental builds don't
repeat unnecessary work.
Let's look at an example
where the controller changes,
and you start an
incremental build of the app.
Changing the controller
invalidates the link itself,
but the other LTO object files
are still in the linker cache.
However, if a function from the
controller is inline into main,
then main needs to be
reoptimized at LTO time as well.
So we start the build, and
only the controller needs
to be recompiled.
After rerunning LTO
analysis, both controller.O
and main.O are optimized and
new LTO object files are stored
in the linker cache.
Then the LTO objects are linked
as before to produce the app.
Incremental LTO gives you
the performance you expect
from an incremental build.
When a source file with small
helper functions is changed,
object files that use
them get reoptimized.
But for a typical
incremental build,
most LTO object files are linked
directly from the linker cache.
Let's take a final
look at the time
to link the Apple
LLVM compiler itself.
The top three bars
show the times
for a fresh build from before.
If we change the implementation
of an optimization pass,
monolithic LTO takes
the same amount of time.
monolithic LTO takes
the same amount of time.
It's the same as a fresh build.
But incremental LTO
takes only 8 seconds.
This is 16x faster
than the initial build,
and 100x faster than
monolithic LTO.
[ Applause ]
This is stunning.
State of the art, link
time optimization,
with low memory requirements
and fast incremental builds.
Try incremental LTO today.
The improvements to
LTO are fantastic,
but if you're using LTO
with a large C++ project,
you can minimize compile time
with the line tables only,
debug information level.
This gives you rich back traces
in the debugger,
at the lowest cost.
That's it for Link-Link
optimization.
I invite Gerolf on stage to talk
about new code generation
optimizations.
[ Applause ]
>> So, let's move on
with optimizations
>> So, let's move on
with optimizations
in the code generator.
We put a lot of effort into
the Xcode 8 Apple LLVM compiler
to improve the performance
of all apps.
In this section, I will
talk about three of them:
Stack packing, shrink wrapping,
and selective fused
multiply-add.
Let's start with stack packing.
This is about local variables
and runtime stack memory.
The Apple LLVM compiler
always had optimizations,
that try to reduce the
stack memory space.
In Xcode 8, the compiler
got better at doing
that than ever before.
And here is why.
Let's look at that example.
Pay attention to the definition
of x inside the scope
of the if statement.
Then look at the definition
of y after the if statement.
If the compiler does not
optimize this code snippet,
you would expect that the
two variables x and y,
live at two different
stack locations
live at two different
stack locations
on the runtime memory stack.
However, according to C-style
language rules, the lifetime
of a variable ends at the
scope where it is defined.
And this is what the
compiler exploits.
Let's take a look what this
does to our two variables.
x is defined inside
the if statement.
The scope of that if statement
ends just before y is defined.
When y is defined, it starts a
lifetime of y and what you see
in this picture is that x
and y, the lifetimes for x
and y, do not overlap.
So the compiler can assign
them the same stack location.
This reduces the total memory
stack required for your program,
and that may provide better
performance for your app.
Now this is all great, but
there is a little caveat
that comes with it.
Programming language rules are
sometimes really hard to check.
And when they are violated,
your program may give
unpredictable results,
or in a technical term, it
may have undefined behavior.
Let's take a look
at the example.
Pay attention to the
pointer variable.
Inside the if block, it gets
assigned the address of x.
Now look at the print statement.
At that point, the address
of x is used via the
pointer variable.
So what is happening
here is that the address
of x escapes the scope
of the if statement,
and that [inaudible]
our language rules
as undefined behavior.
Luckily, it is easy to fix.
It's just something
to be aware of.
The way you fix this is by
simply extending the scope
of the lifetime for x by moving
the definition of x outside the
if statement, just before
the condition is checked.
And now the definition of x is
in the same scope as the address
And now the definition of x is
in the same scope as the address
that is used in the
print statement.
The takeaway from this is simply
please make any effort you can
think of to make
your program adhere
to programming language rules,
that will give a much
better experience for you,
for the optimizing
compiler, and for our users.
That is stack packing.
Let's move on to
shrink wrapping.
Shrink wrapping is about the
code the compiler generates
in the entry and exit
of your function.
It's resource management code
that manages the runtime
stack and the registers.
The observation here is
that this code is not needed
for all the paths
through your function,
and shrink wrapping will
place this resource managing
instruction only to places
where they are actually needed.
So let's take a look
at a simple example.
So here's a simple function that
takes two parameters, a and b,
So here's a simple function that
takes two parameters, a and b,
simple compares the two.
And if a is smaller than b,
then it calls a function foo
which has the address of a
local variable as a parameter,
and eventually it returns.
To understand shrink wrapping,
let's look at the
pseudo assembly code akin
to the code the compiler
actually generates.
So here you'll see an entry code
that allocates stack memory,
saves registers, does
the comparison, branches.
And the exit block that
restores the registers,
deallocates the stack,
and returns.
And then if the condition
is right,
the function foo is called.
So what we see here is we have
two paths in that program.
One from the entry to the
exit block, and another path
from the entry block to the
call -- to the exit block.
The key observation here is
that the resource
managing instructions
for the runtime stack memory
and for the registers
are only needed
and for the registers
are only needed
because we have a call
instruction in our function.
So what shrink wrapping does,
it recognizes this condition
and moves these instructions
out of the entry block and out
of the exit block to the place
where they actually need it.
So it shrinks the lifetimes
of the resource managing
instructions and wraps them
around the region, where
they are actually needed.
In this case, the
region is just the call.
But now imagine,
that the hot code
in your function is the path
from the entry to the exit.
No longer do we have
to execute the resource
allocating instructions.
And if this is hot functions,
you have many functions
like this.
You can imagine that this
way we can save millions
of instructions, providing
nice performance gains,
and also power savings
for your app.
And that is shrink wrapping.
[ Applause ]
Let's move on with
fused multiply-adds.
It takes us back to very
simple arithmetical operations,
additions and multiplications.
The arm64 processor
has one instructions --
a fused multiply-add
instructions --
instruction, that
computes an expression
like a plus b times c, in
one single instruction.
You would naively assume that
whenever you see an expression
like this, it's always the best
to generate this instruction.
And that's exactly what the
compiler has done so far.
But it may surprise you.
In some cases, it's
actually faster
to generate two instructions,
an add instruction,
and a multiply instruction,
to get faster performance
for your app.
Why? I've prepared a simple
example to demonstrate this.
This function takes
four integer parameters
and it computes a
simple expression,
a times b plus c times d.
What does the code look
like when the compiler
generates a single fused
multipy-add instruction?
Well, it will compute a times
b, then the multiply-add had
to wait for that result.
And the multiply-add
will compute the value
of our expression.
How long does this take?
It takes four cycle
for the multiply,
four cycle for the add.
So in total, that simple
sequence takes eight cycles.
When we generate 2
multiplies and 1 add,
how can that be faster?
Let's take a look
at that sequence.
First we issue the
compiler issues a multiply,
computes a times b,
then computes c times d.
Eventually adds the result.
The secret here is that modern
processors have instruction
level parallelism.
They can execute two
or more multiplies
at the same time, in parallel.
So what is happening here is
that the two multiplies
get executed in parallel.
So within four cycles, we don't
get the result of one multiply,
but the results of
two multiplies.
Then we just add the results
in one cycle, and now we see
that for this to compute the
value of this expression,
we only need five cycles.
Let's compare this with
the sequence for --
with the sequence with a fewest
multiply-add and now we see
that for this simple sequence,
we get a speed up of about 2x.
So with selective fused
multiply-add, you can now speed
up many simple expressions
in your app.
And that is fused multiply-add.
[ Applause ]
So let's move on to
arm64 cache tuning.
Here I will talk about two
techniques: The compiler has
Here I will talk about two
techniques: The compiler has
to impact which data gets stored
into the cache and the data
that gets stored into the cache,
impact the performance
of your app.
Before we go into the details,
what the compiler is doing,
I want to quickly review
the memory hierarchy.
At the top, you see
the main memory
that hosts the program
variables.
And it's at the top.
At the bottom, you
see the registers.
It's very slow to load data from
main memory to the registers.
To bridge that gap, there's
a cache, a temporary storage.
It's about 10,000
to 100 times --
to 100,000 times smaller than
main memory, but it's much,
much faster to load
data out of the cache.
About 10 times to
100 times faster.
Then data is loaded
out of main memory.
We not only load a
single register value,
but an entire cache
line out of main memory.
So cache line contains more
than one register value.
The reason why this design
is usually so successful,
is because your program data
have two locality properties:
Temporal locality
and spatial locality.
Temporal locality simply means
the data your program accesses
now, it will access
this very soon again.
Spatial locality simply means
data your program accesses now,
it will also access
neighboring data.
So when you access
an array field,
it will also access
the field next to it.
When it accesses a field
in your data structure,
it will also access
a field next to that.
And so on and so forth.
Now, you look at this
design, and you wonder --
so it's so fast to load
data out of the cache.
Can we somehow magically
preload the data into the cache,
out of main memory while the
processor is doing some other
operations, so that all
data get loaded quickly
when we need them, when
our programs need them?
It might surprise you, on your
processor, in your iPhone,
It might surprise you, on your
processor, in your iPhone,
that magic is already happening.
It's hardware prefetching.
The processor looks at each
address that is loaded,
tries to find pattern
in those addresses.
When it finds a pattern,
it predicts which data your
program will need in the future,
prefetches this data out
of main memory, again,
while other complications
are happening, puts the data
into the cache and eventually
when your program needs them,
the program can load them
quickly, out of the cache.
Now this year, we worked
with the hardware architects
to see how the compiler
can even make
that prefetching magic
work better for your apps.
And we found a few patterns.
So what the compiler does now,
it analyzes your source code,
predicts which data your
app will need in the future,
issues prefetch instructions
for this app, for this data.
Now while your app is running,
the prefetch instruction
executes, fetches the data
from main memory, puts
it into the cache,
and when your program is ready
to need those data, it simply
and quickly loads
it out of the cache.
Your program no longer
has to wait for the data
to be loaded out of main memory.
And that is the magic
of software prefetching.
So far--
[ Applause ]
-- so far, I have talked
about optimizations the compiler
does automatically for you.
For the next optimization,
non-temporal stores it
will need your help.
To understand what is going on
there, we need to take a look
at what is happening when data
is stored into main memory.
So again, let's take a look
at our memory hierarchy,
and assume we have a simple
assignment in our program
and assume we have a simple
assignment in our program
where we want to assign a
value of say 100, to a variable
in main memory, which has
the current value of 55.
The first thing that is
happening, the old data
from the address you want
to store the new value
in is loaded into the cache.
And since we load data from main
memory, not only the old data
for that variable is loaded,
but also the neighboring data --
also the neighboring data
because we fill the
entire cache line.
In the second step, we store
the value in our register,
into the cache line,
in the cache.
And then finally, the
data in the cache line
or the cache line is
needed for different data
and our values get written back
in the third step
to main memory.
So it's a three step process.
Loading data from main
memory into the cache,
store the register
value into the cache,
store the register
value into the cache,
and then eventually store the
data back into main memory.
And this all because data
usually have the locality
property, and specifically
temporal locality.
But what if your data do
not have temporal locality?
Wouldn't it be much faster to
simply store the value directly
from the register into the main
memory in just one, single step?
This is exactly what
non-temporal stores do.
They avoid the extra
load of a cache line,
and also the other benefit of
that you have is since you know
that the data are no
longer needed, that don't go
into the cache, so the cache
can now have other data
for your app that
are more useful.
The way you instruct
the compiler
to generate non-temporal stores
is with a compiler builtin.
So you use the builtin
to instruct the compiler
So you use the builtin
to instruct the compiler
to generate the non-temporal
stores.
It takes two parameters.
The value you want to
store, and the address
where you want to store it to.
When do you want to use
non-temporal stores?
When there is no [inaudible]
use, no temporal locality
in your code, you copy a large
chunk of data, preferably larger
than the size of the cache,
and to make it really count
for your app, it should be
where the performance is hiding.
So it should be in hot loops.
And you don't want to use a
non-temporal storage basically
when any of these
condition does not hold.
What can you gain from this?
So in this slide?
We look at the performance gains
from the non-temporal stores
on three benchmarks that contain
hot loops that look similar
to the example that
I showed you before.
So what this data show you is
that for very common
loop bodies,
you get very significant
speed ups in the range
you get very significant
speed ups in the range
from 30 to 40 percent.
And that is what non-temporal
stores can hopefully do
for many hot loops in your app.
[ Applause ]
It has been a long journey.
We have looked at many,
many new features and a lot
of great new optimizations
that the new compiler
brings to your app.
So we've talked about the
Apple LLVM compiler is based
on an open source project.
You can interact with us in
the open source community,
even provide patches for
your favorite compiler.
We have seen many
great new features
like the objective
class properties
and C++11 thread
local storage support.
Now the new Libc ++ has
full support for C++14
and has many new improved
features, but keep in mind,
and has many new improved
features, but keep in mind,
that Libstandardc++
is deprecated.
You get many more
warning and diagnostics
to make your code
cleaner than ever before.
And you heard about this
fantastic new feature,
incremental LTO that gives you
basically the same performance
as monolithic LTO, and really,
truly, awesome compile time.
And we talked about a
number of optimizations
in the code generator that
automatically should speed
up all your apps,
and we finally talked
about long term pro stores where
the compiler provides the means,
but you provide the smarts,
and then to use them.
So I hope all this convinces you
that we put together a really
great compiler release.
We couldn't be happier about it.
Please get your hands on it.
Check it out, what it
can do for your app.
You'll find more
information on our website.
There are a number of
really cool talks out there
There are a number of
really cool talks out there
that you should be
interested in.
Check them out.
Thank you all for watching.
Thank you for coming.
Have a great time
at the conference.
[ Applause ]