Transcript
>> Hello everybody.
[ Applause ]
Thanks for coming out this
morning.
I'm Louis Gerbarg.
I work on the dyld Team, and
today we're going to talk about
App Startup, Past, Present and
Future.
So we got a lot to go through,
so I'm just going to get into
it.
So first off I want to do an
overview of what we're going to
be talking about today.
So first we're going to review
some advice we gave from last
year.
Then I want to talk about some
new tooling we've developed to
make finding certain types of
app startup time problems
easier.
After that I want to take a side
tour into a brief history of
dyld on our platforms, and then
I want to discuss the all new
dyld that we're going to be
shipping in macOS High Sierra
and iOS 11.
And then finally, I want to talk
about best practices for this
new dyld.
So before that I just want to do
a little bit of bookkeeping.
So first off, we want your
feedback.
So if you have anything you want
to tell us, please file bugs
with the title DYLD USAGE, and
hopefully they will get back to
us.
And now I want to talk about
some terminology that I'm going
to use in the rest of this talk.
So first off, what does startup
time mean?
And startup time for the
purposes of this talk means time
spent before main.
Now, if you are writing an app,
you have to do more than that.
After that happens, there will
be nib loading and other things
like that and you have codes to
run after you -- in UI
application delegates and what
not, but you have more
visibility into that and there
are many other talks about that.
Today we just want to talk about
what happens before your main
executes and how you can speed
that up.
Additionally, I want to define a
launch closure, and this is a
new term.
And a launch closure is all of
the information necessary to
launch your application.
So what dylibs it uses, what the
offsets in them are for various
symbols, where their code
signatures are.
And with that, let's go into the
main body of the talk.
So last year I said do less, and
I'm going to say that again this
year and I'm always going to say
that because the less you do,
the faster we can launch.
And no matter how much we speed
things up, if we have less work,
it's going to go faster.
And the advice is basically the
same.
You should use fewer dylibs, if
you can, you should embed fewer
dylibs.
System ones are better in
certain ways from a time
perspective, and we'll go into
that.
You should declare fewer classes
and methods and you should run
fewer initializers.
Finally, I'm going to tell you
you can do a little bit more of
something.
You can use more Swift, and the
reason is Swift is designed in
such a way that it avoids a lot
of pitfalls that C, C++ and
Objective-C allow you to do.
Swift does not have
initializers.
Swift does not allow certain
types of misaligned data
structures that cost us time in
launch.
So, in general, moving to Swift
will make it easier for you to
get very responsive app startup.
So also, there are the Swift
size improvements and smaller is
better, so please move to this
new Swift that we've shipped
this year with the size
improvements and that's going to
help you out.
So now let me talk about some
new tooling we have.
So new in iOS 11 and macOS High
Sierra, we've added Static
Initializer Tracing to
Instruments.
So, yes, this is pretty exciting
stuff because initializers are
code that have to run before
main to set up objects for you,
and you haven't had much
visibility into what happens
before main.
So they're available through
Instruments and they provide
precise timing for each static
initializer.
So with that, I'd like to go to
a demo right now.
So over here I have an
application, and as most
applications at WWDC are, it's a
way of sharing cute pictures of
animals.
So here, let me launch it.
And, you know, it's taking a
little while here, but it's
still taking a while and it gets
up and we can see some
chinchillas and some cats.
And so let's take a look at why
it took that time.
So I'm going to go and I'm going
to rerun it under Instruments.
So we'll stop the execution of
the current one and run it.
And now if we go in, I'm going
to start with a blank template
and we can add the new Static
Initializer tool, which is right
there.
And while we're at it, I'm also
going to add a Time Profiler
because it's always kind of nice
to see what's going on.
There we go.
Okay. So now that we have those,
let's start running our
application.
So we're getting in our trace
data, and it's still not up, but
it just came up and as you can
see in the background there, we
had something fill in there.
So I'm going to just zoom in so
you can get a look, and I have a
function there called
waitForNetworkDebugger, and
that's right, because I was
loading these off of an adjacent
feed that we had up on our site.
I was trying to debug that.
So let's go and -- I just want
to actually take a quick look
here in the CPU Usage tool.
So you can see that that
initializer's roughly the same
length as my CPU usage.
So if I go down there, I can
actually drill down into dyld,
and if I do that, we're actually
going to see what was taking all
that time, and that time is 9.5
seconds, 9.5 seconds into the
initializer.
It's pretty deep.
You don't usually have to do
this, but I want to show you
what's going on.
And down in here I can finally
see waitForNetworkDebugger,
which is what we saw up in the
initializer call, but now it's
very easy for you to find that.
So now that we've done that, I'm
going to go back over into Xcode
and, oh, yeah, that's the
waitForNetworkDebugger call that
I implemented.
I implemented it in C because
Swift won't even let you do
something like this, which is
because this is a bad idea, but
I created a constructor there.
So if I go back to my source
code -- if I go back to my
source code, I can just delete
that function because it was
just for debugging anyway.
If I run it, my app's going to
come up almost instantly.
So we just saw how to quickly
find what stack initializers are
causing you slowdowns.
This will work across multiple
dylibs, including system dylibs
that may be taking a long time
because of inputs you've given
them, such as complicated nibs.
It depends on new infrastructure
in High Sierra and iOS 11's
kernel and dyld, so you need to
be running the new builds to see
this.
And it catches most initializers
now and there's some edge cases
we're still working on adding,
but we think this is going to
allow you to quickly find out
what is taking time during your
app launch so that you get
quicker, more responsive
application launches that will
make your users happy.
Thank you.
[ Applause ]
Okay. So now I said we'd do a
brief history of dyld.
So Dynamic Linking Through the
Ages.
So originally we shipped the
first dyld -- these didn't have
version numbers, but
retroactively we're giving them
them.
And this was dyld 1 and it
shipped as part of NeXTStep 3.3
back in 1996.
Before that, NeXT used static
binaries.
And it's worth noting this
predates the POSIX dlopen calls
being standardized.
Now, dlopen did exist on some
Unix.
They were proprietary extensions
that later people adopted.
And NeXTStep had different
proprietary extensions, so
people wrote third-party
wrappers on the early versions
of macOS 10 to support standard
Unix software.
The problem was they didn't
quite support the same
semantics.
There were some weird edge cases
where it didn't work, and
ultimately they were kind of
slow.
It also was written before most
systems used large C++ dynamic
libraries, and this is
important.
C++ has a number of features,
such as how its initializer
ordering works.
And one definition rule; they
work well in a static
environment, but are actually
fairly hard to do, at least with
good performance, in a dynamic
environment.
So large C++ code bases cause
the dynamic linker to have to do
a lot of work and it was quite
slow.
We also added one other feature
before we shipped macOS 10.0,
Cheetah, and that's called
prebinding.
And for those of you in the
audience who know what
prebinding is, I know it was
kind of painful; and for the
rest of you, prebinding was a
technology where we would try to
find fixed addresses for every
dylib in the system and for your
application, and the dynamic
loader would try to load
everything at those addresses
and if it succeeded, it would
edit all of those binaries to
have those precalculated
addresses in it, and then the
next time when it put them in
the same addresses, it didn't
have to do any additional work.
And that sped up launch a lot,
but it meant that we were
editing your binaries on every
launch, and that's not great for
all sorts of reasons, not the
least of which is security.
So then came dyld 2, and we
shipped that as part of macOS
Tiger.
And dyld 2 was a complete
rewrite of dyld.
It had correct support for C++
initializer semantics, so we
slightly extended the mach-o
format and we updated dyld so
that we could get efficient C++
library support.
It also has a full native dlopen
and dlsym implementation with
correct semantics, at which
point we deprecated the Legacy
API's.
They are still on macOS.
They have never shipped on any
of our other platforms.
It was designed for speed and
because it was designed for
speed, it had limited sanity
checking.
We did not have the malware
environment we have today.
It also has security issues
because of that, that we had to
go back and retrofit in a number
of features to make it safer on
today's platforms.
Finally, because it was so much
faster we could reduce the
amount of prebinding.
Rather than editing your
applications, we just edited the
system libraries and we could do
that just at software update
times.
And if you've ever seen the
phrase optimizing system
performance appear in your
software update, that was added
to the installer to be displayed
during the time we were updating
prebinding.
Nowadays it is used for all the
optimizations, but that was the
impetus.
So we shipped dyld 2 back then
and we've done a number of
improvements over the years,
significant improvements.
First off, we've added a ton of
more architectures and
platforms.
Since dyld 2 shipped on PowerPC,
we've added x86, x86 64 arm,
arm64, and a number of
subvariants of those.
We've also shipped iOS, tvOS,
and watchOS, all of which
required significant new work in
dyld.
We've improved security in a
number of ways.
We added codesigning support, we
added some for ASLR, which is a
technology Address Space Layout
Randomization, which means that
every time you loaded the
libraries it may be at a
different address.
If you want more details on
that, last year's talk where
Nick went into extreme detail on
how we launch an app, goes into
that.
And finally, we added a
significant bounds checking to a
number of things in the mach-o
header so that you couldn't do
certain types of attach with
malformed binaries.
Finally, we improved
performance, and because we
improved performance, we could
get rid of prebinding and
replace it with something called
the shared cache.
So what is the shared cache?
Well, it was introduced in iOS
3.1 and macOS Snow Leopard, and
it completely replaced
prebinding.
It's a single file containing
most of the system dylibs.
And because we merged them into
a single file, we can do certain
types of optimizations.
We can rearrange all of their
text segments and all of their
data segments and rewrite their
entire symbol tables to reduce
the size and to make it so we
need to mount fewer regions in
each process.
It also allows us to pack binary
segments and save a lot of RAM.
It effectively is a prelinker
for the dylibs.
And while I'm not going to go
into any particular
optimizations here, the RAM
savings are substantial.
On an average iOS system, this
is the difference in about 500
megs to a gigabyte of RAM at
runtime.
It also prebuilds data
structures that dyld and Ob-C
are going to use at runtime so
that we don't have to do it on
launch.
And again, that saves more RAM
and a lot of time.
It's built locally on macOS, so
when you see optimizing system
performance, we are running
update dyld shared cache, among
things that happen, but on all
of our other platforms we
actually build it at Apple and
ship it to you.
So now that I've talked about
the shared cache, I want to move
into dyld 3.
dyld 3 is a brand-new dynamic
linker, and we're announcing it
today.
It's a complete rethink of how
we do dynamic linking and it's
going to be on by default for
most macOS system apps in this
week's seed, and it will be on
by default for all system apps
on 2017 Apple OS platforms.
We will completely replace dyld
2 in future Apple OS platforms
for all third-party apps as
well.
So why did we rewrite the
dynamic linker again?
Well, first off, performance.
In case that's not a recurring
theme, we want every ounce of
launch speed we can get.
Additionally, we thought what is
the minimum, what is the
theoretical minimum that we
could do to get an app up and
running and how could we achieve
that.
Security. So as I said, we
retrofitted a number of security
features into dyld 2, but it's
really hard to add that kind of
stuff after the fact.
I think we've done a good job
with it in recent years, but
it's really, really difficult to
do that.
And so can we have more
aggressive security checking and
be designed for security up
front?
Finally, testability and
reliability.
Can we make dyld easier to test?
So Apple ships a ton of great
testing frameworks, like XCTest,
that you should be using, and we
should be using, but they depend
on low level features of dynamic
linker to insert those libraries
into processes, so they
fundamentally cannot be used for
testing the existing dyld code,
and that also makes it harder
for us to test security and
performance features.
And so how did we do that?
Well, we've moved most of dyld
out of process.
It's now mostly just a regular
daemon and we can test that just
like everybody else does with
standard testing tools, which is
going to allow us to move even
faster in the future in
improving this.
It also lets the bit of dyld
that stays in process be as
small as possible and that
reduces the attack surface in
your applications.
It also speeds up launch because
the fastest code is code you
never write, followed closely by
code you almost never execute.
So to tell you how we did this
I'm going to briefly show how
dyld 2 launches an app.
And again, we went into this in
much more detail in last year's
talk, Optimizing App Startup
Time, so if you want to pause,
if you're watching this on
video, and go watch that, that
might be a good idea.
Or if you just want to follow
along here, I'm going to go
through it briefly.
So first off we have dyld 2 and
your app starts launching.
So we have to parse your mach-o,
and as we parse your mach-o we
find what libraries you need,
and then they may have other
libraries that they need, and we
do that recursively until we
have a complete graph of all
your dylibs, and for an average
graph of application on iOS
that's between 3- and 600
dylibs, so it's a lot of them
and a lot of work.
We then map in all the mach-o
files so we get them into your
address space.
We then perform symbol lookups,
so we actually look and say if
your application uses printf, we
go and look and see that printf
is in lib system, and we find
the address of it and we
basically copy that into a
function pointer in your
application.
Then we do what's called binding
and rebasing, which is where we
copy those pointers in and we
also -- because you're at a
random address all of your
pointers have to have that base
address added to them.
And then finally, we can run all
of your initializers, which is
what I showed the tooling for
earlier, and at that point we're
ready to call your main in
launch, and that's a lot of
work.
So how can we make this faster
and how can we move it out of
process?
Well, first off we identify the
security sensitive components.
And from our perspective the
biggest ones of those are
parsing mach-o headers and
finding dependencies because
malformed mach-o headers allow
people to do certain attacks and
your applications may use
@rpaths, which are search paths,
and by malforming those or
inserting libraries in the right
places, people can subvert
applications.
So we do all of that out of
process in the daemon, and then
we identify the expensive parts
of it, which are cache-able, and
those are the symbol lookups.
Because in a given library,
unless you perform the software
update or change the library on
disk, the symbols will always be
at the same offset in that
library.
So we've identified these.
Let me show you how they look in
dyld 3.
So we moved those all up front,
at which point we write a
closure to disk.
So as I said earlier, a launch
closure is everything you need
to launch the app.
And then we move it -- we can
use that in process later.
So dyld 3 is three components.
It's an out-of-process mach-o
parser and compiler.
It's an in-process engine that
runs launch closures, and it's a
launch closer caching service.
Most launches use the cache and
never have to invoke the
out-of-process mach-o parser or
compiler.
And launch closures are much
simpler than mach-o.
They are memory map files we
don't have to parse in any
complicated way.
We can validate them simply.
They are built for speed.
And so let's talk about each one
of those parts a little bit
more.
So dyld 3 is an out-of-process
mach-o parser.
So what does that do?
It resolves all the search
paths, all the rpaths, all the
environment variables that can
affect your launch.
Then it parses the mach-o
binaries and it performs all of
those symbol lookups.
Finally, it creates the closure
with the results, and it's that
normal daemon so that we can get
that improved testing
infrastructure.
dyld is a small in-process
engine as well, and this is the
part that will be in your
process and this is what you
will mostly see.
So all it does is it validates
that the launch closure is
correct and then it just maps in
the dylibs and jumps to main.
And one of the things you may
notice is it never needs to pars
a mach-o header or perform a
symbol lookup.
We don't have to do those to
launch your app anymore.
And since that's where we're
spending most of our time, it's
going to result in much faster
app launches for you.
Finally, dyld 3 is a launch
closure caching service.
So what does that mean?
Well, system app closures we're
just building directly into the
shared cache.
We already have this tool that
runs and analyzes every mach-o
in the system.
We can just put them directly
into the shared cache, so it's
mapped in with all the dylibs to
start with.
We don't even need to open
another file.
For third-party apps we're going
to build your closure during app
install or system updates
because at that point the system
library has changed.
So by default these will all be
prebuilt for you on iOS and tvOS
and watchOS before you even run.
On macOS, because you can side
load applications, the
in-process engine can RPC out to
the daemon if necessary on first
launch, and then after that it
will be able to use a cached
closure just like everything
else.
But like I said, that is not
necessary on any of our other
platforms.
So now that I've talked about
this dynamic linker that we'll
be using for system apps this
year and for your apps in the
future, I want to talk to you
about some potential issues you
might see with it so that you
can start updating your apps for
it now.
So first off, it is fully
compatible with dyld 2.x. So
some existing API's will cause
you to run slower or use
fallback modes in dyld 3 though,
so we'd like you to avoid those,
and we'll go into those in a
second.
Also, some existing
optimizations that you are doing
may not be necessary anymore, so
you don't have to rip them out
but, you know, it may not be
worth putting in a lot of
effort.
The other thing I want to talk
about is that we're going to
have stricter linking semantics.
So what do I mean by that?
Well, there's a lot of things
that maybe work most of the
time, but aren't actually
correct even today and so we've
identified a lot of those.
As we've been putting the new
dynamic linker in, that tends to
find all these edge cases.
So what we've been doing is
we've been putting in
workarounds for old binaries,
but we do not intend to carry
those forward.
We will do linked on or after
checks to see what SDK you were
built with and we will disable
those workarounds for new
binaries so that you move to
these improved -- you fix these
issues.
So new binaries will cause
linker issues.
So, first off, I want to talk
about unaligned pointers in your
data segments.
So what do I mean by this?
Well, when you have a global
structure that points to a
function or another global
structure, that's a pointer that
we have to fix up before you
launch, and pointers must be
naturally aligned on our system
for best performance.
And fixing up unaligned pointers
is much more complex.
They can span multiple pages,
which can cause more page faults
and other issues, and they can
have atomicity issues related to
multiprocessors.
The static linker already emits
a warning for this, ld warning,
pointer not aligned at address,
and that's an address, often
your data segments.
And if you're fixing all
warnings, you should --
hopefully you've already taken
care of this.
The seeds that we have out this
week have some issues with Swift
keypaths, but they will be fixed
so you can ignore those, but
other than that, please go and
fix these issues.
So for those of you who are
asking how would you get
something like this, I'm going
to just show you real quick.
If you don't know how, it takes
a lot of work.
You can't do it in Swift.
So again, use more Swift.
This code here will do it, so
let me show you what's going on.
First off, I have attributes
forcing specific alignment.
So by default the compiler's
going to align it correctly for
you.
But sometimes you may need
special alignments and this case
I've said change whatever the
default alignment rules are to
one, and I've done that in two
different ways just to be
really, really bad, so you have
to fix both of these.
Then I constructed a global
variable.
That global variable sets a
pointer in with the structures
and that's going to force the
dynamic linker to fix up that
pointer on launch.
So if you see code like this,
you can just remove the
alignments.
You could rearrange the
structure so that the pointer
goes first, because that's a
better alignment thing.
And there's plenty of guides
online about C structure
alignment if you want to get
into the nitty-gritty, but
hopefully you don't have to deal
with this, and if you write
Swift, you definitely don't have
to.
So next off, eager symbol
resolution.
So what do I mean by this?
So dyld 2 performs what we call
lazy symbol resolution.
So I said up front that dyld has
to load all those symbols and
that's something expensive that
we want to cache.
It's actually too expensive to
run up front on existing
applications.
It would take too long.
So instead, we use a mechanism
we call lazy symbol resolution,
where, by default, the function
pointer in your binary for,
let's say, printf, doesn't point
to printf.
By default it points to a
function in dyld that returns a
function pointer to printf.
And so when we launch, you'll
call printf, it goes into dyld,
we return what we call the
printf and call it on your
behalf the first time and then
on the second time you go
straight to printf.
But since we are caching and
calculating all these symbols up
front now, there's no additional
cost at app launch time to find
them all up front, so we are
going to do that.
Now, having said that, missing
symbols behave differently when
you do this.
On existing lazy systems, if you
are missing a symbol, the first
call -- you'll launch correctly
and the first time you call that
symbol, you'll crash.
With eager symbols you'll crash
up front.
So we do have a compatibility
mode for this, and the way we're
going to do that is that we are
going to just have a symbol
inside dyld 3 that automatically
crashes, and if we can't find
your symbol, we will bind that
symbol, so on first call you
will crash.
But again, that's how it works
on today's SDK.
If future SDK's we are going to
force all symbol resolution to
be up front.
So if you are missing a symbol,
you will crash, and that should
hopefully result in you
discovering crashes during
development instead of your
users discovering them at
runtime.
And you can simulate that
behavior now today.
There's a special linker flag
which is dyld bind at load, so
if you add this to your debug
build, as I said, it's much
slower, so please only put it in
your debug builds, but add this
to your debug builds and you'll
get more reliable behavior today
and it will get you ready for
what we're going to be doing in
dyld 3.
Again, only use that in your
test builds.
Dlopen, dlsym and dladdr.
So last year I got up here and
said please don't use them
unless you have to, but we
understand that you may have to,
and that's the same thing I'm
saying this year.
They have some problematic
semantics, but they are still
necessary in some cases.
Particularly, symbols found with
dlsym, we need to find it at
runtime.
We don't know ahead of time what
they are.
We can't do all that prefetching
and presearching.
So as soon as you use dlopen or
dlsym, we're going and we're
reading in all the pages for
your symbol table that we didn't
have to touch before.
So it's going to be a lot more
expensive.
Additionally, we might have to
RPC out to the daemon, depending
on how complicated it is.
So we're working on some better
alternatives.
We don't have those yet.
But we also need to hear about
your use cases to make sure
we're designing something that
will work for you.
So please, again, they're not
going away, but they will be
slower and we want your
feedback.
I want to take a second to talk
about dlclose specifically.
And so dlclose is a bit of a
misnomer.
It's a Unix API, so that's the
name, but on our system, if we
had been writing it, it probably
would be called dlrelease
because it doesn't actually
close the dylib.
It decrements a refcount and if
the refcount hits zero, it
closes it.
And why is that important?
Well, it's not appropriate for
resource management.
If you have a library that
attaches to a piece of hardware,
you shouldn't shut down the
hardware in response to a
dlclose because some other code
in your app may have opened it
behind your back and so now your
hardware's not shutting down.
You should have explicit
resource management.
There are also a number of
features on our platforms that
prevent dylibs from unloading,
and I'd like to go through a few
of those because maybe you do
them.
You can have Objective-C classes
in your dylib.
That will make it not
unloadable.
You could have Swift classes.
That will also make it not
unloadable.
And you can have C under bar
thread or C++ thread local
variables, all of which make it
impossible to unload a dylib.
So on macOS, where there's a
number of existing Unix apps,
obviously we will keep this
working, but because almost
every dylib on all of our other
platforms does one of these
things, effectively it hasn't
really worked on any of them
ever.
So we are considering making it
just a straight up no-op, that
will not do anything on any of
those platforms.
If there's a reason why that's a
problem, please, we want to hear
about it.
Finally, I want to talk about
the dyld all image infos.
So this is an interface for
introspecting dylibs in a
process, and it comes from the
original dyld 1.
But it's just a struct in
memory, it's not an API, and
that was okay when we had five,
ten dylibs.
But as we've gotten to 300, 400,
500 dylibs, the way it's
designed wastes a lot of memory,
and we want that memory back.
We always want our performance
and we always want our memory.
So we're going to take it away
in future releases, but we will
be providing replacement API's
for it.
And so it's very rarely used,
but if you're using it, again, I
want to know why you're using it
and how you're using it and make
sure we design API's that fit
your use case.
There are a number of bits of it
that are vestigial and don't
quite do what you expect or work
anyway today, so if you aren't
using those, they may just go
away and we need to hear about
that.
So please let us know how you
use it.
So finally, let's talk about
best practices.
First, make sure you launch your
app with bind at load in your LD
FLAGS for debug builds only.
Fix any unaligned pointers in
your data segments.
Again, this warning is there.
You should try to be fixing all
of your warnings.
If you see it with the new Swift
keypath feature, you can ignore
that because we'll fix that.
You can make sure you are not
depending on any terminators
running when you call dlclose.
And we want you to let us know
why you're using dlopen, dlsym,
dladdr, and the all image info
structures to make sure that our
replacements are going to suit
your needs.
In the case of the ones that are
part of POSIX, they will stay
around, they will just be lower
performance.
In the case of all image infos,
it is going to go away to save
that memory.
Please file bug reports using
DYLD USAGE in their titles so
that they get to us so that we
can find out all of your usage
cases that we need to support.
And for more information, you
can go to this URL.
Related sessions.
So last year we did Optimizing
App Startup Time, so you may
want to go and watch that for a
refresher on how traditional
dynamic linking works.
It goes into much more detail
than I did here since I was
trying to discuss all the new
stuff we're doing.
So thank you everybody for
coming.
I hope you've had a great WWDC.
[ Applause ]