WWDC2016 Session 411

Transcript

[ Music ]
[ Applause ]
>> Good morning.
This is session 411,
System Trace in Depth.
My name is Chad Woolf.
>> And I'm Joe Grzywacz.
>> And we are performance
tools engineers for Apple.
Now in last year's In Depth
session we covered the
Time Profiler.
And we showed you how to analyze
your applications all the way
down to the disassembly level.
And our goal, then,
was to show you how
to make your code
as fast as possible.
Now at some point you may want
to run that optimized code
on multiple CPUs in order
to get even more work done.
But as you do this, you also
increase the system load.
Now as the system
load increases,
a couple of second-order effects
creep in, such as increases
in preemption, lock contention,
and virtual memory activity.
Now any one of these
three things is enough
to potentially offset the
gains that you had made
when you were doing
your time profiling.
So in today's session,
we're going to show you how
to use system trace to analyze
the second-order effects.
And we're going to show you how
to efficiently load the
system while still maintaining
optimum performance.
So our session's going
to look like this today.
We're going to talk a little
bit about system trace
and why it applies to
application developers.
And then Joe and I are going to
walk you through system trace.
We're also going to talk a
little bit about threading,
signposts, some a little
bit on virtual memory,
and show you some
best practices on how
to get the most out
of the tools.
So why system trace for
application developers?
Well, when your application
becomes front and center
on the device, from the users'
perspective that is the system.
They don't see your
application as a peer
with all these system
services and daemons contending
for shared resources
like CPU time and memory.
They just see your app.
And so if your app
stutters do to a flurry
of virtual memory
activity or due
to maybe a misprioritized
thread, they're going to come
to you looking for a fix.
So that's the bad news.
The good news is that when
your app is front and center
on the device, it is the most
important thing on the device.
So the operating system knows
this, and it's going to give you
as much CPU time and memory
as it can possibly spare.
So when we talk about
tuning with system trace,
we're not really talking about
tuning the system, we're talking
about tuning your application
to use the resources
that it's been given.
Now system trace is a
template in Instruments.
It works great in all
four of our platforms.
And when you take a recording
with it, it puts the kernel
into a special tracing mode
that records all the scheduling
activity, system calls,
and virtual memory
operations that are occurring.
Now this can accumulate
over time.
It could be a lot of data.
So one of the changes we
made in Instruments 8 is
that we put the template into
Windowed Mode by default.
And what that means
is that we keep
about the last five seconds
worth of data around.
And the advantage here
is that you can take --
And the advantage here
is that you can take --
you can start your recording,
setup your application,
take as long as you
need to get it
to reproduce the
performance problem,
and when that performance
problem reproduces, you hit stop
and you get that last
five seconds worth
of actionable data.
Now this is what five seconds
worth of data can look like.
These traces can
get pretty dense.
We found it's very
useful if you're able
to correlate this data with some
of the high-level activities
inside your application,
such as am I updating
a table view,
did I have a download
going on in the background,
maybe am I updating a graph.
Those kind of high level things.
So this year in Instruments
8 we included the Points
of Interest instrument.
Now the Points of Interest
instrument is essentially a
blank canvas where you tell
Instruments what you consider
interesting and we'll put it
up there on the graph for you.
Now you do this using
Signposts from within your code.
And Signposts have been
released for a while,
but we used to have -- the
way you used to get to it was
to call this direct system call
and you had to dig these macros
out of these header files
and put it altogether.
And it was kind of clunky.
Now on top of it being clucky
in today's SWIFT world,
it just simply doesn't work.
So what we're going to do
is deprecate that approach.
But we have added in the new
operating system these nice
functions to do the
exact same thing.
And they're kdebug signpost.
So they work great from
C, Objective C, C++,
and now SWIFT as well.
So to get started,
the easiest way is
to drop a point event
on a timeline.
That's one of those
little red lollipops.
And the way you do that
is invoke kdebug signpost
and that's it.
So inside of our Mouse Down,
every time we hit our
Mouse Down, you'll see one
of these appear on the timeline.
Now it does take
a few arguments.
The first one is a code that's
just some arbitrary integer
between 0 and 16383.
And it helps you
identify your Signpost.
The next four arguments
are integers as well
and can be anything you want.
By default Instruments will just
pass that right up to the UI.
Now you can name these things.
Just got to toggle over to
the configuration section
of the Instrument and
add to the Table View.
And so we have our code
5 now being Mouse Down.
And so we have our code
5 now being Mouse Down.
So next time I take a recording,
I can see that those Points
of Interest are now Mouse Downs.
Now if you like Points
of Interest,
you'll also like
Regions of Interest.
They're basically the same
thing, but they work for states
and actions or things that
occur over a period of time.
Now it's a little bit more
complicated because instead
of one call you do have two.
You got a start and an end.
And you also have a pairing
rule that's in Instruments.
Now by default it's a
very simple pairing rule.
It just uses the
code, so all you have
to do is supply the same code
in both the start and the end.
Now we know that
that's not going to work
for all applications because
some applications have
to issue a flurry of starts
followed by a flurry of ends.
And what that can be --
inside Instruments there
can be some ambiguity in how
to pair those things up.
So we do allow you to
change the pairing rule.
Come down here to
the Configuration.
You can select Code and First
Argument or Code and Thread.
So Code and First Argument
means that both your start
and end have to have the same
code and same first argument.
So in this example what
we've done is started a bunch
of URL downloads in parallel,
and we've used the URL
download task pointer
as a first argument.
So now you can see
these parallel downloads
on the Instruments timeline.
If you use Code and Thread,
you have to issue
the same code start
and end on the same thread.
And here's an example
of that happening.
But you can see it's
happening with inside
of a dispatch apply block.
So now we have four Regions
of Interest on the graph
that are -- we've got one for
each thread or a worker thread.
Now so far, the Regions
of Interest you've seen
have all been read.
It's all monochromatic.
But if you're willing to
sacrifice your last argument,
you can come over here
and check this box
that says use the last
argument for color.
And in your fourth argument,
you just need to supply one
of these enumerations
between 0 and 4,
and those are the five
basic colors in Instruments.
And in this example when
my download task completes
successfully, I've
colored it green.
And if it completes with
an error, I color it red.
And if it completes with
an error, I color it red.
So you can see very
clearly a difference
between pass and fail.
So now when we put
this altogether,
you can see how it's much
easier to correlate that big,
complicated trace I showed you
with the high level
activity in your application.
So for example, you can see
that this flurry activity
was actually created
by our download tasks.
So let's talk about our
demo application today.
We're going to be
looking at an app we wrote
for you guys called Graphasaurus
2, it's the spiritual successor
to Graphasaurus from
last year's session.
But just like last
year's session,
we're going to be looking
at real-world problems.
So these are problems
that we encountered
when we were tuning
Instruments 8.
And we decided to
just include them
in an iOS application for demo.
There is a new graphing style.
This looks more like the state
graphing style that you see
in Instruments because
that's something
that we're tuning this year.
And another difference from
last year is that we're going
to assume that our code is --
has already been time profiled,
and it's already optimal
given the constraints we have.
And just like last year,
we're going to be
generating our graphs
we're going to be
generating our graphs
on the CPU using core graphics.
Now, what we did is we
did some initial timing.
And we found that to generate
all four of these graphs.
It was going to take about
20 milliseconds in total.
And that's in the worst case
when everything's zoomed out.
Now that is larger than
our 16 millisecond deadline
if we wanted to hit
60 frames per second.
So we decided that
we needed to try
to introduce some parallelism.
Because what we know
is that all four
of these graphs can be
generated independently.
And they take about five
milliseconds in the worst case.
So we're thinking if throw
all this work at dispatch
and we have perfect scalability,
then our two-core iPad
over here should be able to
cut that work down in half.
To see how we did, show you
how to use system trace.
I'm going to turn
it over to Joe.
>> Thank you, Chad.
[ Applause ]
All right, so what
you're looking
at here is a Quick Time mirror
of the iPad Pro I'm
playing with here.
So you can see just
like in Instruments,
you can pan around
nice and smooth.
But when I eventually
pinch out a punch,
But when I eventually
pinch out a punch,
the animation gets just a little
bit stuttery way out here.
It's not awful, but
it could be better.
So what I want to do now
is figure out why that is.
So let's go to it here to Xcode.
And let's click and hold on the
Run button and choose Profile.
So that'll build your
application release mode.
Install it on the device.
And then Instruments will come
up with its template chooser
where you decide how
you want to profile.
I'm going to go ahead
and double click
on the System Trace template.
And now from here before I
start recording, I went ahead
and added some of the kdebug
signpost start and end points
to my code ahead of time.
And so I want to configure
how those will appear inside
of Instruments.
And so you do that down
here in the lower right
in the Record settings.
And first off, I did want to
use my first argument for color,
so I put some unique numbers
inside of my kdebug signpost.
And I added three
different codes.
Code 0 is CADisplayLink.
So that's that 60 hertz timer
that's driving the animation.
So that's that 60 hertz timer
that's driving the animation.
So basically this region
will correspond approximately
to my frame time because this is
where I do all of my rendering.
Next is Code 1, and that
is, I'll call it CreatePath.
So this one I'm actually
creating the CG paths,
the rectangles and the
labels that are going
to be appearing on the screen.
Finally Code 2 is
called RenderGraph.
So that's when I take those CG
paths and actually render them
into a CG bitmap context and
then display them on the screen.
Finally, since my code's going
to be running in parallel
and they're going to be emitting
the same two codes here,
I need to tell Instruments
how to differentiate them.
And in this case, the
thread is good enough
since it'll be running
on different threads.
All right, so I did
all that configuration.
I don't want to be
doing this again
and again every time
I come to Instruments.
So we go to file and
choose Save As Template.
I'll give it a descriptive name.
Graphasaurus System
Trace and hit Save.
And now let's say you close this
document, you close Instruments,
you come back a week
later, whenever you get
to the template chooser,
move over to the Custom tab,
to the template chooser,
move over to the Custom tab,
there's your template
ready to go.
You don't have to redo
your configuration.
Just hit record.
Now Instruments is going
to wait for me to kind
of reproduce the problem here.
So I'm just going to pinch out,
just like you saw me do before.
Pinch out where it
gets to the problem.
And then I'm going to just
reproduce that problem
for a couple of seconds so that
it fills that window buffer
with the data I'm interested in.
And once I do that,
I stop the recording.
So now Instruments
will go download all
of that data off the device
and then begin to analyze it.
And since this was a
Windowed Mode recording,
you only get those
last few seconds.
Make sure that when you
reproduce the problem,
you stop the recording
right afterwards.
Otherwise, newer events are
going to come in and kind
of push out the stuff you
were actually interested in.
So we'll wait here for
Instruments to finish analyzing.
And there we go.
So we're looking at a
whole bunch of stuff here.
So let's make this
a little bit larger
so we can see what's going on.
All right, so this first
selected instrument that's the
new Points of Interest
Instrument.
So let's zoom in on a
section here randomly and kind
of see what we're looking at.
So now we can see there's
all those codes I created.
I see DisplayLink, the
CreatePath, the RenderGraph.
And it looks like I'd expect.
I have my big blue
CADisplayLink time here.
And inside there's four pairs
of the green and purple create
and rendering those graphs.
And so that looks good.
But when I mouse
over to this region,
you get a little tool tip
showing you those arguments you
provided along with
the duration.
And I'm actually running here
close to 30 milliseconds.
So it's about half of the
speed I want to be running at.
So that's not good.
But this was just one frame.
I rendered a ton.
So what does it look
like in aggregate?
Well, we come down here and
look at this detail table.
This is currently showing
us a time sorted list
of all those regions.
So you could look through
this huge list here and look
at the arguments and et cetera.
But we did a table here that
aggregates that for you.
It's called the KDebug Interval
Signposts by Code table.
It's called the KDebug Interval
Signposts by Code table.
And when I select that
here's all the codes
that Graphasaurus 2 emitted.
And we can see here's
my CADisplayLink.
I rendered 152 frames, and
on average, they were taking
about 28 milliseconds.
So yeah, in average,
I'm not running so well.
You can see the min, the max,
the standard deviation
that sort of thing.
You can dive in with this
focus button next to the code.
And now that'll give you a
table of all of that data.
Those are all the
events that happened
that were my CADisplayLink
events.
So I'm going to -- from
here you could sort them
by anything you want,
different arguments,
whatever is important
in your application.
Here I just want to
sort them by duration.
And then what I'm going
to do is, I don't know,
I'm going to pick one of
these ones here somewhere
in the middle, and now
what I want to point out is
when I click on one
of these rows,
the graph view up above shifted.
And what happened was it went
and it made the region
I'm interested in visible.
Here it is.
And it put this blue
inspection head
at the beginning of that region.
So you kind of correlate
the thing you clicked
on in the bottom with where it
is up above in the track view.
So now I'm looking
at this frame.
You can also control-click
on that row
and choose set time filter.
And what that'll do
is gray out everything
in the detail view that's
outside of that time range,
and it does the same up in
the track view up above.
So you can use that to kind
of filter out your data
that you're interested in
or here just use it as kind
of a visual cue at the
frame you're looking at.
So now that I've done that.
I'm looking at this frame.
I can see that my
CADisplayLink started here.
It ended here.
But I don't really know
why it looks like this.
All this is telling you is
when it started, when it ended.
You don't know if your
application was doing work,
if it went to sleep.
You can't tell from this graph.
So we need to dive in deeper.
Here in the top right of
Instruments in the toolbar,
we're currently on the
Instruments strategy,
which is those list of all the
instruments in this template.
You could click here
on this thread data
to see all the threads
in your application.
Alternatively, let's say we
were already down here looking
at a thread in our detail view.
If you option-click, you'll see
you get these hyperlink kind
If you option-click, you'll see
you get these hyperlink kind
of style that you
can click on that
and choose Reveal
in Thread Strategy.
So that'll jump you
to the Thread Strategy
and preselect that
thread for you.
So we can see here, make
this a little bit larger.
It selected the main
thread for us.
And if I look at
this, there's a couple
of other dispatch worker
threads that are running.
And these two in particular,
this one right here
and the one below it,
are doing a lot of work.
These are all those red
dots that are showing up.
So if I option drag to zoom
in on one of those regions,
we can actually start
to see what those are
by hovering over them.
Here, this is a ulock wake
system call, so it's waking
up from some sort of lock.
Here's a ulock wait system call,
so it's waiting on some lock.
And if you keep hovering over
them, you're actually going
to see that pattern repeating.
There's a lot of this ulock
wait and waking going on.
So if you click, you'll set that
inspection head at that point.
And if you come down here
to what we call the threads
And if you come down here
to what we call the threads
and narrative view,
down here in this table,
it'll actually show you in that
table kind of where I clicked
up above, what was going on
in the thread at this time.
All right, this is a list
of everything this thread
was doing, kind of a story
of this thread's life.
So here we can see one
of those wait calls.
Okay, well where
did that happen?
Whenever possible Instruments is
going to take a backtrace along
with those system events.
And you can find those
over here on the right
in the Extended Detail View.
So we can see I have some SWIFT
code building paths and inside
of it it's actually creating
an NSAttributedString.
Okay, and down below
that a couple frames is
where it's taking that lock.
Kind of unexpected, wasn't
expecting a lock inside
of NSAttributedString,
but there it is.
So what are the ramifications
of that?
Let's go back over to the
threads narrative view
and see what happens.
So it took us 109 microseconds
just to take that lock.
And then we blocked for
another 6 microseconds.
What's really cool this
year is now it'll show you
What's really cool this
year is now it'll show you
which thread made
your thread runnable.
So basically who released that
lock so that you could take it.
And we can see that
it was made runnable
by a Graphasaurus
thread 0x8468b.
Okay. And we can see that even
after that lock was released,
we still waiting another 98
microseconds before we actually
give a CPU back.
All right, let's look at what
that other thread was doing.
Why did he release that lock?
Option click on that, choose
reveal and thread strategy.
That'll select that other thread
up above, and then down below
in the narrative view, we'll
see what thread was doing
at that time.
And we can see he was
calling ulock wake.
So he was releasing that lock.
That makes sense.
So we can kind of see that we
have these two threads running
in parallel, except they're kind
of contending over this lock.
And so they're doing a bunch
of things besides
actually just running.
Another way to see that
is up in this track view.
We have those thread
states visible here.
Let me make this a
little bit larger.
So if you hover over
the thread states,
So if you hover over
the thread states,
you'll see it was
running for this period
of time, 64 microseconds.
Then it was blocked
for a little while.
Then it was runnable
for kind of a long time,
so that means it's
not actually running.
And then finally it runs.
And if we kept digging
around, looking around,
we would see this pattern
repeating over and over again.
And so what becomes clear is
these two threads are fighting
over this lock.
And I'm creating a whole bunch
of strings during this time.
So if this is taking so much
time and is so important,
why don't I see that
in the time profiler?
Well, truth be told if
you go back and look
at the time profiler,
it does show up.
But it's only like four
or five percent or so.
It didn't really stick
out as a giant red flag.
And a big reason for that is
because these threads are
spending a lot of time
in the blocked and
runnable states.
And time profiler only
samples what's actually running
on a CPU, so it's not
going to show up there.
And so what I need to do to fix
this problem is realize, well,
I'm just putting some attributed
strings onto a state graph.
There's not that many states
that need to be displayed.
I should just cash them ahead
of time, and then look them
up in a dictionary lock free.
And everything should
run a lot more smoothly.
And everything should
run a lot more smoothly.
And to show you what that
looks like, back over to Chad.
[ Applause ]
>> Okay, so what Joe is
seeing is a textbook Lock
Contention pattern.
And this is where we have
two threads now working away.
And they're contending for
a shared resource somewhere
in the attributed string
lock creation code.
Now even though we're
only holding that lock
for a few microseconds,
the performance impact is much
more significant than that.
And Joe showed you
that a little bit.
But I want to talk about
it just a little bit more.
So when your thread is in
the running state that means
that the thread is on the CPU.
It's running its full stride,
and all of those
performance optimizations
that you were making with a time
profiler are now paying off.
Now at some point, you
do call into ulock wake,
and then some short period
after that it puts the
thread into the block state.
And so what's happening here is
ulock wait system call realizes
that that lock is being
held by another thread,
and so it asks the kernel
to take you off the CPU
and put you back when
that thread or --
and put you back when
that thread or --
sorry, when that lock
is actually acquired.
Now 3.42 microseconds
later that does happen.
We do go into the
runnable state.
But now when we're in
the runnable state,
this is the amount
of time it takes us
to get back onto the CPU.
Now if you'll notice we're
in the runnable state
for about 7 microseconds.
That's nearly twice
the amount of time
that that lock was
actually contended.
So we're getting a
significant amount of overhead.
Now another way to look at this
quantitatively is you can go
into the thread strategy,
select the thread,
create a time filter selection
the way that Joe showed you,
and then change the detail
section from the narrative
over here to the thread summary.
And what that will show you
is the total time spent per
thread state.
In this particular example,
we can see that our
thread was only running
about 82 percent of the time.
Now what that means is
that we're still getting some
benefit from the multicore.
We are getting some work
done, but our scaling is not
that linear scaling, perfect
scaling that we were hoping for.
We are still wasting
a little bit of time.
Now when Joe makes the fix
that he was talking about,
Now when Joe makes the fix
that he was talking about,
what's going to happen -- well,
two things are going to happen.
The first one is that the update
graph regions, they're going
to get a little shorter
because we are doing less work
when we're caching
these strings.
But more importantly,
that thread is no running a
hundred percent of the time.
So you're going to get
that perfect scalability.
So if you add a -- so if you
double the number of CPUs,
you're going to half the amount
of time that that code takes.
And that's great.
So if you have a fix like that,
you should definitely
try to take it.
Now let's talk about preemption.
We didn't get to see any
preemption in our examples here,
it is something that shows
up quite frequently
in a system trace.
And what preemption is is
an involuntary removable
of your thread from the CPU.
So some other higher
priority work was needed --
needed the CPUs.
There were none available,
and so your thread
had to be removed.
Now there's an exception
to that.
There is a voluntary
form of preemption
that you might see
from time to time.
And that occurs inside
of a spin lock.
When a spin lock realizes it's
not making any more forward
progress, it can call into
the thread switch system call
and essentially yield
its quantum of time
and essentially yield
its quantum of time
over the holder of the lock.
And so what that looks
like in system trace
in the thread narrative
is you'll see it called a
thread switch.
And then the preemption
narrative after it will say
that it was yielding
the CPU rather
than being I think
removed from the CPU.
So another lighter weight form
of preemption is called
the interrupted state.
And the interrupted state is
when your thread is running
on a CPU and that CPU has to
handle a hardware interrupt.
So your thread is suspended.
The interrupt handler runs, and
then your thread is resumed.
Now at this point, the priority
of your thread doesn't matter.
You can have the
highest priority.
It really won't matter.
The interrupt will
always take precedence.
Now the good news is that
these are typically brief,
a couple of microseconds.
And typically they won't add
up to contribute to any sort
of performance problem
inside your application.
But they do show up, so that's
why I wanted to present them.
Now another nice feature
of Instruments 8 is called a
new System Load instrument.
And what that instrument does
is helps you identify hotspots
And what that instrument does
is helps you identify hotspots
in your system trace that
could be contributing
to dropped frames, for example.
Now it does that in two ways.
The first way is this
bottom table view.
And this shows you a picture
of what the scheduling
state look like.
All of the threads
that were ready
to run underneath the
blue inspection line.
So you can tell that
at that moment
in time, we had three threads.
One was a kernel thread and
two are a Graphasaurus threads.
And these are the threads
that were not blocked
and trying to run.
You can see the core
assignments for those as well.
Now the other feature of this
instrument is called the User
Interactive Load Average graph.
And what this is is a --
each one of these bars
represents a 10 millisecond
window of time.
And the height of the bar is the
average number of active threads
in that 10 millisecond period.
So that's threads that are
either running, runnable,
preempted, or interrupted.
Essentially threads
that are not blocked.
Now since it's the User
Interactive Load Average,
we only include threads that
have a priority greater to
or equal to 33 because
those are the --
those are the threads with a
priority that could interfere
those are the threads with a
priority that could interfere
with the user interactive
quality of service class.
Now to make it stand out
a little bit more clearly,
when the thread bars
turn orange that means
that your load average has
exceeded the number of cores
on that particular device.
So wherever you see
a burst of orange,
you can see that that's a
frame drop waiting to happen.
So you might want to zoom in
to those particular regions
that are orange, make
sure that the threads
that you have running
are balanced
at the right quality
of service level.
Now when Joe makes this
fix we're going to end
up with a little
bit more head room.
We're going to be able
to add a new feature.
And that new feature are these
hover labels, very similar
to the hover labels that
you see in Instruments,
except on Graphasaurus
you do a long press.
And then the hover labels
follow your fingertip.
So to show how that
feature is going,
I'm going to turn
it back over to Joe.
[ Applause ]
>> Thanks, Chad.
So yeah, so I added the
NSAttributedString fix,
went back to 60 frames
per second looked good,
added the new generation
of tool tips work,
and things got slow again.
You can kind of tell there's a
couple frames here dropping here
You can kind of tell there's a
couple frames here dropping here
and there.
So looked in at time profiler,
wasn't anything obvious
that I could remove.
There wasn't any extra
work I was doing.
So I went back and
took the system trace
that you're looking at here.
Before I took that
trace, I went ahead
and added a new Signpost code,
number 3, and that's going
to represent my GenToolTips
work.
And you'll see that
show up in red up above.
So let's zoom in on one
of these sections here.
All right, so we can see
here's my new red bars,
these GenToolTips.
And so it's important
for me to kind
of describe how my
algorithm works here.
Basically whenever this
CADisplayLink region starts,
for every single
graph on the scene,
I go and do a dispatch
async of all the render work
for that graph and
I dispatch async
to go generate the tool tip
dictionary, look up stuff.
And I do that for every
single graph on the scene.
But then I had kind of a clever
realization that in order
to kind of call my rendering
complete, I don't actually need
to wait for the tool
tip stuff to finish.
And so I do a dispatch group
wait on just the render work.
And so I do a dispatch group
wait on just the render work.
And we can see that actually
kind of worked here pretty well.
Here's my start of my frame.
The CADisplayLink time.
We can see some of those
tooltips working here.
Let me scroll over to the right.
You can see actually
one of them here.
Actually, barely
doesn't even start
until my render frame is done.
So it looks like
I did a good job.
I can give myself
a pat on the back.
That looks nice.
However, when I look at my
CADisplayLink time here,
it's taking 17.4 milliseconds.
Pretty close, but it's
not my 16.6 that I want.
So again, that was just
one of the regions.
Let's look at what we
were doing in aggregate.
Let's go back to that KDebug
Interval Signpost by Code table.
Here we can see our
CADisplayLinks.
I did about 260.
That's more than we did the
last time, so that sounds good.
Sixteen milliseconds on average.
That's actually less
than my 16.6.
So that actually
sounds pretty decent.
However, this max is still
sitting here at about 19.27.
And if we look at all
the individual events,
let's sort this by duration
from longest to shortest,
we can see here's
that one at 19.
There's a bunch in the 18s.
There's some here in the 17s.
A lot in the 17s.
More in these upper 16s.
So we still have
a number of frames
that are actually
rendering too slowly.
Not by much, but
they're still too slow.
That means we're going
to be dropping frames.
So this time, where
do we go from here?
We could go back to diving down
into the thread strategy looking
at all our threads and the
system calls and VM events
and thread events and
all sorts of things.
But whenever possible, well
this system trace template has a
bunch of instruments up here,
and they kind of give you sort
of like higher level aggregate
information that's kind
of useful to look at that first.
So before you go diving down
into the 100,000 plus events,
take a look at these
higher level aggregates.
And so what I'm going to
do is let's take a look
at this User Interactive
Load graph.
That's part of that
System Load instrument.
And let's go ahead and zoom
out here to snap track to fit,
so we can see all the data
again back on the screen.
so we can see all the data
again back on the screen.
And when we've done that, I'm
going to zoom in over here.
You can see there's a fair
bit of orange in this graph.
Make this a little bit larger.
So you can see there's a
fair bit of orange up there,
which means we have more user
interactive threads running
than we have cores, right.
These are threads that are
saying, I have a lot of work
to do, and I need to do it now.
Give me a CPU.
Well, we're running out of CPUs
and that's why our
graph here is orange.
So let's zoom on one
of these large regions
of orange over here.
You can tell what the value is
just by hovering over a region.
Zoom in a little bit.
So we can see this particular
10 millisecond bucket,
on average there's about
2.84 user interactive threads
that we're trying to run,
again, on a dual-core machine,
so about .8 threads
are lacking CPU time.
And that's why that's orange.
And we can see on
average, there's a lot
of regions here that
are too big.
Here actually it's four.
We were trying to run twice as
many threads as we have cores.
So let's look at that
region in more detail.
So let me scooch this
over so you can see this.
So let me scooch this
over so you can see this.
As Chad mentioned, you can see
what threads you're actually
trying to run in that
period by click and holding
in the ruler view up
here, and you can move
that blue inspection
head back and forth
and Instruments will tell you
which threads we're trying
to run during that
instant of time.
And if we come down
and look at this table,
I'll sort it by the priority.
We can see, I looked right here
and there are two Graphasaurus
threads that are running, cool.
There's actually two location D
threads that are trying to run
at a slightly lower priority.
And that's part of
being on a real system.
You're going to see
system daemons coming in
and trying to do their work.
But it's okay.
They're running on a slightly
lower priority than my stuff,
so I still have the
CPU, looks good.
I do, however, have this
third thread that's running
at the same time, well, trying
to run at the same time.
And he's not getting
any CPU resources.
And I know what these
threads are.
The two of them are
my render work.
And then I have the third
thread that's trying
to do my generation
of tool tips.
And so what's happening is one
of them is not able
to get a CPU.
of them is not able
to get a CPU.
And we can kind of
see this again.
If we look at, say,
one of our frames.
Let's go back up to that
Points of Interest Region.
You know, we could see
that we're doing our rendering
here inside of our frame,
and the generation
of tool tips happen,
so it gets a little
bit of CPU time here.
But what it's doing is when
it does get that CPU time,
it's taking away
from my rendering.
And so basically I've kind
of misprioritized my work
because when I stop and
think about it, well,
I need that rendering
to happen right now.
Being at that user interactive
level makes perfect sense
because I want it to be nice and
smooth and 60 frames per second.
But these tool tips, they're
not quite as high priority.
I do want them done
quickly because as soon
as that user long
presses on that screen,
I want them to show up.
But they're not really as
important as that render work.
And you can see it
pretty clearly here.
They are definitely taking
away some of that time
that CPU resources from
this render work that's now
being delayed.
And that's what's
helping kind of drag
out that CADisplayLink time.
So the fix for that is
actually simple in this case.
Let's go over to Xcode.
So I have this view
controller class.
So I have this view
controller class.
And one of the things it does
is it creates a tool tip queue.
We can see that created
right down here.
This is where I do
all my tool tip work.
And it's created with
a couple attributes.
One is, you know, it's
concurrent, so good,
they can run on multiple
CPUs if they're available.
And it's set for the user
interactive QOS class.
And that's that same QOS class
that my render work
is happening,
so they're all contending
for resources.
So like we said, it's not
actually as important,
so I'm going to change
that class.
You can read about the
different classes right
in the header file.
I'm going to take it a
couple of notches down and go
with the utility level class.
And what that'll do is give
that CPU prioritization
to do my rendering work, and
then when there is a little bit
of CPU time at the end
of the frame or whenever,
then the tool tips will run.
And they're still at a high
enough priority that when
that user taps on the screen,
they should be ready to go.
And so to show you what that
looks like, back over to Chad.
[ Applause ]
>> Okay, so when
Joe makes that fix,
the graph's going
to look like this.
We're going to notice that our
CADisplayLink times have come
We're going to notice that our
CADisplayLink times have come
down to about 12.7
milliseconds on average,
which is much better
than before.
But even better still is
that our max duration is only
about 14.6 milliseconds.
So we're not dropping any
frames, and we're well
within our 16 millisecond
deadline.
Now we're doing that in
spite of also continuing
to overload the system.
If you think about it,
we are still running
three different threads.
But because we've correctly
prioritized the work,
our Gen Tool Tips
code is running
down here at priority four.
So that's going to
stay out of the way
of the User Interactive code.
So we're still getting a
lot of work done on the CPU.
We still have a very
high system load.
But at the same time,
we're still getting a perfectly
smooth user experience.
So what is Quality
of Service, really?
Quality of Service, in case
you hadn't seen it yet,
is an attribute that you attach
to blocks, queues, and threads.
And it's basically an expression
to the kernel about how much
of the system resources you're
willing to devote to getting
of the system resources you're
willing to devote to getting
that particular piece
of work done quickly.
Now the different Quality
of Service classes can
constrain the priority range.
So you can see that our
utility classes put it
down into a priority of four,
so in our User Interactive
code is running
in the high 30s, high 40s.
But the Quality of Service
classes can also throttle things
like IO and also
the CPU frequency
that the code is running at.
So when you pick a Quality
of Service for your code,
make sure that you look through
the documentation very carefully
and make sure that
it matches the kind
of work that you're doing.
Now another thing that
can affect the performance
of your application is
virtual memory faults.
They do get a little
worse under load
as memory pressure increases.
But the good news is
they are manageable.
Now System Trace has all the
tools that you need in order
to analyze virtual
memory faults.
Inside the thread strategy,
virtual memory faults appear
as these little blue capsules.
Inside the thread narrative it
reports virtual memory faults
and even attaches a
backtrace on where
and even attaches a
backtrace on where
that fault was resolved
inside your code.
Now we also have an
instrument that's dedicated
to analyzing virtual
memory faults.
So for example, you can see
where your code is
more susceptible
to one type of fault or another.
So for example, maybe
you have code
that is experiencing more zero
fills or more copy on writes.
Now the next thing
you need to know
about virtual memory faults
is that the fault occurs
on access rather
than allocation.
So you can ask for a large
allocation from the kernel,
let's say 500 meg, but
you don't actually back
that with physical memory
until you start touching
or accessing the pages of that
allocation in your process.
So it's something to think about
when you're allocating memory.
The other thing that's
important to know
about virtual memory faults is
that they are resolved inline.
So there's no explicit
call that you need to make
to resolve a virtual
memory fault.
All you need to do is touch any
byte inside a page that's marked
as requiring a fault, and
the kernel will take control
of your thread, resolve
the fault,
of your thread, resolve
the fault,
and then give you control back.
And so when you see those blue
capsules in your System Trace
on the thread strategy that's
exactly what's happening.
So what do you do about
virtual memory faults inside
your application?
Well the easiest thing
to do -- excuse me.
The easiest thing to do is
just simply absorb them.
What I mean by that is leave
enough room inside your
performance budget where you
can handle a certain amount
of virtual memory faults
before you run your deadlines.
This will make you more
resilient under a load.
So as memory pressure increases,
if your budgets are big enough
and you have enough slack,
you won't notice the
difference in your performance.
Now we realize that some
people do not have these kinds
of lax deadlines in terms
of their UI generation.
So another alternative is
to try to do the faulting
on a background thread.
So let's say you have
a game, for example,
and your player is coming
to the end of level one
and they're going to
transition to level two.
Well, what you might
do is dispatch async
to a background queue and then
touch the pages for the content
to a background queue and then
touch the pages for the content
of level two on that
background queue.
And then by the time your
rendering thread comes
around to pick up that
new content there will be
no stutter.
Now, we'll have to give you a
warning here on this approach.
Make sure you only
touch the pages
that you're absolutely
going to use.
Because if you start touching
more of the pages than you need,
then you're actually going
to make the problem worse.
And that about does it
for today's session.
We think that the System
Trace makes a great companion
to the Time Profiler.
The Time Profiler helps
you make your code fast,
but System Trace
allows your application
to scale better under
a higher load.
We encourage you to try a System
Trace out on your own apps.
We know that when we try
it against Instruments,
we always find something
that's worth fixing.
And if you've used System Trace
in the past, we invite you
to come back to Instruments
8 and give it another look
because we've done
some major improvements
to both the approachability
and the power of the tool.
We think it'll make a great
addition to your toolbox.
For more information,
here's our link Session 411.
For more information,
here's our link Session 411.
We have some related
sessions today
that happened also happened
during the week and on Friday.
Enjoy the rest of your morning.