WWDC2018 Session 716

Transcript

[ Music ]
[ Applause ]
>> Good afternoon, everybody,
and welcome to Object Tracking
in Vision session.
Do you ever need to face and
solve various computer vision
problems?
If you do, and whether your Mac
and iOS, or each of you as
developer, you're in the right
place.
My name is Sergey Kamensky.
I'm excited to share with you
how Vision Framework can help.
Our agenda today consists of
four items.
First, we're going to talk about
Why Vision?
Second, we're going to look
about what's new that we're
introducing this year.
Third, we're going to have a
deeper dive into helping
throughout this Vision API.
And then, finally, we're going
to get to the main topic of our
presentation, which is Tracking
in Vision.
So, why Vision?
When we designed our framework,
we wanted it to be one central
stop to solve all of your
computer vision problems while
first in simple and consistent
interface, one multi-platform;
our framework runs on iOS,
macOS, and tvOS.
We're privacy oriented.
What that means is that your
data never leaves the device.
All the [inaudible] sync is
local, and we're continuously
working.
We're enhancing our existing
algorithms, and we're developing
new ones.
Let's look at the Vision basics.
When you think about how to
interact with Vision API, I want
you to think about in these
terms, what to process, how to
process, and where to look for
results.
What to process is about a
family of requests.
That is how you tell us what you
want to do.
How is about our request
handlers or engines-- our
request handlers, they're used
to process our requests.
And finally, the results; the
results in Vision come in forms
of observations.
Please take a look at this
slide.
If you were to remember anything
from this presentation, this
slide is probably one of the
more important ones.
This slide presents a
philosophy, how to interact with
Vision, requests, request
handlers,
and observations.
Let's look at the requests
first.
This is a collection of requests
that we offer today.
As you can see, we have various
detectors.
We have Image Registration
Request.
We have two trackers, and we
have a CoreML request.
If you're interested to learn
more about integration of Vision
and CoreML, I invite you to come
to the next session in this room
where Fran, my colleague, will
cover details of that
integration.
Let's take a look at the request
handlers.
In Vision, we have two.
We have Image Request Handler,
and we have Sequence Request
Handler.
Let's compare the two using this
criteria.
First, we'll look at the Image
Request Handler.
Image request handler is used to
process one or more requests on
the same image.
What it doesn't say, it caches
certain information like image
derivatives and results of
posting requests.
So, other requests in the
pipeline can use that
information.
I'm going to give you an
example.
If a request depends on running
neural network, as you know,
neural network expects images in
certain sizes and certain colors
schemes.
Let's say your neural network
expects 500 by 500 black and
white.
It's very rare that you'll get
user input just in that format.
So, what we'll do inside, we'll
convert the image.
We'll feed it into that neural
network to get results for the
current request, but we will
also cache that information on
the request handler object.
So, the next request, when it
comes, if it needs to use the
same format, it's already there,
and it doesn't need to be
recomputed.
We'll also cache results that we
get from the requests so other
requests can use it in
pipelines, and we're going to
look at pipelines going forward
in this presentation.
Let's take a look at the
Sequence Request Handler.
Sequence request handler is used
to process a particular
operation like tracking in a
sequence of frames.
What it does inside, it caches
the state of that operation from
frame to frame to frame for the
entire sequence.
In Vision, it's used to process
tracking and image registration
requests.
All other requests are processed
with our Image Request Handler.
Let's look at the results.
Results in Vision coming from
observations.
Observations is a collection of
classes derived from
VNObservation class.
How do we get an observation?
Well, first, the most natural
way is when the request is
processed, you're going to look
at the results property of that
request, and that results
property is a collection of
observations.
That's how we tell you results
of processing.
The second way is you create it
manually.
We're going to look at examples
for other presentations how to
do both.
Let's now look at what's new
coming this year.
First, we have our new face
detector.
Now, we can detect more faces,
and we're now
orientation-agnostic.
Let's look at example.
On the left-hand side, you can
see an image with seven faces in
it, and we could process that
image with the detector from
last year, we can detect only
three faces, and those faces are
the ones that are close to the
upright position.
If you process the same image
with the face detector that's
coming this year, as you can
see, all faces can be detected,
and orientation is not a problem
anymore.
Let's look a little bit more
into details.
Thank you.
[ Applause ]
So, first, our new face detector
uses the same API as from last
year.
The only difference is if you
want to specify revision, you
need to overwrite a revision
property of that request and set
it explicitly to User Vision #2.
Again, I'll talk about the
reasons right on the next slide.
We're also introducing two new
properties.
One is roll, which is when the
head rotates like this, and the
other one is yaw, which when the
head rotates around the neck.
Revisions.
What happens with Vision where
when we need to introducing a
new algorithm, we don't
deprecate the old one right
away.
Instead, we're going to keep for
some time both revisions or,
maybe going forward, even more,
simultaneously.
You tell us which one you want
to work with by specifying
revision property of the
request.
So, that's the explicit
behavior.
But we also have a default
behavior.
If you create a request object,
and you don't tell us anything
at all, and you start processing
that request, here's what's
going to happen.
By default, you're getting the
latest revision of the request
that your app is linked
against-- of the SDK that your
app is linked against.
This is important to understand,
and I'll give you an example.
Let's say your app is linked
against there's the SDK from the
last year.
Last year, we had only single
detector.
So, this is the detector you're
going to get, even if you take
that app and without a
compilation, run it on the
current OS.
If, on the other hand, you
recompile your app with the
current SDK without changing a
single line or coordinate, and
you run it on the current OS,
you're going to get by default
revision #2 because this is the
way this from the current SDK.
We highly recommend you to
future-proof your apps, but
again it's exclusive to Vision.
What you get is, first, you get
the deterministic behavior.
You know performance of the
algorithm that you're quoting
against.
You know what to expect.
We also can be-- you can get
your app to be future-proof from
the errors like, for example, if
we deprecate certain revision
going forward, like a couple of
years from now, then you can
[inaudible]against that
[inaudible] today.
Let's have a deeper dive into
how to interact with Vision API.
First, we're going to look at
the example with Image Request
Handler.
So, as you remember, image
request handler is used to
process one or more requests on
the same image.
It's optimized by caching some
information like image
derivatives and request results.
So, the consecutive requests
that are coming to be processed
can use this information.
Let's look at code sample.
Before we dive into the code
sample, I just want to emphasize
a couple of points about the
code samples in this
presentation.
The error handling is not a good
example how errors should be
handled.
I use a short version of try,
and I use [inaudible].
This is just to simplify the
examples.
When you code your apps, you
probably should use guards to
protect against unwanted
behavior.
I also use Image URL when I
create my image request handler
objects, and that is just a
place on the SSD where the file
is located.
Now, let's look at example.
First, I'm going to create my
detect faces request object.
Then, I'm going to create my
image request handler passing
the image URL with the file of
the image where the faces should
be located.
Then, I'm going to ask my
request handler to process my
request.
And finally, I'm going to look
at the results.
Very simple.
If I had an image with a single
face in it, my results would
look something like this.
So, what I get back?
I get back a face observation
object, and the one of the more
important fields in that object
is the bounding box where the
face is located.
Let's take a look at this slide
again.
The first three lines is pretty
much all you need to find all
the faces in the image.
Isn't that cool?
[ Applause ]
Let's now take a look at the
Sequence Request Handler.
Well, sequence request handler,
as you remember, is used to
process a particular operation
like tracking on a sequence of
frames.
Let's look at code sample, and
this code sample is pretty much
the simplest tracking sequence
we can imagine with Vision API.
First, I'm going to create my
Sequence Request Handler.
Then, I need to specify which
object I want to track, and I'm
going to do that by creating a
detected object observation,
which gets us a parameter
location, a bounding box.
Then, I'm going to start my
tracking signals.
In this example, I'm going to
track my object for five
consecutive frames.
Let's look how the sequence
works.
First, I have a frame feeder
object, which in your case could
be like camera feed, for
example.
This is where I get my frames.
I get my frame, I create my
request object, passing the
detected object observation as a
parameter with its initializer.
And that's something that I just
created before the loop started.
Then, I'm going to ask my
request handler to process
request.
I'm going to look at the
results, and this is the place
where I should be analyzing
results and doing something with
them.
And the last step, which is very
important, what I'm doing here,
I'm taking the results from the
current iteration, and I'm
passing it to the next
iteration.
So, when the next iteration
request is created, I want to
see those results inside.
If I were to want it in a
sequence of five frames, my
results would look something
like this.
How to create a request object.
Well, first, it's important to
understand that our requests
have two types of properties.
There are mandatory properties
and there are optional
properties.
And mandatory properties, as the
name suggests, they need to be
provided via initializer in
order to be able to create a
request object.
Let's look at the example.
There is something that would
just go in the previous slide.
I detected object observation
that is passed into the
initializer of the [inaudible]
object request is an example of
a mandatory property.
We also have optional
properties.
By the way, both types of
properties are declared.
You can find them in the place
where the request object is
declared.
Optional properties is separate
properties where we have
meaningful defaults for them.
So, we'll initialize them for
you, but you can override them
later if you need to.
Let's look at the example.
What I'm doing here, I'm
breaking my detect barcodes
request object.
If I were to do nothing else and
just took my request object and
fed it into request handler, I
would be working on the entire
image, looking for my barcodes.
What I'm going to do here
instead, I'm going to specify a
small portion like a central
property image.
There's this where I want to
focus to look for my barcodes,
and I'm going to overwrite my
region of interest property with
that.
If I take my request object now
and feed it into request
handler, I'm going to be
focusing on the smaller portion
of the image only.
A region of interest property
here is an example of the
optional property that we have.
What's important to understand
here also that once you get a
request object in your hands,
it's a fully constructed object.
It's the object that you can
start working with whenever you
have constructed objects.
If you decide to overwrite
certain properties later on,
you're welcome to do so.
But whenever you have the
object, it's the object that you
can work with.
One thing that's not maybe on
this slide is, which will be
covered in more details in the
next session, is to look at the
bounding boxes.
As you can see, the coordinates
that we receive are normalized.
They are from 0 to 1 and they're
always relative to the lower
left corner.
The next session, which is
CoreML integration with Vision,
will cover this aspect in more
details.
Let's look at how to understand
results.
As we said, results in Vision
come in forms of observations.
Observations are populated
through results property of the
request object.
How many observations can you
get?
All the collection be 0 to N.
There's one more aspect here.
If you get results set to nil,
that means that the [inaudible]
single request failed.
This is different than getting 0
observations met.
Getting 0 observations met means
that whatever you were looking
for is just not there.
As an example, let's say you run
a face detector.
If you feed an image with no
faces in it, naturally, you will
get 0 observations met.
On the other hand, if you feed
the image with one or more faces
in it, you'll get appropriate
number of observations met.
Another important property of
observations is that
observations are immutable.
We will look at the example
where it's used.
There are two more properties
that I want to pay your
attention to, and they are both
declared in the base [inaudible]
for all observations.
One is the unique ID.
This unique ID identifies the
processing step for this
particular-- where this
particular result was created.
Another one is the confidence
level.
Confidence level tells you how
confident the algorithm was
producing the results.
The confidence is in the range
from 0 to 1.
Again, this topic also will be
covered in the next session in
more details.
Let's look at the request
pipelines.
So, what is a pipeline?
Let's say I have three requests,
and it happens to be that
request #1 depends on execution
of request #2, which in turn
depends on execution of request
#3.
How do we process the sequence
while the processing is done in
the opposite order?
What I'm going to do here,
first, I'm going to process my
request #3.
I'm going to get the results
from that request and feed it
into request #2.
I'm going to do exactly the same
with request #2.
And finally, I will process my
request #1.
Let's look at the examples of
how to run request pipeline in
implicit and explicit order.
We're going to look at the next
two slides in these two use
cases, and we're going to run
our face landmarks detector.
As you probably know, landmarks,
face landmarks are the features
on the face.
It's both your eyes, eyebrows,
nose, and mouth location on your
face.
Let's first look at how to do it
implicitly.
I have a simple [inaudible] a
bit similar to what we have seen
already.
First, I'm going to create my
face landmarks request.
Then, I'm going to create my
image request handler.
Then, I'm going to process my
request.
And finally, I'm going to look
at the results.
If I had an image with a single
face in it, my results would
look something like this.
Sorry.
N is for the sequence, and the
results.
So, I get the bounding box of
the face.
That's the location for the
face, and I get the landmarks
for the face.
What's important to understand
here is that when the processing
of the face landmark request
starts, face landmarks request
figures out that faces have not
been detected yet, and it runs
on our behalf inside face
detector.
It gets results from that face
detector, and that's where
landmarks are being searched
for.
On the right-hand side, you can
see a snippet of what face
observation object would look
like.
These are just a couple of
fields from that object.
One is a unique ID that we
discussed that is set to some
unique number.
Then, the bounding box, that's
where the face is located, and
then, finally, the landmarks
field, which points to some
object where the landmarks are
described.
Now, let's take a look at the
same use case but now done
explicitly.
What I'm going to do here
first-- first, I'm going to
explicitly run my face detector.
You've seen these four lines of
code already several times in
the presentation.
When I run it, I get my bounding
box back.
As you can see, the results are
returned in the same type, face
observation.
The fields that we saw on the
previous slide may look like
this, so you get some unique
number to identify this
particular processing step.
Then, you get the bounding box
location, which is the main
outcome of processing this
request.
And the landmarks field is set
to nil because face detector
doesn't know anything about
landmarks.
What I'm going to do next is I'm
going to create my landmarks
request, and then I'm going to
take the results from the
previous step and feed it into
the input object observation
property of that request.
Then, I'm going to ask my
request handler to process it.
And finally, I'm going to look
at the results.
If I run it in the same image, I
get exactly the same results as
I would on the previous slide.
But let's see what happens with
observations.
Remember, we said that
observations are immutable even
though both face detector and
face landmarks detector return
the same type, but we don't
override the observation that
was fed in.
What we do instead, we take the
first two fields and copy it
into a new object, and then we
calculate landmarks and populate
the landmarks field.
Now, if you look now, you will
notice that the UID in most
cases is the same.
Why is that?
Because we're talking about the
same face.
It's the same processing step,
if you will.
Where would you use implicit
versus explicit?
Well, if your application is
very simple, you would probably
want to opt for implicit way.
It's very simple.
You create a single request;
everything else is done on your
behalf.
If, on the other hand, your
application is more complex, for
example, you want to process
faces first, detect them, then
do some filtering.
Let's say you don't care about
faces on the periphery, or you
want to just focus on the ones
that are in the center, you can
do that step, and then you can
do landmarks on the remaining
set of faces.
In this case, you probably want
to use the explicit version
because, in this case, landmarks
detector is not going to rerun
face detector inside.
We want your apps to have
optimal performance both in
terms of memory usage and
execution speed.
That's why it's important to
look at the next two slides.
How long should you keep your
objects in memory?
Well, for the image request
handler, you should keep it as
long as the image needs
processing.
This may sound like a very
innocent and simple statement,
but it is very important that
you do just that.
If you release the object early,
and you still have outstanding
requests to be processed, you
will have to recreate your image
to request handler.
But now you have lost all the
cache that was associated with
the previous object, and you'll
have to pay this performance to
recalculate these derivatives.
If you release it too late, on
the other hand, then, first
you're going to start causing
memory fragmentation, and then
the memory is not going to be
reclaimed by your app for other
meaningful things that you want
to do.
So, it's important to release
it, use it as long as you need
and release it right after.
Remember, it caches the image
and multiple image derivatives
inside.
The situation with sequence
request handler is very similar
with the only difference is if
you release it too early, you
pretty much kill the entire
sequence because the entire
cache is gone by now.
What about requests and
observations?
Well, requests and observations
are very lightweight objects.
You can create them and release
them as needed.
There's no need to cache them.
Where should we process your
requests?
Well many requests in Vision
rely on running neural networks
on the device.
And, as we know, running neural
networks is usually faster on
GPU versus the CPU.
So, the natural question is
where should we run it?
Here's what we do in Vision.
If request is runnable in GPU,
we will try to do that first.
If GPU is not available for
whatever reason at that point in
time, we will switch to CPU
because that's our default
behavior.
But let's say your application
is dependent on displaying a lot
of graphics on the screen, so
you may want to save the GPU for
that particular job.
In this case, you can override
user CPU on the property and set
it to true on the request
object.
This will tell us to process
your request directly on the
CPU.
Now that we've covered basic how
to interact with Vision, Vision
API, in particular, we've seen a
couple of examples, let's switch
to the main topic of our
presentation, which is tracking
in Vision.
So, what is tracking?
Tracking is defined as a problem
of finding an object of interest
in a sequence of frames.
Usually, you find that object in
the first frame, and you try to
look for it in the sequence of
frames.
What are the examples of such
application?
You will probably see many of
them.
It's live annotational sports
events, focus tracking with
camera, many, many others.
You may say why should they use
tracking if I can do detection
on every frame in the sequence?
Well, there are multiple reasons
for that.
First, you probably don't have a
specific tracker for every
single type of object that you
want to track.
Let's say if you're tracking
faces, you're lucky.
You have a face detector for
that purpose.
But if you need, for example, to
track a specific type of bird,
you probably don't have that
detector, and now you're in the
business to create that
particular detector, which you
may not want to do because of
the other things that you want
to have done with your
application.
But let's say you are lucky, and
you're tracking faces, should
you use detector then?
Well, probably not in this case
either.
So, let's look an example.
You start your tracking
sequence, and you run your face
detector on the first frame.
You get five faces back.
Then, you run it in the second
frame; you get another five
faces back.
How do you know that the faces
from the second frame are
exactly the same faces as from
the first frame?
One person could have stepped
out; another one showed up.
So, now, you're in the business
of matching objects that you
found, which is a completely
different task that you may not
want to deal with.
Trackers, on the other hand, use
[inaudible] information to match
objects.
They know the trajectory how the
objects move, and they can
slightly predict where they
would be moving in the next
frame.
But let's say you're lucky
again.
You're tracking faces, and your
use case is limited to a single
face in the frame.
Should you use detectors then?
Well, maybe not even in this
case.
Now, speed is a problem.
Trackers are usually lightweight
algorithms, while detectors
usually run your [inaudible],
which is much longer.
In addition, if you need to
display your tracking
information on a graphical user
interface, you may find that
trackers are smoother and not as
jittery.
Remember, in one of the first
slides, I asked you to remember
these three terms, what, how,
and results.
Let's see how this maps into the
track and use case.
First, request.
So, in Vision, we have two types
of requests for tracking.
There is a general purpose
object tracker, and there is a
rectangular object tracker.
How?
As you should have guessed by
now, we're going to use our
sequence request handler.
Results.
There are two types that are
important here.
There is a detected object
observation, which has an
important property in it,
bounding box, which tells you
where the object is located, and
there is a rectangular
observation, which has four
additional properties telling
you where the vertices of the
rectangle are.
Now, you say if I had my
bounding box, why do I need the
vertices of the rectangle?
Well, when you draw up
rectangles, they are rectangular
objects in the real life.
The way they are projected in
the frame, they may look
differently.
They may look like trapezoid,
for example.
So, the bounding box in this
case is not the rectangle
itself.
It [inaudible] the minimal box
that includes all the vertices
of the rectangle.
Let's look at the demo now.
So, what I have here, I have a
sample app that you, by the way,
can download from WWDC website
and the link is right next to
this session.
What the app does is it takes
the movie; it parses that movie
into frames.
In the first frame, you select
an object.
You want to track a multiple
objects or you want to track,
and it does the tracking.
So, let's first use this movie.
The user interface is simple.
First, you can choose between
objects or rectangles, and
second, you can choose which
algorithm you want to use, fast
or accurate.
What happens in Vision that we
support two types, fast and
accurate, and this is a
trade-off between the speed and
the accuracy.
I'm going to show objects in
this case, and I'm going to use
my fast algorithm.
Let's select objects.
So, I'm going to track this
person under the red umbrella,
and I'm going to try to track
this group of people here.
Let's run it.
As you can see, we can
successfully track the object
that we selected.
Let's look at the more complex
example.
What I want to track here, I
want to track this wakeboarder
guy, and in this case, I'm going
to use my accurate algorithm.
So, I'm going to select my
object,
and I'm going to run it.
As you can see, this object
changes pretty much everything
about itself, its shape, the
location, the colors, everything
comes.
We're still able to track it.
I think this is pretty cool.
[ Applause ]
Now, we're going to switch to my
demo machine and see how the
actual tracking sequence is
implemented in this app.
So, I have the Xcode running,
and I have my headphones
connected to it, which is
running the same app as we just
saw.
I'm going to run it in the
debugger,
going to select my objects, and
it's not important what I select
because we just want to look at
the sequence.
And I'm going to run it.
So, I have a breakpoint set up
here and it breaks in the
perform tracking function, which
is the most important function
of this app.
That's the function that
implements the actual sequence.
Let's see what we do here.
First, we're creating our video
reader.
Then, we are reading first
frame, and we're discarding that
frame because that frame was
used to select the objects.
There's the cancellation flag
here.
Then, I'm going to initialize
the collection of my input
observations.
Remember, as we saw an example
in the slides.
Then, I have my bookkeeping to
be able to display results in a
graphical user interface, which
are kept in the trackedPolyRect
type.
Then, I'm going to run the
switch on the type, and the type
is something that comes from the
user interface, and in this
case, we're working with
objects.
Now, we selected two objects.
So, this is the information
coming from the user interface.
We should see these two objects
here.
Okay, there are two of them.
So, this loop will run two
times.
It will initialize input
observations.
It'll create detected object
observation as also shown in the
slides by passing bounding box
in.
And we initialize our
bookkeeping structures.
Let's run it.
Let's look at the observation
object.
There are a couple of fields
that are important here.
This is the unique ID that we
discussed.
And then it's our bounding box
in normalize organize.
Now, if I run through, I'm going
to hit this breakpoint because
this case we're not using
[inaudible] rectangles.
This is where I create my
sequence request handler.
Now, I have my frame counter,
I have my flag if something has
failed, and I'm finally going to
start my tracking sequence.
As you can see, this is an
infinite loop, and the
conditions to get out of that
loop is if the cancellation was
requested, or if the movie has
ended.
I'm going to initialize my rect
structure to keep the
information for the graphical
user in this interface to be
displayed later, and I'm going
to start iterating over my input
observations, which we have to.
For each one, I'm going to
create a track object request.
I'm going to advance my request
to the collection of all
requests, and we have to in this
case.
Going to break off the loop, and
finally, I'm ready to process my
requests.
Now, if you can see the
performed request, the perform
function accepts a collection of
requests.
In the slides, we only used a
single request to be passed into
that collection, but here we're
going to track two requests at
the same time.
I'm going to perform it.
Now, since the requests are
performed, I'm going to start
looking at the results, and I'm
going to do that by looking at
the results property of each
one.
So, I'm going to get results
property.
I'm going to get the first
object in that property because
we expect them, the single one
in there as an observation.
What I'm going to do here, I'm
going to look at the confidence
property of my observation, and
I set an arbitrary threshold to
0.5.
So, if it's above the threshold,
I'm going to paint the bounding
box with solid line, and if it's
below the threshold, I'm going
to paint it with a dashed line.
So, I have, so I can have an
indication if something is going
wrong.
The rest is simple bookkeeping.
I'm just going to populate my
rect structure, and this is the
last step, which is very
important where I take the
observation from the current
iteration, and I assign it for
the next iteration.
I'm going to do it a second
time.
I'm going to get to this
breakpoint.
Going to display my frame.
I'm going to sleep for the frame
rate in seconds time to simulate
the actual movie.
And then, before you know it,
you're in the second iteration
of your tracking sequence.
So, let's get back to the slides
now.
Thank you.
[ Applause ]
So, let's look at what's
important to remember from what
we have just seen.
First, how to initialize initial
object for tracking, and we saw
two ways.
There's automatic way, which is
usually done by running certain
detectors and getting bounding
boxes out.
And the second is manual, which
usually comes from the user
input.
We also saw that we used a
single tracking request per
tracked object.
The relationship here is
one-to-one.
We also saw that there are two
types of trackers.
One is the general purpose
tracker, and another one is
rectangular object tracker.
We also learned that there are
two algorithms for each tracker
type.
There is fast and accurate, and
this represents the tradeoff
between speed and accuracy.
And last but not least, we
looked at how to use a
confidence level property to
judge whether we should or
should not trust our results.
What are the limits of
implementing tracking sequence
in Vision?
First, let's talk about number
of trackers.
How many objects can you track
simultaneously?
Well, in Vision we have a limit
that is set to 16 trackers for
each type.
So, you can have 16 general
purpose object trackers and 16
rectangular object trackers.
If you try to allocate more,
you'll get an error back.
So, if it happens, you probably
need to release some of the
trackers that you're already
using.
How to do that?
First way is you can set a last
frame property under request and
feed that request into the
request handler for processing.
That way, the request handler
will know that the tracker
associated with this request
object should be released.
Another way is to release the
entire sequence request handler;
in this case, all the trackers
associated with that request
handler will be released.
Now, let's say you've
implemented the tracking
signals.
What are the potential
challenges that you may face?
Well, as you've seen, objects in
tracking sequence can change
pretty much everything about
themselves.
They can change their shape,
appearance, color, location, and
that represents a great
challenge for the algorithm.
So, what can you do here?
Well, one unfortunate answer is
that there's no one size that
fits all solution here, but you
can try a couple of things.
First, you can play with fast or
accurate, and you can figure out
that your particular use case
works better with a particular
algorithm.
If you're in charge of selecting
bounding box, try to find a
select salient object in the
same.
Which confidence threshold to
use?
Again, there is no single answer
here.
You will find that some use
cases work with certain
thresholds while other use cases
work with other thresholds.
There's one more technique that
I could recommend.
Let's say you have a long
tracking sequence, and for the
sake of this example, 1,000
frames.
If you start that tracking
sequence, your object that you
selected in the first frame will
start deviating, and it'll
change everything about itself
the more you go off of that
initial frame.
What you can do instead, you can
break that sequence into smaller
subsequences, let's say 50
frames each.
You run your detector, you track
that object for 50 frames.
You rerun the detector; you run
it again for 50 frames, and you
keep doing just like that.
From the end user point of view,
it'll look like you're tracking
a single object.
But what you do instead, what
you do inside instead, you're
tracking smaller sequences, and
that's a smarter way of running
and tracking sequence.
Let's summarize what we have
seen today.
First, we talked about why you'd
use Vision, and we talked about
a multi-platform framework,
privacy-oriented, which offers
simple and consistent interface.
Second, we talked about what's
new, and we introduced a new
orientation-agnostic face
detector.
We talked also about revisions.
Then, we talked about how to
interact with Vision API, and we
discussed requests, request
handlers, and observations.
And finally, we looked at how to
implement tracking sequence in
Vision.
For more information, I
recommend you to refer to this
link on the slide.
I can also recommend you to stay
for the next session, which will
be at 3 o'clock in this room
where Frank will cover details
of integration of Vision and
CoreML.
This is especially important if
you want to deploy your own
models.
That session will also cover
some details about Vision
Framework that were not covered
by this session.
And we'll also have Vison Lab
tomorrow, which is 3 to 5.
Thank you and have a great rest
of your WWDC.
[ Applause ]