WWDC2019 Session 222

Transcript

[ Music ]
[ Applause ]
>> Good morning.
My name is Brittany Weinert, and
I'm a software engineer on the
Vision Framework Team.
This year the Vision Team has a
lot of exciting new updates that
we think you're all going to
love.
Because we have so much new
stuff to cover, we're going to
dive right into the new
features.
If you're completely new to
Vision, don't worry.
You should still be able to
follow along, and our hope is
that the new capabilities that
we introduce today will motivate
you to learn about Vision and to
use it in your apps.
Today well be covering four
completely new topics, saliency,
image classification, Image
Similarity, and face quality.
We also have some technology
upgrades for the Object Tracker
and Face Landmarks as well as
new detectors and improved Core
ML support.
Today, I'm going to be talking
about saliency.
Let's start with a definition.
I'm about to show you a photo,
and I want you to pay attention
to where your eyes are first
drawn.
When you first saw this photo of
the three puffins sitting on a
cliff, did you notice what stood
out to you first?
According to our models, most of
you looked at the puffins faces
first.
This is saliency.
There are two types of saliency,
attention based and objectness
based.
The overlay that you saw on the
puffin image just now called the
heatmap was generated by
attention based saliency.
But before we get into more
visual examples, I want to go
over the basics of each
algorithm.
Attention based saliency is a
human aspected saliency, and by
this, I mean that the attention
based saliency models were
generated by where people looked
when they were shown a series of
images.
This means that the heatmap
reflects and highlights where
people first look when they're
shown an image.
Objectness based saliency on the
other hand was trained on
subject segmentation in an image
with the goal to highlight the
foreground objects or the
subjects of an image.
So, in the heatmap, the subjects
or foreground objects should be
highlighted.
Let's look at some examples now.
So, here are the puffins from
earlier.
Here's the attention based
heatmap overlaid on the image,
and here's the objectness based
heatmap.
As I said, people tend to look
at the puffins' faces first, so
the area around the puffins'
heads is very salient for the
attention based heatmap.
For objectness, we're just
trying to pick up the subjects,
and in this case, it's the three
puffins.
So, all the puffins are
highlighted.
Let's look at how saliency works
with images of people.
For attention based saliency,
the areas around peoples' faces
tend to be the most salient,
unsurprisingly because we tend
to look at people's faces first.
For objectness based saliency,
if the person is the subject of
the image, the entire person
should be highlighted.
So, attention based saliency
though is the more complicated
of the two saliencies, I'd say,
because it is determined by a
number of very human factors.
And the number, the main factors
that determine attention based
saliency and what's salient or
not, is contrast, faces,
subjects, horizons, and light.
But interestingly enough, it can
also be affected by perceived
motion.
In this example, the umbrella
colors really pop, so the area
around the umbrella is salient,
but the road is also salient
because our eyes try to track
where the umbrella is headed.
For objectness based saliency,
we just pick up on the umbrella
guy.
So, I could do this all day and
show you more examples, but
honestly, the best way to
understand saliency is to try it
out for yourself.
I encourage everybody to
download the Saliency app and
try it on their own photo
libraries.
So, let's get into what's
returned from the saliency
request, mainly the heatmap.
So, the images that I've been
showing to you up until now, the
heatmap has been scaled,
overlaid, and colorized and put
onto the image, but in
actuality, the heatmap is a very
small CV pixel buffer that's
made up of Floats in the range
of 0 to 1, 0 designating
nonsalient and 1 being most
salient.
And there's extra code that
you'd have to do to get the
exact same effect like you see
here.
But let's go into how to
formulate a request at the very
basic level.
Okay. So, first we start out
with a VNImageRequestHandler to
handle a single image.
Next, you choose the algorithm
that you want to run, in this
case, AttentionBasedSaliency,
and set the revision if you
always want to be using the same
algorithm.
Next, you call perform request,
like you usually would, and if
it's successful, the results
property on the request should
be populated with a
VNSaliencyImageObservation.
To access the heatmap, you call
the pixelBuffer property on the
VNSaliencyImageObservation like
so.
If you wanted to do objectness
based saliency, all you would
have to do is change the request
name and the revision to be
objectness.
So, for attention, it's
VNGenerateAttentionBased
SaliencyImageRequest and for
objectness, it's
VNGenerateObjectnessBased
SaliencyRequest.
So, let's get into another tool
other than at heatmap, the
bounding box.
The bounding boxes encapsulate
all the salient regions in an
image.
For attention based saliency,
you should always have one
bounding box, and for objectness
based saliency, you can have up
to three bounding boxes.
The bounding boxes are in
normalized coordinate space with
respect to the image, the
original image, and the lower
left-hand corner is the origin
point, much like bounding boxes
returned by other algorithms in
Vision.
So, I wrote up a small method to
show how to access the bounding
boxes and use them.
Here we have a
VNSaliencyImageObservation, and
all you have to do is access the
salientObjects property on that
observation, and you should get
a list of bounding boxes, and
you can access them like so.
Okay. So, now that you know how
to formulate a request and now
that you know what saliency is,
let's get into some of the use
cases.
First, for a bit of fun, you can
use saliency as a graphical mask
to edit your photos with.
So, here you have the heatmaps.
On the left-hand side, I've
desaturated all the nonsalient
regions, and on the right-hand
side, I've added a Gaussian blur
to all the nonsalient regions.
It really makes the subjects
pop.
Another use case of saliency is
you can enhance your photo
viewing experience.
So, let's say that you're at
home.
You're sitting on the couch, and
either your TV or your computer
has gone into standby mode, and
it's going through your photo
library.
A lot of times, these
photo-showing algorithms can be
a little bit awkward.
They zoom into seemingly random
parts of the image, and it's not
always what you expect.
But with saliency, you always
know where the subjects are, so
you can get a more
documentary-like effect like
this.
Finally, saliency works really
great with other vision
algorithms.
Let's say we have an image, and
we want to classify the objects
in the image.
We can run objectness based
saliency to pick up on the
objects in the image, crop the
image to the bounding boxes
returned by objectness based
saliency, and run these crops
through the algorithm through a
image classification algorithm
to find out what the objects
are.
So, not only do you know where
they are in the image because of
the bounding boxes, but it
allows you the hone in on what
the objects are by just picking
out the crops that have those
objects in it.
Now, you can already classify
things with Core ML, but this
year, Vision has new image
classification technique that
Rohan will now present to you.
[ Applause ]
>> Good morning.
My name is Rohan Chandra, and
I'm a researcher on the Vision
Team.
Today, I'm going to be talking
about some of the new image
classification requests we're
introducing to the Vision API
this year.
Now, image classification as a
task is fundamentally meant to
answer the question, what are
the objects that appear in my
image.
Many of you will already be
familiar with image
classification.
You may have used Create ML or
Core ML to train your own
classification networks on your
own data as we showed in the
Vision with Core ML talk last
year.
Others of you may have been
interested in image
classification but felt you
lacked the resources or the
expertise to develop your own
networks.
In practice, developing a
large-scale classification
network from scratch can take
millions of images to annotate,
thousands of hours to train, and
very specialized domain
expertise to develop.
We here at Apple have already
gone through this process, and
so we wanted to share our
large-scale, on-device
classification network with you
so that you can leverage this
technology without needing to
invest a huge amount of time or
resources into developing it
yourself.
We've also strived to put tools
in the API to help you
contextualize and understand the
results in a way that makes
sense for your application.
Now, the network we're talking
about exposing here is in fact
the same network we ourselves
use to power the photo search
experience.
This is a network we've
developed specifically to run
efficiently on device without
requiring any service side
processing.
We've also developed it to
identify over a thousand
different categories of objects.
Now, it's also important to note
that this is a multi-label
network capable of identifying
multiple objects in a single
image, in contrast to more
typical mono-label networks that
try to focus on identifying a
single large central object in
an image.
Now, as I talk about this new
classification API, I think one
of the first questions that
comes to mind is what are the
objects it can actually
identify?
Well, the set of objects that a
classifier can predict is known
as the taxonomy.
The taxonomy has a hierarchical
structure with directional
relationships between classes.
These relationships are based
upon shared semantic meaning.
For instance, a class like dog
might have children like Beagle,
Poodle, Husky, and other
sub-breeds of dogs.
In this sense, a parent class
tends to be more general while
child classes are more specific
instances of their parent.
You can of course see the entire
taxonomy using
ImageRequest.known
Classifications.
Now, when we constructed the
taxonomy, we had a few specific
rules that we applied.
The first is that the classes
must be visually identifiable.
That is, we avoid more abstract
concepts like holiday or
festival.
We also avoid any classes that
might be considered
controversial or offensive as
well as those to do with proper
names, nouns, excuse me,
adjectives, or basic shapes.
Finally, we omit occupations,
and this might seem odd at
first.
But consider the range of
answers you'd get if we asked
something like what does an
engineer look like.
There probably isn't a single
concise description you could
give that would apply to every
engineer aside from sleep
deprived and usually glued to a
computer screen.
Let's take a look at the code
you need to use in order to
classify an image.
So, as usual, you form an
ImageRequestHandler to your
source image.
You then perform the
VNClassifyImageRequest and
retrieve your observations.
Now, in this case, you actually
get an array of observations,
one for every class in the
taxonomy and its associated
confidence.
In a mono-label problem, you'd
probably expect that these
probabilities sum up to 1, but
this is a multi-label
classification network, and each
prediction is an independent
confidence associated with a
particular class.
As such, they won't sum to 1,
and they're meant to be compared
within the same class, not
across different classes.
So we can't simply take the max
amongst them in order to
determine our final prediction.
You might be wondering then, how
do I deal with so many classes
and so many numbers.
Well, there are a few key tools
in the API that we've
implemented to help you make
sense of the result.
Now, in order to talk about
these tools in the API, we first
need to define some basic terms.
The first is when you get a
confidence for a class, we
typically compare that to a
class-specific threshold, which
we refer to as an operating
point.
If the class confidence is above
the threshold, then we say that
class is present in the image.
If the class confidence is below
the class threshold, then we say
that object is not present in
the image.
In this sense, we want to pick
thresholds such that objects
with the target class typically
have a confidence higher than
the threshold, and images
without the target class
typically have a score lower
than the threshold.
However, machine learning is not
infallible, and there will be
instances where the network is
unsure and the confidence is
proportionally lower.
This can happen when objects are
office gated, appear in odd
lighting or at odd angles, for
instance.
So how do we pick our
thresholds?
Well, there are essentially
three different regimes we can
be in depending on our choice of
threshold that yield three
different kinds of searches.
To make this a little more
concrete, let's say I have a
library of images for which I've
already performed classification
and stored the results.
Let's say in this particular
case I'm looking for images of
motorcycles.
Now, I want to pick my
thresholds such that images with
motorcycles typically have a
confidence higher than this
threshold and images without
motorcycles typically have a
score lower than this threshold.
So, what happens if I just pick
a low threshold.
As you can see behind me, when I
apply this low threshold, I do
in fact get my motorcycle
images, but I'm also getting
these images of mopeds in the
bottom right.
And if my users are motorcycle
enthusiasts, they might be a
little annoyed with that result.
When we talk about a search that
tries to maximize the percentage
of the target class retrieved
amongst the entire library, and
isn't as concerned with these
missed predictions where we say
the motorcycle is present when
it actually isn't.
We are typically talking about a
high recall search.
Now, I could maximize recall by
simply returning as many images
as possible, but I would get a
huge number of these false
predictions where I say my
target class is present when it
actually isn't, and so we need
to find a more balanced point of
recall to operate at.
Let's take a look at how I need
to change my code in order to
perform this high recall search.
So, here I have the same code
snippet as before, but this time
I'm performing a filtering with
hasMinimumPrecision and a
specific recall value.
For each observation in my array
of observations, the filter only
retains it if the confidence
associated with the class
achieves the level of recall
that I specified.
Now, the actual operating point
needed to determine this is
going to be different for every
class, and it's something we've
determined based on our internal
tests of how the network
performs on every class in the
taxonomy.
However, the filter handles this
for you automatically.
All you need to do is specify
the level of recall you want to
operate at.
So, we talked about a high
recall search here, but what if
I have an application that can't
tolerate these false predictions
where I'm saying motorcycles are
present when they're not.
That is, I want to be absolutely
sure that the images I retrieve
actually do contain a
motorcycle.
Well, let's come back to our
library of images then and see
what would happen if we applied
the higher threshold.
As you can see behind me, when I
apply my high threshold, I do in
fact only get motorcycle images,
but I get far fewer images
overall.
When we talk about a search that
tries to maximize the percentage
of the target class amongst the
retrieved images and isn't as
concerned with overlooking some
of the more ambiguous images
that actually do contain the
target class, we are typically
talking about a high precision
search.
Again, like with high recall, we
need to find a more balanced
operating point where I have an
acceptable likelihood about my
target class appearing in my
results, but I'm not getting too
few images.
So, let's take a look at how I
need to modify my code in order
to perform this high precision
search.
So here's the same code snippet,
but this time my filtering is
done with hasMinimumRecall and a
precision value I've specified.
Again, I only retain the
observation if the confidence
associated with it achieves the
level of precision that I
specified.
The actual threshold needed for
this is going to be different
for every class, but the filter
handles that for me
automatically.
All I need to do is tell it the
level of precision I want to
operate at.
So we've talked about two
different extremes here, one of
high recall and one of high
precision, but in practice, it
can be better to find a balanced
tradeoff between the two.
So, let's see how we can go
about doing that, and in order
to understand what's happening,
I first need to introduce
something known as the precision
and recall curve.
So, in practice, there is a
tradeoff to be made where
increasing one of precision and
recall can lead to a decrease in
the other.
I can represent this tradeoff as
a graph, where for each
operating point I can compute
the corresponding precision and
recall.
For instance, at the operating
point at where I achieve a
recall of 0.7, I find that I get
a corresponding precision of
0.74.
I can compute this for a
multitude of operating points in
order to form my full curve.
As I said before, I want to find
a balance point along this curve
that achieves the level of
recall and precision that makes
sense for my application.
So let's see how I need to
change my code in order to
accomplish it and how the
precision and recall curve plays
into that.
So here I have a filtering with
hasMinimumPrecision where I'm
specifying the minimum precision
and a recall value.
When I specify a
MinimumPrecision, I'm actually
selecting an area along the
graph that I want to operate
within.
When I select a recall point
with forRecall, I'm choosing a
point along the curve that will
be my operating point.
Now, if the operating point is
in the valid region that I
selected, then that is the
threshold that the filter will
apply when looking at that
particular class.
If the operating point is not in
the valid region, then there is
no operating point that meets
the constraints I stated, and
the class will always be
filtered out of my results.
In this sense, all you need to
do is provide the level of
precision and recall that you
want to operate at, and the
filter will determine the
necessary thresholds for you
automatically.
So, to summarize, the
observation I get back when
performing image classification
is actually an array of
observations, one for every
class in the taxonomy.
Because this is a multi-label
problem, the confidences will
not sum to 1.
Instead, we have independent
confidence values, one for every
class between 0 to 1, and we
need to understand precision and
recall and how they apply to our
specific use case in order to
apply a filtering with
hasMinimumPrecision or
hasMinimumRecall that makes
sense for our application.
So, that concludes the portion
on image classification.
I'd like to switch gears and
talk about a related topic,
Image -- excuse me.
Image Similarity.
When we talk about Image
Similarity, what we really mean
is a method to describe the
content of an image and another
method to compare those
descriptions.
The most basic way in which I
can describe the contents of an
image is using the source pixels
themselves.
That is, I can search for other
images that have close to or
exactly the same pixel values
and retrieve them.
If I did a search in this
fashion, however, it's extremely
fragile, and it's easily fooled
by small changes like rotations
or lighting augmentations that
drastically change the pixel
values but not the semantic
content in the image.
What I really want is a more
high-level description of what
the content of the image is,
perhaps something like natural
language.
I could make use of the image
classification API I was
describing previously in order
to extract a set of words that
describe my image.
I could then retrieve other
images with a similar set of
classifications.
I might even combine this with
something like word vectors to
account for similar but not
exactly matching words like cat
and kitten.
Well, if I performed a search
like this, I might get similar
objects in a very general sense,
but the way in which those
objects appear and the
relationships between them could
be very different.
As well, I would be limited by
the taxonomy of my classifier.
That is, any object that
appeared in my image that wasn't
in my classification networks
taxonomy couldn't be expressed
in a search like this.
What I really want is a
high-level description of the
objects that appear in the image
that isn't fixated on the exact
pixel values but still cares
about them.
I also want this to apply to any
natural image and not just those
within a specific taxonomy.
As it turns out, this kind of
representation learning is
something that's naturally
engendered in our classification
network as part of its training
process.
The upper layers of the network
contain all of the salient
information necessary to perform
classification while discarding
any redundant or unnecessary
information that doesn't aid it
in that task.
We can make use of these upper
layers then to act as our
feature descriptor, and it's
something we refer to as the
feature print.
Now, the feature print is a
vector that describes the
content of the image that isn't
constrained to a particular
taxonomy, even the one that the
classification network was
trained on.
It simply leverages what the
network has learned about images
during its training process.
If we look at these pairs of
images, we can compare how
similar their feature prints
are, and the smaller the value
is, the more similar the two
images are in a semantic sense.
We can see that even though the
two images of the cats are
visually dissimilar, they have a
much more similar feature print
than the visually similar pairs
of different animals.
To make this a little more
concrete, let's go through a
specific example.
Let's say I have the source
image on screen, and I want to
find other semantically similar
images to it.
I'm going to take a library of
images and compute the feature
print for each image and then
retrieve those images with the
most similar feature print to my
source image.
When I do it for this image of
the gentleman in the coffee
shop, I find I get other images
of people in coffee shop and
restaurant settings.
If I focus on a crop of the
newspaper, however, I get other
images of newspapers.
And if I focus on the teapot, I
get other images of teapots.
I'd like to now invite the
Vision Team onstage to help me
with a quick demonstration to
expand a little more on how
Image Similarity works.
[ Applause ]
>> Hello everyone.
My name is Brett, and we have a
really fun way to demonstrate
Image Similarity for you today.
We have very creatively called
it the Image Similarity game.
And here is how you play.
You draw something on a piece of
paper, then ask a few friends to
re-create your original as close
as possible.
So I will start by drawing the
original.
Okay. Tap continue to scan it in
as my original.
And then save.
Now, my team will act as
contestants, and they will draw
this as best as they can.
Now while they're drawing, I
should tell you that this sample
app is currently available to
you now on the developer
documentation website as sample
code, and also, we are using the
Vision kit document scanner to
scan in our drawings, and you
can learn more about that at our
text recognition session.
Let's them give a few more
seconds.
Five, four, three, okay, I guess
they're done.
Okay. Let's bring them up and
start scanning them in.
Contestant number one.
Pretty good [applause].
That might be a winner.
Let's see contestant number two.
Still pretty good.
Nicely done.
[ Applause ]
Contestant number three please.
[ Laughter and Applause ]
I think that's pretty good.
[ Applause ]
And contestant number four.
Well, I don't know about that,
but we'll see how it goes.
[ Applause ]
All right.
So let's save those, and we find
out that the winner is
contestant number one.
Congratulations.
[ Applause ]
Now I can swipe over, and we can
see that the faces are more
semantically similar that way.
They are closer to the original
while the tree is semantically
different, it was much further
away.
And that is the Image Similarity
game, and background check to
Rohan.
[ Applause ]
>> Thanks everyone.
I want to take a quick look at a
snippet from that demo
application to show how we
determined the winning
contestant.
So here I have the portion of
the code that compares each of
the contestant's drawings
feature print to Brett's
drawing's feature print.
Now, I extracted the
contestant's feature print with
a function we have defined in
the application called
featureprintObservationForImage.
Once I have each feature print,
I then need to determine how
similar it was to the original
drawing, and I can do that using
computeDistance, which returns
me a floating-point value.
Now, the smaller the
floating-point value, the more
similar the two images are.
And so, once I've determined
this for every contestant, I
simply need to sort them in
order to determine the winner.
Well, this concludes the portion
on Image Similarity.
I'd now like to hand the mic
over to Sergey to talk about
some of the changes coming to
Face Technologies.
[ Applause ]
>> Good morning everybody.
My name is Sergey Kamensky.
I'm a software engineer on the
Vision Framework Team.
I'm excited to share with you
today even more new features
coming to the Framework this
year.
Let's talk about Face Technology
first.
Remember, two years ago when we
introduced Vision Framework, we
also talked about Face Landmark
detector.
This year, we're coming with a
new revision for this algorithm.
So, what are the changes?
Well, first, we now have
76-point cancellation, and this
is versus 65-point cancellation
as we had before.
The 76-point cancellation gives
us a greater density to
represent different face
regions.
Second, we now report confidence
score per landmark point, and
this is versus a single average
confidence score, as we reported
before.
But the biggest improvement
comes in the pupil detection.
As you can see, the image on the
right-hand side has pupils
detected with much better
accuracy.
Let's take a look at the client
code sample.
This code snippet will repeat
throughout the presentation so
the first time we're going to go
line by line.
Also, I use for [inaudible] in
my samples.
If this is just to simplify the
slides, when you develop your
apps, you probably should use
proper error handling to avoid
unwanted boundary conditions.
Let's get back to the sample.
In order to get your facial
landmarks, first you need to
create a
DetectFaceLandmarksRequest.
Then, you need to create
ImageRequestHandler, passing the
image into it the image that
needs to be processed, and then
you need to use that request
handler to process your request.
Finally, you need to look at the
results.
The results for everything that
this human face related in
Vision Framework will come in
forms of face observations.
Face observation derives some
detected object observation.
It inherits bounding box
property, and it also adds
several other properties on its
level to describe human face.
This time we'll be interested in
the landmarks property.
The landmarks property is of
FaceLandmarks2D class.
FaceLandmarks2D class consists
of the confidence score.
This is the average single
average confidence score for the
entire set and multiple face
regions where each face region
is represented by
FaceLandmarksRegion2D class.
Let's take a closer look at the
properties of this class.
First is pointCount.
PointCount will tell you how
many points represent a
particular face region.
This property will [inaudible] a
different value depending how
you configure your request, with
65-point cancellation or
76-point cancellation.
The normalizedPoints property
will represent the actual
landmarks point, and the
precisionEstimatesPerPoint will
represent the actual confidence
score for teach landmark point.
Let's take a look at the codes
needed.
This is the same code snippet as
in the previous slide, but now
we're going to look at it from a
slightly different perspective.
We want to see how revisioning
of the algorithm works in Vision
Framework.
If you take this code snippet
and recompile it with the last
[inaudible], what you will get
is that the request object will
be configured as follows: the
revision property will be set to
revision number 2, and the
cancellation property will be
set to cancellation of 65
points.
Technically, we didn't have
cancellation property last year,
but if we did, we could have set
it to a single value only.
Now, if on the other hand you
take the same code snippet and
recompile it with the current
[inaudible], what you will get
is that the revision property
will be set to revision number
3, and the cancellation property
will be set to cancellation 76
points.
This actually represents the
philosophy of how Vision
Framework handles revisions of
algorithms by default.
If you don't specify a revision,
what we will do is, we will give
the latest supported by the SDK
your code is compiled and linked
against.
Of course, we'll always
recommend to set those
properties explicitly.
This is just to guarantee
deterministic behavior in the
future.
Let's take a new metric that we
developed this year, Face
Capture Quality.
There are two images on the
screen.
You can clearly see that one
image was captured with better
lighting and focusing
conditions.
We wanted to develop the metric
that looks at the image as a
whole and gives you one score
back saying how bad or good the
capture quality was.
As a result, we came up with a
Face Capture Quality metric.
We trained our models for this
metric in such a way so they
tend to score lower if the image
was captured with low light or
bad focus, or for example, if a
person had negative expressions.
If we run this metric on these
two images, we will get our
scores back.
These are floating-point
numbers.
You can compare them against
each other, and you can say that
the image that scored higher is
the image that was captured with
better quality.
Let's take a look at the code
sample.
This is very similar to what we
saw just a couple of slides ago,
with the differences being in
the request type and the
results.
Since we still with C1 faces,
we're going to get our face
observation back, but now we're
going to look at a different
property of the face
observation, Face Capture
Quality property.
Let's take a look at the broader
example.
Let's say I have a sequence of
images that could have been
obtained by using the burst mode
on the selfie camera or in the
photo burst, for example.
And you will ask yourself a
question.
Which image was captured with
the best quality?
What you can do now, you can run
our algorithm on each image,
assign scores, rank them, and
the image that apps on the most
light is the image that was
captured with the best quality.
Let's try to understand how we
can interpret the results that
are coming from the Face Capture
Quality metric.
I have two sequences of images
on the slide.
Each sequence is of the same
person, and each sequence is
represented by the images that
scores lowest and the highest in
the sequence with respect to
Face Capture Quality.
What can we say about these
ranges?
Well, there is some overlapping
region, but there are some also
regions that belong to one and
don't belong to the other.
If you had yet another sequence,
it could have happened that
there was no overlapping region
at all.
The point I'm trying to make
here is that the Face Capture
Quality should not be compared
against a threshold.
In this particular example, if I
picked 0.52, I would have missed
all the images on the left, and
I would pretty much can get any
image that's just past the
midpoint on the right.
But then what is Face Capture
Quality?
We define Face Capture Quality
is a comparative or ranking
measure of the same subject.
Now, comparative and same are
the key words in this sentence.
If you're thinking, cool, I have
this great new metric, I'm going
to develop my beauty contest
app.
Probably not a good idea.
In a beauty contest app, you
would have to compare faces of
different people, and that's not
what this metric was developed
and designed for.
And that's Face Technology.
Let's take a look at the new
detectors we're adding this
year.
We're introducing Human Detector
that detects human upper body
that consists of human head and
torso and also a pet detector,
an Animal Detector that detects
cats and dogs.
The Animal Detector gives you
bounding box back, and in
addition to bounding boxes it
gives you also a label saying
which animal was detected.
Let's take a look at the client
code sample.
Two snippets, one for Human
Detector, one for Animal
Detector.
Very similar to what we had
before.
Again, the differences are in
the request types that you
create and in the results.
Now, for Human Detector, all we
care about is the bounding box.
So, we use for that
DetectedObjectObservation.
For the Animal Detector on the
other hand, we also need the
label, so we use
RecognizedObjectObservation that
derives from detected object
observation.
It inherits bounding box, but it
also adds a label property on
the [inaudible].
And that's new detectors.
Let's take a look at what's new
in tracking this year.
We're coming up with a new
revision for the Tracker.
The changes are, we have
improvements in the bounding
boxes expansion area.
We can now handle better
occlusions.
We are machine learning based
this time.
And we can run with low power
consumption on multiple
[inaudible] devices.
Let's take a look at a sample.
I have a mini video clip where a
man is running in the forest,
and he appears sometimes behind
the trees.
As you can see, the tracker is
able to successfully recapture
the tracked object and keep
going with the tracking
sequence.
[ Applause ]
Thank you.
[ Applause ]
Let's take a look at the client
code sample.
This is exactly the same snippet
that we showed last year.
It represents probably the
simplest tracking sequence you
can imagine.
It tracks your object of
interest for five consecutive
frames.
I want to go line-by-line, but I
want to emphasize two points
here.
First is we use our
SequenceRequestHandler.
That is as opposite to
ImageRequestHandler as we have
used so far throughout the
presentation.
SequenceRequestHandler is used
in Vision when you work with a
sequence of frames and you need
to cache some information from
frame to frame to frame.
Second point is when you
implement your tracking
sequence, you need to get your
results from iteration number n
and feed it as an input to a
duration number n plus 1.
Of course, if you recompiled
this quote with the current
[inaudible] SDK, the revision of
the request will be set to
revision number 2 by default.
But we also recommend to set it
explicitly.
And that's the tracking.
Let's take a look at the news
with respect to Vision and Core
ML integration.
Last year, we presented
integration with Vision and Core
ML, and we showed how you can
run Core ML models through
Vision API.
The advantage of doing that was
that you can use 1 over 5
different overloads of the image
request handler to translate the
image that you have in your hand
to the image type, size, and
color scheme that the Core ML
model requires.
We will run the inference for
you, and we'll pack the outputs
or results coming from Core ML
model into Vision observations.
Now, if you have a different
task in mind, for example, if
you want to do image style
transfer, you need to have at
least two images, the image
content and the image style.
You may also need to have some
mixed ratio saying how much of a
style needs to be applied on the
content.
So, I have three parameters now.
Well, this year we're going to
introduce API where we can use
multiple inputs through Vision
to Core ML, and that's including
multi-image inputs.
Also, on the output section,
this sample shows only one
output.
But, for example, if you had
more than one, especially if you
have more than one of the same
type, it's hard to distinguish
them when they come in forms of
observation later on.
So, what we do this year, we
introduce a new field in the
observation that maps exactly to
the name that shows up here in
the output section.
Let's take a look at the inputs
and outputs.
We will use them in the next
slide.
This is the code snippet that
represents how to use Core ML
through Vision.
The highlighted sections show
what's new this year.
Let's keep them for now, and
we'll go over the code, and
we'll return to them later.
In order to run Core ML through
Vision, first you need to log
your Core ML model.
Then, you need to create Vision
CoreMLmodel wrapper around it.
Then, you need to create Vision
CoreMLRequest and pass in that
wrapper.
Then you create
ImageRequestHandler, you process
your request, and you look at
the results.
Now, with the new API that we
added this year, that only image
that you could use last year is
the default or the main image is
the image that is passing to
ImageRequestHandler, but that's
also the image whose name needs
to be assigned to input feature
name field of the CoreMLModel
wrapper.
All other parameters whether
images or not will have to be
passed through feature provider
property of the CoreMLModel
wrapper.
As you can see, image style and
mixed ratio are passed in that
way.
Finally, when you look at the
results, you can look at the
feature name property of the
observation that comes out, and
you can compare it in this case
against image result.
That's exactly the name that
appears in the output section of
Core ML, and that way you can
process your results
accordingly.
This slide actually concludes
our presentation for today.
For more information you can
refer to the links on the slide.
Thank you, and have a great rest
of your WWDC.
[ Applause ]