WWDC2018 Session 717

Transcript

[ Music ]
[ Applause ]
>> Good afternoon, and welcome
to WWDC.
I know I'm competing now with
the coffee and cookies, but I
don't know if the coffee is
gluten free and the cookies are
actually caffeine free, so stick
with me.
My name is Frank Doepke and I'm
going to talk about some
interesting stuff that you can
do with Computer Vision using
Core ML and Division Framework.
So, what are we going to talk
about today?
First, you might have heard that
we have something special for
you in terms of custom image
classification.
Then we're going to do some
object recognition.
And last but not least, I'm
going to raise your level of
Vision awareness by diving into
some of the fundamentals.
Now, Custom Image
Classification, we've seen the
advantages already in some of
the presentations earlier, and
we see like what we can do with
flowers and fruits and I like
flowers and fruits as much as
everybody else.
Sorry, Lizzie, but I thought
we'd do something a bit more
technical here.
So, the idea is, what can we do
if we create a shop and we are
all geeks, so let's build a shop
where we can build robots.
So, there are some parts that we
need to identify.
And I thought it would be great
if I have an app that actually
can help customers in my store
identify what these objects are.
So, we've got to train a
customer classifier for that.
Then, once we have our
classifier, we build an iOS app
around that, that we can
actually run on all devices.
And when going through this, I'm
going to go through some of the
common pitfalls when you
actually want to do any of this
stuff and try to guide you
through that.
Let's start with our training.
So, how do we train?
We use of course, Create ML.
The first step of course we have
to do, is we have to take
pictures.
Then we put them in folders and
we use the folder names as our
classification labels.
Now, the biggest question that
everybody has, "How much data do
I need?"
The first thing is, well, we
need a minimum of about 10
images per category, but that's
on the low side.
You definitely want to have
more.
The more, the better, actually
your classifier will perform
better.
Another thing to look out for is
highly imbalanced data sets.
What do I mean by that?
When a data set has like
thousands of images in one
category, and only ten in the
other one, this model will not
train really well.
So, you want to have like an
equal distribution between most
of your categories.
Another thing that we actually
introduce, is augmentation.
Augmentation will help you to
make this model more robust, but
it doesn't really replace the
variety.
So, you still want to have lots
of images of your objects that
you want to classify.
But, with the augmentation, what
we're going to do is, we take an
image and we perturb it.
So, we have noised it, we blur
it, we rotate it, flip it, so it
looks different to the
classifier actually when we
train it.
Let's look a little bit under
the hood of how our training
actually works.
You might have already heard it,
the term, transfer learning.
And this is what we're going to
use in Create ML when we train
our classifier.
So, we start with a pretrained
model, and that's where all the
heavy lifting actually happens.
These models train normally for
weeks, and with millions of
images, and that is the first
starting point that you need to
actually work with this.
Out of this model, we can use
this as a feature extractor.
This gives us a feature vector
which is pretty much a numerical
description of what we have in
our image.
Now, you bring in your data and
we train the set, what we call
the last layer, which is the
real classifier, on your label
data and out comes your custom
model.
Now, I mentioned already this
large, first pretrained model.
And we have something new in
Vision for this [inaudible].
This is what we call the Vision
FeaturePrint for Scenes.
It's available through Create ML
and it allows you to train an
image classifier.
It has been trained on a very
large data set, and it is
capable of categorizing over a
thousand categories.
That's a pretty good
distribution that you can
actually use [inaudible].
And we've already used it.
Over the last few years, through
some of the user [inaudible]
pictures that you've seen in
photos, have been actually using
this model underneath.
We're also going to continuously
improve on that model, but
there's a small caveat that I
would like to kind of highlight
here.
When we come out with a new
version of that model, you will
not necessarily automatically
get the benefits unless you
retrain [inaudible] new model.
So, if you start developing with
this, this year, and you want to
you know, take advantage of
whatever we come out over the
next years, hold onto your data
sets so you can actually retrain
[inaudible].
A few more things about our
feature extractor.
It's already on the device, and
that was kind of an important
decision for us, because it
makes the disc footprint for
your models significantly
smaller.
So, let's compare a little bit.
So, I chose some common, you
know, available models that we
use today.
The first thing would be Resnet.
So, if I train my classifier on
top of Resnet, how big is my
model?
Ninety-eight megabytes.
If I use Squeezenet, so
Squeezenet is a much smaller
model.
It's not capable of
differentiating as many
categories, and that's 5
megabytes.
So, it's about saving there, but
it will not be as versatile.
Now, how about Vision?
It's less than a megabyte in
most cases.
The other thing of course why we
believe that this is a good
choice for use, it's already
optimized.
And we know a few things about
our hardware or GPUs and CPUS
and we really optimized a lot on
that model so that it performs
best on our devices.
So, how do we train [inaudible]?
We start with some labeled
images, and bring them into
Create ML, and Create ML knows
how to extract [inaudible]
Vision Feature Print.
It trains our classifier and
that classifier is all what
[inaudible] will go into our
Core ML model.
That's why it is so small.
Now, when it comes time that I
actually want to analyze an
image, all I have to do is use
my image and model, and now in
Vision or in Core ML, it knows
another [inaudible] again how to
train -- sorry.
Not train.
In this case, use our Vision
Feature Print and we'll
[inaudible] the classification.
So, that was everything we
needed to know about the
training.
But, I said there was some
caveats that you want to kind of
look at when we deal with the
app.
So, first thing, we only want to
actually run our classifier when
we really have to.
Classifiers are deep
convolutional networks that are
pretty computational intensive.
So, when we run those, it will
definitely use up kind of you
know, some electrons running on
the CPU and GPU.
So, you don't want to use this
unless you really have to.
In the example that I'm going to
show in my demo laters, I really
only want to classify if
actually the person looks really
at an item and not when just a
camera moves around.
So, I'm asking the question, "Am
I holding still?"
and then I'm going to run my
classifier.
How do I do this?
By using Vision, I can use
Registration.
Registration means I can take
two images and align them with
each other.
And you're going to tell me,
"Okay, if you shift it by this
amount of pixels, this is
actually in how they would
actually match."
This is a pretty cheap and fast
algorithm, and it will tell me
if I hold the camera still or if
anything is moving in front of
the camera.
I used VN Translational Image
Registration Request.
I know that is a mouthful.
But that will give me all this
information.
So, to visualize this first,
let's look at the little video.
What I'm doing in this video is
I show basically a little yellow
line that shows me how basically
my camera has moved or the
Registration requests have moved
over the last couple of frames.
So, what out for that yellow
line and see if it's long, then
I have moved the camera around
quite a bit, and when I'm
holding it still, it should
actually be a very small line.
So, you see the camera is
moving, and now I'm focusing on
this, and it gets very short.
So, that's just a good idea.
It's like, "Okay, now I'm
holding still.
Now, I want to run my
classifier."
Next thing to keep in mind is,
have a backup plan.
It's always good to have a
backup plan.
Classifications can be wrong.
And what that means, even if my
classification actually has a
high confidence, I need to kind
of plan for sometimes that it
doesn't work quite correctly.
The one thing that I did in my
example here, as you will see
later on, I have something
wherefore which I don't have a
physical object.
So, how do I solve that?
In my example, I'm using our
backward detector that we have
in the Vision Framework to read
some data backward label to
identify this.
Alright, enough of slides.
Who wants to see the demo?
[ Applause ]
Okay, what you see on the
right-hand side of the screen is
my device.
I'm going to start now my little
robot shop application.
And when you see I'm moving
around, there's nothing
happening.
When I hold still, and I point
it at something, I should see a
yellow line and then voila, yes
step is a stepper motor.
Okay? Let's see what else do we
have here on this table?
That is my micro controller.
That's a stepper motor driver.
We can also like you know, pick
something up and hold it.
Yes, this is a closed loop belt.
What do we have here?
Lead screw.
And as I said, you can also look
at the barcode here, if I get my
cable [inaudible] long enough.
And that is my training course.
>> Hey, Frank?
>> Of course, for that I
didn't--
>> Frank?
>> What's going on?
>> Frank, yes, I'm going to need
you to add another robot part to
your demo.
>> This is Brett.
That's my manager.
>> That'd be great.
>> As usual, management, last
minute requests.
>> I'll make sure you get
another copy of that memo.
>> I'm not going to come in on
Saturday for this.
Alright, well what do we have
here?
We have a [inaudible] motor.
Alright, let's see.
It might just work.
Let me try this.
Do I get away with that?
No, it can't really read this
object.
Perhaps so?
It's -- no, it's not a stepper
motor.
So, that's a bug.
I guess we need to fix that.
Who wants to fix this?
Alright.
So, what I have to do now is I
have to take some pictures of
the aforementioned, [inaudible]
motor.
So, I need to go to the studio
and you know, set up the lights,
or I use my favorite camera that
I already have here.
Now, let's see.
We're going to take a bunch of
pictures of our [inaudible]
motor.
And it's kind of important to
just kind of vary it and don't
have anything else really in the
frame.
So, I'm going for a [inaudible].
I need to have at least ten
different images.
Good choices.
I always just like to put it on
a different background.
And make sure that we get a few
captured here.
Perhaps I'm going to hold it a
little bit in my hand.
Okay, so we have now a number of
images.
Now, I'm going to go over to my
Mac and actually show you how to
actually do then the training
work for that.
Okay, I'm bringing up my image
capture application and let me
for a moment just hide my
[inaudible].
If I now look in my Finder, you
can actually see I have my
training set, which I used
already earlier to train and
model that I've been using in my
application.
And I need to create now a new
folder, and let's call that
Servo.
And from Image Capture, I can
now simply take all the pictures
that I just captured and drag
them into my Servo.
Alright, so now we have that
added.
Now, I need to train my model
again since my manager just
ruined what I've done earlier.
Okay.
I used a simple scripting
playground here.
Not the UI, just because well,
this might be something I want
to later on incorporate as a
build step into my application.
So, I'm pointing it simply at
the data set that we just added
our folder to, and I'm just
simply going to train my
classifier, and in the end,
write out my model.
So, what's going to happen now
as you can see, we're off to the
races.
It's going to go through all
these images that we've already
put into our folders and
extracts the scene print from
that.
It does all the scaling down
that has to happen and will then
in the end, train and model
based on that.
So, it's a pretty complex task,
but you see for you, it's really
just one line of code, and in
the end, you should get out of
it, a model that you can
actually use in your
application.
Let's see as it just finishes.
We're almost there.
And voila, we have our model.
Now, that model, I've already
referenced in my robot shop
application.
This is what we see here now.
As you can see, this is my image
classifier.
It's 148 kilobytes.
That's smaller than the little
startup screen that I actually
have.
[ Applause ]
So, one thing I want to
highlight here already, and
we're going to go into that a
little bit later.
So, this image, that I need to
pass into this has to be of a--
a color image and a 299 by 299
pixels.
Strange layout but this is
actually what a lot of these
classifiers will do.
Alright.
So, now I have hopefully a model
that will understand it.
Now, I need to go into my
sophisticated product database
which is just a key list.
And I'm going to add my Servo to
that.
So, I'm going to rename this one
here.
This is Servo.
I'm giving it a label.
This is actually what we're
going to see.
This is a servo motor and let's
say this is a motor that goes
swish, swish.
Very technical.
Alright.
Let's see if this works.
I'm going to run my application
now.
That's -- okay.
Let's try it.
There's our servo motor.
[ Applause ]
Just to put this in perspective.
This was a world first that you
saw.
A classifier being trained live
on stage, from photos, all the
way into the final application.
I was a bit sweating about this
demo [laughter].
Thank you.
Now we've seen how it works.
There's a few things I want to
highlight actually when we
actually go through the code.
So, I promise, we're going to
live code here a little bit.
Okay, let me take everything
away that we don't need to look
at right now.
And make this a bit bigger.
Okay, so, how did I solve all
this?
So, we started, we actually
created a Sequence Request
Handler.
This is the one that I'm going
to use for my registration work,
just as Sergei already explained
in the earlier session, this is
good for tracking objects.
I will create my request, put
them into one array, and now
what do you see here for the
registration?
Just keeping like the last 15
registration results, and then
I'm going to do some analysis on
that and see if actually I'm
holding still.
I'm going to keep one buffer
that I'm holding onto while I'm
analyzing this.
This is actually when I run my
classification.
And since this can be a longer
running task, I'm actually going
to run this on a separate queue.
Alright, here's some code that I
actually just used to open.
This is actually like the little
panel that we saw.
But the important part is
actually, "So, how do I setup my
Vision task?"
So, I'm going to do two tasks.
I'm going to do a barcode
request and I'm going to do my
classification request.
So, I set up my barcode request.
And in my completion handler, I
simple look at, "Do I get
something back?"
and also just since I'm only
expecting one barcode, I only
look at the very first one.
Can I decode it?
If I get a string out of it, I
use that actually to look up --
that's actually how it worked
with my barcodes to see then --
oh, yes, that is my training
[inaudible].
Alright, so I add that as one of
the requests that I want to run.
Now, I'm setting up my
classification.
So, in this case, what I've
done, I used my classifier and
I'm loading simply this from my
bundle and create my model from
there.
Now, I'm not using the code
completion from Core ML in this
case, because this is the only
line of Core ML that I'm
actually using my whole
application, and it allows me to
do my custom kind of error
handling.
But, you can choose also to use
the code completion already from
Core ML.
Both are absolutely valid.
Now, I create my Vision model
from that.
My Vision Core ML Model, and my
request.
And again, simply when the
request returns, meaning I'm
executing my completion handler.
I simply look, "What kind of
specifications did I get back?"
And then I set this threshold
here.
Now, this is one that I
empirically set against a
confidence goal.
I'm using 0.98.
So, a 98% confidence that this
is actually correct.
Why am I doing that?
That allows me to filter out
actually when I'm looking at
something, and maybe not quite
sure what that is.
Maybe we'll see that in a
moment, actually what that
means.
So now, I have all my requests.
When it comes to the time that I
actually want to execute them, I
created a little function here
that actually mean, "analyze on
the current image."
So, when it's time to analyze
it, I get my device orientation,
which is important to know how
am I holding my phone.
I create an image request
handler on that buffer that we
currently want to process.
And asynchronously, I let it
perform its work.
That's all I have to do
basically for actually doing the
processing with Core ML and
barcode reading.
Now, a few things, just okay.
How do I do the scene stability
part?
So, I have a queue that I can
reset.
I can add my points into that.
And then simply I had created a
function that allows me to look
basically through the queue of
points that I've recorded.
And then setting like, "Well, if
they all sum together, only show
a distance of like 20 pixels."
Again, that's an empirical value
that I chose.
Then I know the scene is stable.
So, I'm holding stable.
Nothing is moving in front of my
camera.
And then comes our part of catch
the output.
So, this is the call that
actually AV Foundation calls
[inaudible] buffers from the
camera.
I'm making sure that I you know,
hold onto the previous pixel
buffer because that's what I'm
going to compare against for my
registration, and swap those out
after I'm done with that.
So, I create my Translational
Image Registration Request with
my current buffer.
And then on the Sequence Request
Handler, I simply perform my
request.
Now, I get my observations back.
I can check if they are all
okay.
And add them into my array.
And last but not least, I check
if the scene's stable.
Then I bring up my little,
yellow box which is [inaudible]
detection overlay.
I know this is my currently
analyzed buffer.
And simply ask it to form its
analysis on that.
The one thing that you saw that
I did at the end of the
asynchronous call, in the
currently analyzed buffer, I
released that buffer.
And you see here that I check if
there is a buffer currently
being used.
Now, that allows me to make sure
that I'm only really working on
one buffer, and I'm not
constantly queueing up more and
more buffers while they're still
running in the background,
because that will starve the
camera from frames.
Alright, so when we run this,
there's a few things I want to
highlight actually.
So, let me bring up Number 1,
our console here a little bit on
the bottom.
And you can actually see, when
I'm running this at first, and
so, I'm going to run this right
now here.
Hopefully you should actually
see something.
You see that the confidence
scores are pretty low because
I'm actually [inaudible] over
the [inaudible].
It's not really sure what I'm
really looking at.
The moment actually I point it
at something that it should
identify, boom, our confidence
score goes really high and
that's actually how I'm sure
that this is really the object I
want to show.
Now, another thing I wanted to
demo, let's look actually what
happens in terms of the CPU.
Alright, so right now, I'm not
doing anything.
I'm just showing my screen.
So, when I'm just moving the
camera around, scene is not
stable, I'm using about 22% of
the CPU.
Now, if I hold it stable and
actually the classifier runs,
you see how the CPU goes up.
And that's why I always
recommend only run these tasks
when you really need.
Alright, that was a lot to take
in.
Let's go back to the slides and
recap a little bit what we have
just seen to now.
So, go right to the slides.
[ Applause ]
Okay, recap.
First thing, how did we achieve
our scene stability?
We used a sequence request
handler together with the VN
Translational Image Registration
Request, to compare against the
previous frame.
Out of that, we get our
translation as of terms of an
alignment transform which tells
me the X and Y of like how the
previous frame has shifted to
[inaudible] the current one.
Then we talked about that we
want to only analyze the scene
when it's stable.
And for that, we created our VN
Image Request Handler off the
current buffer.
And we passed in together both
the barcode detection and the
classification.
So, that allows Vision to do its
optimization underneath the
covers and can actually perform
much faster than if you would
run them as separate requests.
Next was the part about thinking
about how many buffers do I hold
in flight?
And that's why I say manage your
buffers.
Some Vision requests, like these
convolutional networks, can take
a bit longer.
And these longer running tasks
are better to perform on a
background queue, so that
[inaudible] or whatever you do
in the camera, can actually
continuously running.
But to do this particularly with
the camera, you do not want to
continuously queue up the
buffers coming from the camera.
So, you want to drop busy
buffers.
In this case, I said I only work
with one.
That's actually in my use case
scenario works pretty well.
So, I have a queue of 1, and
that's why I simply held onto
one buffer and check, as long as
that one is running, I'm not
queuing new buffers up.
Once I'm done with it, I can
reset and work on the next
buffer.
Now, you might ask, "Why am I
using Vision when I can run this
model in Core ML?
It's a Core ML model."
Well, there is one thing why
this is important to actually
use Vision.
Let's go back and look what we
saw when we looked at our model.
It was the strange number of 299
by 299 pixels.
Now, this is simply how this
model is trained.
This is what it wants to ingest.
But our camera gives us
something like 640 by 480 or
larger resolutions if you want.
Now, Vision is going to do all
the work by taking these
[inaudible] buffers as they come
from the camera, converts it
into RGB and scales it down and
you don't have to write any code
for that.
That makes it much easier to
drive these Core ML models for
image requests through Vision.
So, that was image
classification.
Next, we talk about object
recognition.
Now, a little warning.
In this demo, actually a live
croissant might actually get
hurt on stage.
So, for the squeamish ones,
please look away.
So, what we're using for our
object recognition is a model
that is based on this YOLO
technique, You Only Look Once.
That is a pretty fast-running
model that allows us to get the
bounding boxes off objects and a
label for that.
And it finds multiple of them in
the screen.
As you see in the screenshots.
The advantage of those is that I
get actually like where they
are, but I won't get as many
classifications as I can do with
like our overall image
classifier.
The training is also a little
bit more involved, and with
that, I would like to actually
refer you to the Turi Create
session that was, I believe,
yesterday, where they actually
showed you how to train these
kind of models.
These models are also a little
bit bigger.
So, how does this look?
Let's go over to our demo.
Robot shop is closed.
Time for breakfast.
Alright.
Let me bring up my quick
template here and I have my new
little application.
It is my Breakfast Finder.
And what do we see?
Oh, we have a croissant, we have
a bagel, and we can identify the
banana.
And see, they can all be kind of
like in the frame.
I'm going to detect them.
So, some mention in these
cooking shows, normally they
show you how to do it, but then
pull the prebaked stuff out of
the oven.
Well, this model, actually has
been baked way before this
croissant has been baked, and I
can prove this.
It's fresh.
And still a croissant.
[ Applause ]
Alright.
It's fresh but still, got to
chew.
Let's look quickly how this
looks in the code.
So, what did I do differently?
[Inaudible] actually in the
terms of like setting up my
request.
All I have to do is use my Core
ML model, just as I did in the
previous example, create my Core
ML request, and afterwards, I'm
looking actually at simply, "How
do I draw my results?"
Now this is where we have
something new to make this
[inaudible] a little bit easier.
And when we look at all of this,
we get a new object that is the
[inaudible] recognized --
Recognized Object Observation,
and out of that, I get my
bounding box, and my observation
of like the labels.
Now, that's one thing I would
like to show you here.
Let's run our application from
here and I put a break point.
Okay.
Alright, we are now on our break
point.
So that I only look at the first
label.
So, what we are doing when we
actually process these results,
I'm going to take this, let's
try object observation, labels.
So, what you actually see is
that I get more than one back.
I get my bagel, my banana,
coffee -- I didn't bring any
coffee today.
Sorry about that.
And croissant, egg, and waffle.
Now, they are sorted in the
order of like the highest
confidence on the top.
Usually, this is the one that
you're interested in.
That's why I'm taking the
shortcut here and just looking
at the first one.
But you always get all of the
classifications back in terms of
an array of the ones that we
actually support in the model.
Alright.
That was our Breakfast Finder.
Let's go back to the slides, and
this time, I'm pushing the
button.
Good.
So, we made this possible
through a new API and that is
our VN Recognized Object
Observation.
It comes automatically when we
perform a Core ML model request,
and if that model is actually
using an object detector as a
space.
Like in this example, it is
based on the YOLO based models.
Now, you might say, "Well, I
could have already run YOLO like
last year.
There were a bunch of articles
that I saw on the web."
But look at how much code they
actually had to write to take
the output of this model, to
then put it into something that
you can use.
And here, we only had a few
lines of code.
So, it makes YOLO models really,
really easy to use now.
Let's go once more over this in
the code.
So, I create my model.
From the model, I create my
request.
And in my completion handler, I
can simply look at the area of
objects, because we saw we can
get multiple objects back.
I get my labels from that, my
bounding box, and I can show my
Breakfast Finder.
Now, there's one more thing I
would like to highlight in this
example.
You saw how these boxes were a
little bit jittering around
because I was running the
detector frame by frame, by
frame, by frame.
Tracking can often be a better
choice here.
Why? Tracking is much faster
even in terms of like running,
than actually these models would
run.
So, redetection takes more time
than actually running a tracking
request.
I can use the tracker basically
if I want to follow now an
object on the screen because
it's a lighter algorithm.
It runs faster.
And on top of it, we have
temporal smoothing so that these
boxes will not jitter around
anymore and if you see some of
our tracking examples, they
actually move nicely and
smoothly across the screen.
If you want to learn more about
tracking, the previous session
from my colleague Sergei,
actually talks about how to do
all the implementation work of
that.
Alright, last but not least,
let's enhance our Vision mastery
and go into some of our
fundamentals.
Few things are important to know
when dealing with the Vision
framework.
First and foremost, and this is
a common source of problems,
image orientation.
Now, not all of our Vision
algorithms are orientation
agnostic.
You might have heard early that
we have a new face detector that
is orientation agnostic.
But the previous one was not.
This means we need to know what
is the upright position of the
image?
And it can be deceiving because
if you look at it and preview on
the final, my image looks
upright, but that is not how it
is stored on disk.
There is something that tells us
how the device is oriented, and
this is called the EXIF
Orientation.
So, if an image is captured,
that's normally in the sensor
orientation, with the EXIF, we
know what is actually upright
and if you pass an URL into
Vision as our input, Vision is
actually going to do all that
work basically for you and
actually read this EXIF
information from the file.
But like when -- as we showed in
the demos earlier, if I use my
live capture feed, I need to
actually pass this information
in.
So, I have to look at what does
my orientation from my UI device
current orientation and convert
this [inaudible] to CG Image
Property Orientation because we
need it in the form of an EXIF
orientation.
Next, let's talk a little bit
about our coordinate system.
For Vision, the origin is always
in the lower, left corner.
And all processing is done in
the up right -- if the image
would be in an upright position,
hence the orientation is
important.
All our processing is really
done in a normalized coordinate
space, except the registration
where we actually need to know
how many pixels [inaudible].
So, normalized coordinates means
our coordinates go from zero,
zero, to 1,1, in the upper right
corner.
Now what you see here is to that
performed face and landmark
detection request.
And you will see that I get the
bounding box for my face, and
the landmarks are actually
reported in relative coordinates
to that bounding box.
If you need to go back into the
image coordinate space, we have
utility functions and VNUtils
like the VN Image [inaudible]
from normalized way to convert
back and forth between those
coordinates.
Next, let's talk about
confidence scores.
We touched a little bit on this
already during our robot shot
example.
A lot of our algorithms can
express how confident they are
in their results.
And that is kind of an important
part to know when I want to
analyze later on what I get out
of these results.
So, if I have a low confidence
of zero, or do I have a high
confidence of 1?
Clicker. Here we go.
Alright.
Unfortunately, not all
algorithms will have the same
scale in terms of like how they
report their confidence scores.
For instance, if we look at our
text detector, it pretty much
always returns a confidence
score of 1 because if it doesn't
think there's text, it's not
going to return the bounding box
in the first place.
But as we saw, the classifiers
have a very large range actually
of what this confidence score
could be.
Let's look at a few examples.
In my first example, I used an
image from our robot shop
example and I ran my own model
on it.
And sure enough, it had a very
high confidence, this is a
stepper motor.
Now, on this next examples, I'm
going to use some of the models
that we have in our model
gallery.
So, don't get me wrong.
I don't want to compare the
quality of the models.
It's about like actually what
did confidence they return and
what actually want to do with
this.
So, what did this tell us
basically when we want to
classify this image?
Well, it's not that bad, but
it's really sure about it
either.
The confidence score of 0.395 is
not particularly high, but yes,
it has a sand part, there's some
beach involved.
So, that's usable basically as a
result when I want to search for
it, but might I label the image
with that?
It's probably questionable.
Let's look at this next example.
Girl on a scooter.
What did the classifier do with
this?
Well, I'm not sure she's so
happy to be called a sweet
potato.
Let's look at one more example.
So, here's a screen shot of my
code.
What does the classifier do with
that?
It thinks it's a website.
Computers are so dumb.
So, some conclusions about our
confidence scores.
Does 1.0 always mean it's a 100%
correct?
Not necessarily.
It will fill the criteria of the
algorithm, but our perception,
as we saw particularly with the
sweet potato is quite different.
So, when you create an
application that wants to take
advantage of that, please keep
that in mind.
Think of it.
If you would write a medical
application and saying, "Oh, you
have cancer," that might be a
very strong argument where you
want to probably soften that a
little bit depending on like how
sure you can really be on the
results.
So, there are two techniques
that you can use for this.
As we saw in the robot shot
example, I used a threshold on
the confidence score because I
really label the image and I you
know, when it's actually
filtering everything out that
had a low confidence score.
If on the other hand, I want to
create a search application, I
might actually use some of the
images that I had and still show
them, probably on the bottom of
the search because there's still
a valid choice, perhaps in the
search results.
As usual, we find some more
information on our website.
And we have our lab tomorrow at
3 p.m. Please stop by.
Ask your questions.
We're there to help you.
And with that, I would first of
all like to thank you all for
these great application that you
create with our technology.
I'm looking forward to see what
you can do with this.
And thank you all for coming to
WWDC.
Have a great rest of the show.
Thank you.
[ Applause ]