WWDC2017 Session 506

Transcript

>> Hello everyone.
[ Applause ]
I hope you're have a great time
at WWDC so far.
Allow me to introduce myself.
My name is Brett Keating and I'm
here with my colleague Frank
Doepke and we're here to tell
you about Apple's new Vision
framework, so let's get started.
We're going to begin by showing
you what Vision can do for your
apps.
We're going to go through a few
visual examples of the
algorithms that are going to be
made available in the Vision
framework this year.
At which point I'll hand it off
to Frank to talk about the
concepts behind the Vision
framework, why we designed
things the way we did, what the
mental model behind our API is.
And then we'll go a little
deeper and go through a code
example.
This code example brings
together a few different
technologies in our SDK,
including Core Image, as well as
the brand-new Core ML framework
that we're offering this year
which enables you to put in your
own custom models and have them
be accelerated using our
hardware.
So, let's begin with what you
can do with Vision.
Let's start off with face
detection.
Now face detection is something
we already have in our SDK, but
we're offering in the Vision
framework new this year a face
detection that's based on deep
learning.
And you may already know that
deep learning has made
groundbreaking changes in the
accuracy in what we can do with
Vision technologies and face
detection is no exception.
We're going to have higher
precision which means fewer
false positives, but we are also
going to have dramatically
higher recall which means we'll
miss less faces.
So, let's look at some of the
examples of faces that we will
now be able to detect with the
Vision framework.
For one thing, we'll be able to
detect smaller faces.
We'll also be doing a better job
of detecting strong profiles.
We'll also do a better job
detecting more partially
occluded faces and that includes
things like hats and glasses.
Sticking with the faces theme
for a little longer, we now are
offering in the Vision framework
new this year face landmarks,
what are face landmarks?
This is a constellation of
points that we detect on the
facer, things like the corners
of the eyes, the outline of the
mouth, the contour of the chin.
Here's an example, here's
another example, and one more
example.
We're really excited about this
I think there's going to be some
great apps created with this
technology.
Next, also new this year in the
Vision framework is image
registration.
If you don't know what image
registration is it's basically
aligning two images based on the
features that are present in
those images.
You can use this for stitching
together used for panorama kind
of like this example or image
stacking applications.
We have two different kinds, one
that's translation only and one
that gives you for full
homography for greater accuracy.
We're also offering a few
technologies that are already in
our SDK through CIDetector
interface.
We're making them available in
the Vision API as well.
That includes rectangle
detection as you can see, we
detect the sign in the picture.
We're also doing barcode
detection and recognition in the
Vision API and text detection as
well.
Another new technology,
brand-new in the Vision
framework this year is object
tracking.
You can use this to track a face
if you've detected a face.
You can use that face rectangle
as an initial condition to the
tracking and then the Vision
framework will track that square
throughout the rest of your
video.
Will also track rectangles and
you can also define the initial
condition yourself.
So that's what I mean by general
templates, if you decide to for
example, put a square around
this wakeboarder as I have, you
can then go ahead and track
that.
You can see that we handle
pretty large changes in scale,
pretty large deformations fairly
robustly with this technology.
Another really exciting
technology that's new in Apple's
SDK this year Core ML and you
can integrate your Core ML
models directly into Vision.
As I've mentioned, machine
learning does great things for
Computer Vision and you can use
Core ML if you want to create
your own models, do your own
solution.
Perhaps for example, you want to
create a wedding application
where you're able to detect this
part of the wedding is the
reception, this part of the
wedding is where the bride is
walking down the aisle.
If you want to train your own
model and you have the data to
train your own model you can do
that.
Core ML as I mentioned, provides
native acceleration for custom
models so they'll run really
fast and Vision provides the
imaging pipeline to support
these models, so you won't have
to do any rescaling or anything
like that we'll take care of all
that for you.
We know what your model is
expecting and we'll put the
image in the right format.
If you're interested in Core ML
there's some sessions that you
can go to, we've listed the labs
down here for you.
One of them will be tomorrow
morning and then another one on
Friday afternoon.
So that's basically the features
that are in the Vision
framework.
Overall, what Apple's new Vision
framework provides are
high-level on-device solutions
to Computer Vision problems
through one simple API.
Now let me break this statement
down just a little bit.
What do I mean by high-level
solutions?
Well we don't want you to have
to be a Computer Vision expert
to put the magic of Computer
Vision into your applications.
You don't want to necessarily
have to know which feature
detector you want to use in
combination with what classifier
or set of classifiers, we're
going to handle that for you or
whether or not you want to use
machine learning for example.
If you're a developer you're
probably thinking I just want to
know where the faces are.
And so, we're going to handle
all that complexity for you.
Depending on your use case we'll
be doing either traditional
approach if that's what's needed
for maybe real-time applications
or deep learning algorithms for
higher accuracy.
Now I also mentioned that we're
doing all these algorithms on
the device, let's talk a little
bit about why we'd want to do
things on device versus provided
a cloud-based solution.
First of all, it's privacy.
As you know, Apple cares a lot
about privacy, I care a lot
about privacy working at Apple,
sometimes it makes my job a
little harder, but nonetheless
keeping all your data on the
device is the best way to
protect your user's data
privacy.
Furthermore, with certain
cloud-based solutions there's a
cost associated with it.
If you're a developer maybe
you're paying usage fees to use
a cloud-based solution.
Your users will have to transfer
the data to that cloud.
All these costs they can add up
for both the developers and the
users.
So, when everything's on the
device it's free.
And you can support real-time
use cases like the tracking
example I showed you.
Imagine trying to track
something through a video by
sending every frame to the
cloud, I don't think that's
going to work too well.
So, no latency, fast execution
that's what we're offering with
the Vision framework.
So, I hope you enjoyed that
introduction, now we're going to
go a little deeper and talk
about the Vision concepts.
For this part of the
presentation I'm going to hand
it off to Frank.
[ Applause ]
>> Thank you Brett.
Hi, good afternoon, my name is
Frank Doepke and I'm going to
talk about more of the technical
details what is part of our
Vision framework.
So, what do we want to do, when
we want to analyze an image we
have three major tasks that we
actually want to perform.
So, we [inaudible] finding out
what is in the image and what do
I want to know about it.
There's the machinery,
somebody's got to do the work
and we get some results out of
it, at least we hope that's
what's going to happen.
So, in terminology for Vision
that means the asks these are
requests.
And I just did a few examples
here like the barcode detection
or face detection and we feed
them into our request handler.
That's the one in this case to
be an image request and
[inaudible] hold on to the image
and it's going to do all the
work for us.
And as a result, we get back
what we call observations, what
did we observe in this image.
And these observations depend on
what you asked us to do.
So, we have classification
observation or detected objects.
Now when you want to track
something in the sequence like
the wakeboarder it's basically
the same concept.
We have some asks, we have the
machinery, and we get some
results out of it in the end.
Again, the asks are requests.
Now since this changes with
every frame the image actually
travels with the request.
Our machinery is again
[inaudible] request handler it's
the sequence request handler and
we get results which are
observations that go with our
requests.
So, let me talk a little bit
more about these two image
request handlers that I
mentioned so far.
So, we have the image request
handler that is mostly if you
want to do something interactive
with the image.
You want to do multiple Vision
tasks on an image, sometimes you
actually do one and then based
on the results you then kick off
the next one and that's what you
want to use the image request
handler for.
It'll hold on to the image that
it's set up with for its
lifecycle and that allows us
under the cover to do
performance optimizations by
holding on to intermediates to
make these requests perform
faster.
On the flipside, if I want to
track something we use the
sequence request handler.
The sequence request handler
allows us to keep tracking
[inaudible] in the sequence
request handler.
And it will not hold on to all
the images that gets fed into it
over its lifecycle so they get
released earlier.
But that means on the flipside,
you cannot do the same
optimizations if you want to do
multiple requests on the same
image.
So how does this look in the
code, we are developers that's
what we want to see.
So, we start as a blank slate
that's always good and then we
create a request.
In this case, it's a face
detection request.
Now we create the request
handler, what I'm choosing here
is a request handler based on
the files so I have a file on
disk that I want to use.
Now I ask myRequestHandler to
perform my request and this in
this case [inaudible] it's just
one request I have my array, but
it could be many and I get my
observations back.
And this can be many faces that
I detected.
Now the one thing I would like
to highlight here is the results
come back as part of the request
that we actually set up
initially.
How does it look when we want to
track something?
We create a sequence request
handler [inaudible] of course
not set up as an image because
we have to [inaudible] all the
frames of the sequence.
So, I started with an
observation that I got from the
previous detection or I mark
something up and I create my
tracking request.
And I simply have to run the
request.
And I feed in in this case as a
pixel buffer the frame that is
currently being dragged.
And out of it again I get some
results.
So, now that we have talked
about how this API is kind of
structured I would like to guide
you through some best practices
so that you get, you know, the
best experience out of Vision.
So, when we want to put together
a Computer Vision task you have
to think about a few things.
Number one, what is the right
image type that I want to use.
Number two, what am I going to
do with the image.
And number three, what
performance do I need or want.
Of course, you always want
fastest, but there are some
tradeoffs that you have to think
about.
So, let's talk about the image
type.
Vision supports a number of
image types and they range from
CVPixelBuffer, CGIImage or even
as we saw in the previous
example just from data that I
use in NSURL.
And we go over all these types
in the following slides so that
you know what to choose when.
Which to choose depends a lot of
like what you want to do.
If you run from a camera stream
or if you run from files on disk
you have to look at that
[inaudible] kind of which type
of image you want to use.
Now two important things to
remember is we already have an
imaging pipeline in the Vision
framework you don't need to
scale the images.
So, unless you already have a
very small representation that
you absolutely want to use,
please don't pre-scale because
we'll just do the work twice.
And mind the orientation.
Computer Vision algorithms are
mostly not, you know, sensitive
to orientation or sorry, they
are sensitive to orientation so
you have to pass that in.
And that is an important part
because if you pass in a
portrait image that's actually
lying on its side we will not
find the faces and that's one of
the common mistakes that usually
happens.
So, I promised to go over the
types.
When you want to do something
streaming we want to use the
CVPixelBuffer.
When you create a VideoDataOut
[inaudible] capture you will get
CMSampleBuffers and through
those we get your
CVPixelBuffers.
It's also a pretty good format
if you already have something
where you keep your image data
raw in memory like it's LGB
pixels and wrap them into a
CVPixelBuffer this is a great
format to pass into Vision.
When you get files from disk
please use the URL or if it
comes from the web use the
NSData path.
The great thing about that is it
really allows us to reduce the
memory for print in your
application.
Vision will only read what it
needs to perform the task.
If you think about you want to
do face detection on a
64-megapixel panorama Vision
will actually reduce your memory
for it, but not reading the full
file actually into the memory
and that is an important thing
to keep in mind.
We will read in this case the
EXIF Orientation out of the
file, but you can override it if
you have to for those formats
that don't support it.
If you're already using Core
Image in your application by all
means process the CI image.
This is also important when you
want to actually do some
preprocessing.
If you have some domain
knowledge of what you want to do
in your Computer Vision task you
can do some preprocessing and
try and enhance the image and,
therefore, enhance the Vision
results.
If you want to learn a bit more
about Core Image, there's a
session on Thursday at 1:50 and
they will also show the
integration with our Vision
framework.
Last but not least, if you have
all the images in your UI you
can use the CG image [inaudible]
out of the NS image or the UI
images let's say it comes from
the UI image picker and pass
those into Vision.
Now what am I going to do with
the image and that's where we
have to decide if I want to do
something interactive with the
image in that case I use my
ImageRequestHandler.
It will hold on to the image for
the time and I can do multiple
passes on that image and get the
best results out of that.
Now the CVPixelBuffer
technically will allow you that
you could change the pixels
[inaudible], but we see them as
immutable so don't do that
because we'll get some strange
results.
Next, if you want to track
something we use the
SequenceRequestHandler.
It allows us to keep the
tracking state and lifecycle of
my image is not tied to those
requests handler anymore, but
just how long it needs it for
the tracking.
Performance, so these Vision
tasks are computationally
intensive very often and they do
take time, so you have to think
about that you want to actually
run your task on a different
queue not your main queue.
And think about if you want to
do it in the background, which
is a bit slower or if you need
it very quickly use a more
interactive quality of service
to get the performance.
A good practice is to use the
completion handler to get the
results back, this is part of
our API.
But keep in mind that this
completion handler gets called
on that queue in which you
actually set it off.
So, if you need to update your
UI you have to dispatch that
back to the main queue.
So as Brett already highlighted,
we have a new face detection and
you might say oh God, yet
another one.
But we have good reasons for
this.
Vision uses deep learning and
this gives us really a lot
better precision and recall,
therefore, much better results.
The downside of it, on older
hardware it will run a bit
slower.
So, let's look a little bit at
our overall landscape of face
detectors that we have
available.
So, we have Vision which really
gives us the best results and
it's pretty fast and also pretty
good in its power use as it is
optimized for that.
And we have it available on all
platforms except the watchOS.
And this is the same in terms of
availability for Core Image and
it's a bit faster, but the
results are not quite as good.
In the AV capture session which
is only happening during the
capture side we can actually use
hardware so it's really fast in
performance, but the results
again are not as good as we get
out of Vision.
So, you have to choose depending
on your application what you
want to do choose the right
technology for the face
detection.
Now I did mention that our
quality is better, so let me try
to prove that a little bit.
So, I have here an image and I
ran the face detection through
Core Image.
And we find two faces and we see
roughly where the eyes and where
the mouth are.
Now in Vision we find all four
faces even the occluded ones and
we get a whole lot more details
with the visual landmarks.
Speaking of Core Image, I would
like to highlight a little bit
what's happening with the
CIDetectors.
So, whoever uses it already can
keep on using them they are
still in Core Image, but all new
parts and all the improvements
in terms of algorithms for
computer moving forward will be
in Vision that's the new home
for Computer Vision.
So, an awful lot of sides, how
about a demo.
So, what I'm going to show you
is an application that runs an
AV capture session on the device
if the demo Gods are with us.
And we will do a very simple
rectangle detection request.
So, what do I have to do to set
this up?
What you see here is I create my
request, that's my simple
rectangle detection request in
this case.
I'm actually in the wrong sample
that's why I'm getting confused
here, my apologies.
Here we go.
Okay we have our rectangle
detection request and I'm
setting some parameters just as
an example here, I only want
them this minimum size in our
coordinates are normalized so I
only want a 10% minimize size of
the image and I just want 20
rectangles.
I could get more, but I want 20,
I just picked a number.
I set up my area of the request
that I want to perform and the
right here this is our
completion handler and all that
I'm going to do is I'm going to
draw my rectangles, but as you
notice I'm just patching it to
the main queue to update our UI.
Where do our images come from?
So, we look at the capture
output here and as I promised,
in the capture output we get our
pixelBuffer from the
CMSampleBuffer.
Right here I'm getting the
cameraIntrinsics.
Now this is something that is
important in some of these
Computer Vision paths where we
actually know what the camera is
kind of looking at.
As I mentioned, we don't forget
the acts of orientation and I
create an image request handler
and perform our tasks.
So, how does this look when we
actually run it?
All right, so what we're going
to see here is that now we
tracked this rectangle and
that's as simple as it is, we
can find other rectangles.
If the cable is long enough we
can actually look oh, there we
find a computer with various
rectangles.
Now I chose the yellow kind of
on purpose because it's the same
color as you saw in the demo
during the keynote for the new
document camera on notes.
And I borrowed their color
because they borrowed our code
to do actually the rectangle
detection.
[ Applause ]
Thank you.
[ Applause ]
Now that was simple, let's do a
bit more.
So, how about we throw some
machine learning at this as well
just for the fun of it.
So, what I have to do is I have
a little model that I just
dragged into my project here.
And that is a classifier that
will tell us a bit something
about the image.
And we see when we look at this
part here that we need to feed
it an image of a very strange
size and get out of it some
classification.
Now you don't need to worry
about that size because Vision
will do the work for you.
So, what do I need to do?
I first need to create a Vision
model and my request with that.
And that is the part that we
have here, so I'm simply loading
the inception model and I create
my classification request.
Now it tells me there's
something missing and I will get
to that in just a moment.
The last thing I want to
highlight here is it says that
it was okay square image, but
our cameras don't see squares so
I need to tell it actually how
to handle just, you know, the
aspect ratio that I want to use.
And I say okay I want to just
send a crop.
So, I need a completion handler
for my task and I have that
already pre canned here as well.
So, in this completion handler I
simply look at my observation
and I will get so this
classifier can see a thousand
different things and I don't
want to show all of them I only
show the ones that I care about.
So, what I'm doing is a little
bit of filtering, I only take
the top four and I only look at
the ones that have a confidence
of at least 30%, it just works
well for my demo here, but you
know you will figure out what
kind of works well for your
model.
And all I have to do next is add
my classification request into
my area of request and now I
will actually run two requests.
So, I have this already loaded
on my device, let's see how this
actually looks.
Of course, you will see it when
I switch to the correct machine,
there we go.
Okay, so we have a coffee mug
which is empty, somebody better
fill that for me.
We have a ballpoint pen, we have
a padlock and look an iPod.
Who has stolen those empty cards
away and didn't realize it was
an iPod?
[ Applause ]
All right, let's go back to the
slides before we get to the next
show-and-tell.
For my next demo, I want to do
something a little bit more
elaborate and with that I chose
something that's called
MNISTVision.
People in the machine learning
community have already looked at
that a little bit more.
MNIST is a dataset where a bunch
of government employees and high
school students wrote numbers
down and this was marked up and
people were trained in our
classifier on that.
Note this is basically like
white numbers on black
background, so I guess they've
written it with chalk on an old
blackboard.
So, in this sample code I'm
going to show you I want to show
a few concepts that are kind of
important like making something
a bit more elaborate with
Vision.
First, we'll spin off model
requests based on top of each
other then we use Core Image in
between to do some image process
and last but not least, we use
Core ML again for the machine
learning part.
So, how is this going to work?
We have here an image on which
we find a sticky note.
Well we find it by using the
rectangle detector, there's our
sticky note.
Now that is prospectively
distorted and it's clearly not
white text on black background.
So, we use Core Image in the
next step and we'll actually do
the perspective correction of it
and invert the color and enhance
also the contrast so that we get
rid this black-and-white image.
And last but not least, I need
to run my MNIST classifier on it
and it should tell me that this
is the number four and this has
80% confidence that this is the
number four.
Again, let's see how this looks
in the app.
So again, I start off as a
rectangle detector request, it's
my favorite I know.
But it's more interesting what
I'm going to do in the
completion handler.
So, I do some validation just to
make sure that the rectangles
I'm getting out of it are
actually okay.
But the interesting part happens
here.
I get the coordinates of the
corners and feed them into CI to
use the CIPerspectiveCorrection.
That allows me to take this
prospectively distorted image
and actually bring it upright as
if I would have the camera
straight on.
I use the CIColorControls to
really bring out the contrast of
the image to make it kind of
binarized.
And as I said, I have to color
invert it.
Now the resulting image of that
I feed into a new request in
there because we have a new
image on which I'll run the
classification.
So, how does the classification
look like?
The classification for that I
use my endless model which I
actually have ready and this is
a small model that I've really
trained actually on this laptop
very easily give us a few lines
of code script and then thanks
to Core ML I can just drag that
in and use this very easily.
So, I have my model here.
Now again, this one part I would
like to highlight this, so that
takes in this case a very small
grayscale image.
So, an image 28 by 28 pixels it
should be able to read these
numbers.
So that's where my
classification is coming from
and now I need to feed in the
image.
So, this sample code has been
made available also for the
session to make it easy also to
run a simulator not running it
live off of the camera I'm just
going to use actually the
UIImagePicker and feed it into
my VMImageRequestHandler and let
it just perform the rectangle
request.
Now notice I buried the request
for the classification into the
completion handler of my
rectangle detection and that
allows us to basically cascade
multiple requests on top of each
other.
So, let's try the demo for this.
Okay, so I have my app here and
well [inaudible] giveaway.
Okay, so what I see here is
again I have my image on the
top, this was actually the photo
that I took earlier.
We see its correctly classifying
as a number one, it was a really
high confidence in this case.
And what you see on the bottom
is just basically just to
visualize that I took this
intermediate image that we
created in CI and show this as
well.
Let's choose another number.
Yes, this is the number three.
Can we guess what this number
is, it's the number four?
It works correctly.
All right, thank you.
[ Applause ]
Let me go back to our slides.
So that is our Vision framework.
Let's capitalize a little bit on
what we really have seen here.
So, Vision is a high-level
framework for Computer Vision
and it should really make it
easy for you to use this in your
applications even if you're not
a Computer Vision expert.
We have various detectors and
there's a whole variety of that
and they all run through one
consistent interface which would
make it very easy to learn that
set of APIs.
And last but not least, the
integration with Core ML.
By bringing your own custom
models you can do a lot in your
application, you can find
hotdogs and see if they are
really hotdogs.
I had to make that joke.
So, if you want to learn more
about our session, please go to
our website and I would
definitely highlight there are
some related sessions that you
should have watched perhaps in
the past.
I'll read you the Core ML one,
but you can find it on our
website.
Please come for our get-together
that we have at 6:30 today, chat
about what we can do.
And for the little bit more
advanced part of Core ML there's
a session on Thursday, as well
as we have a session with Core
Image where they will also do
some very fancy stuff with Core
Image and Vision.
And with that I'd like to thank
you for coming today and enjoy
the rest of WWDC.
Thank you.
[ Applause ]