WWDC2017 Session 507

Transcript

>> Welcome to Session 507.
I'm Brad Ford.
I'm from the Camera Software
Team, and I'm very excited to
share some deep thoughts with
you this afternoon.
Did you see what I did there?
All right [applause].
This session is part one of a
two-part series on a very
important initiative for Apple
this year, and that is media
containing depth information.
I'll introduce depth at the
conceptual level, I'll
familiarize you with key terms,
and I'll teach you how to
capture depth data on the
iPhone, much like this.
You'll see a lot of ghostly
images in this session.
Here's the agenda.
First we're going to cover depth
and disparity on iPhone 7 Plus
at a high level.
Then we'll move on to streaming
depth data from the camera,
capturing photos with depth
data, and finally we'll end with
a slight tangent, which is dual
photo capture.
It is the most highly requested
feature we've had on dual
camera, and I'm very excited to
talk about it.
Your job is to listen for all
the truly horrible depth puns
that I have sprinkled throughout
this session, and let's make a
game of it.
Okay? Every time you hear one,
just give me a nice big groan to
let me know that you care.
All right?
Here, let's practice.
Everybody ready for a deep dive?
[ Group groaning ]
Thank you, from the bottom of my
heart.
[ Group groaning ]
All right.
Good. The reason you're all here
today is this guy right here.
This is the iPhone 7 Plus.
The product, of course, has sold
exceptionally well, even better
than its plus size predecessor,
and that's thanks in large part
to the quality of the dual
camera system.
It is a dual prime lens system
consisting of a 28-millimeter
equivalent wide-angle camera,
and a 56-millimeter equivalent
telephoto camera.
Both of them are 12 megapixels.
They share the same feature set,
the same formats.
You can run either of these
cameras on its own, or you can
address them in tandem using a
third virtual camera, the first
time we've ever delivered one on
iOS, and it's called the dual
camera.
It runs them in a synchronized
fashion, the same frame rate,
and running them together
enables two marquee features.
The first is dual camera zoom.
This switches between the wide
and the tele automatically as
you zoom.
It matches exposure, focus and
frame rate so that it's kind of
magical.
You don't even realize that
we're switching cameras, but all
of this happens very seamlessly.
We also are compensating for the
parallax shift to make it a
smooth transition as you go back
and forth between the wide and
the tele.
And the second marquee feature
is, of course, the Portrait
mode, where the dual camera
system locks into the tele
camera's narrower field of view,
but then uses images from both
the wide and the tele to
generate a beautiful shallow
depth of field effect that you'd
expect from a much more
expensive camera with a fast,
wide-open lens.
The foreground is sharply in
focus, while the background is
progressively blurred in these
pleasing little bouquet circles.
The depth effect has gotten even
better in iOS 11.
We've made improvements to the
rendering of the out-of-focus
area.
It more accurately represents a
wide-open fast lens with sharp
and well-defined bouquet
circles.
We've also improved how the
rendering handles the edges
between the foreground and the
background.
Please check it out if you
haven't yet.
I think you'll be pleasantly
surprised at how great the
quality of the shallow depth of
field effect is in iOS 11.
To generate an effect like this
you need to be able to separate
foreground from background.
In other words, you need depth.
And up to now that depth
information has been exclusive
to the Apple camera app's
Portrait mode, but now new in
iOS 11 we are opening up depth
maps to third party apps.
Here's a gray scale
visualization of the depth map
that was embedded in this image
file.
Having depth information opens
up a world of possibilities for
image editing, such as applying
different filters to the
background and the foreground,
like this.
I've applied a noir black and
white filter to the background
and the fade filter to the
foreground.
And notice how the little girl's
tights are still pink, but
everything behind them is black
and white.
Knowing the gradations of depth,
I can get even fancier and I can
move the switch-over point
forward or backward, like this.
Keep your eyes on the flower.
So now notice that just her hand
and her flower are in color,
while everything else is in
black and white.
You can even control foreground
and background exposures
differently, like this.
So now she looks like she was
photoshopped into her very own
photo.
I'm not saying you should do it.
I'm saying you could do it.
All right.
Enough fun.
Let's get technical.
I like to call this section deep
learning [group groaning].
Thank you.
First we need to define what a
depth map is.
In the real world depth means
the distance between you and an
observed object.
A depth map is a transformation
of a three-dimensional scene
into a two-dimensional
representation, and you do that
by setting the depth to a
constant distance.
Let me explain what I mean.
I'm going to use a diagram of a
pinhole camera often during this
presentation.
If you've studied computer
vision, you'll be really
familiar with pinhole cameras.
A pinhole camera is a simple
lightproof box without a lens.
Instead, it just has a little
poked hole, a single small
aperture that permits light to
enter in and project itself as
an inverted image on the other
side of the image plane, or a
sensor.
The opposite side is known as
image plane or sensor.
The aperture through which the
light rays pass is called the
focal point, and the field of
view of the image captured
depends on the focal length.
So the focal length is the
distance from the focal point to
the image plane.
A shorter focal length means
wider field of view; whereas,
longer focal length, longer box,
means narrower field of view.
The focal length is that
constant distance by which real
world distances are flattened
into a 2D image.
Put simply, a depth map is a
transformation of a 3D depth
into a 2D, single channel image
where each pixel value is a
different depth, like five
meters, four meters, three
meters.
Now, to truly measure depth you
need a purpose-built camera for
this, something like a
time-of-flight camera.
For instance, a system that
bounces light signals off of
objects and then measures the
time that it takes to return
back to the sensor.
The iPhone 7 dual camera is not
a time-of-flight camera.
Instead, it is a disparity-based
system.
Disparity is a measure of the
magnitude of shift of an object
when observed from two different
cameras, like your eyeballs.
Disparity is another name for
parallax.
You can observe this effect by
holding your head steady and
fixing your gaze on something
close, and then without moving
your head, close one eye and
then the other eye.
So, for instance, this would be
left eye, right eye; left eye,
right eye.
And you can see the colored
pencils appear to shift a lot
more than the markers in the
back because they are closer.
That's the parallax effect, or
disparity.
Now back to our pinhole camera
model.
Now I've taken a bird's eye view
of two cameras that are said to
be stereo rectified.
That means, one, that they are
parallel to one another, they're
pointing in the same direction,
and two, they have the same
focal length, which is very
important.
That's the distance from the
focal point to the image plane
or sensor.
Each camera will have a measured
optical center or a principal
point, and if you draw a
perpendicular line from the
pinhole to the image plane, then
the optical center is the point
at which it intersects with the
image plane.
Now, there's another term that
you should be familiar with and
that is baseline.
Baseline refers to the distance
between the two optical centers
of the lenses in a
stereo-rectified system.
Here's how it works.
Rays of light from an observed
object pass through the optical
centers and -- or through the
apertures and land at different
points on the image planes of
the two cameras.
A fourth term that I'm going to
throw at you right now is Z.
Z is the canonical term for
depth, or real-world depth.
Now, watch what happens to the
points on the image plane as the
observed point gets farther
away.
They moved closer together.
I'm going to show that to you
one more time.
So as the real point gets
farther away, they get closer
together on the image plane, and
as the object gets closer, the
dots move farther away from each
other.
So when the cameras are stereo
rectified, these shifts only
move in one direction.
They either move closer or
farther away from one another,
but on the same line, or the
epipolar line.
Now, knowing the baseline you
can essentially line up the
cameras along their optical
centers like this and subtract
the distance between the
observed points on the image
planes to get the disparity.
That's what disparity is.
You can express this distance in
whatever units make sense for
your processing.
It could be pixels, meters,
microns.
And it's common to store it in
pixels since we think of RGB
images in pixels.
Now, storing pixel shifts works
fine, as long as the image that
they accompany never changes
size.
It's not so good if you're going
to edit that image because if
you've scaled the image down,
you've now effectively changed
the pixel size.
So you have to go through the
map and you have to scale each
value in the depth map.
That's a very brittle
representation.
Instead, we at Apple have chosen
to express disparity using
normalized values that are
resilient to scaling operations.
So here's how we do that.
Again, going to our observed
point, you'll notice that there
are two similar triangles being
formed.
I'll highlight them for you.
These triangles have equal
ratios of sides and proportions.
Now, if I get rid of the cameras
to just show you the triangles,
the real-world triangle sides
are Z, or meters, and baseline,
the distance between the two
optical centers.
Inside the light box, or the
lightproof box, that same
triangle is represented as the
focal length in pixels and the
disparity in pixels.
Do you feel math coming on?
I feel math coming on.
So stay with me here.
This is pretty painless.
Baseline is to Z as pixel
disparity is to focal length.
Okay. Well, what if we divide
both sides by the baseline so
the b's cancel out on the left,
and what you're left with is 1
over z.
That's pretty nice.
1 over z is inverse depth.
That is literally what disparity
means.
When an object moves farther
away, the disparity shrinks.
When it moves closer, the
disparity grows.
So it is the inverse of depth.
What remains on the right is
what we call normalized
disparity.
So it's not a pixel shift
anymore, it's d over focal
length times baseline.
The baseline is baked in so you
don't need to carry that
information with you separately
when you're dealing with the
depth map.
The units are 1 over meters,
just as it's 1 over z, and it
withstands scaling operations,
and as you can see, converting
from depth to disparity is
trivial, since it's just a
1-over operation.
Is anyone feeling way beyond
their depth at this point?
[ Group groaning ]
This is a little tricky stuff,
but the takeaways are simple.
We have a disparity-based
system, not a true
time-of-flight camera, but
disparity is a great proxy for
depth, and normalized disparity
is the inverse of depth.
Hey, speaking of normalized
disparity as being the inverse
of depth, here's a deep thought.
This image has a disparity map,
so I guess this would make this
a depth-defying leap.
[ Group groaning ]
Thank you.
All right.
So in our depth API set we use
the term depth data, and this is
a generic term for anything
that's depthy.
It can refer to either a true
depth map or a disparity map.
Both are related to depth,
they're both depthy, so they are
both depth data.
And we have a purpose-built
object for this.
The canonical representation on
our platform for depth is called
an AVDepthData.
It's available on iOS, macOS and
tvOS.
It's a class in the AVFoundation
framework and it represents
either depth or disparity maps.
It also provides some nice
facilities to convert between
depth and disparity.
Okay. Let's get into the nuts
and bolts of depth maps.
Depth maps are images, if you
haven't figured out by now.
They're kind of like RGB images,
except they're single channel,
but they can still be expressed
as CV pixel buffers, and now
CoreVideo defines four new pixel
formats for the types that we
saw on the previous slide.
They're all floating point.
The first two are for normalized
disparity and it's measured in 1
over meters.
Notice that there's a 16-bit
flavor and a 32-bit flavor.
The second two are for depth and
they're measured in meters.
They also come in 16- or 32-bit
flavors.
Why would we do this?
Well, if you're going to be
working with depth on the GPU,
it would make sense for you to
request 16-bit or half float
values of depth.
If you'll be working on the CPU,
you should work with the full
32-bit float variants.
They'll work better.
We'll talk later about where an
AVDepthData object might come
from, but for right now let's
just focus on its core
properties.
Given an AVDepthData object you
can query its depth data type,
which is one of those four pixel
formats; you can get access to
the depthDataMap itself which,
again, is a CV pixel buffer; you
can iterate through it by row
and column using standard CV
pixel buffer APIs.
And the final two properties I
want to highlight here have to
do with inherent problems in
capturing depth data, and we're
going to go through these
problems one at a time and
discuss the solutions.
The first problem is holes,
holes in the depth data.
To calculate disparity both
cameras need to observe that
same point, but from two
different perspectives.
If they can't see it, no
disparity.
So why might they not be able to
see it?
For one, occlusions, such as a
creepy finger coming in and
suddenly blocking one of your
cameras.
If it's partially obscuring it
or obscuring it, you don't have
two points of view anymore,
therefore, you have no
disparity.
Another more common reason is
difficulty in finding features.
When camera one and camera two's
images are compared, remember,
they line them up by optical
center and look for features
matching key points.
Let's say it's dark out and the
observed point may not have very
well-defined features anymore,
the color is a little bit noisy,
the edges are hard to find.
Another example would be if you
point the cameras at a flat,
white wall with no texture to
it, there are no features so
it's very hard to find
differences in matching.
For any of these reasons you
might have areas in your image
where there is no disparity, and
those are called holes.
Holes are expressed in the
depthDataMap as not a number
standard floating-point
representation, either 16-bit or
32-bit.
Depth maps may also be processed
to fill in the holes.
We can do this by interpolating
based on surrounding depth data
that's good or by using metadata
present in the RGB image.
The isDepthDataFiltered property
of AVDepthData tells you whether
the map has been processed in
this way.
If you receive an unfiltered
AVDepthData, you can expect to
find not a number values within
that map.
Okay. We'll talk a little bit
more about how you can request
filtering later on.
The second problem that
interferes with accurate
disparity generation is
calibration error.
There are lots of different
kinds of calibration errors that
can happen that we can correct,
but there's one that we can't,
and that is incorrect accounting
of the optical center in either
of the two cameras.
So for this one I've shifted our
pinhole cameras down to the
bottom by 90 degrees to give
myself a little more room at the
top.
In an ideal stereo-rectified
system, perspective only shifts
in one direction, left or right,
along these same lines.
So if there's a ray that's
observed from camera one, it
would be viewed as a series of
intersecting points on a line
from camera two, like this.
So for disparities to be
measured accurately you must
have an accurate baseline.
And baseline, again, is the
distance between the two optical
centers.
If you don't have an accurate
baseline, you can't align those
two cameras' optical centers and
you can't figure out how much
disparity there is.
Now, what happens if the optical
center is calculated wrong or
just misreported?
Let's say the true optical
center is here, but for some
reason it's misreported as being
here.
Now suddenly all of our
disparity points on camera two's
image plane are shifted to the
left by the same fixed amount.
Now all the objects will be
reported as being farther than
they truly are.
If the error were in the other
direction, then the objects
would be misreported as being
too close.
So we can detect and fix a lot
of problems, but this one we
can't detect and fix because,
again, all of those points still
look like they're on the same
correct line.
We don't know the difference
between the baseline being wrong
and the person actually moving
further or closer.
Now, how can this happen; why
would there be problems with
optical center calculation?
iPhone cameras don't use
pinholes, they have lenses, and
on iPhones those lenses don't
stay still.
If OIS is engaged, then the lens
may be moving laterally to
counteract hand shake.
Gravity can come into play
because it can cause the lenses
to sag.
The focus actuators are actually
springs to which an electrical
current is applied.
So all of these reasons might
cause it to move around
laterally a little bit, and
these very small errors in
optical center position can
result in large errors in
disparity.
When this occurs, the result is
a constant amount of error in
every pixel in the map.
The disparity values are still
usable relative to one another,
but they no longer reflect
real-world distances.
For this reason AVDepthData
objects have to have a concept
of accuracy.
An accuracy value of absolute
would mean the units do reflect
real-world distances, there's no
calibration problem.
Relative accuracy means that the
Z ordering is still preserved,
but the real-world scale has
been lost.
Depth data captured from, say, a
third-party camera can be
reported as either absolute or
relative, but iPhone 7 Plus
always reports relative accuracy
due to the calibration errors
that I just mentioned.
But I don't want you to be
frightened by that.
Relative accuracy is not bad
accuracy.
Dual camera depth is still
totally usable, and let me show
you how.
Awesome, formulas on slides.
Okay. Here comes a bit of math
again.
Let's say we've got a relative
accuracy disparity value on the
left, which is the d with the
little dunce cap over it because
it's bad, and that's equivalent
to an absolute disparity d plus
a fixed amount of error.
We don't know what the fixed
amount of error is, but it's
there.
Now, let's take a common
operation such as finding the
difference between two
disparities in the same map,
it's like subtracting the
differences.
So let's say the equation looks
like this.
You have two bad datas where
you're subtracting two bad
disparities, and that's the same
as two good disparities with the
same fixed error.
If we reorder things, we find
that actually we can get rid of
the errors because they cancel
each other out and we're left
with a very happy coincidence
here.
This happy discovery is that the
differences are the same,
whether your disparity is
perfect or your disparity is
relative.
This formula kind of proves that
relative is just as good as
absolute if you're creating
effects that only rely on, say,
differences within the same map.
And that's why the effects
produced from relative accuracy
depth still look fantastic.
And with that I think we've
wrapped up our AVDepthData
intro, or maybe we've gotten to
the bottom of it [group
groaning].
It's time to move on to our
first capture case, which is
streaming depth, and I feel a
demo coming on.
Okay. Let's start with a demo
called AVCamPhotoFilter.
This is an app that we released
last year as sample code with
the show, and this was to show
you how to apply an effect in
real time to a preview and
render that same effect to the
photo.
So last year it just had one
button at the top and that was
to filter the video, and it did,
you know, kind of a cheesy
little rosy effect to the video,
but it shows it to you in real
time on the preview and it also
renders it to the photo when you
take a photo.
This year we've added some depth
to this sample by showing you
how to preview depth in a
streaming fashion.
So now what we're doing is
turning on depth and we're
previewing it by mixing between
full RGB and full depth.
I'm going to call up my lovely
assistant Vanna -- actually,
it's Eric.
Thanks, Eric.
He's going to come up and show
us something that's dynamic,
like a baseball glove.
I love it.
Now, notice that it's quite
noisy, there's a lot of jumping
around happening.
You can definitely see what it
is, but it's not perfect and
there's a lot of temporal
problems going on, but I can
click the Smooth button and
suddenly we have filtered the
depth to fill in the holes and
temporally smooth them, and now
it's a really nice-looking
disparity.
I'm going to go ahead and take a
photo.
And now if I go back to the
Photos app, we'll find that we
just captured a really lovely
looking depth representation,
and now this is an educational
app because finally we can
answer the question how deep is
your glove, how deep is your
glove [group groaning].
You really need to learn.
All right.
Let's go back to slides.
I know it's late.
I'm trying to keep you awake.
All right.
How did we do that.
AVFoundation frameworks camera
capture classes are divided into
three main groups.
The first is the
AVCaptureSession, which is just
a control object.
You tell it to start or stop
running, but it doesn't do
anything unless you give it some
input, and for that we have AV
capture inputs, such as an
AVCaptureDeviceInput, I've made
one here associated with the
dual camera, and that provides
input to the session, but now
you need to direct it somewhere
as an output.
And now we have a new kind of
output called an
AVCaptureDepthDataOutput.
This is affectionately referred
to on our team as the DDO, and
it functions similarly to our
VideoDataOutput, except that
instead of delivering CoreMedia
sample buffers, it delivers
AVDepthData objects, that
canonical representation that I
was talking about.
It delivers them in a streaming
fashion.
Now, where is
AVCaptureDepthDataOutput
supported?
You can, of course, add it to
any session anywhere, but you're
not going to get depth unless
you are on the dual camera
because that is the only dual
system or stereo system that we
have for calculating disparity.
When you attach a
DepthDataOutput to your session,
some things happen.
The dual camera automatically
zooms to 2X, that is the full
field of view of the tele, and
that's because in order to
calculate disparity, the focal
lengths need to be the same and
at 2X zoom the wide-angle
camera's focal length matches
the tele.
Also zoom is disabled while you
are calculating depth.
We've added some new accessors
to AVCaptureDevice.
On the dual camera you can
discover which video formats
support depth by querying the
supportedDepthDataFormats
property.
And there's also a new
activeDepthDataFormat property
that lets you see what the
activeDepthDataFormat is or
select a new DepthDataFormat.
We currently support three video
resolutions or presets for
depth, and let me go through
them one at a time.
The first is the ever-popular
Photo Preset.
In the Photo Preset you get a
screen-sized preview coming out
of VideoDataOutput, and you get
full res 12-megapixel images
coming out of the photoOutput.
So here you see that the
VideoDataOutput is delivering
1440x1080, which is
screen-sized.
Accompanying that, if you use a
DepthDataOutput, you get 320x240
at a maximum of 24 fps.
Why so small?
Well, it takes a lot of
horsepower to do that disparity
map 24 times a second.
You can also get it at a lower
resolution if you would like,
160x120.
Next we have a 16x9 format.
This is a new format this year.
Last year we had a 720p 16x9
format that went up to 60 fps.
This is a new one that goes up
to 30 fps, but it supports
depth.
And again, it is aspect correct
in the DepthDataOutput at
320x180 or 160x90.
And finally, we have a very
small VGA-sized preset or active
format that you can use if you
just want something very small
very fast.
Let's talk about frame rates.
AVCaptureDevice allows you to
set the min and max video frame
rates, but it does not allow you
to set the depth frame rates
independent of the video frame
rate.
That is because depth needs to
be delivered coincident with the
video or at an even fraction of
the video frame rate.
So, for example, if you select a
max video frame rate of 24, the
depth can keep up with that, so
you get 24 fps of depth.
If, however, you select 30 fps
video, the depth cannot keep up
so it will select not 24, but
15, so that you got nice even
multiples.
DepthDataOutput supports
filtering depth data, as I just
showed you in the
AVCamPhotoFilter demo.
That fills the holes and it also
smooths things out as you move
around so that you don't see
temporal jumps from
frame-to-frame.
All right.
Let's look at our current
landscape as far as data
outputs.
We have four of them now.
The first is the
VideoDataOutput, which has been
around since iOS 4, and it is
the thing that gives you video
frames one at a time in a
streaming fashion at 30 fps or
60 fps, whatever you set it to.
We also have an AudioDataOutput
which typically gives you pushes
of PCM frames in 1024 at a time
at 44.1.
We also have a MetadataOutput
that can deliver either faces,
detected faces or barcodes, and
these come in sporadically.
They may have some latency, up
to four frames of latency for
finding faces.
And now we're adding
DepthDataOutput, which, as I
just mentioned, is either
delivered at the frame rate of
the video or at a rate evenly
divisible by the video.
So now this is kind of getting
ridiculous.
In order to work with all of
these data outputs you have to
have a very sophisticated
buffering mechanism to keep
track of when everything's
coming in if you care about
dealing with all of them at the
same time, or dealing with a
certain presentation time
altogether.
We have recognized this as a
problem for a while now, but the
DepthDataOutput has proven to be
the bridge too far.
That wasn't very loud.
Next one, better effort, please.
In iOS 11 we've added a new
synchronizing object called an
AVCaptureDataOutput
Synchronizer.
It delivers all of the available
data for a given presentation
time in a single unified
callback, and it delivers a
collection object called an
AVCaptureSynchronizedData
Collection.
So this allows you to designate
a master output, the one that's
most important to you, the one
that you want everything else to
be synchronized to, and then it
will do the job of holding on to
the media as long as it needs
to, to ensure that all of the
data for a given presentation
time is available before it
gives you that single unified
callback.
It will either give you all of
the data for all of the outputs,
or if it's assured that there is
no data for a particular output,
it will go ahead and give you
the collection with what it had.
So here's a little code snippet
showing how to work with the
data output synchronizer's
unified delegate callback, which
passes you, again, a
SynchronizedDataCollection.
It's cool.
You can use it like an array or
like a dictionary, depending on
what you want to do with it.
You can iterate through it like
you would an array, using fast
enumeration if you just want to
get a list of everything that's
in the current collection.
Or if you want to deal with it
in a dictionary like fashion,
you can index by subscripting a
data output that you're
concerned with.
For instance, here I'm just
looking for the particular
result that came from the
DepthDataOutput and if it's
present, it will give it to me.
You have to guard your code to
look for nil because, again,
there might not be any depth for
that given presentation time.
All right.
For an example of how to use
AVCaptureDataOutput
Synchronizer, again, use
AVCamPhotoFilter.
That sample code is already
available.
It's associated with this
session.
You can download it right now.
There's another new streaming
feature in iOS 11, a slight
tangent here, and that is
support for delivering camera
intrinsics with each video frame
when you're using
VideoDataOutput.
If you recall our pinhole
camera, in order to transform
points from a 3D space to a 2D
space, we needed two bits of
information.
We needed the optical center, or
principal point, and we needed
the focal length.
In computer vision you can use
these properties to re-project a
2D image back to the 3D space by
using the inverse
transformation, and this figures
prominently in the new AR kit.
New in iOS 11 you can opt in to
receive such a set of intrinsics
with each and every video frame
that you're delivered, and you
opt in by calling the
AVCaptureConnection
isCameraIntrinsic
MatrixDeliveryEnabled.
When you do that you can expect
to get one attachment per buffer
with the intrinsics.
Let me show you what the matrix
itself looks like.
It may look imposing, but it's
really quite simple.
Camera intrinsics are a 3x3
matrix that describe the
geometric properties of the
camera.
fx and fy are the pixel focal
length.
They're separate x and y values
because sometimes cameras have
anamorphic lens or anamorphic
pixels.
On iOS devices, our cameras
always have square pixels, so fx
and fy are always going to be
the same value.
Then x naught and y naught are
the pixel coordinates of the
lens' principal point, or
optical center.
These are all in pixel values
and they're given at the
resolution of the video buffer
with which they're provided.
So, once you've opted in, you
can expect to get sample buffers
in a streaming fashion and you
can get this attachment from
them, and the payload is a C/F
data that wraps a matrix float
3x3, which is a SIMD data type.
If you're doing computer vision,
you'll be really interested in
this new feature.
Okay. I think we've officially
deep sixed the streaming topics.
[ Group groaning ]
Better effort.
Let's move on to the photo
capture, and let's start with a
demo.
This is a two-for-one.
We're going to do two apps here.
AVCam is the venerable piece of
sample code that shows how to
take photos and movies using
AVFoundation.
And notice here, though we've
added depth support to it, you
don't see anything happening
with depth.
That's because while I'm able to
take a picture of these pencils
here, you don't actually see a
representation of the depth, but
it was stored in the photo.
So when I go into the Photos app
and I look at it, and let's say
I go into the Editing menu, look
what popped up, Depth at the
top.
So I can now touch the Depth and
suddenly it will apply that blur
effect to the background, which
is pretty cool.
So now photos that you take in
your app are eligible to have
the shallow depth of field
effect applied to them as well.
That's pretty cool.
We can also do other more
interesting things with depth,
knowing now that we've got them
in all of these photos.
And by the way, in iOS 11 all of
the photos that you take in the
Portrait mode are now storing
depth information in the photos,
so they are fodder for your new
creative apps.
I'm going to use this app called
Wiggle Me to show some creative
things that you can do with the
depth.
I'll select an easy one for
beginning.
What it's doing is taking
something that was flat and it's
re-projecting it out into a 3D
space and it's kind of rolling
it around, or I can just stop it
from rolling and I'm just going
to use the gyro to move my phone
around.
Isn't that a neat effect?
It sort of comes to life.
I'm going to pick a different
one.
I really like the dog.
The dog looks great.
So now he kind of moves around
from side-to-side.
You can also do something which
is force the perspective to
change.
Knowing where the depth is, you
can mess with the depth, like
this.
Dolly zoom [laughter].
Dolly zoom, dog in your face.
I have preferred to rotate it
while Dolly zooming, because
it's sort of like a gangster
dog.
I think the appropriate music
for this part would be "Rolling
in the Deep," don't you?
[ Group groaning ]
You guys are doing a great job.
I appreciate.
I really appreciate it.
Okay. When taking photos with
depth, we support a wide gamut
of capture options.
You can do flash captures with
depth, you can do still image
stabilization with depth.
You can even do auto exposure
brackets, such as a plus 2,
minus 2, 0 EV.
You can do Live Photos with the
depth stored in the photo
itself.
AVCapturePhotoOutput is what you
need to use to get photos with
depth.
This is a class that we
introduced last year as the
successor to
AVCaptureStillImageOutput.
It excels at handling complex
photo requests.
I'm talking about a request
where you expect to get multiple
assets and they need to be
tracked and delivered, such as
you're going to get a raw and a
JPEG, and a live photo movie, et
cetera.
You could get multiple things
and they're coming in at
different points.
The programming model is that
you fill out a request, which is
called an
AVCapturePhotoSettings, you
initiate the photo capture by
passing the request and the
delegate to be called later.
And as your photoOutput is the
one and only interface for
capturing Live Photos, bare RAW
images, and Apple P3 wide-color
images.
Also, now in iOS 11 it is the
one and only way to capture HEIF
file format, which was mentioned
in the keynote.
A great many changes needed to
be made to the
AVCapturePhotoOutput to support
HEIF and so in iOS 11, to
accommodate those great many
changes, we have added a new
delegate callback.
It's a simple one.
This is a replacement for the
callbacks where you would get a
sample buffer.
Instead, you now get a new
object called an AVCapturePhoto.
AVCapturePhoto is the only
delivery vehicle for depth, so
if you want depth, you need to
opt in by implementing this new
delegate callback.
In addition, you need to
explicitly opt in for
DepthDataDelivery before
starting your session.
Why? Well, remember, the dual
camera needs to do some special
behavior when it's doing depth.
It needs to zoom up to 2X so
that the focal lengths match,
and it needs to lock itself
there so that you're not
zooming.
So the way that you do that is
before you start running your
session, you tell the
photoOutput I want
DepthDataDeliveryEnabled, and
then on a per photo request
basis, that would be when you
actually snap the photo, you
would fill out a settings object
and say, again, I want depth in
this particular photo.
Then you work with the resulting
AVCapturePhoto that comes back
and it has an accessor called
AVDepthData.
Wow, that AVDepthData, it's
everywhere.
It's like pervasive.
It's like deeply integrated into
the API.
[ Group groaning ]
On iOS most AVCaptureDevice
formats have the ability to take
higher resolution stills than
their streaming resolution.
Looking at our formats that
support depth on iPhone 7 Plus,
here you see the streaming video
resolution compared to the high
res photo resolution that you
get.
So, for instance, for photo, if
you're streaming, you only get
screen-sized buffers, but you
get 12-megapixel stills.
The same holds true for depth.
Remember what I told you that
when we're streaming depth,
there's a lot of work to be done
in a real-time fashion to meet
that 24 fps, but when doing a
photo, we have a little extra
time since it doesn't need to be
delivered real time, so we can
give you a very high quality,
great looking map that's over
twice the resolution of the
streaming.
The aspect ratio always matches
that of the video.
So if you're doing 16x9 video,
you get a 16x9 map.
All right.
Now it's time to talk about the
dirty little subject of
distortions.
The depth maps that we capture
and embed in photos are
distorted.
I'm sorry to be the bearer of
that news, but it's actually a
good thing.
Let me explain why.
All the camera diagrams that I
showed you up to this point were
pinhole cameras.
Pinhole cameras have no lenses
so the images are rectilinear;
that is, light passes through
the little aperture in straight
lines and presents a
geometrically perfect replicated
inverted object on the image
plane.
So if you had a perfect grid of
squares like this and you took a
picture of it with a pinhole
camera, it would look like this
on the image plane, but
upside-down.
So straight lines would remain
straight.
Unfortunately, in the real world
we need to let more light in, so
we need lenses, and lenses have
radial distortions.
These distortions are present in
the captured images as well
because they were sort of bent
in slightly odd ways to get to
the image sensor.
And in an extreme case, straight
lines captured through a bad
lens might look something like
this.
This is no good for finding
disparity, since two images need
to be matched to find features.
Well, if camera one has got a
set of distortions and camera
two has got a different set of
distortions, how are you going
to find the same set of features
in those two images since
they're warped differently?
I left out an important step
when I described how we
calculate disparities and I'm
going to fill it in right now.
Before comparing the tele and
the wide images, we have to do
an extra step.
We have to make those warped
images rectilinear; that is, we
unwarp them using a calibrated
set of coefficients and those
characterize the lens'
distortions.
After each image is corrected
they look like this; satisfying,
straight lines, straight.
Now we can, with certainty,
compare points in the two images
and find a perfect, real-world,
rectilinear disparity map, which
looks like that.
Now we have the opposite
problem.
The disparity map matches the
physical world, but it doesn't
match the image that we just
took, which has warping due to
the lens, so now we have to do
another step, which is to rewarp
the disparity map back to the
image so that it -- we use a set
of inverse lens coefficients to
do this, and the final disparity
map has the same geometric
distortions as its accompanying
image.
So I said that this was a good
thing.
Let me explain why.
It means that out of the box our
depthDataMaps that come with
photos are meant for filters,
for effects.
They always match the image that
they accompany.
So if you're working on effects,
if you want to do stuff with
like the Wiggle Me app, you want
to do interesting effects with
the image such as I showed at
the very beginning, they're
perfect for that.
What they're not perfect for is
reconstructing a 3D scene.
If you want to do that, you
should make them rectilinear,
and you can do that.
I'm going to talk about that in
a minute.
I'd like to just touch briefly
on the physical structure of the
depth data in our image files.
In iOS 11 we support two kinds
of images with depth.
The first is HEIF HEVC, the new
format, also called HEIC files,
and there, there is first-class
support for depth.
There's an area inside the file
called the auxiliary image,
which can store a disparity or a
depth or an alpha map, and
that's where we store it.
We encode it as monochrome HEVC,
and we also store metadata
that's important for working
with that depth, such as
information about whether or not
it was filtered, what is its
accuracy, camera calibration
information like lens
distortions, and also some
rendering instructions.
All of those are encoded as XMP
along with the auxiliary image.
The second format we support is
JPEG.
Boy, JPEG wasn't meant to do
tricks like this, but we made it
do this trick anyway.
The map is 8-bit lossy JPEG if
it's filtered, or if it has not
a numbers in it, we use 16-bit
lossless JPEG encoding to
preserve all of the not a
numbers, and we store it as a
second image at the bottom of
the JPEG, so it's like a
multipicture object, if you're
familiar with that.
Again, we store the metadata as
XMP, just as we do with HEIF
HEVC.
All right.
On to the most requested
developer feature for the dual
camera, and that's dual photo
capture.
What do I mean by this.
So far, when you use the dual
camera and take a picture, you
still just get one image.
It's either from the wide or
it's from the tele, depending
where you're zoomed, or if
you're in the area between one
and 2X you might get portions of
both as we do some blending to
make an even nicer picture, but
you still only get one.
You've been clamoring for both
images and that's what we're
giving you now.
With a single request, you can
get both the wide and the tele
in their full 12-megapixel glory
and you can do whatever the heck
you want with them.
[ Applause ]
Here's how you do it.
It's very similar to opting in
for depth.
Before starting the capture
session, you need to opt in by
telling the photoOutput I'm
going to ask for dual photo so
enable it.
And then as you are capturing on
a per photo request basis you
can fill out your settings by
saying I would like this
particular photo to be a dual
photo, give me both wide and
tele.
When you do that, the number of
photo callbacks that you get
doubles.
It's not just that you get two
callbacks.
Let's say you're asking for RAW
plus HEIF dual photo.
Well, that would be four because
you're going to get two wides
and two teles of RAW and HEIF.
So whatever you were expecting
to get before, the number of
callbacks will double.
Now, we support all of the same
gamut of features that we do
with depth, and that is you can
do flash with dual photo, auto
SIS, exposure brackets and you
can optionally get depth if you
need it.
How do we deal with zoom?
This is a problem of security
and confidence.
Let's say that your app only
shows the field of view of the
tele.
Well, the wide-angle camera has
more information, so if you take
a picture, you're actually
giving people something outside
of the viewable area and that
might be a privacy concern.
So if you are zooming, we
deliver dual photos, but with
the outside blackened so that
they match the field of view
that's seen in preview.
If you want the full image, you
can, just don't set the zoom to
anything other than one.
How do you know if it has this
blackened area on the outside?
Well, inside the image we store
a clean aperture rectangle that
defines the area with valid
pixels.
Dual photos can be delivered
with camera calibration data,
too.
Camera calibration data is the
kind of data that you need to do
augmented reality, virtual
reality, lens distortion
correction, et cetera.
So with both a wide and a tele
and camera calibration data, you
can make your own depth maps.
I challenge you to make one
better than Apple does.
You can also augment reality, of
course, because you get the
intrinsics.
Let's talk about the individual
properties of camera
calibration.
This is the last object that I'm
going to introduce tonight.
The AVCameraCalibrationData is
our model class for camera
calibrations.
Where does it live?
Well, if you ask for depth, you
get it with an AVDepthData.
It is a property of that.
You can also get it if you've
opted in from an AVCapturePhoto.
So you opt in by saying I would
like to camera calibration with
this photo, which works rather
nicely.
If you're doing dual photo
capture, you ask for dual photo
and you ask for the camera
calibrations, you get two photo
callbacks and you get the
calibrations for the wide, with
the wide result and the tele
with the tele result.
What does an intrinsicMatrix
look like?
I hope this is a little bit
familiar, since it's the same as
what we looked at earlier for
the streaming VideoDataOutput
case.
Again, it's a 3x3 matrix and the
CameraCalibrationData, it's used
for going from the 3D space to
the 2D space when flattening an
image.
You can apply the inverse when
going back to the 3D space.
It has pixel focal length which,
again, are two different
numbers, but because we have
square pixels, they are the same
number.
And it also has an x and y for
the optical center.
The pixel values are given at a
resolution of a reference frame.
Again, the depth data might be
very low resolution.
We don't want to give it to you
at that low resolution so,
therefore, we provide a separate
set of dimensions.
Typically, they're the full size
of the sensor, therefore, you
get a lot of accuracy, a lot of
resolution in the
intrinsicMatrix.
Next is the extrinsicMatrix.
This is a property that
describes the camera's pose in
the world.
You need it when you're working
with images from
stereo-rectified cameras to
triangulate where one is
compared to another one.
And our extrinsics are presented
as a single matrix, but kind of
two matrices squashed together.
So the first one, the one on the
left, is the rotation matrix.
It's a 3x3 that describes how
the camera is rotated with
respect to the world origin,
wherever that happens to be.
And there's also a 1x3 matrix
describing the camera's
translation, or sort of distance
from the world origin.
It's important to note that the
tele camera is the origin of the
world when you're using the dual
camera, which makes it very
easy.
If you're just getting a tele
image, the matrix that you get
will be an identity matrix.
If you're working with wide and
tele, then the wide will, of
course, not be an identity
matrix, since it's describing
its pose and distance from the
tele camera.
But using the extrinsics, you
could, for instance, compute the
baseline between the wide and
the tele.
There are also several
properties dealing with the
geometric distortions of the
lens, as we talked about
earlier.
These are useful for when you
need to make either an image or
a depth map rectilinear.
There are two properties that
you need to be concerned with.
The first is
lensDistortionCenter.
This describes the point on the
sensor that coincides with the
center of the lens' distortion.
This is frequently different
from the optical center of the
lens.
It's like if you looked at all
of the distortions, radial
distortions on the lens sort of
like tree rings, this would be
the center of the tree rings.
Also, along with this distortion
center we have a
lensDistortionLookupTable, which
you can think of as being a
number of floating point dots
connecting the
lensDistortionCenter to the
longest radius.
Again, if you drew little
circles from each of these dots,
you would get something that
looks like tree rings that would
show you the radial distortions
of the lens.
The lensDistortionLookupTable is
a C array of floats that are
wrapped in a data.
If each and every point along
those dotted lines was a 0, you
would have the one and only
perfect lens in the world.
It has no radial distortions at
all.
If there is a positive value, it
indicates that there is a
lengthening of the radius there.
If you have a negative value, it
indicates that it was shrunk
there.
But looking at this entire table
together, you can sort of get a
feel for where the bumps in the
lens are.
To apply distortion correction
to an image you'd begin with an
empty destination buffer and
then iterate through it
row-by-row and for each point
you would use the
lensDistortionLookupTable to
find the corresponding value in
the distorted image, and then
write that value to the right
position in your output buffer.
This is extremely tricky code to
write.
We know this.
So, we've provided a reference
implementation for you in
AVCameraCalibrationData.h. We
actually put code in a header
file.
It's all commented out.
It's a big objective C function.
Please take a look at it.
It describes how to rectify an
image or how to rewarp an image,
depending on which table you
pass it.
There is also, as you might
expect, the inverse of that
table, which describes how to go
from the warped back to
unwarped.
It's really easier to show you
with a demo.
Let's do a demo.
This will be our fourth and
final sample app of the day, and
it's called Straighten Up.
I bet you can guess what it
does.
This is an app that uses the
AVCameraCalibrationData,
specifically the lens distortion
characterizations, to make an
image rectilinear.
This morning I went outside and
I took a series of dual photos.
You can tell that they're dual
photos and that I was zoomed in
to 2X because of the black
border around them.
This one is, of course, from the
tele and this is the distorted
image.
Now, when I press the Undistort
button, you'll see something
that's a little bit subtle.
You can definitely see it, but
it's pretty subtle.
Typically telephoto lenses have
less curvature so they have
fewer radial distortions at the
edges than wide lenses.
I'll zoom in to a portion so you
can see the difference.
This is rectilinear, straight
lines are straight, and this is
distorted.
Now, if I go into the wide
image, again, we don't have the
corners, but you can see in the
image data that we do have
already that the distortions are
more prominent.
So distorted, undistorted.
Distorted, undistorted.
You can definitely see around
the edges that they are pulling
in more.
Undistorted.
Distorted, undistorted.
All right.
Back to slides.
Time for the wrap-up.
iPhone 7 Plus dual camera is not
a time-of-flight camera system.
It's a?
>> [Group] Disparity.
>> Disparity system.
If you leave with only one piece
of knowledge, it's that, and I
hope you know how disparity
differs from depth.
Also, the canonical
representation on our platform
for depth is AVDepthData.
We learned about intrinsics,
extrinsics, lens distortion
info.
These are all properties of an
AVCameraCalibrationData.
We learned about the
AVCaptureDepthDataOutput, and
that it provides streaming depth
which you can filter, or not.
And we learned that you can
capture photos using
AVCapturePhotoOutput and have
depth delivery enabled.
Finally, we spent a little bit
of time talking about the dual
camera, dual photo delivery
which produces a wide and a tele
for a single image with which
you can do interesting computer
vision tasks, and I hope you do.
We have three pieces of sample
code that are available right
now and are associated with this
session; AVCam, PhotoFilter and
Wiggle Me.
For more information, here is
the URL for the site.
And don't tune me out just yet.
Directly following this session
there is an informal
get-together for developers with
an interest in photography.
[Whispering] That's all of you.
So you can come and mingle with
members of the Apple Media
Technologies Group.
You can ask questions, of
course, or we can just talk and
socialize.
Tomorrow there is a sister
session at 11:00 a.m. where
you'll learn how to read and
manipulate the depth data that's
in image files.
Today we just briefly touched on
the surface of what you can do
with images with depth.
Tomorrow you get a whole host of
demos.
So I really hope you'll make
time for that tomorrow.
If you do, I'd deeply appreciate
it [group groaning].
And finally, I will be
presenting a dedicated session
on working with HEIF on Friday
morning that I also hope you'll
attend.
In that one I will delve deeply
into the AVCapturePhoto
interface.
Thank you, and enjoy the rest of
the show.
[ Applause ]