WWDC2019 Session 249

Transcript

[ Music ]
>> Multi-camera capture or as we
like to call it internally,
MultiCam.
MultiCam is our single most
requested third-party feature.
We hear it year after year in
the labs.
So what we're talking about here
is the ability to simultaneously
capture video, audio, metadata,
depth and photos from multiple
cameras and microphones
simultaneously.
Third parties aren't the only
ones who benefit from this
though.
We've had many and repeated
requests from first-party
clients as well for MultiCam
capture.
Chief among them is ARKit.
And if you heard the keynote,
you heard about the introduction
of ARKit 3.
These APIs use front camera for
face and pose tracking while
also using the back camera for
world tracking which helps them
know where to place virtual
characters in the scene by
knowing what you're gazing at.
So we've supported MultiCam on
the Mac since the very first
appearance of AVFoundation way
the heck back in Lion.
But on iOS, AVFoundation still
limits clients to one active
camera at a time.
And it's not because we're mean.
There were good reasons for it.
The first reason is hardware
limitations.
I'm talking about cameras
sharing power rails and not
physically being able to provide
enough power to power two
cameras simultaneously full
bore.
And the second reason was our
desire to ship a responsible
API, one that would help you not
burn the phone down when doing
all of this processing power
with multiple cameras
simultaneously.
So we wanted to make sure that
we delivered something to you
that would help you deal with
the hardware, thermal and
bandwidth constraints that are a
reality in our world.
All right, so great news in iOS
13, we do finally support
MultiCam capture, and we do it
on all recent hardware, iPhone
XS, XS Max, XR and the new iPad
Pro.
On all of these platforms, the
aforementioned hardware
limitations have been solved
thankfully.
So let's dive right in to the
fun stuff.
We've got a new set of APIs for
building MultiCam sessions.
Now, if you've used AVFoundation
before for camera capture, you
know that we have four main
groups of classes: inputs,
outputs, the session and
connections.
The AVCaptureSession is the
center of our world.
It's the thing that marshals
data.
It's the thing that you tell to
start or stop running.
You add to it one or more
inputs, AVCapture inputs.
One such is the
AVCaptureDeviceInput which is a
wrapper for either a camera or a
microphone.
You also need to add one or more
AVCapture outputs to receive the
data.
Otherwise, those producers have
nowhere to put it.
And then the session
automatically creates
connections on your behalf
between inputs and outputs that
have compatible media types.
So note what I'm showing you
here is the traditional
AVCaptureSession, which on iOS
only allows one camera input per
session.
New in iOS 13, we're introducing
a subclass of AVCaptureSession
called AVCaptureMultiCamSession.
So this lets you do multiple ins
and outs.
AVCaptureSession is not
deprecated.
It's not going away.
In fact, the existing
AVCaptureSession is still the
preferred class when you're
doing single-cam capture.
The reason for that is that
MultiCamSession, while being a
power tool has some limitations,
and I'll address those later.
All right, so let me give you an
example of a bread-and-butter
use case for our new
AVCaptureMultiCamSession.
Let's say you want to add two
devices, one for the front and
one for the back camera to a
MultiCamSession and do two video
data outputs simultaneously, one
receiving frames from the back
camera, one from the front.
And then let's say if you want
to do a real-time preview, you
can add separate
VideoPreviewLayers, one for the
front, one for the back.
You needn't stop there though.
You can do simultaneous metadata
outputs if you want to do
simultaneous barcode scanning or
face detection.
You could do multiple movie file
outputs if you want to record
one for the front and one for
the back.
You could add multiple photo
outputs if you want to do
real-time capture of photos from
different cameras.
So as you can see, these graphs
are starting to look pretty
complicated with a lot of arrows
going from a lot inputs to a lot
of outputs.
Those little arrows are called
AVCaptureConnections, and they
define the flow of data from an
input to an output.
Let me zoom in for a moment on
the device input to illustrate
the anatomy of a connection.
Capture inputs have AVCapture
input ports, which I like to
think of as little electrical
outlets.
You have one outlet per media
type that the input can produce.
If nothing is plugged into the
port, no data flows from that
port, just like an electrical
outlet.
You have to plug something in to
get the electricity.
Now, to find out what ports are
available for our particular
input, you can query that input
ports property, and it will tell
you "I have this array of
AVCapture input ports."
So for the dual camera, these
are the ports that you would
find, one for video, one for
depth, one for metadata objects
such as barcode scanning and
faces and one for metadata items
which can be hooked up to a
movie file output.
Now, whenever you use
AVCaptureSession's add input
method to add an input to the
session or add output to add an
output to the session, the
session will look for compatible
media types and implicitly form
connections if it can.
So here we had a
VideoDataOutput.
VideoDataOutputs receive video,
accept video, and we had an
electrical plug that can produce
video, and so the connection was
made automatically.
That is how most of you are
accustomed to working with
AVCaptureSession if you've
worked with our classes before.
MultiCamSession is a different
beast.
That is because inputs and
outputs, you have multiple
inputs out now with multiple
outputs.
You probably want to make sure
that the connections are
happening from A to A and B to B
and not crossing where you
didn't intend them to.
So when building a
MultiCamSession, we urge you not
to use implicit connection
forming but instead use these
special purpose adders,
addInputWithNoConnections or
addOutputWithNoConnections.
And there are likewise ones that
you can use for video preview
layer, which are
setSessionWithNoConnections.
When you use these, it basically
just tells the session "Here are
these inputs, here are these
outputs.
You now know about them, but
keep your hands off them.
I'm going to add connections as
I want to later on manually."
The way you do that is you
create the AVCaptureConnection
yourself by telling it "I want
you to connect this port or
ports to this output," and then
you tell the session, "Please
add this connection," and now
you're ready to go.
That was very wordy.
It's better shown than talked
about, so I'd like to bring up
Nik Gelo, also from the Camera
Software group to demonstrate
AVMultiCamPIP.
Nick?
>> Thanks, Brad.
AVMultiCamPIP is an app that
demonstrates streaming from the
front and back camera
simultaneously.
Here we have two video previews,
one displaying the front camera
and one displaying the back
camera.
And when I double-tap the
screen, I can swap which camera
appears full-screen and which
camera appears PIP.
[ Applause ]
Now, we can see here that Brad
is live at Apple Park.
And before I ask him a few
questions, I will press the
Record button here at the bottom
to watch his conversation later.
Hey, Brad.
So tell me, how's it going over
at Apple Park?
>> Nik, it's pandemonium here at
Apple Park.
As you can see in front of the
reflecting pool, there's all
kinds of activity happening.
I hear a rushing of water.
Sounds like I'm about to be
drenched at any moment.
I hear wild animals behind me
like ducks or something.
I honestly fear for my life
here.
>> Well, Brad, that seems
absolutely terrifying.
Hope you stay safe out there.
>> Okay, thanks.
>> Got it.
>> So now that we finished
recording the movie, let's take
a look at what we just recorded.
Here we have the movie.
As you can see, when I swap
between the two cameras, it
swaps just like we did when
using the app.
And that's AVMultiCamPIP.
Back to Brad.
[ Applause ]
>> Thanks, Nik.
Awesome demo.
All right, so let's look at
what's happening under the hood
in AVMultiCamPIP.
So we have two device inputs,
one for the front camera, one
for the back camera added with
no connections as I mentioned
before.
We also have two video data
outputs, one for each and two
VideoPreviewLayers.
Now, to place them onscreen,
it's just a matter of taking
those VideoPreviewLayers and
ordering them so that one is top
of the other and one is sized
smaller.
And when Nik double-tapped them,
we simply reposition them and
reverse the Z ordering.
Now, there is some magic
happening in the Metal Shader
Compositor code.
There it's taking those two
VideoDataOutputs and compositing
them so that the smaller PIP is
arranged within one frame, so
it's compositing them to a
single video buffer and then
sending them to an AVAssetWriter
where they are recorded to one
video track in a movie.
This sample code is available
right now.
It's associated with the
session.
You can take a look and start
doing your own MultiCam
captures.
All right, time to talk about
limitations.
While AVMultiCamSession is a
power tool, it doesn't do
everything, and let me tell you
what it does not do.
First up, you cannot pretend
that one camera is two cameras.
AVCaptureDeviceInput API will
let you create multiple
instances for, say, the back
camera.
You could make 10 of them if you
want.
But if you try to add all those
instances to one
MultiCamSession, it'll say
"Uh-uh," and it will throw an
exception.
Please, only one input per
camera in a session.
Also, you're not allowed to
clone a camera to two outputs of
the same type such as taking one
camera and splitting its signal
to two video data outputs.
You can, of course, add multiple
cameras and connect them to a
VideoDataOutput each, but you
cannot fan out from one to many.
You're also not allowed -- the
opposite holds true as well.
AVCapture outputs on iOS do not
support media mixing.
So all the data outputs only can
take a single input.
You can't, for instance, try to
jam two camera sources into a
single data output.
It wouldn't know what to do with
the second video since it
doesn't know how to mix them.
You can, of course, use separate
video data outputs and then
composite those buffers in your
own code, such as the Metal
Shader Compositer that we used
in MultiCamPIP.
You can do that however you
like, but as far as session
building is concerned, do not
try to jam multiple cameras into
a single output.
All right, a word about presets.
The traditional AVCaptureSession
has this concept of a session
preset, which dictates a common
quality of service for the whole
session.
And it applies to all inputs and
outputs within that session.
For instance, when you set the
sessionPreset to high, the
session configures the device's
resolution and frame rate and
all of the outputs so that they
are delivering a high-quality
video experience such as
1080p30.
Presets are a problem for
MultiCamSession.
Think again about something that
looks like this.
MultiCamSession configurations
are hybrid; they're
heterogeneous.
What does it mean to have high
quality for the whole thing?
You might want to do different
qualities of service on
different branches of the graph.
For instance, on the front
camera you might want to just do
a low-resolution preview such as
640 by 480, while also
simultaneously doing something
really high-quality, 1080p60,
for instance, on the back.
Well, obviously, we don't have
presets for all of these hybrid
situations.
We've decided to keep things
simple in MultiCamSession.
It does not support presets.
It supports one and one preset
only which is input priority.
So that means it leaves the
inputs and outputs alone when
you add them.
You must configure the active
format yourself.
All right.
On to the cost functions.
I mentioned at the beginning
that we took our time with this
MultiCam support, because we
wanted to deliver a very
responsible API, one that could
help you account for the various
costs that you incur when
running multiple cameras and
lighting up virtually every
block on the phone.
So this is trite but true.
There is no such thing as a free
lunch.
And so this is the part of the
session where I become your
father, and I'm going to give
you the dad talk.
In the dad talk, I will explain
how credit cards work and how
you need to be responsible with
your money and live within your
means and, like, such things.
So it's a fact of life that we
have limited hardware bandwidth
on iOS.
And though we have multiple
cameras, so we have multiple
sensors, we only have one ISP or
image signal processor.
So all the pixels going through
those sensors need to be
processed by a single ISP, and
it is limited by how many pixels
it can run per clock at a given
frequency.
So there are limiters to the
number of pixels that you can
run at a time.
The contributors to the hardware
cost are, as you would expect,
video resolution.
Higher resolution means more
pixels to cram through there.
The max frame rate.
If you're delivering those
pixels faster, it's got to do
more pixels per clock as well.
And then a third one which you
may or may not have heard of is
called sensor binning.
Sensor binning refers to a way
to combine information in
adjacent pixels to reduce
bandwidth.
So, for instance, if we have an
image here, and we do a 2 by 2
binning, it's going to take 4
pixels in squares and sum them
into one so that we get a
reduction in size by 4x.
It gives you a reduction in
noise.
It gives you a reduction in
bandwidth.
It gives you 4x intensity per
pixel.
So there are a lot of great
things about sensor binning.
The downside is that you get a
little reduction in image
quality.
So diagonal lines might look a
little stair-stepped.
But their most redeeming quality
is that bin formats are super
low power.
In fact, whenever you use ARKit
with a camera, you are using a
binned format, because ARKIT
uses binned formats exclusively
to save on that power for all
the interesting AR things that
you'd like to do.
All right.
How do we account for cost, or
how do we report those costs?
MultiCamSession tallies up your
hardware cost as you configure
your session.
So each time you change
something, it keeps track of it
just like filling up a shopping
cart or going to an online store
and putting things into the cart
before you pay for them.
You know when you're getting
close to your limit on your
budget, and you can kind of try
things out and then put new
things in or move old things
out.
You see the cost before you have
to pay.
It's the same with
MultiCamSession.
We have a new property called
hardwareCost.
And this hardwareCost starts at
zero when you make a brand-new
session.
And it increments as you add
more features, more inputs, more
outputs.
And you're fine as long as you
stay under 1.0.
Anything under 1.0 is runnable.
The minute you hit 1.0 or
greater, you're in trouble.
And that's because the ISP
bandwidth limit is hard.
It's not like you can, you know,
deliver every other frame.
No, this is an all or nothing
proposal.
You have to either make it or
you don't.
So if you're over 1.0 and you
try to run the
AVCaptureMultiCamSession, it'll
say "Uh-uh."
It'll give you a notification of
a runtime error indicating that
the reason it had to stop is
because of a hardware cost
overage.
Now you're probably wondering,
"How do I reduce that cost?"
The most obvious way you can do
it is to pick a lower
resolution.
Another way you can do it is if
you want to keep the same
resolution, if there is a binned
format at the same resolution,
pick that one instead.
It's a little bit lower quality
but way lower in power.
Next, you would think that
lowering the frame rate would
help, but it doesn't.
The reason is that
AVCaptureDevice allows you and
has allowed you since, I think,
iOS 4 to change the frame rate
on the fly.
So if you have a 120 FPS format
and you say, "Set the active
format to 60," you still have to
pay the cost for a 120, not 60,
because at any point while
you're running, you could
increase the frame rate up to
120.
We must assume the worst case.
But good news.
We're now offering an override
property on the
AVCaptureDeviceInput.
By setting it, you can turn a
high frame-rate format into a
lower frame-rate format by
promising that you will go no
higher than a particular frame
rate.
Now, this is a point of
confusion in our APIs.
We don't talk about frame rates
as rates.
We talk about them as durations.
So to set a frame rate, you set
1 over the duration.
That's the same as the frame
rate.
So if you want to take a 60 FPS
format and make it into a 30 FPS
format, you do that by making a
CMTime with 1 over 30, which is
the duration and then set that
deviceInput.
videoMinFramDurationOverride to
thirtyFPS.
Congratulations, you've just
turned a 60 FPS format into a 30
FPS format, and you only pay the
hardware cost for 30.
I should also mention that there
is a great function in the
AVMultiCamPIP app that shows how
to iteratively reduce your cost.
It's a recursive function that
kind of picks things that are
most important to it, and it
throttles down things that are
less important until it gets
under the hardware cost.
Now, next up is system pressure
cost.
This is the second big
contributor that we report.
As you're well aware, phones are
extremely powerful computers in
little bitty thermally
challenged packages.
And in iOS 11, we introduced
camera system pressure states.
These help you monitor the
camera's current situation.
Camera system pressure consists
of system temperature, that is
overall OS thermals, peak power
demands, and that has to do with
the battery.
How much charge does it
currently have?
Is it able to ramp up its
voltage fast enough to meet the
demands of running whatever you
want to do right now?
And the infrared projector
temperature.
On devices that support
TrueDepth camera, we have an
infrared camera as well as an
RGB camera.
Well, that generates its own
heat, and so that's part of the
contribution to system pressure
states.
We have five of them, nominal
all the way up to shutdown.
When the system pressure state
is nominal, you're in great
shape.
You can do whatever you want.
When it's fair, you can still
almost do whatever you want.
But at serious, you start
getting into a situation where
the system's going to throttle
back, meaning you have fewer
cycles for the GPU.
Your quality might be
compromised.
And at critical, you are getting
a whole lot of throttling.
At shutdown, we cannot run the
camera any longer for fear of
hurting the hardware.
So at shutdown, we automatically
interrupt your session, stop it,
tell you that you're interrupted
because of a system pressure
state, and then we wait for the
device to go all the way back to
nominal before we'll let you run
the camera again.
That was all iOS 11.
Now, in iOS 13, we're offering
you a way to account for the
system pressure cost upfront,
okay?
Instead of just telling you
what's happening right now,
which may be influenced by the
fact that you played Clash of
Clans before you restarted the
camera, we now have a way to
tell you what the camera cost as
far as system pressure is,
independent of all other
factors.
So the contributors to this cost
are the same as the ones for
hardware along with a lot of
other ones, such as video image
stabilization or optical image
stabilization.
All of those cost power.
We have a Smart HDR feature,
etc. All of those things listed
here are contributors to overall
system pressure cost.
MultiCamSession can tally that
score upfront just like it does
for hardware, and it will only
account for the factors that it
knows about.
So if you're going to be doing
some wild GPU processing at the
same time, the score won't
include that.
It'll just include what you're
doing with the camera.
Here's how you use it.
By querying the system pressure
cost, you can find out how long
you would be runnable in an
otherwise quiescent system.
So if it's less than 1.0, you
can run indefinitely.
You're a cool customer.
If it's between 1 and 2, you
should be runnable for up to 15
minutes, 2 to 3 up to 10
minutes, and higher than 3, you
may be able to run for a short
little bit.
And, in fact, we will let you
run the camera, even if you're
over 3, but you have to
understand that it's not going
to stay cool very long.
And once it gets up to a
critical or shutdown level, your
session will become interrupted.
So we'll save the hardware even
if you don't want to.
But, hey, it's great.
If you can get what you need to
get done in 30 seconds of
running at a very, very high
system pressure cost, by all
means, do that.
Now, how do you reduce your
system pressure while running?
I'm not talking about while
you're configuring your session.
I'm talking about once you're
already running and you notice
that you're starting to elevate
in system pressure.
The quickest and easiest way to
do it is to lower the frame
rate.
Immediately, that will relive
system pressure.
Also, if you're doing things
that we don't know about, such
as heavy GPU or CPU work, you
can throttle that back.
As a last resort, you might try
disabling one or more of the
cameras that you're using.
AVMultiCamSession has a neat
little feature that, while
running, you can disable one of
the cameras without affecting
preview on the other.
We don't shut everything down.
So if, for instance, you're
running with the front and the
back, you notice that you're way
over budget, and you're soon
going to go critical, you could
choose to shut down the front
camera.
The back camera will keep
previewing.
It won't lose its focus,
exposure, white balance.
And when you shut down the last
active input port on the camera
that you want to disable by
setting its input port's enabled
property to false, we will stop
that camera streaming and save a
ton of power and give that
system a chance to cool off.
All right, so I've just talked
about two very important costs,
hardware and system pressure.
There are other costs that we
are not reporting.
I didn't want to trick you into
believing that there aren't
other things at work here.
There are, of course, other
costs such as memory.
But in iOS 13, we are
artificially limiting the device
combinations that we will allow
you to run, the ones that we are
confident will run and that will
not get you into trouble.
So we have a limited number of
supported device combinations.
Here I'm listing the ones that
are supported on iPhone XS.
This is kind of an eye chart.
I don't expect you to remember
this.
You can pause the video later.
But there are six supported
configs, and the simple rule to
remember is that you're allowed
to run two physical cameras at a
time.
You might be questioning, like,
Brad, what about config number
one there?
There's only one checkbox.
That's because it's the dual
camera, and the dual camera is a
software camera that's actually
comprised of the wide and the
telephoto, so it is two physical
cameras.
How do you find out if MultiCam
is supported?
Like I said, it's only supported
on newer hardware, so you need
to check if MultiCamSession will
let you run multiple cameras or
not on the device that you have.
There's a class method called
isMultiCamSupported, which you
can right away decide yes or no.
And then further when you want
to decide am I allowed to run
this combination of devices
together, you can create an
AVCaptureDevice.DiscoverySession
with the devices that you're
interested in and then ask it
for its new property
supportedMultiCamDeviceSets.
And this will produce an array
of unordered sets that tell you
which ones you're allowed to use
together.
Next up is a way that we are
artificially limiting the
formats that you're allowed to
run.
The supported formats last I
checked on in iPhone XS, there
were more than 40 formats on the
back camera.
So there are tons to choose
from.
But we are limiting the actual
video formats allowed to run
with MultiCamSession, because
these are the ones that we can
comfortable run simultaneously
on end devices.
So again, this is a bit of an
eye chart, but I'm going to draw
your attention to groups.
First group is the binned
formats.
Remember? Low power.
Yay, these are our friends.
At the sensor, you're getting
that 2 by 2 binning, so you're
getting very low power.
All of these are available up to
60 FPS.
You've got choices from 640 by
480 all the way up to 1920 by
1440.
Next group is the 1920 by 1080
at 30.
This an unbinned format, and
this is the same as the one you
would get if you chose the high
preset on a regular traditional
session.
This one is available for
MultiCam use.
The final one is 1920 by 1440
unbinned at 30 FPS.
This is kind of a good stand-in
for the photo format.
We do not support 12 megapixel
on N cameras.
That would certainly do bad
things to the phone, but we do
allow you to do 1920 by 1440 at
30 FPS.
And notice, it still allows you
to do 12 megapixel
high-resolution stills.
So this is a very good proxy for
when you want to do photography
with multiple cameras
simultaneously.
Now, how do you find out if a
format supports MultiCam?
You just ask it.
So while iterating through the
formats, you can say, "Is
MultiCam supported?"
And if it is, you're allowed to
use it.
In this code here, I'm iterating
through the formats on a device
and picking the next lowest one
in resolution that supports
MultiCam and then setting it as
my active format.
The last way that we're
artificially limiting is,
because we need to report costs,
and those costs are reported by
the MultiCamSession, we're
specifically not supporting on
iOS multiple sessions with
multiple cameras in an app, and
we're also not supporting
multiple cameras in multiple
apps simultaneously.
Just be aware that the support
on iOS is still limited to one
session at a time, but of course
you can run multiple cameras at
a time.
Thus concludes the dad talk.
Okay, write good code.
Be home by 11.
If your plans change, call me.
All right.
All right, now back to the fun
stuff.
Synchronized streaming.
I talked a little bit about
software cameras.
Dual camera for one was
introduced on iPhone 7 Plus, and
it's now present on the iPhone
XS and XS Max as well.
And the TrueDepth camera is also
another kind of software camera,
because it's comprised of an
infrared camera and an RGB
camera that is able to do depth
by taking the disparity between
those two.
Now, we've never given these
special types of cameras a name,
but we're doing that now.
In iOS 13, we're calling them
virtual cameras.
DualCam is one of them.
It presents one video stream at
a time, and it switches between
them based on your zoom factor.
So as you get closer to 2x, it
switches over to the telephoto
camera instead of the wide
camera.
It also can do neat tricks with
depth, because it has two images
that it can use to generate
disparity between them.
But still, from your
perspective, you've only been
able to get one stream at a
time.
Because we have a name now, they
are also a property in the API
which you can query.
So as you're looking at your
camera devices, you can find
out, programmatically, is this
one a virtual device?
And if it is, you can ask it,
"Well, what are your physical
devices?"
And in the API, we call this its
constituentDevices.
Synchronized streaming is all
about taking those
constituentDevices of a virtual
device and running them
synchronized.
In other words, for the first
time, we're allowing you to
stream synchronized video from
the wide and the tele at the
same time.
You continue to set the
properties on the virtual
device, not on the
constituentDevices.
And there are some rules in
place.
When you run the virtual device,
the constituentDevices aren't
allowed to run willy-nilly.
They have the same active
resolution.
They have the same frame rate.
And at a hardware level, they
are synchronized.
That means the sensor's reading
out those frames in a
synchronized fashion so that the
middle line of the readout is
exactly at the same clock time.
So that means that they match at
the frame centers.
It also means that the exposure,
white balance and focus happen
in tandem, which is really nice.
It makes it look like virtually
it is the same camera, just
happens to be at two different
fields of view.
This is best shown rather than
talked about, so let's do a
demo.
This one's called AVDualCam.
There we are.
Okay, AVDualCam lets you see
what a virtual camera sees by
showing you a display of the two
cameras running synchronized.
And it does this by showing you
several different views of those
cameras.
Okay, here I've got the wide and
the tele constituent streams of
the dual camera running
synchronized.
On the left is the wide, and on
the right is the tele.
Don't believe me?
Here, I'm going to put my finger
over one side.
I'm going to put my finger over
the other side.
See? They're different cameras.
All I've done with the wide is
zoom it so it's at the same
field of view as the tele.
But you can notice that they're
running perfectly synchronized.
There's no tearing.
There's no weirdness in the
vertical blanking.
Their exposures and focuses
change at the same time.
Now we can have a little bit
more fun if we change from the
side-by-side view to the Split
View.
Now, this a little bit hard to
see, but I'm showing the wide on
the left and the tele on the
right.
So I'm only showing you half of
each frame.
Now, if I triple-tap, I bring up
a distanceometer [phonetic]
which lets me change the plane
of depth convergence for the two
images.
This app knows how to register
the two images relative to one
another, so it lets me play with
the plane at which the depth
converges, kind of like with
your eyes when you focus on
something up close or far away,
you're kind of changing that
depth plane of convergence.
So, for instance, up close with
my hand, I can find the place
where the depth converges
nicely.
There we go.
Now I've got one hand.
But that's not right for the car
behind me so I can keep going
further away.
There we go.
And that's not right for the car
behind it.
So now I can pull that guy back
too.
And that's dual camera streaming
synchronized from the dual
cameras.
[ Applause ]
Here's a diagram showing
AVDualCam's graph.
Instead of using separate device
inputs, it just has one.
So it's using a single device
input for the dual camera, but
it's sourcing wide and tele
frames in a synchronized fashion
to two VideoDataOutputs.
You'll notice that there is a
little object, little pill at
the bottom called the
AVCaptureOutputSynchronizer.
I don't want to confuse you.
That thing is not doing the
hardware synchronization that I
talked about.
It's just an object that sits at
the bottom of a session, if you
desire, which lets you get
multiple callbacks for the same
time in a single callback.
So instead of getting a separate
VideoDataOutput callback for the
wide and the tele, you can slap
a DataOutputSynchronizer at the
bottom and get both frames for
the same time through a single
callback.
So it's very handy that way.
Now, below it, there's a Metal
Shader Filter Compositor that's
doing some magic.
Like I said, it's knowing how to
blend those frames together, and
it decides where to render those
frames to the correct places in
the preview, and it also can
send them off to an
AVAssetWriter to record into a
video track.
Now, recall my earlier diagram.
I showed you a close-up view of
the AVCaptureDeviceInput,
specifically the dual camera
one.
The ports property of the dual
camera input exposes which ports
you see there.
Anybody see two video ports
there?
I don't see two video ports.
So how do we get both wide and
tele out of those input ports
that we see here?
Is that one video port somehow
giving us two?
No, it's not giving us wide or
tele.
It's giving us whatever the dual
camera decides is right for the
given zoom factor.
That's not going to help us get
both constituent streams at the
same time, so how do we do that?
Well, I'll tell you, but it's a
secret, so you have to promise
not to tell anybody, okay?
Virtual devices have secret
ports, okay?
The secret ports, previously
unbeknownst to you, are now
available, but you don't get
them out of the port's array,
you get them by knowing what to
ask for.
So instead of just getting an
array of every conceivable type
of port, including ports that
are not allowed to be used with
single-cam session, you can ask
for them by name.
So here we have the
dualCameraInput, and I'm asking
for its ports with
sourceDeviceType WideAngleCamera
and source device type
TelephotoCamera.
It goes "Aha, those are the
secret ports that I know about.
I'll give them to you now."
Once you've got those input
ports, you can hook them up to a
connection the same way that you
would when doing your own manual
connection creation.
Then you're streaming from
either the wide or the tele or
both.
Now, in the AVDualCam demo, I
was able to change the depth
convergence plane of the wide
and tele cameras with the
correct perspective.
And you saw that it wasn't kind
of moving and shaking all over.
It was just moving along the
plane that I wanted it to, was
just along the plane of the
baseline.
And I was able to do that
because AVFoundation offers us
some homography aids.
Homography is, if you're
unfamiliar with the term, it
just relates two images on the
same plane.
They are the basis for computer
vision.
They are common for such tasks
as image rectification, image
registration.
Now, camera intrinsics are not
new to iOS.
We introduced those in iOS 11.
They're presented as a 3 by 3
matrix that describes the
geometric properties of a
camera, namely its focal length
and its optical center seen here
using the pinhole camera where
you can see where it enters
through the pinhole and hits the
sensor and that being the
optical sensor and the distance
between the two being the focal
length.
Now, you can opt in to receive
per-frame intrinsics by
messaging the
AVCaptureConnection and saying
you want to opt in for intrinsic
delivery.
Once you've done that, then
every video data output buffer
that you receive has this
attachment on it,
CameraIntrinsicMatrix, which
again is an NSData wrapping a
matrix float 3 by 3 which is a
simd type.
You'll get when you get the wide
frame, it has the matrix for the
wide camera.
When you get the tele frame, it
has the matrix for the tele
camera.
Now, new in iOS 13, we offer
camera extrinsics at the device
level.
Extrinsics are a rotation matrix
and a translation vector that
are kind of crammed into one
matrix together.
And those describe the camera's
pose compared to a reference
camera.
This helps you if you want to
kind of relate where the two
cameras are, both their tilt and
how far away they are.
So AVDualCam uses the extrinsics
to know how to align the wide
and the tele camera frames with
respect to one another so it's
able to do those neat
perspective shifts.
That was a very, very brief
refresher on intrinsics and
extrinsics.
So I've described them in
absolutely excruciating detail
two years ago in Session 507, so
I'd invite you to review that
session if you have a very
strong stomach for puns.
Okay, the last topic of MultiCam
capture is multi-mic capture.
All right, let's review the
default behaviors of microphone
capture when using a traditional
AVCaptureSession.
The mic follows the camera.
That's as simple as I can put
it.
So if you have a front-facing
camera attached to your session
and a mic, it will automatically
choose the mic that's pointed in
the same direction as the front
camera.
Same goes for the back.
And it'll make a nice cardioid
pattern so that it rejects audio
out the side that you don't
want.
That way you're able to follow
your subject, be it back or
front.
If you have an audio-only
session, we're not really sure
what direction to direct the
audio, so we just give you an
omnidirectional field.
And as a power feature, you can
disable all of that by saying,
"Hands off AVCaptureSession, I
want to use my own
AVAudioSession and configure my
audio on my own," and we'll
honor that.
So now comes the time for
another dirty little secret.
There is no such thing as a
front mic.
I totally just lied to you.
In actuality, iPhones contain
arrays of microphones, and there
are different numbers depending
on the devices.
Recent iPhones happen to have
four.
iPads have five, and they are
positioned at different
strategic locations.
On recent iPhones, you happen to
have two that point straight out
the bottom.
And at the top, you have one
pointing out each side.
All of them are omnidirectional
mics.
Now, the top ones do get some
acoustic separation because
they've got the body of the
device in between them which
acts as a baffle, but it's still
not giving you a nice
directional pattern like you
would want.
So what do you do to actually
get something approximating a
front or back mic?
What you do is called microphone
beam forming.
And this is a way of processing
the raw audio signals to get
them to be directional.
And this is something that Core
Audio does on our behalf.
Here we've got two blue dots
which represent two microphones
on either side of an iPhone, and
the circles are roughly the
pattern of audio that they are
hearing.
Remember, they are both
omnidirectional mics.
If we take those two signals and
we just simply subtract them, we
wind up with a figure-eight
pattern, which is cool.
It's not what we want, but it's
cool.
If we want to further shape
that, we can add some gain to
the one that we want to keep
before subtracting them, and now
we wind up with a little Pac-Man
ghost, and that's good.
Now we've got rejection out the
side that we don't want, but
unfortunately, we've also
attenuated the signal, so it's
much quieter than we want.
But if after doing all that, we
apply some gain to that signal.
We get a nice, big Pac-Man
ghost, and now we've got that
beautiful cardioid pattern that
we want, which rejects out of
the side of the camera that we
don't want.
Now, this is extremely
oversimplified.
There's a lot of filtering going
on to ensure that white noise
isn't gained up, but essentially
that is what's happening.
And up to now, only one
microphone beam form has been
supported at a time.
But the good folks over in Core
Audio land did some great work
for this MultiCam feature, and
as of iOS 13, we now support
multiple simultaneous beam
forming.
[ Applause ]
So going back to the old
AVCaptureSession.
When you get a microphone device
input and you find its audio
port, that port lives many
lives.
It can be the front, back or
omni depending on what cameras
the session finds.
But when you're using the
MultiCamSession, the behavior is
rigid.
The first audio port you find is
always for omni, and then you
can find those secret ports that
I was talking about to get a
dedicated back beam or dedicated
front beam.
The way you do that is by using
those same device input port
getters, this time by specifying
which position you're interested
in.
So you can ask for the front
position or the back position,
and that will give you the ports
that you're interested in, and
you'll get a nice back or front
beam form.
Here is for the front, and here
is for the back.
Now, going back to the
MultiCamPIP demonstration we had
with Nik, we stuck to the video
side while we were showing you
the whizzy part of the graph.
Now I'm going to go back and
tell you what we were doing on
the audio side.
We were running all the time a
single device input with two
beam forms, one for the back and
one for the front, and we were
running those to two different
audio data outputs.
This slide should say
AudioDataOutputs.
And then choosing between them
at runtime.
So depending on which is the
larger of the two, we would
switch to back or front and give
you the beam form that we
desired.
There are a couple rules to know
about multi-mic capture.
Beam-forming only works with
built-in mics.
If you've got something
external, USB, we don't know
what that is.
We don't know how to beam form
with it.
If you do happen to plug in
something else, including
AirPods, we will capture audio
of course, but we don't know how
to beam form, so we'll just pipe
that microphone through all of
the inputs that you have
connected, thus ensuring that
you don't lose the signal.
And that's the end of the
multi-camera capture part of
today's talk.
Let's do a quick summary.
MultiCam capture session is the
new way to do multiple cameras
simultaneously on iOS.
It is a power tool, but it has
some limitations.
Know them.
And thoughtfully handle hardware
and system pressure costs as
you're doing your programming.
And if you want to do
synchronized streaming, use
those virtual devices with
constituent device ports.
And lastly, if you want to do
multi-mic capture, be aware that
you can use front or back beam
form or omni.
Thank you.
[ Applause ]