WWDC2010 Session 411

Transcript

>> James McCartney: My name is James McCartney.
After me, Eric Allamanche will be up to talk.
There's going to be two parts to this talk.
I'm going to talk about Audio Processing
Basics, and then Eric will talk
about the Voice Processing Audio Unit and Audio Codecs.
So, in many of the past WWDCs we've in Core Audio, we've
given a lot of talks about Core Audio and how to use it.
And we've always assumed a certain knowledge
about audio that everyone is an audio programmer.
But a lot of people just want to be able to play audio.
So, I'm-- we're going to step back a bit and I'm
going to talk about well, what is digital audio
and how does it, you know, how is it represented.
So, there'll be three parts to my talk
about audio representation formats,
converting audio, and processing audio.
So, what is digital audio?
Sound is a moving air molecules.
And a microphone transduces that
into an electronic waveform.
And you sample that waveform at a
periodic interval called sampling rate.
And each sample is a number.
And so, you need to have at least two numbers per cycle of
a sine wave in order to be able to reconstruct a sine wave.
And therefore, the highest reproducible
frequency is half of the sampling rate.
So, there's different ways -- once you
have sampled the waveform as numbers,
there's different ways to represent
that, you know, in the computer.
I'm going to talk about liner PCM, nonlinear
PCM, and packetized compressed formats.
Linear PCM or LPCM is the most direct
way to represent the sampled audio.
And you just store the sampled numbers in binary form.
But there's a lot of ways to do that.
Are the numbers you're storing integer or floating point?
Are the integers, if you're storing
integers, are they signed or unsigned.
How many bits are in each number?
That's called the bit depth.
What order are you storing the bytes in?
If the most significant byte is coming
first, that's called big endian.
If the least significant byte is coming
first, that's called little endian.
How many channels of audio are there?
And are you storing the channels of audio together, which
is called interleaved, or are you storing them separately?
That's called non-interleaved.
And are the bytes packed or is there padding?
So, now, different groups in different
areas refer to these things differently.
So, in Core Audio, we have a consistent way
of referring to samples, frames, and packets.
Now, if you work in codecs, you might
call a frame what we call a packet.
Or if you work in the music industry,
you might call things in another way.
So, we've decided on a certain way of calling things.
So, a sample is one sample of a waveform.
A frame is a collection of samples
for each channel, you know,
audio stream for the same vertically aligned moment in time.
And then a packet is for a particular stream of audio
is the smallest cohesive unit of data for that format.
It's what you pass around.
For a linear PCM, one packet and one frame are synonymous.
But for compressed formats, one packet is
some group of bytes that you can decompress
into some number of frames of linear PCM.
And so, this is going to show the difference between what
interleaved audio and non-interleaved audio looks like.
In Core Audio, we have a universal container
for audio called the AudioBufferList
which we use to pass around audio to all our APIs.
That contains an array of buffers.
So, in the non-- in the interleaved
case, you see that for stereo sound,
you have the left and right channels are alternating.
They're interleaved in a single buffer.
And in the non-interleaved case,
each channel is in its own buffer.
So, when we see what one frame
looks like in each of these formats,
one frame of interleaved audio,
this is linear, stereo linear PCM.
So, one frame of stereo would be the left
and right sample in the same buffer together.
So, if you're talking about 2-byte
like 16-bit audio samples,
the left and right sample together will be 4 bytes of audio.
Whereas in the non-interleaved case, you have, in each
buffer, one sample, so there would be 2 bytes occupied
by a frame in each buffer in a non-interleaved case.
So, in linear PCM or in audio in general, there's a--
well, in linear PCM, there's two dimensions of quality,
basically - there's sample rate and bit depth.
In sample rate, as I mentioned earlier, determines
what the highest reproducible frequency is.
It's the bandwidth of the audio you're listening to.
So, we just halve the sampling rate.
So, this is the list of common
sample rates and how they're used.
Eight kHz is narrow-band speech.
You can only represent frequencies up to
4 kHz, so it's not very good for music.
Sixteen kHz is wide-band speech.
You're able to represent other incidental sounds
in this format, and it just sounds better.
Then 44 kHz, 44.1 kHz, that's CD quality audio.
That contains basically the full human audible spectrum.
And another common rate is 48 kHz, which is using
digital audio tape and a lot of audio hardware.
Then you see other higher rates used in
pro equipment, and now it's showing up like
in home theater equipment with 96
kHz and 192 kHz sampling rates.
So, a human hearing extends up to 20 kHz.
So, you don't really need sampling
rates above 48 kHz in order to be able
to reproduce the entire human audible spectrum, but there's
technical reasons why you might want higher sampling rates
than 48 kHz, and that has to do with being
able to simplify your audio processing.
So, in a lot of pro situations or in
audio processing situations internally,
you'll have rates like 96 kHz or 192 kHz.
So, the other dimension is bit depths.
And bit depth determines Signal to Noise Ratio.
And so what is Signal to Noise Ratio?
It's the amplitude of the signal, which is what
you're interested in, the music or the speech,
divided by the amplitude of the noise, and
the noise in this case is quantization noise.
When you converted this, the analog signal to a
number, if you converted to an integer, especially,
there's only an integer number of steps that you have
available to you to represent the amplitude of the waveform.
And the error between the value that you chose and the
actual value of the audio is called the quantization error,
and that becomes a noise in your audio stream.
And so, every 6 decibels, it's measured in decibels, and
every 6 decibels is roughly a factor of two in an amplitude.
So, and then every bit you add to the audio, gives
you 6 more decibels of Signal to Noise Ratio.
So, in the integer format, Signal to
Noise Ratio is amplitude dependent.
So, you see a quoting of what the Signal
to Noise Ratio is for an integer format,
that's referring to what you get
if you got a full amplitude signal.
But the quieter signals have a worse Signal to Noise Ratio.
And so, if you got a signal that's at -20 dB, and you've
only got 48 dB of-- or you've got an 8-bit audio signal,
then you're going to have a 20
dB worse Signal to Noise Ratio.
But in floating point, Signal to Noise
Ratio is independent of the amplitude.
So, this is some common bit depths and how they're used.
8 bit integer is sort of the format that was used
when it was very expensive to store and process audio.
It's used in the old games and
a lot of gear made in the 1980s.
It's got 48 decibels of Signal to
Noise Ratio for a full scale signal.
So, that's not very good especially if you've got things
that are not at full scale, which is quite often the case.
Then 16 bit integer is CD quality sound.
It's 96 decibels of Signal to Noise Ratio, that's
quite good except if you're applying a quite sound
and you're turning the volume way up then you can
start to hear the limitations of 16 bit integers.
So, 24 bit integers gives you 144
decibels of Signal Noise Ratio.
That's quite good.
A lot of-- actually I think almost all hard work
can't actually even reproduce that much quality.
So, then for internal processing, you
have-- or in OS X, we use 32 bit values.
The AudioUnitSampleType on iPhone OS or
iOS is a 32 bit integer, which is 8.24,
which is 8 bits of integers in
sine bit and 24 bits of fraction.
That gives you 144 decibels of Signal to Noise Ratio,
but then you've also got 42 decibels of headroom,
so you can go over unit you gain and sort of not
worry about some kinds of processing problems
when you're dealing with internal processing.
Then on the desktop, we'll use 32 bit floating point
which has 144 decibels of Signal to Noise Ratio
for any amplitude even very quiet signals.
And it's got an essentially unlimited dynamic range.
So, one point about quality, once you've
lost quality, you can't add it back.
So, sometimes, you hear about people converting their
audio to a higher sample rate or to a higher bit depth.
That's not going to gain you any quality.
Similarly, re-encoding compressed audio to a higher bit
rate or re-encoding it with a better codec is not going
to give you quality back for your signal.
It's only if you have the original uncompressed source
at a higher quality will you be able to then re-encode
to a better format than you previously encoded it to.
So, then, the next step past linear
PCM was kind of historical.
But it's called nonlinear PCM.
And that's instead of storing the number, you store the
logarithm of the number, and that increases the Signal
to Noise Ratio of quiet signals at
the expense of the loud signals.
And there are two common algorithms, and they differ in
how they quantize or proximate the logarithm function.
It's mu-law and A-law.
And those both encode audio in 8 bits per sample.
Then you get to packetized compressed formats.
And that's when a group of frames is
compressed into a packet of bytes.
One thing to note is that packets often
have dependencies on preceding packets.
So, when you're decoding audio, you're putting
the codec into a particular state, and that--
the next packet will assume that the codec is in that state.
So, if you take chunks of compressed audio from
different streams and you splice them together just
by appending the packets to each other, the codec
is not going to be in the state that the packet,
the next packet after the splice is going to think
that it's in, and you're going to get a glitch.
So, the way you really have to do that is to edit
non-compressed data together and then compress that,
or you do something more sophisticated about
how you splice your packetized audio together
by overlapping and decompressing or re-compressing.
So, but there's going to be more about
compressed formats in the next section.
Now, formats are represented in Core Audio by a
structure called the AudioStreamBasicDescription.
It's used in nearly every API in Core Audio.
It's been covered extensively in previous WWDC talks.
There's information about how to fill one
out, how you get one from the various APIs
like AudioFormat, AudioFile, and AudioConverter.
And one thing to note is that if you use the
AVAudioFoundation classes which play a file
to the audio hardware, then you can avoid the
audio stream basic descriptions altogether.
So then, converting audio.
When you've got audio in all these formats,
how do you get from one format to another?
So, we have an API called the AudioConverter
which does this.
And there are three main conversions you can do which
is linear PCM to linear PCM, which handles all kinds
of transformations like sample rate conversion, bit
depth changes, converting integer to floating point,
interleaved to non-interleaved or any combination
of these, or removing numbers of channels.
And then you have encoding and decoding which is
taking, encoding is taking linear PCM and converting it
to a compressed format, and decoding is taking the
compressed format and turning it back to the linear PCM.
Some of our APIs have AudioConverters built within them.
So, if you use these APIs, then you can avoid having
to deal with the complexity of the AudioConverter
and sort of take advantage of some
of the work that's been done for you.
So, if you're on one of these scenarios like if you
are playing or recording buffers of audio from memory
and you want to play it out to the hardware or record
from the hardware, then you can use the AudioQueue API
and that will take care of converting
between your format and the hardware format.
If you want to read and write audio files into memory or
to and from memory, then you can use the ExtendedAudio API,
ExtendedAudioFile API, and that will
handle the conversions between your--
the format you want and the format that the file is in.
Then if you want to just play a file out to the audio
hardware or record a file from the audio hardware,
then AVAudioPlayer will handle the conversion between
the file format and the hardware format for you
so you don't have to deal with format change at all.
And in this case, you don't have to even name
the formats using audio stream basic description.
OK. So, when you're doing a sample rate
conversion using the AudioConverter,
you have a number of ways to set the quality.
It's a-- sample rate conversion is a relatively
expensive operation depending on what quality you want.
There's a property called the AudioConverterSample
RateConverterComplexity property which is,
allows you to choose several different algorithms
which choose different levels of quality.
So, there's linear, normal, and on the
desktop, you have mastering quality.
Linear is just a linear interpolation between samples.
It's fast, but it's not very good quality.
And then normal is a-- does a better
job doing more sophisticated algorithm.
So, within normal and mastering complexities,
you have several bands of quality which are
from minimum to maximum that you can set.
Linear is just linear, there's
no quality setting for linear.
A higher quality costs more CPU.
The other thing is on the desktop
where you have normal and mastering,
the lowest quality of mastering is better
than the highest quality of normal.
So, they're completely disjoint bands of quality, but
mastering is quite a lot more expensive especially
if you're doing-- if you choose
maximum of mastering quality.
So, processing audio, in Core Audio, we have AudioUnits
which are used to process audio, there are components.
And the main attributes of them is they have inputs
and outputs, so you have ways to get audio in and out,
and then you have parameters that
you can adjust in real time.
And there's a lot of different kinds of AudioUnits.
There's I/O which talk to the hardware so you
can read audio from the hardware or play it out.
And then there are effects which is the most
numerous category which gives you filters,
compressors, delays, reverbs, time/pitch changes.
And then there are panners and mixers.
For the I/O AudioUnits on iOS or the
iPhone OS, you have the remote I/O unit
which is your most direct access to the audio hardware.
And then on the desktop, you have AUHAL
which fulfills basically the same role.
It's an AudioUnit that talks to the hardware.
On the desktop, the AUHAL is built on top of the
HAL which is a sensor hardware abstraction layer,
and that's the low level access to the hardware.
If you use the AUHAL AudioUnit, you're going to
benefit from having a lot of the details of dealing
with the low level handled for your
including audio conversion, format conversion.
But there's no cost of latency for using that AudioUnit.
Just an example, on the desktop, you have various
filter AudioUnits which these images are from the UI
of the AudioUnit that show a graph of frequency
versus the gain for the various filters.
There's a Parametric EQ, Graphic EQ, Lowpass,
Highpass, Bandpass, Low and High Shelf Filters.
And then you also have Compressors.
And there's Delay unit, Reverb unit.
There's also Panner units.
We have Mixers.
On the iPhone, you have a Mutichannel
Mixer which actually does mono
and stereo, and then there's the Embedded 3D Mixer.
And then on the desktop, you have Multichannel
Mixers, 3D Mixers, Stereo Mixer and Matrix Mixer.
So, I'm going to talk about the Embedded 3D Mixer.
It gives you two basic algorithms which
is equal power panning for stereo,
and then you have now spherical head algorithm which gives
you interaural time to delay cues, intensity difference,
and filtering due to head, and distance filtering.
So, the 3D Mixer uses Azimuth, Elevation, and Distance
as its parameter digital listener-centric
parameterization of the source of the audio.
So, Azimuth is the angle from directly
forward for the listener.
So, positive is around to the listener's right,
and negative is around to the listener's left.
And 180 is in the rear.
So as this just illustrates these parameters'
distances, some distance from the listener.
And then on some of the desktop panners
and 3D Mixers, you have also Elevation.
So, then there's also a property on the
3D Mixer which is Distance attenuation.
There's a reference distance which below which
there is no change in the amplitude of the audio,
and there's a maximum distance above
which there is no change in the amplitude.
But between that, there's a distance curve, and there's
several different distance curves you can choose.
Another way to access 3D spatialization is through OpenAL.
And OpenAL is a OpenGL-like library for 3D audio.
It's cross-platform, and allows
3D spatialized source positioning.
It's built on top of the 3D Mixer, so you're using the 3D
Mixer underneath, but it allows you to use world coordinates
which are in x, y, z, so the listener can be anywhere
and the source can be anywhere in space rather
than using listener-centric coordinates
like the 3D Mixer uses.
OK. So, now, it's time for Eric Allamanche to talk
about the Voice Processing Unit and Audio Codec.
[ Applause ]
>> Eric Allamanche: Thank you, James.
And welcome again.
My name is Eric Allamanche, and I'm going to
walk you through the Voice Processing Audio Unit
and the Audio Codecs we provide on the iPhone.
So, let's start with the Voice Processing Audio Unit.
The Voice Processing Audio Unit was
added to iPhone OS 3.0 last year.
And it is basically a dedicated RemoteIO
unit with a built-in Acoustic Echo Canceler.
So, from a programmer's perspective, this RemoteIO--
this Echo Canceler can be accessed exactly the
same way as you would access the RemoteIO unit.
So, basically, setting up, creating the
instance, setting parameters, and so on.
But this year, in iOS 4, we added a new algorithm
which provides significantly better quality,
of course, at the cost of a heavier CPU load.
And this Echo Canceler was specifically designed
to allow extremely high quality audio
chat like in the FaceTime application.
So, now, we provide two, we offer two algorithms in this
RemoteIO unit, and this allows you to make a tradeoff
between the quality and the CPU you want
to spend for this kind of application.
Let me just recall the functionality
of an Acoustic Echo Canceler.
So, on the left-hand side, you have
what we call the far end speaker,
which has the device with a microphone and a loud speaker.
And on the right-hand side, there is the
near end speaker with the same appliance.
So, the far end speaker starts to talk, and this is
visualized with the blue arrows, so the speech signal goes
to the microphone on the far end speaker's end.
And then it's encoded and propagated over the internet
and comes to our device to the near end speaker device.
And so, this signal is then rendered
through the loud speaker.
But because most devices, the microphone is really in the
vicinity of the loud speaker, there is a certain amount
of acoustic energy which goes back
to the near end speaker's microphone.
And if we don't take any measures at this level, well,
this signal is then propagated back to the far end speaker.
And because of all the delays in the chain being
from the encoding, network delays and et cetera,
the far end speaker signal comes back after
a certain amount of time typically around 100
or 200 milliseconds; and this is perceived as an echo.
So, what happens now if the near end speaker talks on top
of the far end speaker while the signals
are mixed together acoustically in the air,
and so this mixed signal is then captured by
the microphone and send back to far end speaker.
So, this is now where the Echo Canceler
comes in to play because we want
to eliminate the amount of blue
signal here from the lower path.
And this is done with several algorithms which
have been developed over the last decades.
And basically, the purpose of the echo-- of
the Echo Canceler is to analyze both signals,
so the signal which is about to be sent to the loud
speaker and the signal which is captured by the microphone.
And from these two signals, the Echo
Canceler tries to make an estimate
of the amount of echo included in the lower path.
And this amount-- this echo is then
removed from the-- from the path.
This is visualized at the-- usually-- so the
subtraction sign on the lower part of this diagram.
And what basically-- and in either case, what should
go back to the far end speaker is actually a red signal
which is the speech of the near end speaker only.
So, now, what happens if other app sounds are playing
or you get a notification, e-mail or whatever,
then this sound gets mixed in with the far end
speaker's signal, and then rendered through the speaker.
But what we want to do here is we want
to eliminate this sound, this loud sound,
or it could be some background audio playing in the case of
a game, for example, and we don't want this signal to go,
to be echoed back to the far end speaker.
And in order to do this, we have to put the
summation point of the signals on the left,
on the farthest most side of this-- of the system.
And this is why the-- this mixing happens
before the Echo Canceler is invoked.
One important thing to note here is that your application
has, you will see all the signals that you have
to deal with, but you won't see any other signals coming
in like an e-mail notification or
sounds played by other applications.
So, because you won't see these signals here,
in this case, the mixing will happen further
on the right side, and those won't be eliminated.
So that's why this RemoteIO with
built-in Canceler is an ideal solution
to eliminate all ancillary sounds
coming from other applications.
The way you open and interact with this processing
unit is exactly the same as with the RemoteIO units,
and this will be explained in much more
detail in the next session by Murray.
I just wanted to point out here how-- some basic setups.
So, you create the audio component description.
And the difference here is that you just provide the
componentSubType to be the VoiceProcessingIO instead
of the RemoteIO, and that's basically it.
So, once you find the component and create a new instance,
you have your RemoteIO with a built-in
Canceler ready to be used.
Now, with this voice processing
unit, we provide a few parameters.
The first one is the Bypass parameter which
allows you to bypass the whole process,
which means that basically nothing happens,
everything is mixed together and nothing is removed.
This can be useful in certain circumstances.
The voice processing unit also has built-in automatic
gain control unit to boost the resulting signal coming
out from the Echo Canceler, and this can be
controlled by this property and it is on by default.
And another property is for ducking the
NonVoiceAudio, as I mentioned in the diagram.
So, all the other app sounds is
what we call the NonVoiceAudio.
And there's a property to duck
this audio to a certain extent.
Now, in iOS 4, we've added two more properties.
And as I mentioned, on the first slide, we added a
new algorithm which provides much better quality,
and this quality, this algorithm is now
controlled by this VoiceProcessingQuality property.
So, with this property, you can select either the
old echo suppressor we had from which way it is,
either in iPhone OS 3.0 or the
new better one available in iOS 4.
And the last property we added is a MuteOutput,
which basically zeroes out the signal
coming out of the Echo Canceler.
So, this is for the muting of a conversation.
So that was about the Voice Processing Unit.
Now, let's dive into the Audio Codecs.
The term CODEC is a contraction from encoder and
decoder, and this is not specific to audio only,
but applies to any kind of codecs
like video codecs and so forth.
And the main purpose of this-- of a codec is
to compress and decompress PCM audio signals.
So, because we're talking about audio
codecs, we only deal with PCM audio signals.
And in general, we differentiate two
different, two big categories of codecs.
One being the lossy codecs which are associated with loss
of information, and on the other hand, lossless codecs.
And of course, codecs, audio codecs nowadays
are core technology in digital audio I mean,
it is the basically the backbone of the iPod and any
media player application, and also for the iTunes Store.
Now, let's talk about lossy versus lossless audio codecs.
In the case of lossless codecs,
there is no loss of information.
So, after one encoding and decoding cycle, the resulting
signal should be bit identical to the input signal,
and this is regardless of the bit depth,
be it 16, 24 or 32 bits integer or float.
So, no loss of information.
But because there is, this can be compared
to the Unix zip command, for example.
But Unix zip is a general tool, and it is not, it
doesn't provide good compression ratios for audio signals
in general, so that's why it's better
to have dedicated audio lossless codecs.
And typical compression factors for state of the art
lossless codecs nowadays are in the range from 1.5 to 2
and 2 being a very good compression ratio already.
On the other hand, the lossy codecs, which
are the most widely used one like MP3 and AAC.
And this typically rely on a perceptual
model of the human auditory system.
What this means is that as James mentioned
before, we can only hear signals up to 20 kHz,
but this is already a very optimistic
case because growing older,
this frequency shifts towards more 15 kHz and even lower.
And by taking advantage of many properties of what's going
on in our auditory system and especially the masking effect
which basically is you have a tone playing back at a certain
magnitude and you have another tone coming in somehow
in the same frequency range but with a much lower magnitude.
What will happen is that this second
tone won't be perceived at all.
So, because it is not perceived,
there is no reason to encode it.
And this is what the lossy codecs try to achieve
by evaluating first what information
can be discarded and what should remain.
So, this is the irrelevant part of the
information that we're going to try to remove,
and the other part is the redundant information,
but this is more of a mathematical nature.
This is basically the predictable part of the signal.
The lossy codec is basically controlled by the bit rate.
So, it is obvious the higher the
bit rate, the higher the quality.
And conversely, the lower the bit
rate, the worse the quality.
So, it is extremely important to make good decisions
regarding the bit rate because we want to make sure
that we don't degrade the signal or
that degradations won't be perceived.
And in contrast to lossless codecs, lossy codecs have
typical compression factors ranging between 6 and 24,
which is a very big range, and this is
achieved with very sophisticated algorithms.
Now, I'm just going to talk about the audio
decoders, which are available on the iPhone.
So, we first have the Adaptive Data Pulse
Code Modulation Codecs like IMA, IMA4, DVI,
and MS-ADPCM which are very simple codecs.
They don't provide very good audio quality,
but they were, because they were simple,
they were widely used historically, and they
are still for example on Voice-over IP providers
when they send you an e-mail, if
you can't pick up the phone,
then voice messages are typically
encoded in one of these formats.
Then we have the QDesign version 1 and 2, which
is actually the old audio codec which was used
by QuickTime before we moved over to AAC,
so it's just there for historical reasons.
Then of course, there's GSM, the GSM Full
Rate Codec used on the mobile networks.
Then we have the Internet Low Bit rate Codec which is a free
and open codec providing decent quality
at decent-- with pretty low bit rates.
Then of course, MP3 which is MPEG-1/2 Layer 3.
Apple Lossless which is the only
lossless codec we provide on the iPhone.
And then the MPEG-4 AAC family of codecs which I
will discuss in more details in the next slides.
Now, regarding encoders, we don't
provide an encoder for every decoder.
There is no-- there is not always a need
for having an encoder for every formats.
Therefore, the choices of encoders
is much more restricted here.
So, for the ADPCM, we have IMA4.
We have the iLBC codec which will allow you to use this as
a Voice-over-- for Voice over IP applications, for example.
The Apple Lossless, which in certain
cases in used for the voice recording.
And then for the MPEG-4 AAC, we provided three
different codecs the Low Complexity Codec,
the Low Delay Codec, and the Enhanced Low Delay Codec.
Now, regarding MP3, we don't provide any MP3
encoder, and this is also true on the desktop.
Core Audio doesn't provide an MP3 encoder.
So, if you want to encode to MP3 on
the desktop, you need to use iTunes.
The audio converter doesn't-- can't encode to MP3.
If we put this together all the codecs with their
characteristics, we get the following table.
So, let's first start with iLBC.
iLBC is 8 kHz.
It is optimized for speech.
It is a speech codec and has two, offers 2 bit
rates in ballpark of 15 kilobits per second.
Then of course, we have MP3 which has a sampling rate range
of 16 to 48 kHz, and can provide mono-stereophonic signals.
An MP3 is what we call a general audio codec.
What this means it doesn't-- MP3 hasn't been designed
to encode a certain class of signals, but it is,
it can be used for any kind of
signals including speech of course.
ALAC has the particularity that the bit rate can't be set.
And the reason is obvious because it is a lossless codec.
So, the content itself will actually
determine what the final bit rate will be.
And it is also a general audio codec.
Then the AAC Low Complexity encoder, codec sorry, provides
a very broad sample rate range first going from 8 to 48 kHz,
and it is also a general audio codec as same as MP3.
And in iOS 4, we've added two more channel
configurations which are the 5.1 and the 7.1.
What this means if you have an AAC file encoded as
with this 5.1 or 7.1, you will be able to decode it,
however, it will only be rendered in stereo.
We do a downmixing at the end of the decoding process.
Now, for the High Efficiency AAC Codec, we have
also provide also a mono and stereo channels.
In this codec as I will go into more details in the
next slide, this has been optimized for streaming audio.
And the AAC Enhanced Low Delay or Enhanced
Low Delay has been specifically optimized
for AV chat kind of applications.
Now, let's go talk about the AAC Codecs into more details.
Now, why do we even have to bother about AAC, I mean
isn't MP3 now is ubiquitous and why isn't MP3 good enough?
Advanced Audio Codec, AAC, and I just wanted to
point out that none of the AAC stands for Apple,
it's all about advanced audio, and it's
a standout-- it's an MPEG standout.
MP3 is almost 20 years old now.
It was standardized in 1991.
And at that time, the requirements for audio codecs
were completely different than they are nowadays.
So, and even at that time, it was a challenge
to decode an MP3 stream in real time.
This could only be done with expensive DSP boards.
Therefore, MP3 had serious limitation in its design.
And most specifically, it was limited
in the bit rate it would support,
the sampling rates and the channel configurations.
So, right off the bat, MP3 can only handle stereo signals.
And there were some other mathematical underpinnings
in the codec design itself which wouldn't allow it
to be transparent for certain signal classes.
So, all these together, led the
MPEG consortium to start a new--
to a new work group which was focused
in designing a much better codec,
non-backwards compatible codec,
and this is where AAC came to life.
And the first version was standardized
in the course of MPEG-2 in 1997.
But since that time, AAC was adopted as the basic
codec in MPEG-4, and since that time, it has seen many,
many additions and new variations coming
in which makes it a very versatile codec.
So, about the AAC Codecs, so first,
there is the Low-Complexity Codec.
The Low-Complexity Codec is actually the codec you
use for any kind of media playback application.
So, this is what the iPod uses.
This is what the iPod application
uses on iPhone or iPod touches.
Then we have the High Efficiency and High
Efficiency v2 Codecs which have the advantage
that they provide similar quality to some
extent, but at significantly lower bit rate.
And this makes it really interesting
for internet radio stations.
And if you look up some-- even the iTunes radio station's
library, you will notice that every low bit rate,
something around 64 kilobits and below is--
most of the time, only encoded using
High-Efficiency or High-Efficiency v2.
And on the other side, we have the Low Delay Codec
which was first introduced in Mac OS X Leopard.
And it is the default codec for the iChat AV application.
And today, in iOS 4, we've added a new codec which
is called the Enhanced Low Delay Codec, and it's--
this is the codec which is used for
the-- for the FaceTime application.
Now, Low Complexity and High Efficiency.
Low Complexity provides the highest
audio quality multi-channel support.
And High Efficiency uses some tricks in
order to significantly reduce the bit rate.
And what High Efficiency is basically
doing is that during the decoding process,
it synthesizes the other frequency
bands rather than encoding them.
And so, this results in some significant bit rate savings.
And as an extension to the High Efficiency,
the High Efficiency v2 Codec expands from mono
to stereo signals using some parametric stereo techniques.
And this v2 version can even provide lower
bit rates down to 20 kilobits per second.
So, to summarize this, the highest quality will
always be achieved using the Low-Complexity Codec
and the lowest bit rate with degraded quality will
be-- can be provided using the High Efficiency v2.
Now, just to give you an idea--
I hit the wrong button, sorry.
I have a few sound examples here.
So, the first sound example is a low complexity
as encoded using the Low-Complexity Codec
at 120 kilobits per second which is a very popular bit rate.
[ Music ]
The next example is going-- is encoded
using the High-Efficiency Codec
at 64 kilobits per second, so half the bit rate.
[ Music ]
And the next one is going to be High-Efficiency v2 encoded
at 32 kilobits per second, so even
half of what we just heard.
[ Music ]
Now, in order to put this in contrast
to what Low Complexity to--
Low Complexity, the next example is actually the same item
encoded with Low Complexity but at 32 kilobits per second.
So, exactly the same rate-- bit rate as High Efficiency v2.
[ Music ]
So, the-- even in this-- with this
acoustic, the difference is obvious.
But just to be clear, the message here is not--
I don't want you to rush back to your offices
and re-encode all your assets using a High Efficiency v2 at
the lowest possible bit rate, that's not the message here.
So, I just want to say how efficient this technique is.
And as I mentioned before, the high efficiency just--
there's some clever synthesization
of the upper frequency bands,
but they do not really reflect
the-- what the original content was.
So, you should always be aware of this.
So, the way this is working with the High Efficiency
is because when High Efficiency was introduced,
many systems were already using AAC decoders, and
they wanted to preserve backwards compatibility.
So, what the MPEG consortium did
was to use a layer approach.
So, we start with a-- with the Low
Complexity base layer which is either mono
or stereo and typically at 22 kHz sampling rate.
And on top of this Low Complexity layer, we add
the High Efficiency which is also mono or stereo.
But the High Efficiency layer operates at
double the frequency of the Low Complexity one.
What this means is that-- I mean as I said
before, we synthesize the other frequency bands.
And therefore, only the lower frequency portion
will be encoded using the Low Complexity.
And similarly for the-- for the High-Efficiency
v2, we started with a low complexity layer,
but this time, this layer is a mono-only layer.
And then we add High Efficiency layer twice
the sampling rate, but will also be mono.
And then on top of this, comes the High-Efficiency v2
layer which will then expand the mono to stereo signal.
And this-- the discovery mechanism and how you deal
with this format is described in the tech note 22.3.6,
so I recommend having a closer look at this document.
Now, given that Low-Complexity provides such high qualities
and we have very good efficiency using the High Efficiency
Codecs, what's the problem now with the Low Delay?
Why do we need Low Delay?
Well, if we look at the waveform
as it would come out directly
from a Low Complexity decoder,
we would see something like this.
The input signal is actually the same
signal, but really left-justified.
So, we see here that before the signal onset, there is a
huge lag-- there is a huge region which contains only zeros.
And this lag is obviously too much for a full
duplex type of-- for chat-like applications.
This problem was somehow addressed with the Low
Delay AAC Codec, and you see that the output
of AAC Low Delay Codec has substantially less-- the lag
is substantially less than for the Low Complexity one.
And in the case of the Enhanced Low Delays, the
lag is even-- is only half of the Low Delay one.
So, to summarize, sorry-- the-- and if we put this in terms
of numbers, then we see that the low complexity has a lag
of 2112 samples whereas the Enhanced
Low Delay has only 240 samples lag.
So, this shows that this codec is
well suited for AV chat applications.
The Low Delay Codecs share the same foundations as AAC,
so it's actually an extension of the ACC standards,
but they provide much smaller delays typically 15 to 40
milliseconds, and these have been specifically design
for full-duplex type of communication applications.
Low Delay, which is-- can be created using the format
ID, the format ID constant kAudioFormatMPEG4AAC_LD.
It has a minimum delay of 20 milliseconds.
The Enhanced Low Delay provides 15
milliseconds as the minimum delay.
And because that they are part of AAC
codecs, they have a large bit rate range
and they allow for even transparent quality.
One thing I just wanted to point out is because the
windows-- the block sizes affect for all the codecs,
the delay is actually proportional to the sampling rate.
So, the higher the sampling rate, the
lower the delay will be, and conversely,
the delay will increase if you
go down with the sampling rates.
I just want to briefly give you an
overview of where those codec lives.
Allan, in the previous session, did
explain when a software codec comes
into play whenever-- how do a codec comes into play.
This table just summarizes in which world which codec live.
And we see that the High Efficiency and High Efficiency
v2 Codecs aren't yet available as software codecs.
And this is important for your application because
if you want to use High Efficiency in codec material,
you have to be aware of this that you
may not be able to decode it at its--
at its full quality if the hardware
codec is already in use either
by another application or by another AV player instance.
And-- but the Low Delay Codecs are software only,
so they can have multiple instances on them.
Now, I just wanted to go over some key
parameters for the encoding process.
First, we have the sampling rate, of course, the number
of channels, and the bit rate, the bit rate modes,
and all this leads to the subjective
quality, which we want to maintain.
The bit rate determines the compression ratio.
The higher the bit rate, the bigger the
resulting file, but the better the quality.
And the bit rate is typically accounts for all channels.
So we don't specify on a per channel basis because
sometimes it doesn't make sense like for 5.1 material
or 7.1 because the Low Frequency effect
channel doesn't require many resources.
And the bit rate typically also grows with the number of--
with the number of channels and also the sampling rate.
And one thing to be aware of is that the AAC encoder--
the software AAC encoder has a sample rate converter--
an internal sample rate converter, and it made you a
sample rate conversion if you specify your bit rate
which is too low for the sampling rate of the input signal.
So, like in the example I just showed
before, the 32-kilobit Low Complexity one,
the sampling rate was actually down to 16 kHz.
Bit rate modes is another knob
you can turn to do some tradeoffs.
The most simple mode is the Constant Bit Rate Mode which
allocates a fixed amount of bytes for every packet.
But therefore, it is not flexible at all.
It doesn't accommodate to the content.
So, encoding-- one second of silence takes us many
resources as encoding complete symphony orchestra.
We recommend the use and the default is actually the Average
Bit Rate Mode, which has much more flexibility in the sense
that it dynamically allocates the resources to
the-- for every packets according to its content,
but with the constraint of trying to maintain
the average bit rate as provided by the user.
And the most flexible mode which is also known
from MP3 is the so-called Variable Bit Rate Mode
where there is basically no limitations to the bit rate.
And VBR is expressed in terms of quality
in terms of-- instead of bit rate.
Just to wrap up about the encoder, I just want to give a
few recommendations and hopefully you will follow them.
The thing is that you should choose the codec
according to the use case and the limitations.
If you want high-quality audio like media playback, then
there is no doubt you should always use Low Complexity.
For streaming kind of applications like streaming
radios, High Efficiency is the best choice obviously
because of the lower band, the
much significantly lower bit rates.
But if you want high-quality voice chats, then
you should reuse the Enhanced Low Delay Codec.
And whenever possible, you should
favor the highest possible--
about highest possible quality by choosing the right codec,
the best encoding mode, and the highest possible bit rate.
And also as James stated in the previous section
is that lost information can be recorded.
What this means if you convert an MP3 to an AAC, even
at the higher bit rate, the quality will be degraded,
even though the high-- the bit rate has been higher.
So, this is something you should really avoid.
If you don't have the source material, try--
avoid transcoding from one format to the next.
Now, the way this AAC streams are
packaged, basically, you will--
well you know these .mp4 endings which
is the MPEG-4 native file format.
There's also the .m4a which is MPEG-4 compatible
which adds iTunes-specific data chunks,
and this can also be used to encode--
to embed ALAC material.
And there's the preferred format which is the
Core Audio file format with the ending .caf.
For streaming, you have the ending .adts or .aac which is a
self-framing format, and this is what is used in SHOUTcast,
internet broadcast, and HTTP live streaming.
So, this concludes my talk.
I just wanted to point out to the next session about Audio
Development for iPhone OS Part 2 by Murray who will go
into much more details about the
RemoteIO, the AudioUnits, excuse me.