WWDC2016 Session 509

Transcript

[ Music ]
>> Hi. I'm Henry Mason,
an engineer working
on speech recognition for Siri.
Today we're incredibly excited
to announce a brand new API
which will let our speech
recognition solve problems
for your apps too.
A quick overview of what
speech recognition is.
Speech recognition is
the automatic process
of converting audio of
human speech into text.
It depends on the
language of the speech.
English will be recognized
differently
than Chinese, for example.
On iOS, most people
think of Siri
but speech recognition is also
useful for many other tasks.
Since Siri was released
with iPhone 4S,
iOS has also featured
keyboard dictation.
iOS has also featured
keyboard dictation.
That little microphone
button next
to your iOS keyboard spacebar
triggers speech recognition
for any UI kit text input.
Tens of thousands of apps
every day use this feature.
In fact, about a third
of all requests come
from third party apps.
It's extremely easy to use.
It handles audio recording
and recording interruption.
It displays a user interface.
It doesn't require you to write
anymore code than you would
to support any text input.
And it's been available
since iOS 5.
But with its simplicity
comes many limitations.
It often doesn't make sense
for your user interface
to require a keyboard.
You can't control when the
audio recording starts.
There's no control over
which language is used.
It just happens to use the
system's keyboard language.
There isn't even a way to know
if the dictation
button is available.
The default audio recording
might not make sense
The default audio recording
might not make sense
for your use case and you
may want more information
than just text.
So now in iOS 10,
we're introducing a
new speech framework.
It uses the same underlying
technology that we use
in Siri and Dictation.
It provides fast
and accurate results
which are transparently
customized to the user
without you having to
collect any user data.
The framework also
provides more information
about recognition
than just text.
For example, we also provide
alternative interpretations
of what your users might
have said, confidence levels,
and timing information.
Audio for the API
can be provided
from either pre-recorded files
or a live source
like a microphone.
iOS 10 supports over 50
languages and dialects
from Arabic to Vietnamese.
Any device which runs
iOS 10 is supported.
Any device which runs
iOS 10 is supported.
The speech recognition API
typically does its heavy lifting
on our big servers which
requires an internet connection.
However, some newer devices do
support speech recognition all
the time.
We provide an availability
API to determine
if a given language is
available at the moment.
Use this rather than looking
for internet connectivity
explicitly.
Since speech recognition
requires transmitting the user's
audio over the internet,
the user must explicitly
provide permission
to your application before
speech recognition can be used.
There are four major steps
to adopting speech
recognition in your app.
First, provide a
usage description
in your app's info.plist.
For example, your
camera app, Phromage,
might have used a
usage description
for speech recognition
of this will allow you
for speech recognition
of this will allow you
to take a photo just
by saying cheese.
Second, request authorization
using the request authorization
class method.
The explanation you provided
earlier will be presented
to the user in a
familiar dialogue.
And the user will be able
to decide if they want
to provide your app
to speech recognition.
Next create a speech
recognition request.
If you already have
recorded audio file,
use the
SFSpeechURLRecognitionRequest
class.
Otherwise, you'll want
to use the SFSpeechAudioBuffer
RecognitionRequest class.
Finally hand the
recognition request
to an SFSpeech recognizer
to begin recognition.
You can optionally hold onto
the returned recognition task
which can useful for monitoring
and controlling recognition
progress.
Let's see what this
all looks like in code.
We'll assume we've already
updated our info.plist
We'll assume we've already
updated our info.plist
with an accurate description
of how will it be used.
Our next step is to
request authorization.
It may be a good idea
to wait to do this
until the user has invoked
a feature of your app
which depends on
speech recognition.
The request authorization class
method takes the completion
handler which doesn't guarantee
a particular execution context.
Apps will typically want to
dispatch to the main queue
if they're going to do
something like enable
or disable a user
interface button.
If your authorization handler
has given authorize status,
you should be ready
to start recognition.
If not, recognition won't
be available to your app.
It's important to gracefully
disable necessary functionality
when the user makes
this decision
or when the device is
otherwise restricted
from accessing speech
recognition.
Authorization can
be changed later
in the device's privacy
settings.
Let's see what it looks like
to recognize a prerecorded
audio file.
We'll assume we already
have a file url.
Recognition requires
a speech recognizer
which only recognizes
a single language.
The default initializer for
SFSpeechRecognizer is failable.
So it'll return nil if the
locale is not supported.
The default initializer uses
device's current locale.
In this function, we'll just
return one in this case.
While this speech
recognition may be supported,
it may not be available,
perhaps due
to having no internet
connectivity.
Use the is available property on
your recognizer to monitor this.
Now we create a recognition
request
with the recorded
file's url and give
that to the recognition task
method of the recognizer.
This method takes
completion handler
with two optional
arguments, result and error.
If result is nil, that
means recognition has failed
for some reason.
Check the error parameter
for an explanation.
Otherwise, we can read the
speech we recognize so far
by looking at results.
Note that the completion handler
may be called more than once
as speech is recognized
incrementally.
You can tell the
recognition is finished
by checking the is final
property of the result.
Here we'll just print the
text of the final recognition.
Recognizing live audio from
the device's microphone is very
similar but requires
a few changes.
We'll make an audio buffer
recognition request instead.
This allows us to
provide a sequence
of in memory audio buffers
instead of a file on disc.
Here we'll use AVAudioEngine to
get a stream of audio buffers.
Here we'll use AVAudioEngine to
get a stream of audio buffers.
Then append them to the request.
Note that it's totally okay
to append audio buffers
to a recognition
request both before
and after starting recognition.
One difference here is that
we no longer ignore the return
value of the recognition
task method.
Instead we store it in
a variable property.
We'll see why in a moment.
When we're done recording,
we need to inform the request
that no more audio is coming so
that it can finish recognition.
Use the end audio
method to do this.
But what if the user
cancels our recording
or the audio recording
gets interrupted?
In this case, we really
don't care about the results
and we should free up any
resources still being used
by speech recognition.
Just cancel the recognition
task that we started --
we stored when we
started recognition.
This can also be done for
prerecorded audio recognition.
Now just a quick talk
about some best practices.
We're making speech recognition
available for free to all apps
but we do have some
reasonable limits in place
so that the service remains
available to everyone.
Individual devices may
be limited in the amount
of recognitions that can
be performed per day.
Apps may also be
throttled globally
on a request per day basis.
Like other service backed
APIs, for example CLGO Coder,
be prepared to handle network
and rate limiting failures.
If you find that you're
routinely hitting your
throttling limits,
please let us know.
It's also important to be aware
that speech recognition can
have a high cost in terms
of battery drain
and network traffic.
For iOS 10 we're starting with
a strict audio duration limit
of about one minute
which is similar to that
of keyboard dictation.
A few words about
being transparent
and respecting your
user's privacy.
If you're recording the user's
speech, it's a great idea
to make this crystal clear
in your user interface.
Playing recording sounds and/or
displaying a visual recording
indicator makes it
clear to your users
that they're being recorded.
Some speech is a bad
candidate for recognition.
Passwords, health data,
financial information,
and other sensitive
speech should not be given
to speech recognition.
Displaying speech as it
is recognized like Siri
and Dictation do can also help
users understand what your app
is doing.
It can also be helpful for
users so that they can see
when recognition errors occur.
So developers.
Your apps now have free access
to high performance speech
recognition dozens of languages.
But it's important to
gracefully handle cases
where it's not available
or the user doesn't
want your app to use it.
Transparency is the best policy.
Make it clear to the user
when speech recognition
is being used.
We're incredibly excited
to see what new uses
for speech recognition
you come up with.
For more information
and some sample code,
check out this session's
webpage.
You might also be interested
in some sessions on SiriKit.
There's one on Wednesday and
the more advanced one later
on Thursday.
Thank you for your time
and have a great WWDC.