WWDC2001 Session 127
Transcript
Kind: captions
Language: en
thank you thank you so the good news is
that uh can you walk hear me
yep the spoken language technologies in
case you haven't noticed are in Mac OS
10 we got on there whoo thank you what
we're going to do today is describe
briefly the speech recognition and the
speech synthesis that are there we're
going to give you guidelines about where
to use them in your applications and why
you would want to use them in your
applications and we're going to actually
lead you through the process of getting
your applications talking and listening
so let's start off by doing a demo of
what we've got over here can we switch
to far oh I'm supposed to do an ad buy
I'm supposed to do it whoo
there we go the user interface to speech
recognition is this round window that
you may have seen this replaces the face
that some of you might be familiar with
in OS 9 it consists of three parts the
middle if you you might be able to see
says escape this shows you the listening
mode you there are two different
listening modes with speech recognition
what we call push-to-talk mode and
continuous listening in push-to-talk
mode it's only listening when you hold
down a key it says ESC which means that
by default is the escape key users can
configure that in continuous listening
mode is listening all the time and
optionally you can have it wait for a
key word like computer before you speak
your commands I am using push a talk
mode here and I recommend that you do as
well when you're demoing so that
so that when you're explaining things to
people it's not try to recognize
commands that you're actually intending
to other people not to the computer what
time is it
it's 10:30 what day is it
it's Thursday May 24th show me what to
say okay the other part of us of the
feedback is this speaker speech commands
window which has two halves the top half
which is scrollable shows what it has
recognized and if it speaks back to you
what it says to you the bottom half
which also is scrollable and all of
which is resizable by the user
at last Thank You cocoa shows what you
can say there are now disclosure
triangles so that it no longer Scrolls
off the bottom of the screen the middle
item there I don't know whether you can
see it says speakable items that shows
the commands that can be spoken all off
the time no matter what application is
running I mentioned at the start I spoke
what time is it what day is it their
items down in there these are actually
kept in the speakable items folder let's
take a look at it open the speakable
items folder there it is mm-hmm so any
item that is in this speakable items
folder can be launched by speaking it
and it's just the same as
double-clicking on it
applications aliases documents servers
URLs anything that you can launch by
double-clicking you can now launch by
speech on OS 10 the real power of this
is that users can customize it to the
way they work by dragging their own
items into the speakable items folder in
addition just like in OS 9 know the
speakable items folder itself contains a
folder called applications speakable
items this contains folders which are
named by applications the items in those
folders are only speakable
when that application is in the
foreground that's shown in the speech
commands window in the top item the top
disclosure triangle so you see at the
moment let me close the others that it
says finder and the finder ease of in
the foreground let me demonstrate this
in action I'll switch to my browser and
as I do you watch the items there change
open my browser there so it now stays
Internet Explorer and there are some
different items there this is an
opportunity for you developers you can
make application specific folders for
your applications and populate those
folders with scripts or other commands
that control just your application they
won't be speaker ball when your
application is not itself is not in the
foreground oh I'd like to encourage you
to explore this by yourselves I just
want to show you one other thing hide
this application and that is that you
may have noticed we shipped one game and
that is chess Oh in telling you this I
should point out we keep a track of all
the applications that you've launched
since we started we actually restart the
machine and you can switch to any
application running or not just by
saying its name switch to chess so the
chest application you might have seen
mentioned and demoed in the keynote on
Monday
one of the keynotes as being an example
of a good user interface aqua quality
and one of the things about it is that
you can control it by speech porn d2 to
d4
Knight b1 to c3 what's it going to do to
that let's see it's thinking it's
thinking how than I am but it's a better
player than I am too one of the things I
like about this is this one thing I can
do with this that I can't do when I'm
playing with real people and that's the
following I can say take back move my
twelve-year-old son does not let me do
that when I'm playing with him okay
let's move on so what speech is there in
Mac os10 bear with me there are a lot of
people at this conference this year who
are you new to the Mac OS platform some
of you who are already familiar with the
platform will know already what we have
so it's just a brief mention for those
that are get from it familiar with it we
have speech synthesis and speech
recognition the speech synthesis will
take any text and convert it into
audible speech there are 22 different
voices they range from adult male and
adult female through two voices that
sing and sound like aliens and novelty
voices and we have speech recognition
there are a number of characteristics
about the speech recognition they're
important first is it's speaker
independent that means you don't have to
Train it to your voice you just take it
out of the box and it just works the Mac
was and remains the only computer
platform I know of that you can just
take out of the box put on the desk and
command by voice it's continuous speech
you don't have to pause between words it
uses a far-field microphone this is a
technical point but a very important
point speech recognition is very
sensitive to background noise so
sensitive that all other recognizes that
I know of require that the users mount
purchase and use a noise canceling close
talking head mounted microphone we tune
up our recognizer to work with the
inbuilt microphones that are built into
the IMAX and the other CPUs that Apple
delivers that means that we are getting
the background noise that other
recognizers are getting so we have
several layers of software to adaptively
model
and subtract and deal with that
background noise but there are limits on
how much background noise we can deal
with so we've tuned it up to work well
in the situations where most of our
users are using their computers in the
office at home if you have users or
customers who are using speech
recognition in noisy environments such
as in classrooms then it might be
pushing the limits of the head mount of
microphones a little bit too far
one such environment for example is
giving a presentation in an auditorium
with a what 300 watt sound system that's
providing a slapback echo from from the
back of walk hall and so for that this
is one of those environments where I've
been using this head mounted microphone
this one is produced by VXI they have
worked quite a lot with us over the last
couple of years to optimize their
microphones to work with our speech
recognition and so you can direct your
users if you want to those as an
alternative solution our speech
technologies currently are US English
only on OS 10 so what's new in speech
synthesis we've done a few things we
merged the Mac and talked three and the
map and talked Pro codebases into a
single codebase this has got a couple of
advantages one is means that we have at
last a totally new codebase we have
divested ourselves of deep intricate and
interpretable legacy code and positioned
ourselves to at last be able to fold in
the research improvements that we've
been making in speech synthesis in the
last few years so we've got ourselves
into a new platform ready to go forward
an immediate benefit right now is that
you get consistent behavior across
voices a lot of developers have said to
me that they they change from one voice
to another and words are pronounced
differently or the intonation is
different or that will no longer happen
all the pro voices are now just the
high-quality versions so what was
Victoria high quality in OS 9 is now
called Victoria and there is no lower
quality version in OS 10 that means the
speech is crisper it's easier to
understand it's more robust in the
presence of background noise people can
understand it without it being so loud
and we've improved the pronunciation in
a
of ways we've enlarged the dictionary
from about 20,000 words to about what a
120,000 Jerome's work down here the
morphological decomposition is now
recursive if you want to know what that
means ask us in the question time it's
fascinating really really it's very cool
it's a it's a multiplier on the
effectiveness of the dictionary and the
letter to sound rules are now
automatically trained based on the new
large dictionary rather than handwritten
based on one linguists intuitions for
speech recognition we also have a
benefit here we have factored out that
pronunciation subsystem from the speech
synthesis and made it a separate
subsystem that's now shared between
speech synthesis and speech recognition
this means first of all it reduces the
overall RAM which improves performance
of everything running on the platform it
means the recognition is more accurate
because the pronunciation subsystem is
expecting the correct pronunciations for
more words and it gives consistent
behavior across speech recognition and
speech synthesis developers have said to
me I was prototyping some spoken
commands for my application and one of
the commands was not recognized very
well and I thought that perhaps the
recognizer was not listening for the
correct pronunciation so to find out
what the recognizer was listening for I
tucked my command into your
text-to-speech system and sure enough it
was spoken with an inquiry incorrectly
pronounced word well up until today
that's been irrelevant now it is
relevant now if you want to know whether
the recognizer is expecting the correct
pronunciation they type the word though
you're the word or the commander that
speech synthesis and it will tell you
the way the recognizer thinks people
will pronounce that word the user
interface has been completely revised as
I showed you and there's an improvement
of speakable items we've added xml-based
command files that you associate a
spoken command with a keystroke sequence
Kevin Aiken later in this presentation
will show you that in some more detail I
want to talk about why you should use
speech in your applications there are
two classes of applications I think from
our perspective there are applications
that are centered around speech
where speech technology is central to
the users experience and central to the
value that application delivers and then
there's a huge number of applications I
think most of you write that for which
speech is not centrally relevant at all
I'd like you to think about places that
you can use speech in those applications
as well chess is an example speech is
not really relevant to chess but we
entered it and people say hey that's
cool
if you add speech to your application
then you'll increase the number of
potential users and that increases your
market for example younger users will
find your application more approachable
people with disabilities will be able to
use your applications and people who are
less familiar with computation will be
less scared of trying out your
application speech is a very natural
form of communication we've all been
talking and listening since we were what
two years old wait one and a half years
old speech and layout enables you in
your application to move beyond the
limits of point-and-click there's
nothing wrong with point-and-click in
fact it's very good at letting you
control things that you can see on the
user interface and reach with a single
gesture but there are lots of things
that you want people to control that
they can't see to point to to click on
speech gives you a way to get past that
if you think about it clicking is rather
like grunting turns we turning our back
on what about 200 thousand years of
human evolution because when I click on
things I'm just going
that's true I'd like like to think that
we've come somewhat further than that
similarly speech output I think can be a
lot better than just beep so many of us
are still using alert sounds well beep
was the mentality of 1960s when all that
computers had was this tiny little
speaker we've come forward since then
speech gives you a way to bring
yourselves into the 21st century and
conversation is an appropriate modality
for delegating tasks to a computer we'll
illustrate that a bit more shortly so
what are the weights some of the ways
that you can use speech synthesis in
your application
one is notifications we recommend that
you judiciously get the users attention
back if the users attention has wandered
away from your applications some of you
may have known about or experienced
talking alerts that we did in OS 9 now
there's a slider that lets you set how
long between the delay between an alert
coming up and it being spoken you may
want that longer or shorter than the
default but the point is you normally
should not hear any speech except when
your attention wanders and the computer
wants your attention and then if you
don't respond it gets your attention
back you can use notifications for
asynchronous events for example in the
AOL Instant Messenger if you're buddies
enter your chatroom then it will
announce that with our speech synthesis
it will say buddy Smith just into the
room you can give you speech synthesis
to give a fictional additional feedback
for younger users and we have found and
our users have found that applications
for the 6 to 12 space become accessible
for the K through to 6 space if they
don't change anything but just read out
the text messages that they're putting
up on the screens and you can use speech
synthesis for proofreading for example
you have an application where people are
entering data to say a spreadsheet and
people are into entering budget figures
in a column allow them to select that
column and have it read back so they can
just check what they were entering to
make sure that there aren't any errors
speech will give you your application
more accessibility for those with
disabilities I think that's pretty
obvious
is really cool in games you saw that we
did it with with chess there are a lot
of games already that you're using
speech for things like cheat codes and
changing weapons use speech for non
time-dependent control I don't think it
would be appropriate to use a speech
recognition in your game to say fire now
quick quick quick right lift but you can
do it for that as well if you want to
try it in noisy games
these are headsets are probably the
right thing to do so the speech
recognition doesn't confuse your
commands with oh there are a lot of
people who are successfully using speech
for in education applications and I
think there are a lot more of you with
education applications that could take
advantage of that one example would be
the dinette product which is using
speech recognition for pronunciation
correction for adults learning English
is way cool check it out you can use
speech to enhance the web browsing
experience for navigating within a
browser if those of you who have
explored will see that we actually ship
a couple of spoken commands for Internet
Explorer that let you do some simple
navigation if you have a browser you can
do a much better job than we do by
working within it
for example people could speak speak the
links jump to pages by topic read out
web pages and there's a big opportunity
with voicexml the enterprise industry
that's moving more and more of its
information onto web access is now doing
two or three different versions of all
of their websites they're doing the HTML
version they're doing a web version for
personal digital portable digital
assistants to access their stuff via
wireless and they're making voicexml
versions so that people can ring up the
webpage and have the information read
out over the phone the way this is done
is by an extra set of tags that are in
the web pages that are just the thing
that you need to interpret for speech
speech access and you can do that on a
desktop from with a Mac using our API is
it would be pretty straightforward
because the infrastructure and the hard
work has all been done for you by the
web developers
we recommend that you think about using
speech for form-filling
as an alternative to people filling out
things with pop-up menus you can now
have people speak the contents of each
field and you can use it that lets you
use a constrained language model for
each field to increase the recognition
accuracy for example the person could
say create a new customer record and the
computer could then respond what type
and then narrow down its search to just
the alternative customer records you
person could say okay corporate account
and then the next field would be payment
schedule and the person could then say
30 days and and that's saved through the
recognition model was changed again to
just listen to the possible let's just
listen for possible possible payment
schedules okay there are lot of tasks
where people's eyes are busy their hands
are busy for example you're in a
graphics program you're drawing
you've got the mouse down you're putting
a line across some an object and you
want to move it around send it to the
back or change the pressure sides when
the eyes are busy in the hands are busy
speech gives you a another way for your
users to control your app so at this
stage I'd like to invite salsa going up
salsa goyan is the Apple script product
manager and he's going to show you some
way cool ways that he's been using
speech so is this huh great hi this is
amazing that I got up this early for
this whole thing those that know me know
that it's an impressive feat what we're
gonna be showing today is can you switch
me to this or me switch me to this is
how to use Apple script with speech
that's one of the best integrations on
Mac OS is the ability to use these two
technologies together and on Mac OS 9 we
introduced the ability to have a script
listen for a response and based upon the
users response perform a different set
of actions and incorporated a technology
called the speech listener so I'm going
to show a couple scripts today that use
this technology on Mac OS 10 and both
scripts will involve a conversation to
get a task done the first one is a
rather straightforward example
where I asked the script for some music
and it prompts me with a series of
questions and we have some music played
so let's see if my voice is back where
it should be and we'll try this out some
music please which artist or category
Christine Caine which song from
Christine Kane Tucson
[Music]
stop playing songs so in this example
the script starts up it got the
information about which artists were
available held it into memory and said
which artist or category when I say
Christine Kane it matched that then
queried and found out which songs were
available by Christine Kane held that
into memory and then said which song I
said Tucson that matched and then it had
the song play with iTunes so this is a
simple example
you got a program in a certain amount of
grief with these things just to keep you
honest so there's an example of being
able to carry on a conversation with the
script it's a limited scon verse a ssin
but it is a way of gathering information
and moving forward and the hide this
application in the next example I'm
going to use a program from by up
systems it's called goat reap and one of
the things that this application does is
it accesses information over the
Internet and I'm going to use a script
acting as a person a personality called
Victoria and Victoria will act
independently of the speech recognition
speakable items in that she will have
her own set of scripts that she's going
to use in conversing with me so here we
go let's try this out and see if she's
awake - Victoria yes L show my newspaper
here you go something else clear all
stories all stories have been removed
from your newspaper Edie multiple
anything else add multiple stories ready
Motley Fool adding monty fool ready
apple stock quote adding Apple stock
load ready
Appl top story adding apple top story
ready that's all anything else not at
this time anything else not right now
not right now
goodbye so in this instance when the
script was called Victoria which exists
in speakable items when the script loads
up he goes to a subfolder in the
speakable items folder called tasks and
within that folder our individual
scripts the names of which she holds
into memory when I say show my newspaper
she loads that script and then executes
that so you have a script running a
script and all of the commands that
Victoria were doing are not included in
the standard speakable items commands
they are a subset so you can create
these individual personalities so that's
just two examples of how you can use
Apple script in speech and if you're
interested in how to do this the Apple
script website has a complete overview
it has an apple script guide book of how
to use speech and Apple script together
thank you
[Applause]
some issues if you're going to include
speech in your application then there's
a few things that you need to keep in
mind educate you users about how to
speak a good example would be go to
speech preference panel and turn on
speakable items and you'll see a sheet
come down and you'll see how we explain
to users that they shouldn't pause and
and so on let them know about background
noise being a problem you might want to
refer them to head mounted microphones
we train the speech recognizer on North
American English and so officially
that's what we say we support that
happens that it is somewhat forgiving
and so I'm Australian and our group we
have Jerome from France we have Devon
where are you Devon native Gujarati
speaker we have mutt ears from speech
swiss-german as a native language we
even have Tom from the Bronx and
understands all of us
but there again there are limits and
localization is an issue currently as I
said we are US only so it's time to code
for speech programming 101 I'd like to
invite Matias near Hoffa up on the stage
Matias okay
now that Kim has told you what to do
with speech Kevin and I are going to
talk about how to do this using our
speech technologies in Mac OS 10 is
pretty simple we are installed in every
install of Mac OS 10 to use us just link
with carbon framework or if you have a
carbon-based application with the carbon
lip our api's are identical for cocoa
for carbon the same API is that you used
on Mac OS 9 you can use them from
Objective C from C or from pretty much
any language that we ship on Mac OS 10
let's start with speech synthesis let's
say you want your application to say
something how difficult can this be
turns out it's not difficult at all it's
a single line and you will get hello
world
okay that was simple enough if you want
to have a little bit more control you
open a speech channel giving a voice
this can either be something that you
get from a menu you give to the user or
if you pass in now you get the default
voice you probably shouldn't hard-code a
voice unless you know exactly why you
would want to do this then you can
adjust parameters as you like them and
once you have them to your liking you
can speak the actual text by calling
speak text all of these calls are
asynchronous so it will actually return
control to you before the text is
entirely spoken we offer a lot of
control you can control the speech rate
to peak speak slower for younger users
for instance or quicker in a game
situation for instance we can control
the speech pay
modulation so it sounds more lively at
the volume to customize the way speech
sounds in your application we also give
you callback routines so you can when
you have a screen reader you can
highlight words as on-screen as they are
spoken or if you have an animated
character on-screen you can animate the
lips of the character as the phonemes
are spoken you can see many of these
options in action in our Koko speech
synthesis example which ships on the
developer CD most of these controls you
actually don't have to write any code
because you can simply embed them in
text so for instance in the sentence if
you want to emphasize the word next you
just embed an emphasis and emphasis
command in front of the word don't meet
LLL make Tuesday this is very important
because to really have speech synthesis
work for you the best it can
you should customize what is spoken
basically your application just knows a
lot more about how things should be
spoken then our engines can know by
default so there are a number of things
you can and should do with your
application first of all you should
filter the text that is passed on to the
text-to-speech engine for instance if
you have a stock ticker application and
you come across the acronym AAPL what
you should do is tell the speed texture
speech engine to say Apple computer
instead second you should customize the
pronunciation of words that don't come
out right and last of all you should
customize the intonation of what is
spoken now we try to have a huge
dictionary as cam already said but even
the biggest dictionary cannot possibly
handle all the words and not all proper
names especially for instance my first
name is tricky it's certainly our system
cannot pronounce it by default in the
past some developers have just used
funny spellings to get it to work
approximately the right way like this my
name is Monty here sounds almost right
but we don't recommend this because if
you use words that are not
part of the English language or strange
combinations we might change in the
future how this is pronounced second
this is not a very precise way of
specifying what you want said so instead
what you should do is use embedded
commands to temporarily switch to
phoneme input using the phoneme notation
we describe this notation in inside
McIntosh speech it's explained with
examples on a page doesn't take very
long at all to learn how to use it and
the result is something like this my
name in whom RTL sounds somewhat better
second you should customize the
intonation of the text you pass on to
text-to-speech because words the written
words alone are not always enough to
convey the meaning for instance if you
see the sentence you can read it as John
only introduced Mary to Bill
he didn't introduce her to anybody else
you could read it as John only
introduced married to Bill
he didn't introduce Caroline to him or
you can read it as John only introduced
Mary to Bill he didn't ask her to marry
him
so these distinctions can be very
important so you should annotate the
text you pass on our system tries to do
it the best it can to to find out how a
sentence should be spoken but it this
can be very difficult if not impossible
to do in the general case your
application has domain knowledge of my
much of the text that is spoken and has
the potential to do bus much better for
instance take a flight reservation
system at the end it gives it a
confirmation text and I'm going to play
your two different versions of saying
this confirmation text first of all is
not annotated at all and the second is
annotated I'm not going to say anything
between the two versions your first
flight is with Alaskan airlines flight
2762 departing from San Jose on Monday
May 24th at 6:10 landing in San
Francisco at 7:10 p.m. thank you for
using tea
two years travel your first flight it's
with Alaskan airlines flight 27 62
departing from San Jose on Monday May
24th at 6:10 p.m. landing in San
Francisco at 7:10 p.m. thank you for
choosing TT years travel so so did you
hear a difference raise your hand if you
heard the difference between the two
versions excellent I see that the hands
are up so we did this with quite a bit
of annotation and this can basically be
distilled to five principles of how to
improve the intonation of the spoken
elements the first principle is let the
user catch up by adding pauses at
strategically important points at
punctuation wherever appropriate and
appropriate does not mean appropriate in
the sense of English grammar nobody is
going to see what is spoken so feel free
to add a comma if you think a pause is
necessary at a point breakups larger
sentences into smaller ones and insert
some explicit pauses with the silence
command at major pause points so in our
example we added punctuation we added
pauses all of this lets the user catch
up second command second principle is to
let family familiar things go into the
background by de-emphasizing repeated
words for instance if the minutes are
identical you should de-emphasize the
second instance also de-emphasize in
items inferable from your overall
application scenario you know that
you're booking a flight so you don't
have to emphasize this word third
principle is to liven it up simply by
adding an exclamation point at the end
fourth principle is to focus the users
attention by emphasizing the important
words this can be done with an emphasis
command or simply by inserting a colon
before the most important
item and faced and maybe most important
use paragraph intonation group your
sentences together into international
paragraphs and the first sentence in
this paragraph you should raise the
pitch pitch range and then reset it for
the rest this makes quite a bit of
difference for longer texts that are
read you write you raise the pitch base
and increase the pitch modulation and
then decrease it after the first
sentence and between paragraphs add an
extra pause so to summarize you should
customize the pronunciation of words
that you say if a word you notice that a
word you have hard-coded in your
application gets mispronounced used for
nemo to get it pronounced correctly you
should customize the intonation of the
texture that is said which helps
understanding that you helps the user
understand the text a lot better and
gives the user a much better overall
speech experience now let's move on to
speech recognition with my colleague
Kevin Aitken who is not lazy at all I
might add can y'all hear me yes I can
hear myself well my manager Kim is
actually yeah assured me that this is
not necessarily a commentary on my work
ethic so I feel better but if you're
like me once in a while I do feel a
little bit lazy and at those times I
just love having a simple solution to be
really productive and so on the next few
thing next 15 minutes or so I'm going to
show you two easy methods that you can
use to add spoken commands to your
application and so hopefully an
afternoon is worth of work you can walk
into your manager's office or your
co-workers office and say oh by the way
our Mac os10 application understands
spoken commands so let's get started
so as I mentioned I'm going to provide
two methods so the first method is to
use the speakable items application
that's built into Mac OS 10 as Kim
demoed in the beginning it's designed
for end-users so they can easily add
spoken commands to any application and
so you as a developer can use this also
so it's great because you don't have to
write any speech code speakable AIIMS
takes care of this for you by taking a
list of items building the language
model and then waiting for the
recognition result and itself showed in
his demonstration it understands how to
execute Apple script so that you can
easily send Apple events to your
application or even other applications
for that matter now the second method
I'm going to describe is to use the
speech recognition API you may be
familiar with this a mac os9
it gives you a little bit more
flexibility you can have multiple
command lists so you can have one set of
commands for when the user is selected
in an object and one set of commands
when they haven't and I'm going to show
you in a little bit of an example that
gives you a really easy three-step
approach to adding spoken commands to
your application using the speech
recognition API well one of the things
that both of these methods have in
common are commands and so let's talk
just for a second about what makes a
good command
well commands are like menu items but we
suggest that they're normally from three
to six words long the longer the better
generally because the recognition system
can understand them easier and they're
more unique amongst your other commands
but you don't want to too long to where
the user has a hard time speaking and
fluently also you should avoid single
words and especially words like hot cut
and quit because those are oftentimes
Mis recognized or they sound a light to
the recognition system the other
important item is that you should test
your commands
especially test them together to make
sure that they're confused with each
other and test them with the global
commands that are shipped with Mac os10
and a prototype your commands you can
use speakable IMS or the SR language
modeler applications
you'll find on a developer CD okay let's
talk about method one in a little more
depth and this is using the speakable
items application so the first thing is
I mentioned you want to create a number
of items you can easily do this by
bringing your application to the
foreground and speaking the command make
this application speakable that creates
a folder in the speakable items
directory as Kim showed you earlier
inside that application speakable items
folder and then you can begin adding
your items as he showed you so once you
have all your items together the next
thing is you want to bundle those inside
of your application so I'm gonna show
you a little bit an example you can use
project builder to easily copy these
files into your application bundle at
the time you build it and then finally
you need to install those items now we
really suggest that you install them at
runtime this gives you a couple of added
benefits it allows your application to
be drag-and-drop installed so therefore
in order to support speakable items you
don't have to have a separate installer
to install the items and it's great for
Mac OS 10 support for multiple users
because after a user after your
application has been installed let's say
the administrator creates new users well
that new user is just going to get those
speakable items the next times they run
your application because you'll
automatically install them well let's
talk about items real quick
Kim briefly mentioned those it's
basically a file that can be open but
there's really two types that are best
for you developers the first are Apple
script files as I talked about allow you
to send Apple events your application
the other one that Kim mentioned are the
new XML based command files and then
allow you to sand keyboard events to
your applications so that you can
activate menus or controls via keyboard
shortcuts well one of the things that I
wanted to do in preparing for this WWDC
this year is create an example that
really showed how easy it is to add
spoken commands your application I
really wanted to make as simple as copy
paste go
so as you saw in the sales demonstration
you use iTunes and that's a really
pretty good real-world application it's
shipping and you can see how those are
integrated into the application well
since I can't ship or give you the
source code for iTunes I thought well
let me create a clone of it
so I've named mine foe tunes courtesy of
our French person in the the group and
it's up on the web right now that you
can go grab it at this URL and hopefully
you can either grab it this week or
right now or when you get back and start
taking a look at it it's really I
believe shows a really easy way of
getting going so let me go show it to
you oh I need to switch like everyone
else I forgotten to switch okay let me
hide this real quick so let me show you
the application my clone of iTunes real
quick it has the identical menu items
and the window is you know pretty close
I mean if you haven't done anything in
cocoa or interface builder I basically
took 15 minutes through all the menus in
there laid out the window and pretty
much got an automatic resizing window
it's really awesome
okay let me show you that it really is
listening for commands
show commands window show speech
commands window okay that's of course
display visual get song info get song
info there we go as you see it doesn't
do anything it just shows the command
down below okay so it really is folk
tunes okay let's switch into project
builder and I'll show you how this is
set up let me try the command switch to
project builder yeah okay go I'll move
those out of the way so we can see the
window here I'll put this down since
I'll be giving it more commands okay so
let me show you real quick what this
basic object looks like that manages the
window it's really simple has a couple
of instance variables and then it pretty
much has a method for each one of the
menu items and a couple extras to handle
some of the controls in the windows so
it's really simple all these methods do
is basically display at the bottom of
that window what's happening okay so as
I mentioned the first step was creating
items so we've created those items the
next step is that we need to include
them in our application bundle so what
we'll do is we'll go to at the active
target and the way we do this is we use
a file copies build phase and so we
include those down here let me show you
where you do that and here if you
haven't already seen it you got a new
build phase and you get a new copy build
phase it's not highlighted right at the
moment because I haven't selected a
particular item but as you can see I've
included the items here I have two sets
of of items I have the command files
it's a majority of them and so I'm
saying place these any folder name
command files inside the inside the
resources directory of the application
bundle and then I have a single Apple
script file that I've included in here
as well okay
so now project builder has made that
easy now they're being copied when I
build it being copied inside the
application bundle the next task is to
install them at runtime so we've tried
to simplify this a lot by providing a
single routine that you can call so here
it is install speakable items for this
application you'll pass in the name of
those folders that you placed your items
into your resources directory and then
you'll call it and it's smart enough to
go out creates the folder if the folder
is already there it doesn't create it
again in the case of this demo I
actually call this routine every time it
startup but you could choose to call it
lazily later or you could call it in
response to a user
specifying it and the Preferences dialog
or something like that the rest of this
file has a tutorial or documentation in
more depth than I've talked about here
about creating the items adding those
items to your application bundle and
then how to call this routine and
special notes so it's all there ok let's
go back to the slides for a second ok
great so let's touch on the second
method for a minute here and that's
using the speech recognition API well
you know just like the previous example
I want to make it a copy-paste and go
solution so we're trying to provide some
really simple routines that you can use
so what I've done is I've broken down
the process of recognition setting up
for recognition and handling the
recognition into 3 easy steps and you'll
see in a minute where I provide you with
a single routine to execute each one of
these steps so let's talk a bit about
basically what the recognition process
looks like this is a very simplified
version of it and a graphical version of
it that just kind of cements it in your
minds as to what this example is doing
and generally what recognition process
is what happens during that ok so step
one we provide a routine that basically
sets up all the recognition objects
instantiates a recognition system a
recognizer object a language model
object
hooks them all together and that's all
set up and ready to go something that
most virtually every developer has to do
when they adopt the the Apple speech
recognition API the second step is that
you need to tell it what commands will
isn't for so in the routine that we give
you pass in the recognizer object and
then you pass in an array of commands it
basically gives a recognizer object
those commands to display and the speech
commands window then you also pass in
the language model object and it gives
the commands to that so the recognition
engine knows what to listen for
and then for the third step you need to
implement one Apple event handler the
speech dun Apple event handler and so
now your application is just sitting
there you've set up it's running ready
to handle the user spoken command so
when the user says something the
recognition engine passes it off to the
recognizer object it then sends an apple
event to your application in this case
play this song so we provide you a
single routine you pass in the apple
event that you get and it returns an ID
and then you can take this ID and map it
through a switch statement or a table
lookup or however you like to a
particular action or routine okay so
let's go back to project builder and
I'll show you how this is done
by the way this single project has two
targets one for building the speakable
items based version of photons and one
that uses the API so it's all in here
you can build either one and to some
extent contrast and compare okay we
provided in this speech routines dot C
and header file those three three
routines so let's see him here it is you
basically call setup speech recognition
that accomplishes step one then you call
add commands with a list of your
commands you can call add commands over
and over as you need to to change the
set of commands if you want to be able
to provide the user different commands
based on whether they've selected a
particular object and and your interface
or they're in a particular portion of
your application and then inside the
Apple event handle that you've created
you'll call this one routine pass in the
Apple event and you'll get the ID that
was connected with that original command
that you set up okay so let's look at
basically the application delegate
object where I do this what I've done is
I use an application delegate that gets
the application did finish launching
method after KOCO has brought my app up
and is basically running and so I pretty
much do it all at the front just for
this demonstration so that code is all
in one place the way I approach it is I
create a simple table basically that has
my command name and then the method to
be called so that's the the way I'm
doing it here actually do it
programmatically at the bottom of this
file you could do an XML file you could
do it differently it's up to you I
register the speech done apple event
handler by calling the AE install event
handler then I call the routine to set
up speech recognition that we provide in
that utility file I create an array of
the commands because I need that to pass
to the add commands routine that we
provide and then finally I call our SR
start listening routine that we that's
part of the API and now the application
is up and running page down here this is
the Apple event handler just a couple of
lines of code I call the routine that we
provide an utility file to convert the
Apple event into the ID and then so the
Wonder is an objective-c I just use as
index into a table and then basically go
off to the appropriate method selector
so that's pretty much it
I really urge you to go out and grab
this and see how this can be applied to
your application so let me summarize
real quick
are you summarize real quick so we saw
in the method of the first method using
the speakable AIIMS application that's
really easy because you don't have to
write any additional speech code all you
need to do is include those items inside
your application bundle and then install
them at runtime with the routine that we
provide the second method is using the
speech recognition API as I explained
that's an easy three-step process that
we give you a single routine to execute
each one of those steps so I've
discussed the lazy way to do it there's
more things you can do with the speech
recognition API and Matthias is gonna
come up here and talk about what to do
if you're feeling a little bit more
ambitious so thank you thank you Kevin
my manager assures me that overachievers
does not necessarily apply to my
performance either so kevin has shown
you how to get 95% of the benefits but
5% of the work however there are some
situations where you might need the
extra 5% one example of this is chess
you've seen it
demoed it ships with Mac OS 10 as of
this week you can get the source code
from this URL here you will find the
speech related code in chess listener
and chess illustrates important lessons
in language model design now you might
think the language model of chess is not
very complex right
d2 to d4 all simple sentences problem is
if you just do this as a list of
possible moves it gets out of hand
pretty quickly if you do the math you
find out that if you just have a model
with all the possible moves you end up
with more than 24,000 moves and clearly
this is unacceptable
it doesn't help accuracy plus you're not
doing the user any favor if you're
listening to stuff like rook a1 to h8
its won't do him any good at all in fact
it turns out that each in each chess
position there are only 20 to 30 moves
that are actually legal so there is no
reason whatsoever to include the extra
moves performance is going to go away up
and user satisfaction is going to go up
if you only include legal moves however
you shouldn't quite over constrain your
model there are some illegal moves which
are still plausible for instance people
frequently put their King into check
accidentally even experienced chess
players so what you would want to do is
to leave a move like this in so you can
say I heard you but I won't do it
another technique that we use in chess
is to use prefabricated parts there are
not so many words that are actually used
in this language model so we we
fabricate that them at start up by
calling s our new word to get these word
objects and then when we come to a
position and see that for instance pawn
d2 to d4 would be appropriate we simply
grab these prefabricated objects and
paste them together to form this this
command so to summarize for complex
language models you will want to
constrain your language model to only
those commands which are plausible in
each situation and consequently you
adapt the language model when the
situation changes furthermore in very
complex situations you might consider
using prefabricated language
objects to quickly get your list of
commands to build these language models
we've included a tool called
SR language modeler which helps you to
quickly experiment with different
language models how well they work for
your users SR language modeler allows
both live microphone tests for rapid
turnaround if you want to try something
or grab somebody into your office to
have him try something if you want to do
systematic scientific research tests you
can record your users saying these
commands record them into AIFF files and
feed those files into SR language model
to get a systematic evaluation of how
well this performs this tool and all of
our sample code which we shaped with Mac
OS 10 you will find on the developer CD
in the examples slash speech folder and
we encourage you to start with that if
you want to do anything with speech so
let me now turn our session back over to
our fearless leader came Silverman for
some values of fearless so to summarize
speech synthesis and speech recognition
are there
we've given you a conceptual overview of
the api's and I try to give you some
ideas about why you would want to use
them and Mathias and Kevin have followed
up with how to use them well so I want
to encourage you all to speech enable
your apps at this stage I'd like to
single out just a couple of developers
who've been doing this you might
remember thinking home which got the
Apple Design Award last year they've
poured of their application to OS 10 and
have been at they added speech to it and
they found that that adds a lot of value
to their users I was talking to one of
their developers on the phone yesterday
he said they're getting a lot of
feedback saying that users just think
it's great when they can walk into the
room and say things like dim the lights
in the living room or turn the upstairs
thermostat to cool the folk who were
working on Omni web
you may have seen their cool browser on
OS 10 they've been experimenting also
with using speech to its integrate
speech e to the browsing experience for
those that don't want to or can't deal
with the keyboard in the mouse we saw it
a prototype of that this morning it's
looking really good they've got some
great ideas about how to do it
so now you're going to give you the
guidelines of good better and best about
how to put speech into your apps good is
the easy way use speech recognition to
allow people to speak the visible
controls on on the screen things that
they can would normally do manipulate
them and say them and use speech
synthesis to speak simple alerts and
alert panels when they come up you can
do these by the way with either the
speakable items framework or by calling
the API directly if you want to go
better then use delegation I've
mentioned this a few times you probably
inferred what I mean normally when we
interact with a computer we specify
explicitly each step we want the
computer to take in order to reach a
goal that we have in mind with
delegation we delegate the goal to the
computer and have it figure out the
steps about how to get there and then
execute them for us so group what would
be otherwise multiple interactive
actions into one coke in command for
speech synthesis start to customize your
texts using the guidelines that Mattias
went through so if you want any help on
those read back information to your
users and if you want to be best then
move to interactive spoken dialogues
like you saw Sal demonstrating where you
delegate a goal to the computer or your
agent anything comes back and asks you
questions to refine that goal and think
about using speech for form filling so
that's it thanks a lot for coming
I am I