WWDC2003 Session 402

Transcript

Kind: captions
Language: en
so we're going to talk about the OS 10
speech technology today to introduce
myself at you just as I am principal
research scientist and manager of the
spoken language technologies spoken
language technologies means for speech
technologies that you've heard of the
speech recognition and the speech
synthesis and also we have language
technologies because we believe you
can't deal with speech without dealing
with language so for example the junk
mail filter is one of our spoken
language technologies and if you heard
the state of the union address about OS
10 the Japanese input method is now
using our speech technologies and that's
why they're doing so much better than
windows at the moment so before we start
I figured I'd go straight into a demo
I'm going to show you some of the way
speeches usable in Panther and you guys
being developers I'm allowed to give you
the caveat this is beta software right
so I've been through this demo lots of
times it always works but because the
speech sits on top of every other
component of the operating system it
means that if anything goes wrong
something could break here so we I throw
ask you to bear with me when the first
thing you should do when turning on
speech recognition is let the Machine
adjust to the acoustic environment in
which it's being used we have speaker
independent recognizer I'll talk about
what that means a bit more later but
even though it's independent of who you
are it does need to adapt sample and
adapt the acoustic environment we have
made it tracks the acoustic
characteristics of most places where
people would use it but this is outside
of the parameters for which we've
developed it the distance between these
walls combined with the positions of the
PA system speakers mean the durations of
the echoes are a little bit outside of
the spectral range that we're looking at
so I'm going to need to adapt it so what
you do any time you should use speech
recognition in a new place is go to the
speech preferences to the speech
recognition tab listening
and click on volume this ostensibly lets
you set the volume on the microphone and
also gives you a chance to make sure
that you've got the right microphone
connected our speech recognition works
with a far-field microphone that means a
desktop microphone that's about this far
from you in this particular situation
because of these echoes I'm going to use
the alternative which is a headset
microphone to put on now you can
purchase these pretty cheaply from the
Apple stores there are a few brands this
particular one is by ZXI your task is
just to read down this list of commands
as each command is recognized it will
flash if a come if it doesn't / you just
repeat it until it is if it doesn't /
after one or two repetitions and go on
so i'll do that now it's actually
sampling my voice in this environment
while i'm talking right now what time is
it quick this application quickness
there we go quit this application open a
document show me what to say make this
page speakable move page down move page
down hides this application switch to
find it I'll go through that again what
time is it quick this application open a
document show me what to say make this
page speakable move page down hide this
application switch to find it good
so let's try it what time is it the
audio is not plugged in just a moment
let's the audio in guys this might
go bang I'm still here even though you
can't see my face ok it's audio is
plugged in let's try that again what
time is it what day is it quit this
application open my browser close this
window go to Google News go to Google
News you can set up any web page to be
speakable quit this application get my
mail so suppose I've got a message here
and I see some text that I'd like to
send to somebody else I can do that by
speech and we do that by integrating
with the address book to find out
because an email address the whole
address book is speakable so I can say
for example sent this to speech dude
send and this should complain oh I love
that sound ok hide this application
switch to check
whoo he's the man down there Mattias
porn d2 to d4 pawn d2 to d4 Knight B 1 2
C 3 you see these ghosts piece pieces
wandering around it's great quicks this
application phone for Tom bonora thanks
okay so for those of you who know about
speech there's some stuff that you've
seen before and there's some stuff
that's new there k ask how many people
have not sat down and used speakable
items on the OS 10 before okay so if you
go I'll walk through quickly what we've
done here what we've seen is a bunch of
different things we have a speech
recognition engine which has a robust
API you can call that api's from your
applications many of you do already and
more hour all the time in addition we
ship a couple of applications that also
use that API one of those is called
speakable items and that's largely what
I've been showing you now you turn that
on from the system preferences from the
speech preferences speech recognition in
the on/off window pane speakable items
is a very simple idea that's been
developing over the years and it's
turning out to be quite powerful when
you turn it on we create and a folder in
your home directory called the speakable
items folder open the speakable items
folder anything that's in that folder
can be launched by speaking it it's just
the same as double-clicking on it
applications documents templates aliases
stationery URLs anything that you can
launch in the finder by double-clicking
you can launch by speech just by putting
it into that folder most of the commands
that I use now are part of a kind of a
starter kit that we ship we pre populate
that folder with a few items that do
generally useful things and that's what
I've been using so for example there's
what day is it what day is it
the real power of this is that you can
add your own items to that folder and
make your own thing speakable so you can
customize a speech recognition according
to the kinds of things that you do and
how you work within that folder that the
speaker Bowl items folder there is
itself another folder called application
speakable items that folder contains
folders named by applications the items
in those folders are only speakable when
that application is in the foreground
that provides a framework where you
developers can ship commands that are
specific to your applications and you
don't have to worry about accidentally
using the same wording as somebody else
with a different application because you
put them into that that's framework that
folder then they'll only be speakable
when your application is in the
foreground and in fact you don't need to
install them into this folder you can
put them in your own application bundle
and specular items will find them we
have documentation on how to do that so
the kinds of things you can put in here
are script that send Apple script
commands Apple events to your
application or other keyboard shortcuts
what do I mean by that what I mean is
that anything that is speakable as sorry
anything that the menu item with a
keyboard shortcut can have a spoken
command associated with it so if you
think let me give you an example of that
first we'll go back to the web open my
browser what do we got here i'm looking
for a picture of course the web is slow
one of the accessibility features is
that you can zoom in on the screen and
that's a keyboard shortcut so I have a
textured spoken commands with that so
now I can say zoom in zoom out a bit
right
so anything that has a spoken shortcut
oh sorry keyboard shortcut you can
attach speech too and that's a one easy
way that you can have speech control of
your application without doing a lot of
extra work a speech is really important
for disability solutions let me just
give you a story I just heard this just
as I was sitting up here I kid you not
this is absolutely true just as I was
sitting up here about half an hour ago I
think his name's bill bill was at you a
guy from the projection behind came out
and said look I've just gotta tell you
excuse me interrupt me while I was
setting up he said I saw you give a demo
of this stuff a couple of years ago at
macworld and so I showed it to a blind
friend I turn on his machine and I said
that's my browser and the browser opened
and we got his mail and we got it to
read his mail out to him he said he was
just blown away well it turns out that
this guy teaches an exercise class full
of blind people and so they were doing
their exercise class and they had an
imac over on the side then they will go
over the imac after exercise class and
surf the web a bunch of blind people and
they do this every Monday have their
exercise and then going surf the web by
voice using just the things that you're
seeing here yeah I was touched really I
wanted to share that with you because it
means that you can make your
applications available to folk with
disabilities through technologies like
this the section 508 ruled as I
understand it of the American
Disabilities Act is that everything that
can be done with your application must
be possible to do without requiring the
keyboard or the mouse or that you see
the screen and we have an accessibility
API that lets you get at screen controls
and provide alternative ways of
controlling those things we ship one
method already built in and that speech
so with very little effort you can have
stories like that circulating about your
applications to within the speakable
item framework you can choose what
lines of commands you can give this is
controlled by the commands tab here and
the reason I want to tell you about this
is because of this particular guy who's
off by default which is front window
commands this lets you speak any of the
control in the frontmost window of
whatever applications in the foreground
so with this on I can navigate those
preferences here for example speech
recognition what we already beers or
doesn't show much default voice spoken
user interface speaks the phrase so I'm
going down the check boxes here speak
the alerts text speech recognition we
did not build this specially into the
speech preferences this is just the
general accessibility features that use
speech so as long as you use standard
apple controls you get that for free in
addition you can have the computer
readout text that appears on the screen
and there are a few different ways of
doing this one that I just turned on is
talking alerts here's the mentality we
have here through the philosophy is when
you're interacting with a computer
sometimes the computer needs to tell you
things and the standard way that will do
that is by putting up a sheet or an
alert dialog in front of you you should
be able to read that think about it and
respond to it but sometimes your
attention is elsewhere I'm talking on my
machine aight aight quit I turn around
they have a conversation with somebody
and I don't realize that there's an
alert saying that I need to save my
changes before quitting perhaps I go
away to lunch and they come back and I
find that my work wasn't save and
somebody couldn't get at it so what
happens is if the alert has been up for
a certain amount of time and you haven't
responded to it then we read it out to
you in order to get your attention back
so let me demonstrate that switch to
text edit
so type something here and then try to
quit from it close it office before I do
we ship it with a delay of about 20
seconds by default between the alert
appearing and it's being spoken I'll put
that back to zero for now so that you
guys don't have to wait around so long
and that will quit or close its document
so that's talking alerts people tell us
that they love it the kinds of scenarios
I hear feedback from users are some guy
said to me he was crawling around on his
hands and knees under his computer under
his desk and accidentally kicked out the
ethernet cable and he didn't know that
he had done that and then he heard his
voice come from his computer saying the
network has been disconnected and he
turned around and thought oh yes indeed
ahead I was giving a keynote at an
international conference in Berlin I
arrived the night before and actually I
wrote my talk in the flight land in
Heathrow on the way over there and when
I arrived I thought I'd better get my
presentation printed out on
transparencies just in case my computer
went plug into their projection system
so i went to a print shop that had all
maybe two dozen different kinds of
printers they had big printers a little
slide printers high quality stuff kind
of like kinkos on steroids and we
started printing these things off and
while we were doing that I demoed this
to them they were delighted so they had
they had two dozen carmax they put
talking alerts on to each of their power
max because their typical business model
is somebody arrives at about nine
o'clock in the morning with a CD
containing a large image that needs to
be printed out on a large poster camera
ready and will take about two hours to
print out so they put into a appropriate
machine and started printing and then at
about eleven o'clock they go to make
sure it has finished okay and there's a
message on the screen that we came up
five minutes into the printing saying
that there was some problem with the
cyan ink and so they lose the business
because they don't have done in time
and they found that this was great with
24 machines going all the time their
voices coming up from all over the place
saying the printer is out of paper the
network is down and and this saves them
a lot of time but there was a problem
they all had the same voice so a voice
would come out of this method machine
saying the printer is out of paper and
who said it who did that you can set the
default voice but they put a different
voice onto which machine and you can set
what phrase is spoken before the alert
is read out by default wish to choose
stepping through a small list we have
here and they so they made each machine
and nouns itself by name so we think
talking alerts is useful I'll put us to
lay back to doesn't keep talking all the
time here another disability feature
that we have is the ability to speak any
text at the end of the mouth this again
uses the accessibility API so if I turn
this on and slide the mouse around yep
so if I ever fly the mouth around your
here text being read out I'm not
clicking here okay so if you stand up
controlled you'll get that for free as
well okay so let's go back and talk
about a bit more about what we've
actually been seeing here we'll get back
to the main machine now
so I want to talk a little bit about why
you should adopt speech thing to think
about about the reasons putting it in
there rather than just being a novelty
first of all speech gives you a way to
get beyond the limits of a graphical
user interface graphical user interfaces
are mature and they present will the
information that they can present but
there are limits to what they can do
screen real estate is at a premium and
so we have all sorts of technology to
try to squeeze a little bit more power
out of our screen real estate but no
matter how big my screen no matter how
many monitors I have in front of them
there are always things that I need to
get to or see that are behind other
things I can't see them and of course
there's always the issue of what happens
when the user is not attending no matter
how cool the graphics are the user just
might be staring out the window speech
gives you an extra modality give to your
user more choices about ways of
interacting with a computer it's more
natural that is we've all been speaking
and listening since we're about two
years old it's something that comes to
us without too much thinking whereas
working through a typical computer user
interface requires some training so if
you put speech into your applications as
an alternative control meant not
modality you'll find that some users who
are new to computing will give more
likely to try out your application it's
particularly appropriate in an ice busy
hands busy scenario so think about your
application think of this there's any
time where the user is looking at
something on the screen and their hands
are busy busy for example they're
drawing something and they need to make
a controller computer for example I'm
drawing a line that I would have change
the brush size or increase the amount of
blur normally with our graphical user
interfaces I have to stop drawing go up
to a menu pull up a dialogue set some
settings and then click out of that then
return to drawing so speech is good when
a the hands of the eyes need to keep
busy with what they're doing think about
that in your application and finally
speech gives us a way to move out of the
1980s back in the 1980s computers had a
little
weak speaker soldered onto the
motherboard and all it could do was go
feed and so we got into the habit of
writing our program whether beep written
into them whenever we needed to get me
users attention we put up a look and
alert we go feed well I like to think
that life has moved forward somewhat
since then of course now instead of just
going beep we play lots of different
sounds but the verdant is still on the
user to understand what all those sounds
mean so for example if you want to let
the user know that mail has been
centrally play one sound if we want to
let the user know that somebody has
logged onto I checked we play a
different sound and it seems to me we
should be able to do better than that
the application developer who is playing
a sound knows the meaning knows what
information is trying to convey to the
user so why not just say it and think
about the mouth the amount the mouth is
essentially the equivalent of doing this
it's such a narrow I guess 11.2 bit
interface we should be able to do better
than that if I want to do something with
a computer rather than just poking and
grunting I should ask you how to say
what I want to do so we have a couple of
engines that I've mentioned and shown
you already the speech recognition is
speaker independence that means you
don't have to train it to your voice
there are speaker dependent speech
recognizers around and have got
different characteristics and one of
those characteristics is that you need
to spend at least four hours of speaking
to transfer your voice which takes more
than four hours then you get a get tired
at the end of it then at the end of that
four hours you still want to use them
for a month or two before they can
finally adapt to your particular voice
we think that the kinds of users that by
Macintoshes expect to just walk up to it
and have it work so if we make it
speaker independent it works with a
far-field microphone we have layers of
software that are tracking adapting to
and compensating for the background
acoustics and the microphone character
sticks you can also use it with a
head-mounted microphone as you just saw
me do over the air its robust against
background noise I use it now at Apple
at the cafeteria at lunchtime and to my
delight it works the secret there is
that it's kind of noise that's easiest
to compensate for is noise it's steady
state so in the cafeteria when there are
hundreds of people talking the overall
spectrum tend to be fairly constant
what's a situation that we have not
solved is if I'm in front of a computer
trying to talk to it and right next to
me there's somebody else talking because
then there's two voices at once and the
spectrum of the distracting voice is
changing all the time so we don't claim
to ourselves that one yet it's a large
vocabulary speech recognizer we have
over 120 1000 words in the dictionary
and we have layers of software to figure
out how to pronounce words that ant in
the dictionary and it's a continuous
speech recognizer you don't have to
pause between words which is a great
relief it's driven by a finite-state
grammar that's how your application
tells the recognizer what to listen for
you are why you should use speech
recognition well as i mentioned speech
is a very natural way of controlling a
computer it gets you beyond the limits
of point-and-click because you can't
click on what you can't see 2.2 and
conversation is a particularly
appropriate modality for delegating
goals to a computer you can tell a
computer what you want to do if you
haven't specified enough it can then
come back and ask you questions to
refine the nature of the goal and can
then do what it's good at which is
figuring out the steps necessary along
the way to get there and of course
speech recognition is right for
accessibility the latest story being
that one that's only 30 minutes old ok
we have speech synthesis in there it
takes any text and convert it into
American English speech I have to say
that because I'm getting requests all
the time for other varieties of English
and other languages there's a range of
different voices you can control the
speaking rate and that is important
because there is no correct answer
or no single answer for the question
what's the appropriate speaking rate for
speech synthesizer the rate at which was
fixed synthesizer speaks should depend
on why it is speaking we'll talk more
about that later I do want to let you
know that we are working steadily all
the time on improving the quality and
the naturalness of the speech synthesis
we did a lot going from Puma to Jaguar
and we got good feedback from folks who
listen to it and said oh wow that's a
lot better now and we're stealing a lot
more work so when should you speech
synthesis there's a bunch of different
different areas I won't go to all of
these now but one thing that I think is
useful is when something happens inside
the computer that outside of the users
control or not directly relevant of a
current task of hand then speeches and
appropriate modality for letting them
know for example you have new mail from
your boss or your compile failed another
area is proof reading you know creation
of documents used to be an art form and
people would spend a lot of time
crafting them but the world's got too
busy for that we don't have the time so
we have tools like spell checkers and
grammar checkers well grammar checkers
don't do very well they don't often they
don't catch awkward constructs and the
constructs that they do catch we don't
always agree with the Marine correct
correct and spell checkers can only find
a word that is not in the dictionary
often when we make typing mistakes the
psychologists will tell us there's good
evidence on this when we make a typing
mistake we are much more likely to
transpose letters if it creates another
real word and spelling checkers can
never text never catch that but if you
have text read out to you you
immediately spot it it just becomes so
painfully obvious people have asked me
to talk a little bit about why should
you I use speech synthesis versus
recorded speech there are a few reasons
if you only have a small amount of
things that you need to say to your user
then i say go ahead and record them
get your voice talent but sometimes
recordings impractical for example if
you have a huge amount to read out or to
save you your users then it takes ages
to record it and takes a huge amount of
storage the average CD is what about six
hundred and forty megabytes and usually
about two-thirds of that media content
and so if you can reduce the audio by a
factor of well typically about 80 by
going from audio recordings down to text
then you have much more space for real
content on your titles on your CDs you
also get a consistent voice if you
record a voice talent then later on you
bring them back to record some more or
even the next week from one day to
another their voices tend to pick up an
inconsistent they're speaking louder one
day a bit more relaxed to the next day
and then in the user interaction the
voice sounds like it's going up and down
with speech synthesis you get a
consistent voice you can save costs
because you don't have to hire a voice
challenge you don't have to rehire
recording studio it's flexible if I
don't know this is whether there have
happened to you it's happened to me a
lot you're working on application you're
about to ship it and just as you're
about to go GM somebody says all we have
to change some of the strings so you
have to call up the voice talent and to
get them back into the studio to record
something different but no they're on
break variation in Brazil now or they
would have sorts road and just it's real
pain with speech synthesis yes type in
the new strings and you're done another
important reason for using speech
synthesis is if the things that you're
saying to your users are longer than a
single short sentence then you need to
control the intonation to make sure
they're spoken in a way that people can
track the meaning across the longer
sentences and you can't do that if
you're piecing together real sentences
that were recorded at different times
and you just concatenate them together
and you get lip synchronization for free
alright at this stage I want to invite
up Jack Minsky who is the president of
software Matt Kiev jack has his company
has produced world book which as you may
have seen is this wonderful application
I think it's about the best OS 10 you I
on any applications I've seen he'll show
to you it's gorgeous and these guys have
been using speech and he's going to tell
you about it yes good morning we had a
pretty simple goal in mind at the
creative labs of software McCabe when we
set out to build world book speech
edition and that's that we wanted
virtually impaired users or even blind
users to be able to use the world book
to be able to search all 22 volumes
18,000 articles on their own without
assistance and that meant we really had
to be kind of creative not just how text
be able to be read by passing your
cursor over it or highlighting something
but build in the kind of interaction
that would allow a user really to be
able to do this on their own and like to
show it for it to you first I'm going to
do is Kim did to adjust the labs feature
there you go going to let this Mac
adjust to my voice in this room what
time is it quick this application open
the document open the document show me
what to say make this page speakable
move page down hide this application
switch to find there so that's done and
then just to get started we wanted to
use or even be able to launch this from
the finder and we were thinking launch
start and we're going to need something
even friendlier so we chose hello world
book is our starting on let's try and
see if that works hello world book
and immediately they get the feedback of
the music and for book starting up so a
blind person already knows they're in
let this go by just for a second
[Music]
and then the next step would be to go
and so you can see there's all kinds of
sounds and things built in there so even
someone who can't actually see the
screen can hear some of the things going
on so the next step was to be able to
get them to be able to search through
the encyclopedia for a particular
article that they're looking for and
here you're going to hear me say search
please a window will open and blind
users can touch type things in so let's
see that work search please so now i
have a wood i have a window opening if
you heard that it said ready to search
letting the user again have feedback to
know that the thing is working i'm going
to type in a simple word here horse
picking horse in particular a whole
bunch of articles what we've done is to
embed sound to the top of the articles
again oral feedback so they know when
they reach that article will actually
hear the animal noise or whatever that's
going on you'll also hear more feedback
when I hit return because we're dealing
with a blind user who might not be able
to type in successfully the right word
we wanted to give them feedback so it
will actually say searching for horse
and then at the end if the horse
articles found will say search complete
so let's try that and when the horse
stops running they can now simply again
ask the computer to read to them so I'd
say read to me so in this way assuming
they have typed incorrectly they can get
to any article that they can think of
the name for now of course we thought
people aren't going to be able to
necessarily type it incorrectly i know i
miss type all the time and i can see
just fine so what we did was to build in
a catch for that so i'm going to type in
and misspell apple computer here and
you're going to see on this one that
it's going to come up with a series of
suggested alternative words and we
thought even beyond that as you'll see
from this example it will read through
the instructions first once and then go
through the list one by one pronouncing
the alternatives that the user might
have meant to type in the first place it
will then pause briefly at the end of
the list assume that the user didn't
hear what they wanted or maybe they
didn't weren't sure start the list again
but without that long intro introduction
of explanation of what they need to do
simply repeat the words again so let's
try that
and there I go and I've got my Apple
computer articles so the user can do
that quite nicely we also built in a lot
of other speech technologies to try and
go to the maximum of what Max's 10 has
to offer Kim showed the speech under
mouth so I won't show that but we have
custom controls in some places here it
comes for free if you simply enable that
for all the dialogues with normal tabs
and so forth but if you build custom
controls you can go the extra step of
making sure those will also work with
text under mouth and then we've done one
more thing I'm going to pull up another
page here it's already set up and that's
to take a bunch of the abbreviations
that are very common in encyclopedia
which won't mean anything to a blind
person for example a population is a
very common thing in an article about
cities so what we've done here is made
it so that population will not read as
pop here but read out the word and I'll
just show you that let's try that again
with them down
or something like this instead of food
so all of those things have been for us
the way that you can see someone who's
uncited could navigate the sentence like
Peter really use it on their own without
assistance without someone standing over
the shoulder we've gotten a lot of
recognition for this this is the first
encyclopedia and only one that's fully
ad a compliant with section 508 and
that's resulted in a number of magazine
articles written in the education space
about this application also just three
weeks ago we had the great honor of the
American Association of Education
publishers voted this one the best
children's software of the past year
that's the first time that a mac only
application has ever been nominated for
this this is you know windows
cross-platform everything for the prize
but a macintosh only product one that
category and also another great reason
to do this Apple has put this
application on every emac iMac and I
book they sell and probably the best
thing of all is that we know it software
mekia because of the work we did in
implementing the speech technologies
which were already set up to for us with
all the things that are built into Mac
os10 if there are literally tens of
thousands of visually impaired users and
even blind people out there we now have
a whole new world opened up to them to
be able to explore independently the
world book encyclopedia and we feel
really great about that thank you
you can purchase world book on the apple
store online or in the retail stores
check it out alright so now it's your
turn so will we want to talk to you a
little bit about what you can do to
incorporate speech in your applications
and we'll start with talking about
customizing speech synthesis what I mean
by this is that when we sent text to a
speech synthesizer the speech
synthesizer look at each sentence
scratches its head and says hmm how
should I speak this the answer is the
way of sentences spoken depends on why
it's being spoken and what the intention
of what is conveying to the user the
problem is difficult in the general case
but you guys have an advantage your
application knows a lot more about how
things should be spoken than the
text-to-speech engine does for example
Jack's application news that pop within
brackets followed by digits should not
be spoken as pop but should be expanded
to population the speech synthesizer
could never figure that out by itself so
there are three things that you can do
one is filter the text the way the map
kiev guys did another example would be
stock quote abbreviations then you can
customize the pronunciations and you can
customize the information let's talk
about that in a little bit more detail
to customize the pronunciation you're
dealing with a problem with the way the
synthesizer pronounces the word is not
the way that you want it pronounced this
is most often a problem with names or
invented names of characters if you have
a fantasy game I'm sure you've got some
character names in there that are
written to be difficult to pronounce
some developers send special string to
the synthesizers that just use funny
spelling we don't recommend that because
the way we pronounce and orthodox
spelling might change from version to
version instead we recommend that you
use what we call phoneme input which is
looks obscure but is actually very quick
to learn as a totally precise
unambiguous way to specify how words
ought to be pronounced you can embed
phonemes like this into the text or you
can load a custom dish
ritu the synthesizer that has these
mapping jewelry ready in it then you
should customize the intonation the
intonation is the pitch and the timing
that we use when we speak it's not what
we say it's the way that we say it and
the problem is that once you've
synthesized the words so that they are
clear you've not synthesized enough
considered a sentence John only
introduce Mary to bill now if I say it
like that it means he didn't introduce
Mary to anybody else John only
introduced married to Bill but suppose I
say John only introduced Mary to bill
that he might have in truth introduced
her to other people as well but to build
he only introduced Mary quite a
different meaning and if I say join only
introduced Mary to build then it means
he didn't encourage them to call from
form a partnership together so the
problem is the meaning of a sentence
depends crucially on the intonation it's
difficult to generate in the oil in the
general case because we need to know
what's the intended meaning but your
application often knows that and so your
developers can employ your domain
knowledge knowledge within your
application to do a better job let me
work through an example so here is a
text for an application that people are
using to book flights and here's the
confirmation that's being a sense of the
user I will read this out first by just
passing the text as you see it directly
to the speech synthesizer and it was
sort of do okay but it won't sound all
that great here we go oh the audio is
not going out from the old demo machine
is there a reason for that
all right well uh what can you do okay
we can put them off microphone on it
yeah is this mic working ok this is
high-tech let's see if this works life
is a lesson in life like when he can see
speaking recording dance and as we all
know that may 24-second exchange here
landing in san francisco we're
collecting champion thank you confusing
to you to dance travel whoa okay was
that he rrible all right i thought it
didn't sound that good so let's talk
about what you can do about that there
are commands that you can embed into the
texts that you send to the synthesizer
you can embed those commands by rule and
they'll give the synthesizer hints about
how to speak the text last year and in
developers prior developers conferences
we've given some instruction on how to
use some of these commands and according
to those this would be the kind of way
that you would annotate the the text
that synthesizer I've put the embedded
commands into a smaller font so that you
can see them but we've been working on
the front end of the synthesizer and
some of this information we can now
infer because we're now tracking the
topic as we go through texts and
modifying the way we say it according to
the topic structure and the block
structure that means that some of these
are no longer needed so those ones you
get for free but there are others here
that I've left behind which do depend on
domain knowledge let's take a some
examples I'm going to go through these
by laying out some simple principle you
can use the first one is I'm calling let
the user catch up what you should do is
add pauses at major sense units where
are pieces of information seem to cohere
together make sure they are separated
from other pieces of infamy information
and you can do that just by sprinkling
punctuation around there if you want to
increase
longer you can add the embedded command
I've got there sln see which means add
in this case 500 milliseconds of silence
you can also adjust the speaking rate to
be appropriate for the purchase purpose
of the speech in this particular case
the user needs to transcribe the
information and so you want to read it a
little more slowly if the user already
knew that information and you're just
reading it back for confirmation then
you would read it back more quickly so
here for example is one of those
sentences power with just a plain text
it sounds like this or play this out
through the demo machine again well hang
on it didn't play all right so what I've
done here is added a command to slow
down the rate a little bit and added
some colons and commerce that you can
see at the end of those lines and a
little bit of extra silence let's see if
this one will play out that's an
unadorned text now with those commands
that you can see it sounds like this to
hear a difference okay go on second
principle is for many of things go in
the background when we speak we don't
equally highlight every word we mark for
our listeners which things are what
we're saying that's referring to what
they already know and what things that
we're saying our new and important and
the way we do that is by reducing the
emphasis on things that listeners
already know so you can do that by the
emphasizing repeated words for example
departing at 610 landing at seven ten
but that one you now get for free
because we're tracking things like that
in the synthesizer but in addition you
can do emphasize words it could be
inferred from the overall application
scenario so for example in this case the
text started with you your first flight
is
but the user already knows that it's
talking about flight and so it's
appropriate to de-emphasize flight so it
should be spoken as your first slide is
I'll plays that first of all without
that embedded command and you'll hear
there's equal emphasis on the words
first and flight then I'll play it
immediately afterwards with this
embedded command which takes the
emphasis off the word flight see if you
hear a difference did you put the audio
off again because we have the audio on
all right well sir let's go through this
again run okay first without that
embedded command and then Wiz then with
you hear a difference okay third
principle is liven it up if you add an
exclamation mark at the end of a
sentence then that stops us from
gradually rolling the pitch off or the
way through the sentence and it makes it
sound a little bit more involved a
little bit more lively so if you're
hearing your synthesizer having a kind
of a board sound this is one way that
you can reduce that don't use it
everywhere use it judiciously then you
can focus the users attention on what
the important biting em extra emphasis
on the most important words by embedding
m+ just before them and finally we
suggest using what we call paragraph
information when we speak we don't
string all of our sentences together
into one long undifferentiated dream of
speech but rather we group our sentences
together into larger units that span
multiple sentences that relate to the
topic structure and we mark that through
our users for example when I start
talking about a new topic I raised my
voice just a little bit and now as I
talk about that topic I keep lower my
voice down to its normal voice range and
then towards the end of that topic I
kind of roll my voice off then for the
next topic I raised my voice again you
hear that we all do this listen to
people at lunch time you'll hear it's
going up up and down all the time to
signal the topic structure so you can do
that
we have told people that you can you er
should raise the pitch range at the
first end of a paragraph by some
embedded commands and then lower the
pitch range at each subsequent sentence
and then put extra silence in well now
you get all that for free so what you
need to do is put in a blank line
between sentences and we will do the
rest so in this particular case the last
sentence thank you for choosing TTS
travel is not related to the topic of
the previous information and so we can
separate it just by a blank line and
that now sounds like this it's are
absolutely unforgivable at least concern
among international certain p.m. thank
you for choosing the example for
comparison I'll just play that text
again unadorned so you can see where
we've come by accumulating all these
commands oh no I won't play all right
we'll go on all right nothing this so to
summarize customizer pronunciations when
you're using speech synthesis customize
the information using those principles
and together those things will help you
to give your users a better experience
now I'd like to introduce a new tool
that we're making available do you guys
starting today to further customize the
information I'd like to go back over the
demo machine please the problem that
we're addressing is that sometimes no
matter how many embedded commands you
put in the text you can't quite get it
to be spoken the way you want it to be
spoken with the personality or the
emotion that you want would it be great
if you could just record yourself saying
is from a sentence the way you'd like
the synthesizer to say it and have it
copy you well that's what this tool does
let's start up over here that's not
there alright
it's called repeat after me this tool
we've had going in the lab for quite
some time and was an internal tool that
ran on Mac OS 9 it has been ported to
Mac os10 that a new user interface has
been put onto it the make it easier to
use and more consistent with mac OS 10
and this that work was done for us by
the folk at software mac here's and so
we're very grateful to us Jack let's
give me a head for doing this and the
plants down who did the work so you can
type in some text we are at WWDC and
this will tell you first of all the
phonemes that the synthesizer used to
pronounce it there's the wii is the are
then down here it plots with time going
this way and pitch going this way the
fundamental frequency the tune that's
generated by the synthesizer for that
sentence so if I speak it will sound
like this is this a machine going
through the sound system now ok here we
go I'll plate it again now suppose I
think that's all spoken a little bit too
quickly I'd like to slow it down perhaps
to time it with an animation that I have
well I can just click on the end up here
and drag it out and make it take longer
or if my animation is really quick so
I'm gonna have this quite short
[Music]
let's go back to the default if i want
to emphasize the wii which is here's the
w and here's the e of we that's the me
on that a bit more then i can just raise
a pitch up there let's pick it up and
pick this up and st. could take longer
and now that will sound like this i can
record myself let's try it and have it
give me a recording all right let's try
it is hello hello the audio input
working one two three let's check the
sound preferences sound guitar sound
input oh oh let's plug this microphone
is a trick computer i'm going to
disappear again but i'm actually still
here so you can't escape yet
where's the connection aha all right oh
there is so you sound input works on iOS
10 hello there we go we add up now let's
try that again we are at WWDC alright
save that my audio file comes up where
is it all right audio isn't completely
working as I said this is pants are bear
with it but I've got some pre-prepared
one here to show you just in case that
happened so my recording might might
sound wave should have come up down here
so we'll show show you one would've
previously prepared which is here we go
so here's the original signal and here's
Victoria now copying
we're going to make this available to
developers watch the speech developers
mailing list to find out the method that
you by which you can get hold of this
we're also planning on running a kitchen
on it because I see a show of hands are
people who would be interested in the
coming to a kitchen to learn how to use
this define kitchen a bunch of
developers that's good a good question
but bunch of you come along to Apple as
I'll get and sit there and we teach you
how to use it you bring along text from
your application and sit down with some
machines and we sit with you all day and
teach you how to use it so who would
like to come along and do that I'll
quite a quite a number of you okay cool
let me just give you a couple of
examples of what you can do with this
I've Tuesday's up in itunes Baba bong
bong bong well our buddies in World Book
used this for us a speech that you heard
although it wasn't very loud that spoken
back while you're doing a search for
example you type in Panther and the
computer will say searching for Panther
if you just can send the text to the
synthesizer it sounds like this that's
well with customization using this tool
they go to the sound like this do you
hear the difference when the search is
complete it would say which didn't sound
that natural so they use this tool and
now it sounds like this yeah
I had an application where people would
call up an information system typing
their ID number and it would then read
out news and email and so on to them and
it would greet them by name the
developers of this system got a voice
talent to record greetings to about
5,000 different names and they found so
their dismay that this has very little
coverage for names we have actually
65,000 themes in our dictionary that
gives us about eighty percent coverage
of English names if we increase it by
another 65,000 that would put it up to
about eighty nine percent coverage so
names are difficult right right that the
statistics of names so they use our
speech synthesizer and when they put
past test text to the synthesizer it
didn't sound the way they want to have
found here's an example of some names
being spoken just from text a bit
tedious so we using this tool we're
going to now sound like this so that's
the tool that you guys can use ok let's
go back to the main machine another
thing we want to introduce for you today
is cocoa classes
and our philosophy here is that they
should be simple to use inspired by LMK
we think simple things should be simple
and complex things should be possible
and so here to tell you about them is
Kevin aitken so you see author and you
can blame him alright thanks camp and
yeah fill three there's definitely a lot
of people have contributed to but I'm
willing to take the blame I guess so let
me get started into this first of all
we've worked really hard on this on
panther now offers coco developers the
ability to easily access the most
popular features of our speech engines
so or the next few slides i'm going to
take you through the in a speech
recognizer class which allows you to
listen to and respond to the users
spoken commands as in the NS speech
synthesizer class which will allow you
to generate synthesized speech you
through the computer speaker or to a
file so let's get started with the NS
speech recognizer class so first of all
we designed this to be really easy so
all you do virtually you just give it a
list of strings and tell it to start
speaking they means that you don't need
to understand concepts like language
models and recognition results just to
get started but we've made sure that
it's dynamic you can change it on the
fly and you can have several recognition
objects running at the same time so it's
very flexible so what I'm going to take
you through as a couple of coding
examples for this example with Anna
speech recognizer to think of your
writing application a game allows user
to move through maze using four commands
north south east and west and so let's
get started so I've broken these in kind
of two sections for sectional just get
us listening in the second session will
section will handle the result so first
thing we're going to do is going to
create a recognizer object and then
we're going to set the delegate remember
a delegate object is just a helper
object in this case is going to receive
the message when the wreck
nishan recognizer system has heard
something then we're going to set the
commands as I said before this is just a
simple array of strings in this case
north south east west and then we're
going to start listening so now your
application is listening for those four
command so the user starts using you're
navigating through that maze and so they
say one of those so what happened so
what's going to happen is your delegate
object is going to receive a digit
recognized command message and as the
command parameter you're just going to
receive one of those strings that you
originally gave it so you can use that
strings compared to one of your gnome
strings I've just used a simple if then
else I'm sure there's much more
efficient to more exciting ways to do it
and then it converts that into some
action okay so that's pretty easy so
let's go on and talk about NS speech
synthesizer so it's going to allow you
to speak a circus Lee is so the
computer's speaker or to a file ok
because it's speaking eight successfully
you can handle certain events during the
speech generation process specifically
you can get notification when the speech
is finished you can get notification
when a phoneme is about to be spoken and
when a word is about to be spoken we
give you access to all the voices that
are installed on the system so you can
get information about each one of those
and create a pop up for the user just
select one and finally can combine both
the NS speech synthesizer and speech
recognizer class to create spoken user
interactions those are kind of dialogues
between your application and the user
those we've got a code an example of
that we're just going to instantiate our
synthesizer object using the default to
initializer here so it's going to use
the the default voice to the user is
chosen in the speech preference panel
we're going to set that delegate object
and then we're going to start speaking
by calling start speaking string now
alternatively this is going to be coming
out of the the default output device
alternatively we can
all start speaking string to URL to have
it written to a file and then now your
application is speaking away you can
handle some of those events so we can
implement the did finish speaking method
on our delegate object so we know when
it's finished speaking so you can say
update your user interface you can be
notified when it's about to speak a word
so that you could do the follow the
bouncing ball on the screen or highlight
a word on screen as it's being spoken
and you can also find out when it's
about to spoken speak a phoneme so you
can animate a mouse on screen or avatar
some character whatever you like so
anyway that's a wrap up of the speech
classes do I have I'm going to go over
to demo machine you really quick show
you the example this guy up and going
really all right so let me quickly take
you and show you where this is so we
have example applications in here under
speech we've added some for recognition
and we've a Z in a speech synthesizer
example here let me show you what we
built with this so using most of the
callbacks let me choose a voice here and
start them speaking so that's what we
created using the NSB synthesize a class
and it was really fast really easy and
hopefully you'll find that as well so
that that example is on your Panther CD
and there's some other example with in
there so go take a look you want face up
so that it brings us to the end of those
work material we've prepared for you
today to summarize we've talked about
we've introduced a speech technology for
those who aren't quite familiar with
them we've introduced a tool for
customizing speech synthesis which we're
going to make available to all
developers we've introduced Coco classes
and we have given some guidelines about
when you should use speech and what
kinds of principles are behind your
adoption of it for those that are
interested interested in more background
information about this you might want to
look at the introduction to developing
applications with Coco to find out about
Coco programming or two you might want
to see the applescript update because
speech and Apple scripts have such a
strong synergy that many forget apples
say those two together are the two most
strategic technologies at apple and you
can find out more about the
accessibility API at the Mac os10
accessibility session we don't have time
for questions now but the way the team
will be gathered just outside there and
happy to say as long as any of you would
like to answer any questions if you have
any questions subsequently which is us
to your contacts John Gill NZ the is a
manager of software evangelism and his
email address is up there it is hard to
read its Gill NZ GTL ey NF e @ apple.com
and go to the speech web page to find
out about the speech developers list and
documentation of all the things we've
shown you and more the URL is up there
thanks a lot
[Applause]