WWDC2001 Session 502
Transcript
Kind: captions
Language: en
today's session is going to be about
describing at a high level what we have
to offer and three new Apple Java
frameworks Java spelling framework Java
speech synthesis framework and Java
speech recognition to framework these
API are free and they will be available
this week
surely by Friday they're in the pipeline
and we just need to put them on the web
from an apple developers website and you
can download them and use them with Java
and your favorite IDE jbuilder that you
got in the bag metro works project
builder what have you and have some fun
with them we're not going to go into a
lot of detail about how speech synthesis
and speech recognition work or about all
their capabilities but by that I mean
the underlying technologies built into
OS 10 that have been around for a while
in different versions of OS 9 OS 8 etc
but you can find more information about
those things on our Apple developers
website and we're assuming that you all
are familiar with Java
now one thing we're going to talk about
is Java beans one of the things we've
designed into these api's is to allow
them to be used in visual builders the
premiere example is jbuilder and before
we go on I just wanted to review exactly
what a Java Bean is so it's basically at
a high level method that that was
designed by Sun and its partners to be
able to share discrete components among
third party developers and tool vendors
so that each tool could interoperate and
use visually these discrete components
so technically a Java Bean is basically
a class that has certain method
signatures or another class called a
bean info that is its pair that
describes properties it can also
generate events so for example the
classic bean would be a button and you
could get an action event on that button
for example and the job of being
specification Alta also defines methods
to provide editors and customer
that is something richer then you'd get
by default in a visual builder such as
just editing a string or cranking up a
number up and down switching true to
false etc and Java beans of course are
part of the of the Mack exchange j2se
implementation part of all j2
implementations so the first framework
will discuss is the java speech
framework and the job of speech
framework is basically layered upon the
synthesis and recognition technologies
building to us - and as I said before
these are from carbon they exist on mac
OS 9.1 and carbon as well the native
versions and of course we're running on
OS 10 here today so what are some of the
differences between what we're talking
about today and the underlying
technologies well the java speech API
object based whereas the carbon API our
C API native API the Java frameworks
provide programmatic as well as visual
programming for the frameworks the
carbon API are only programmatic java
speech API are a higher level the the
carbon C API provides a very very low
level every single possible thing you'd
ever want to do with speech synthesis or
speech recognition the java speech API
are only available on Mac OS 10 the
native API are available on carbon 7.5
and beyond
so speech synthesis basically it's about
converting text to speech and we provide
in our Java frameworks about 99%
coverage of the functionality you'll
find in the carbon API its architectural
II neutral now what we mean by that is
speech synthesis provides the ability to
get callbacks instead of just saying
speak this text you may want to know
when it's done speaking a particular
word a sentence when it's done speaking
a part of a word you can even stick
other cues into the string that it's
speaking and we'll go into that more
later and get callbacks from the speech
technology now those callbacks won't
certainly happen on the UI thread of
whatever particular UI toolkit you're
using swing and I Bhatia IFC cocoa etc
so what we've done is we've put in a
thread task handler and a default
implementation for swing to provide that
synchronization for you by default and
an abstraction in the task handler
interface that allows you to add your
own synchronization between the speech
thread callback thread and the UI thread
and we provide javabeans and customized
property editors for tight integration
with the job IDs so the main classes
there are others but the main two
classes of speech synthesis or the sense
of synthesizer itself and it basically
allows you to speak the text and attach
event listeners to the different events
it generates the voice class which
embodies all the characteristics of a
voice now Mac os10 as Mac OS 9 and in
previous Mac OS s have has built-in
voices about a half dozen different
voices of different characteristics one
sounds like a robot one sounds like a
young girl one sounds like a older man
etc etc and these all these voices have
names and the voice class allows you to
pick one of these voices by name adjust
its properties properties such as the
rate with which the voice speaks its
pitch its volume etc and we have three
basic events we generate from the speech
synthesizer the first event is the word
event and you can listen for that event
whenever the synthesizer concludes
speaking a particular word you can also
listen for the sync event now the sync
event is something that you can embed a
cue for in the text you're asking the
synthesizer to speak and I'll
demonstrate that later in a photo event
which is used to animate for example the
face let's say you wanted a
computer-generated face with a mouth to
really move with the sound of the word
you would use these callbacks and again
you can find out detailed information
about that call
those events in the speech synthesis
documentation for carbon on the Apple
Developer website so you also have other
events since you basically say here
synthesizer speak this text and then
your thread continues on you're not
going to be able to listen for an
exception right there that may be
generated down the road so the
synthesizer itself is a problem say you
embedded incorrectly some sync event or
some other embedded command and we'll go
into embedded commands later you can
listen for error events you can also
listen for exactly when the synthesizer
begins speaking when it finishes
speaking whether that is someone stopped
it some other API just called stop stop
talking or it concluded itself it read
all the text and that would be the done
event so let's do a speech synthesis
demo and this demo is packaged in the
speech synthesis SDK and let's launch it
here this is just a normal Java app
wrapped with mrj app builder and it's
built this is all all Java with a swing
aqua look and feel and it's built to not
only show not only be a little fun but
be educational because I'm assuming that
Java programmers that want to do speech
synthesis don't necessarily know all the
capabilities of the synthesis system
underneath on Mac os10 so it's fun to
play with we'll go through and we'll
just ask it to speak something we
basically have a voice here agnus and we
can see her age 35 about female etc the
rate that she speaks that the pitch base
and pitch modulation the volume that she
speaks at and we'll just ask her to
speak isn't a nice to have a computer
will talk to you and maybe we want her
to speak a little faster it'd be nice to
have a computer that'll talk to you so
you can play around with these and see
the different voices we'll go through
what's it what's some more interesting
ones
do that act on all these strings by the
way are into the system they for the
particular voices you choose there
particularly they particularly exemplify
the type of voice that you have selected
and we'll look at one more let's look at
some really aliens that looks like a
peaceful planet so different voices it's
pretty interesting and this again is is
is just the Java framework now another
thing that you can do with speech
synthesis is embed commands so let's say
for example we have many commands that
you can embed you and embedded command
basically is just a bracketed keyword in
the text that's how you send speech
synthesis and Betty commands that's not
part of the Java speech framework that
is actually part of the underlying
speech synthesis technology on OS 10 and
you can do many things you can change
the emphasis of a particular word if you
really want to hit it heavily
as the computer speaks it you can change
a number mode for example and this is
interesting so let's let's go play with
this for a minute speech synthesis is it
is pretty intelligent it tries to guess
how it should pronounce numbers but
sometimes it it you need it to pronounce
it single digit by digit other times you
need it to read it as if it was a whole
number say 502 instead of 502 but let's
type something in let's say please call
me at 5 5 5 - 1 2 3 4 at extension 1 2 3
4 5 6 7 8 5 6
I mean okay so first we'll listen to the
computer speak to snooze call me at 5 5
5 1 2 3 4 at extension 5678 so there was
a little bit of interesting hiccup in
the highlighting there but it pronounced
the first number correctly it recognized
as a phone number so it read it
individually but an extension number you
wouldn't normally tell someone to call
you at 5,000
seventy-eight you'd say extension five
six seven eight so will embed here the
number command to ask it to speak this
literally and then once we're done at
the other end of that at the number so
returns to a normal number mode will
insert the normal command again will ask
it please call me at five five five one
two three four at extension five six
seven eight so you can do things like
that so it's a little more intelligent
about what it's pronouncing another
thing that you can do is among all these
embedded commands is ask for a sync
event as I mentioned a few slides
earlier let's say you're using speech
synthesis to do a presentation or
something and after a certain sentence
or a certain word certain paragraph you
want to present a graphic on the screen
and then you want it to continue on with
what it's doing or you want to sync in
some other way a UI with with the speech
so let's type in a sentence such as you
can buy product a for five dollars or
product B for ten ten dollars and then
let's enter a sync command and the way
in our sync command this is sort of
legacy OS 7.5 or so OS nine ish of how
they had many four-letter keywords for
different technologies in the OS so in
this case since the roots of this are
still in carbon and in earlier OS is
even though on OS 10 now we basically
need to add a four-letter keyword to our
sync events so that when we get a call
back we'll know exactly which callback
that was for for the sync event so first
all embed this here and then and by the
way this is this is just a convenience
this whole mechanism I'm using to embed
these commands this is just text so I
can actually just
cut and paste here yes and I'll change
this to BCC or something and pay
attention to the embedded command
feedback area here at the bottom of the
window we return by blood-like to four
or five dollars for Block B for ten
dollars so that's a method that you
could use to sync something with the
speech synthesis and before we go on let
me just replay that and pay attention to
the phone um feedback down here phone on
key codes error codes are actually in
the speech synthesis documentation for
carbon on the Apple developers website
but you'll see it roll through all the
different sounds as we speak you attend
by a blood like two four or five dollars
or blood for ten dollars and so again
you could use that to animate properly
the shape of an the mouth of an animated
head for example so that is speech
synthesis that's about all the
capabilities that are in speech
synthesis of course there are variations
not shown in this demo for example you
can pause the speech you can choose to
stop the speech at a word break at a
sentence break immediately I won't have
you insane for pausing but let's
continue on now to java speech
recognition and this is the nightmare
demo because we've already tried it in
here and i feared this before i even got
in this room and the acoustics even with
none of you in here really plays havoc
with this microphone but we'll give it a
shot anyways but first let's just talk
about java speech recognition so a java
speech recognition
it recognizes spoken language contained
in a language model and we'll talk a
little bit more about a language model
later but that could be something for
example like call a meeting or a phone
home that's a simple language model
itself we provide about 70% coverage for
speech recognition when compared to the
carbon speech API
again when architecture neutral as I
said before with speech synthesis so
that you can coordinate the recognition
callbacks with your UI callbacks as
convenience and again we provide
javabeans and customized property editor
for tight integration with visual IDs
such as Jay builder or Mitchell works
the two main classes force java speech
recognition are the recognizer and the
recognizer does about what you'd expect
it allows you to start recognition stop
recognition set a particular language
model
i--listen add your listeners for events
etc the second important class is the
language model there are two different
ways to use the language model class one
is just hey here's an array of different
sentences I want you to listen for tell
me when you get one of them and tell me
which one you got with which one you
recognized another one is more complex
and it really is something that if we
have time I'm going to demonstrate a bit
but it's still in really an alpha stage
for for a couple different reasons
that'll probably be clearer when when we
give that demo the events that the
recognizer can generate are unrecognized
hey you know somebody said something but
I don't really know what they said I it
didn't match up with anything in my
language model that you gave me and you
would use that to do something like ask
the user to speak that phrase again
another event that you can listen force
the detected event that just means
someone has started to speak but the but
the recognizer has not yet completed
recognizing what they said they are not
done saying what they're saying and they
the other event is the done event this
means that the recognizer actually
recognized what they said and it can
tell you exactly what that was so
language model and editor this is an
example of a language model the top
model top l limb is the top language
model and it has two possibilities call
person or schedule meeting now schedule
meeting is literal listen for the word
schedule meeting call person actually is
a concatenation of two other models the
call model third line down and the
person model
Cole model can be Coughlin or dial the
person model could be literally Arlo
Brent Matt or my wife so in this
language model and then also you have
the top language model with person or
view today's schedule variations of the
other two so in this language model you
could possibly receive dial Brent which
would be the first part of the first
language model or schedule meeting or
view today's scheduled or scheduled
meeting with Brent for example so let's
go to the speech recognition demo keep
our fingers crossed and this is the
speech recognition mic here I'm going to
start up speech recognizer and this is a
very simple demo is actually something I
used to control my web browser but right
now I just have it as a simple list
that's going to pick up what I'm saying
so let's wait for it to start okay and
if we can cut this Mike for a second
so in the acoustics of this room some of
the softer sounds I found will hard to
pick up between this microphone and the
fans and all the computers and in the
different space in the room but there we
see the speech recognition working and
by the way we're going to go into sample
code and and actually write some code
take a look at some other code after
we're done with these demos let me quit
this here and can we have the main
presentation board back thank you so the
last framework that I want to take a
look at is the job of spelling framework
the java spelling framework is a set of
java api that allows you to access the
underlying cocoa spelling service and
it's built on the cocoa spelling
services which in turn are built on top
of mac OS 10
it provides three levels of usage three
styles or levels one is just a simple
set of API you can call to ask it is
this particular word spelt correctly
give me a list of suggested Corrections
modify the system dictionary that is I
have a new word it shouldn't be in the
dictionary it spelled correctly it just
isn't add it to the dictionary and that
would be system-wide so then another
cocoa apps such as TextEdit would then
pick up that word in its spell-checking
session another way to integrate is
complete integration with the swing text
components now I don't know if you're
aware but in swing all text components
derived from J Tex component in J text
component has a classic
model-view-controller paradigm design
and that makes it very easy for us to
tie in to all text widgets than in swing
whether it's a one line on stylized text
multi-line on stylized text or a fully
stylized multi-line text and it also has
an abstraction for integrating with non
swing text components now its main
classes are the spelling checker itself
and this just basically has the the
simple static API that you would call if
you wanted to just roll all this
yourself another important class is the
misspelled word class you'll receive
that in callbacks from the spelling
panel which we'll go into later
that will say what word was misspelled
where it was in a group of text how long
it is etc so that you can use that to
mark up your text etc if you're not
using the full-featured second mode of
operation we described earlier which is
full integration with Java text
components on a word class which is a
parent class of misspelled it just
indicates a correctly spelled word so
this is a snapshot of the classic
spelling session these are all swinging
aqua widgets and as we can see we have a
classic spelling panel on top of our
demo and I'll show that in a second and
through this panel you can receive a
bunch of different events but let's see
how we bring that panel up so in the
case of the second usage just go to that
session on a standard Java text
component it takes these two lines of
code and your text component has to be
nothing special
so you just instantiate the j' text
component driver and you ask it to check
the spelling of the text widget and you
can still if we go back you can still
edit the underlying text component if
you wish and you can just close the
spelling panel and manage all of your
dictionaries etc this spelling panel
that's on top now will appear as soon as
we call driver dot spelling check
spelling so another manner of checking
the spelling is with the inline pop-up
in real time spell checking in real time
spell checking you will get cues for
misspelled words which are red dashed
underlines just as you would in any
Cocoa application and by right by
control clicking on the misspelled words
you can get a list of suggested
suggested Corrections and you can add
and delete things from the system
dictionary the Java spelling framework
UI API which is to do with this window
on top allows you to integrate the spell
checking panel with some other text
component there is since swing
didn't come along for a while in javis
life many different text widgets out
there and so you can still use for the
most part the Java spelling framework as
is with with the spelling panel with
your own custom text widget as long as
you do a few things now one is
implementing the J text component driver
a new implementation of driver J text
component drivers that is the swing
default swing implementation and you
would need it implements the interface
below driver and you would need to
implement your own implementation of the
driver interface and then once you did
so you would tie in with the spelling
panel the class above and you would be
able to have this spelling panel tied in
with your text component of course you
would have to handle the highlighting of
the misspelled word you would just get
cues and the cues you would get are
events generated by the spelling panel
one of the cues would be an event called
ignore the user wants to ignore the
currently suggested misspelled word find
next these just basically all apply to
the buttons in that spelling panel
correct the user wants to correct the
misspelling the correct event would have
the choice of the new correct spelling
and it would be your job if you were
implementing this driver yourself to
replace that text another event is found
and it would be your job to highlight a
misspelled word when you receive that
event so let's take a look at this spell
checking demo
and we'll go through three methods of
doing this so the first one is the
standard spell checking session and as
you can see we have a list of possible
suggested corrections to the misspelled
word and it's picked the top one the top
one is always the the most likely to be
correct it's not always though and we'll
see that so in this case I will choose
that word but for the third word I
really want a fourth word I really want
one not the first and I'll correct that
then and this is just your typical
spelling session but the thing to
remember is that this is all Java these
are all Java widgets this is all swing
accessing the underlying cocoa spelling
service will correct that will correct
that etc and since this this particular
integration is that first two line call
using full swing integration with a
spelling service when we have a word
down below that needs to be highlighted
the the text panel scrolled for us etc
we'll replace that word and we'll
replace the last word so that is the
first manner with which you can check
spelling and let's get back our eye
misspellings I start over
the second manner is just to simply
check them all so it checks them all and
again puts a red underscore misspelling
cue under all the misspelled words and
we can right click as we saw on the
presentation is selected the correct
word we'll just we'll just select a few
here so you can do that as well let's
get back a fresh copy and go to the real
time spell checker and while that's
coming up all of these calls are being
made to check the spelling when it
eventually has a word then it wants to
find out is this spelled correctly it's
taken from the world of Java in this
particular process and it's sent via do
I don't know if you know anything about
do do is part of is a distributed
objects mechanism in Cocoa so it's
packaged up in Cocoa sent over the wire
to another
running on Mac os10 the spell server
that spell server looks at the word
finds it in the dictionary or doesn't
and since back response whether it's
spelled correctly or not so all that
happens pretty quickly and to give an
example let's turn on the real-time
spell checker and let's just type
garbage at first let's just type blah
you know who knows what this is but as
you can see as I type it is checking
every single correction I make to the
document and all that it's just
happening extremely rapidly now let's go
back and let's actually do something
look at our classic sentence again so
since the real-time spell checker is
still on we can delete make our
correction manually we know that this is
s for example should be is but we can
also then do as we did for the check all
option which is to ask it to give us a
list of suggested corrections and we can
correct it in this manner so that is a
demo of Java spell checking so where to
get the code you'd go to Apple's Java
developer website it's sometime this
week I'd look by Friday and I'll
definitely be there but well on the
subject to code let's let's see how he
it easy it is just to roll some of these
things our own and let's see a bit about
what I was talking about when I said
Java beans so let's launch jbuilder and
let's close this nice little app again
okay oh go back okay let's start this
guy up again since the demigods don't
like me I think I'll be again to save
incrementally
okay okay let's save all let's go back
to the visual designer
and let's add our speech since this
being again okay and let's set the
layout of this this will be the world's
most horrible GUI do null meaning XY
let's go back to swing buttons okay
that's not a button to start speech
wrecking our speech synthesis another
one to stop it
and let's add a swing scroll panel and
inside of that let's add text widget and
let's put some default text in it okay
and save again and let's go and you know
we won't even bother naming this buttons
let's kill and asked the synthesizer to
begin speaking so let's say this I love
this feature of jbuilder it's just
totally awesome oh this is codeinsight
so especially during these demos is it
really great jacent the sizes should be
down there somewhere did it name it just
synthesizer I think it did okay so the
size R 1 dot speak text and we'll get
that text from this dot J text area 1
dot get
next I can't get any easier than that
well you can write it itself no I'd be
out of a job
okay let's don't add that feature Blake
please so let's go back and add stop
stop stop speech okay and okay let's run
it so obviously this is about the
simplest form of synthesizer but you
could also add callbacks to it but we'll
just take a look at those events on it
so we can speak it this one's neat let's
add watch the stuff and hope that stop
works huh because otherwise it's gonna
repeat this okay and ask it to speak
this was neat okay that's good this was
neat this was neat okay that's good it
worked not actually near an orphan thank
you so so that is speech synthesis so
while we're in this demo let's add one
more button to the beauty of jbuilder
and let's add our spell checking button
spell checking beam
one more time this J this J text yeah
okay compile it again I'll fast it
compiles and let's type something more
interesting in let's see try to think of
something new okay that's right and
let's bring up the spell checking panel
that one line of text that we added and
let's see if it has a guest for me here
and yes it does okay so we'll do that
and correct that spelling doesn't have
one for there so I'll just change it
myself there it does have one for that
for the heck of it let's just say ignore
that one and let's also say that for
whatever reason we want that word to be
declared as spelled correctly let's
learn it and continue on and and then
just kill our session here and then we
could just continue on or whatever now
let's open up let's quit that so there
are two examples of easily incorporating
the speech synthesis and spell checking
framework into your application if we
look again we need to look at the top
this application because J builder did
some things for us it instantiated the
jate X component driver when we added it
the to the application visually it also
do the same thing for the synthesizer
class as well and then when we look it
down at our code below it just took one
line of code to ask the synthesizer to
speak the text another line of code to
ask you to stop it
one line of code to ask it to check the
spelling and and protecting a spelling
they're also checking real-time and
check all API as well so any way you
slice it you have a couple lines of code
that only would work on Mac OS 10
but then again since there are only a
couple of lines of code and because of
the types of frameworks they are for
example spell checking you might want to
consider integrating them into your
application java applications that you
want to run on all platforms the way
there are several ways you could do this
but the benefit is that it would
basically run the same on all the
platforms it's just that when you got to
Mac OS 10 and users users weigh in it
there they'd also have for example spell
checking so you might use since it's
only a couple lines of code for spell
checking reflection to call this you
might also wrap it in another API build
it on Mac OS 10 or building on a
platform that has these same API stubbed
out so you could compile and then run it
on all of your systems and and simply
win it when it ran on Mac OS 10 it would
have the spell checking capabilities
built into it and same for basic speech
synthesis although you know that may be
more involved in your application when
you do speech synthesis so those are the
first two easy frameworks to demo now
the bleeding-edge guy is speech
recognition so let's actually go and
open up a new file open up a file that
was put on our system when we installed
the speech framework okay let's see here
dudududu where am i pilot
was that okay Wooper
and research and java speech by mark
samples speech okay let's look at
example one this is the sample code to
the speech recognition sample we saw and
for the most part what it's doing is
instantiating the recognizer and
creating an stanching and language model
setting the language mode on the
recognizer and in this case using the
simplest manner of setting the data in
the language model that is just a array
of strings so and then it asks the
recognizer just start now the code we
don't see here is that there's a utility
class built into the java speech
recognition framework called the
feedback panel and that was that set of
strings i'm not going to go through the
whole thing actually we don't need to go
that machine I can look I can launch it
up here that was the set of strings that
you saw like the the list of strings in
the small sample and we'll just bring it
up just to associate again so everything
in this panel is the feedback panel and
you can use that as well you might want
to cue your user into what they can
speak and we've just wrapped that in a
Java frame and then we have told the the
feedback frame what recognizer it should
attach to and it's gone and attached to
all the listeners that especially the
done event listener so that can
highlight the correctly spelled
correctly a spoken phrase now the other
thing that I wanted to show you that is
the real bleeding edge stuff let me
close this file
let's go back to G butter
example one let's go back to the
designer and let's go and just add a
recognition being to this a little
sample application and take a look at
the rough early version of the language
model customizer so the thing that
actually makes this rough is that
drag-and-drop
probably if you're familiar with Java at
all you know is still in a rough state
of affairs not only do we have some
books on our site but there are just
some bugs and the shared code that all
the VM implementers share with Sun and
I'd basically discovered those specific
bugs while doing this editor and we need
to get those changes fed back to Sun get
them into the pipeline so everything
everyone can benefit from them but I
will try to struggle my way through this
to give you a bit of an example of how
it works
so I'm going to try to recreate that
first language model we had on the
slides when we had a call person for
example scheduled meeting with person so
let's say for first of all that we just
wanted to say scheduled meeting with
person so we could say we could add a
new language model down here oops I need
to add anyone first called person and we
could say this is let's see we could say
that this is Dan Mickey
Vincent mind not in the room and then we
could say we could drag this person now
this is part of this feedback windows
one of the one of the bugs of that I was
referencing before so we could type a
peer call person
and then I should be able to say call
Dan
or call Nicky now the actual we had a
bit of a difficulty with setting up the
mic so the development development
environment for this sample is on this
machine I'm sitting on here and the mic
actually has attached to the other
machine so I can't actually show you
this working but trust me it works
dragging an absolutely problem but you
can go into a test mode and build your
language model you could speak it etc go
back to edit mode to edit other things
once you're done you can save it as a
file somewhere and it'll bring up the OS
10 a save dialog box just save it
somewhere and say ok you're done and
then if we go back and we look at the
source J builder because J bloated
here's to even the nitty-gritty of the
bean specification it's gone in and used
the java speech recognition recognizer
being language model being info and
being to allow us our framework to add
source code to the file that you're
editing that's that's pretty neat in
this case the source code we added was
and allow me to just just add this to
two lines so you can see it better the
source code that we added was new
language model set data file for the
particular file that now contains our
language model and then we could listen
for those same phrases called and Cole
Nikki out of that complex language model
more will be added to that so that you
can have easier ways to recognize a more
complex model but that's the state of
the situation with speech recognition at
the moment so speech recognition is
provided as one of those frameworks made
freely available on our website as of
this week the way that I would recommend
you use it for the time being is with
the simple language model that we saw
below which is just setting a simple set
of phrases listening for the recognition
event the done event and then getting
the sentence out of that event to see
what the user has spoken so that's a bit
of code and taking a look at how easy it
is to use the frameworks of course
there's a lot more in the frameworks
than we've shown today there are all
sorts of interesting things and
callbacks and synthesis and recognition
that you can get at we've actually
covered most of the spelling framework
as is so again you can get the code from
the developer website and send me email
if you have any questions
you