WWDC2001 Session 127
Transcript
Kind: captions Language: en thank you thank you so the good news is that uh can you walk hear me yep the spoken language technologies in case you haven't noticed are in Mac OS 10 we got on there whoo thank you what we're going to do today is describe briefly the speech recognition and the speech synthesis that are there we're going to give you guidelines about where to use them in your applications and why you would want to use them in your applications and we're going to actually lead you through the process of getting your applications talking and listening so let's start off by doing a demo of what we've got over here can we switch to far oh I'm supposed to do an ad buy I'm supposed to do it whoo there we go the user interface to speech recognition is this round window that you may have seen this replaces the face that some of you might be familiar with in OS 9 it consists of three parts the middle if you you might be able to see says escape this shows you the listening mode you there are two different listening modes with speech recognition what we call push-to-talk mode and continuous listening in push-to-talk mode it's only listening when you hold down a key it says ESC which means that by default is the escape key users can configure that in continuous listening mode is listening all the time and optionally you can have it wait for a key word like computer before you speak your commands I am using push a talk mode here and I recommend that you do as well when you're demoing so that so that when you're explaining things to people it's not try to recognize commands that you're actually intending to other people not to the computer what time is it it's 10:30 what day is it it's Thursday May 24th show me what to say okay the other part of us of the feedback is this speaker speech commands window which has two halves the top half which is scrollable shows what it has recognized and if it speaks back to you what it says to you the bottom half which also is scrollable and all of which is resizable by the user at last Thank You cocoa shows what you can say there are now disclosure triangles so that it no longer Scrolls off the bottom of the screen the middle item there I don't know whether you can see it says speakable items that shows the commands that can be spoken all off the time no matter what application is running I mentioned at the start I spoke what time is it what day is it their items down in there these are actually kept in the speakable items folder let's take a look at it open the speakable items folder there it is mm-hmm so any item that is in this speakable items folder can be launched by speaking it and it's just the same as double-clicking on it applications aliases documents servers URLs anything that you can launch by double-clicking you can now launch by speech on OS 10 the real power of this is that users can customize it to the way they work by dragging their own items into the speakable items folder in addition just like in OS 9 know the speakable items folder itself contains a folder called applications speakable items this contains folders which are named by applications the items in those folders are only speakable when that application is in the foreground that's shown in the speech commands window in the top item the top disclosure triangle so you see at the moment let me close the others that it says finder and the finder ease of in the foreground let me demonstrate this in action I'll switch to my browser and as I do you watch the items there change open my browser there so it now stays Internet Explorer and there are some different items there this is an opportunity for you developers you can make application specific folders for your applications and populate those folders with scripts or other commands that control just your application they won't be speaker ball when your application is not itself is not in the foreground oh I'd like to encourage you to explore this by yourselves I just want to show you one other thing hide this application and that is that you may have noticed we shipped one game and that is chess Oh in telling you this I should point out we keep a track of all the applications that you've launched since we started we actually restart the machine and you can switch to any application running or not just by saying its name switch to chess so the chest application you might have seen mentioned and demoed in the keynote on Monday one of the keynotes as being an example of a good user interface aqua quality and one of the things about it is that you can control it by speech porn d2 to d4 Knight b1 to c3 what's it going to do to that let's see it's thinking it's thinking how than I am but it's a better player than I am too one of the things I like about this is this one thing I can do with this that I can't do when I'm playing with real people and that's the following I can say take back move my twelve-year-old son does not let me do that when I'm playing with him okay let's move on so what speech is there in Mac os10 bear with me there are a lot of people at this conference this year who are you new to the Mac OS platform some of you who are already familiar with the platform will know already what we have so it's just a brief mention for those that are get from it familiar with it we have speech synthesis and speech recognition the speech synthesis will take any text and convert it into audible speech there are 22 different voices they range from adult male and adult female through two voices that sing and sound like aliens and novelty voices and we have speech recognition there are a number of characteristics about the speech recognition they're important first is it's speaker independent that means you don't have to Train it to your voice you just take it out of the box and it just works the Mac was and remains the only computer platform I know of that you can just take out of the box put on the desk and command by voice it's continuous speech you don't have to pause between words it uses a far-field microphone this is a technical point but a very important point speech recognition is very sensitive to background noise so sensitive that all other recognizes that I know of require that the users mount purchase and use a noise canceling close talking head mounted microphone we tune up our recognizer to work with the inbuilt microphones that are built into the IMAX and the other CPUs that Apple delivers that means that we are getting the background noise that other recognizers are getting so we have several layers of software to adaptively model and subtract and deal with that background noise but there are limits on how much background noise we can deal with so we've tuned it up to work well in the situations where most of our users are using their computers in the office at home if you have users or customers who are using speech recognition in noisy environments such as in classrooms then it might be pushing the limits of the head mount of microphones a little bit too far one such environment for example is giving a presentation in an auditorium with a what 300 watt sound system that's providing a slapback echo from from the back of walk hall and so for that this is one of those environments where I've been using this head mounted microphone this one is produced by VXI they have worked quite a lot with us over the last couple of years to optimize their microphones to work with our speech recognition and so you can direct your users if you want to those as an alternative solution our speech technologies currently are US English only on OS 10 so what's new in speech synthesis we've done a few things we merged the Mac and talked three and the map and talked Pro codebases into a single codebase this has got a couple of advantages one is means that we have at last a totally new codebase we have divested ourselves of deep intricate and interpretable legacy code and positioned ourselves to at last be able to fold in the research improvements that we've been making in speech synthesis in the last few years so we've got ourselves into a new platform ready to go forward an immediate benefit right now is that you get consistent behavior across voices a lot of developers have said to me that they they change from one voice to another and words are pronounced differently or the intonation is different or that will no longer happen all the pro voices are now just the high-quality versions so what was Victoria high quality in OS 9 is now called Victoria and there is no lower quality version in OS 10 that means the speech is crisper it's easier to understand it's more robust in the presence of background noise people can understand it without it being so loud and we've improved the pronunciation in a of ways we've enlarged the dictionary from about 20,000 words to about what a 120,000 Jerome's work down here the morphological decomposition is now recursive if you want to know what that means ask us in the question time it's fascinating really really it's very cool it's a it's a multiplier on the effectiveness of the dictionary and the letter to sound rules are now automatically trained based on the new large dictionary rather than handwritten based on one linguists intuitions for speech recognition we also have a benefit here we have factored out that pronunciation subsystem from the speech synthesis and made it a separate subsystem that's now shared between speech synthesis and speech recognition this means first of all it reduces the overall RAM which improves performance of everything running on the platform it means the recognition is more accurate because the pronunciation subsystem is expecting the correct pronunciations for more words and it gives consistent behavior across speech recognition and speech synthesis developers have said to me I was prototyping some spoken commands for my application and one of the commands was not recognized very well and I thought that perhaps the recognizer was not listening for the correct pronunciation so to find out what the recognizer was listening for I tucked my command into your text-to-speech system and sure enough it was spoken with an inquiry incorrectly pronounced word well up until today that's been irrelevant now it is relevant now if you want to know whether the recognizer is expecting the correct pronunciation they type the word though you're the word or the commander that speech synthesis and it will tell you the way the recognizer thinks people will pronounce that word the user interface has been completely revised as I showed you and there's an improvement of speakable items we've added xml-based command files that you associate a spoken command with a keystroke sequence Kevin Aiken later in this presentation will show you that in some more detail I want to talk about why you should use speech in your applications there are two classes of applications I think from our perspective there are applications that are centered around speech where speech technology is central to the users experience and central to the value that application delivers and then there's a huge number of applications I think most of you write that for which speech is not centrally relevant at all I'd like you to think about places that you can use speech in those applications as well chess is an example speech is not really relevant to chess but we entered it and people say hey that's cool if you add speech to your application then you'll increase the number of potential users and that increases your market for example younger users will find your application more approachable people with disabilities will be able to use your applications and people who are less familiar with computation will be less scared of trying out your application speech is a very natural form of communication we've all been talking and listening since we were what two years old wait one and a half years old speech and layout enables you in your application to move beyond the limits of point-and-click there's nothing wrong with point-and-click in fact it's very good at letting you control things that you can see on the user interface and reach with a single gesture but there are lots of things that you want people to control that they can't see to point to to click on speech gives you a way to get past that if you think about it clicking is rather like grunting turns we turning our back on what about 200 thousand years of human evolution because when I click on things I'm just going that's true I'd like like to think that we've come somewhat further than that similarly speech output I think can be a lot better than just beep so many of us are still using alert sounds well beep was the mentality of 1960s when all that computers had was this tiny little speaker we've come forward since then speech gives you a way to bring yourselves into the 21st century and conversation is an appropriate modality for delegating tasks to a computer we'll illustrate that a bit more shortly so what are the weights some of the ways that you can use speech synthesis in your application one is notifications we recommend that you judiciously get the users attention back if the users attention has wandered away from your applications some of you may have known about or experienced talking alerts that we did in OS 9 now there's a slider that lets you set how long between the delay between an alert coming up and it being spoken you may want that longer or shorter than the default but the point is you normally should not hear any speech except when your attention wanders and the computer wants your attention and then if you don't respond it gets your attention back you can use notifications for asynchronous events for example in the AOL Instant Messenger if you're buddies enter your chatroom then it will announce that with our speech synthesis it will say buddy Smith just into the room you can give you speech synthesis to give a fictional additional feedback for younger users and we have found and our users have found that applications for the 6 to 12 space become accessible for the K through to 6 space if they don't change anything but just read out the text messages that they're putting up on the screens and you can use speech synthesis for proofreading for example you have an application where people are entering data to say a spreadsheet and people are into entering budget figures in a column allow them to select that column and have it read back so they can just check what they were entering to make sure that there aren't any errors speech will give you your application more accessibility for those with disabilities I think that's pretty obvious is really cool in games you saw that we did it with with chess there are a lot of games already that you're using speech for things like cheat codes and changing weapons use speech for non time-dependent control I don't think it would be appropriate to use a speech recognition in your game to say fire now quick quick quick right lift but you can do it for that as well if you want to try it in noisy games these are headsets are probably the right thing to do so the speech recognition doesn't confuse your commands with oh there are a lot of people who are successfully using speech for in education applications and I think there are a lot more of you with education applications that could take advantage of that one example would be the dinette product which is using speech recognition for pronunciation correction for adults learning English is way cool check it out you can use speech to enhance the web browsing experience for navigating within a browser if those of you who have explored will see that we actually ship a couple of spoken commands for Internet Explorer that let you do some simple navigation if you have a browser you can do a much better job than we do by working within it for example people could speak speak the links jump to pages by topic read out web pages and there's a big opportunity with voicexml the enterprise industry that's moving more and more of its information onto web access is now doing two or three different versions of all of their websites they're doing the HTML version they're doing a web version for personal digital portable digital assistants to access their stuff via wireless and they're making voicexml versions so that people can ring up the webpage and have the information read out over the phone the way this is done is by an extra set of tags that are in the web pages that are just the thing that you need to interpret for speech speech access and you can do that on a desktop from with a Mac using our API is it would be pretty straightforward because the infrastructure and the hard work has all been done for you by the web developers we recommend that you think about using speech for form-filling as an alternative to people filling out things with pop-up menus you can now have people speak the contents of each field and you can use it that lets you use a constrained language model for each field to increase the recognition accuracy for example the person could say create a new customer record and the computer could then respond what type and then narrow down its search to just the alternative customer records you person could say okay corporate account and then the next field would be payment schedule and the person could then say 30 days and and that's saved through the recognition model was changed again to just listen to the possible let's just listen for possible possible payment schedules okay there are lot of tasks where people's eyes are busy their hands are busy for example you're in a graphics program you're drawing you've got the mouse down you're putting a line across some an object and you want to move it around send it to the back or change the pressure sides when the eyes are busy in the hands are busy speech gives you a another way for your users to control your app so at this stage I'd like to invite salsa going up salsa goyan is the Apple script product manager and he's going to show you some way cool ways that he's been using speech so is this huh great hi this is amazing that I got up this early for this whole thing those that know me know that it's an impressive feat what we're gonna be showing today is can you switch me to this or me switch me to this is how to use Apple script with speech that's one of the best integrations on Mac OS is the ability to use these two technologies together and on Mac OS 9 we introduced the ability to have a script listen for a response and based upon the users response perform a different set of actions and incorporated a technology called the speech listener so I'm going to show a couple scripts today that use this technology on Mac OS 10 and both scripts will involve a conversation to get a task done the first one is a rather straightforward example where I asked the script for some music and it prompts me with a series of questions and we have some music played so let's see if my voice is back where it should be and we'll try this out some music please which artist or category Christine Caine which song from Christine Kane Tucson [Music] stop playing songs so in this example the script starts up it got the information about which artists were available held it into memory and said which artist or category when I say Christine Kane it matched that then queried and found out which songs were available by Christine Kane held that into memory and then said which song I said Tucson that matched and then it had the song play with iTunes so this is a simple example you got a program in a certain amount of grief with these things just to keep you honest so there's an example of being able to carry on a conversation with the script it's a limited scon verse a ssin but it is a way of gathering information and moving forward and the hide this application in the next example I'm going to use a program from by up systems it's called goat reap and one of the things that this application does is it accesses information over the Internet and I'm going to use a script acting as a person a personality called Victoria and Victoria will act independently of the speech recognition speakable items in that she will have her own set of scripts that she's going to use in conversing with me so here we go let's try this out and see if she's awake - Victoria yes L show my newspaper here you go something else clear all stories all stories have been removed from your newspaper Edie multiple anything else add multiple stories ready Motley Fool adding monty fool ready apple stock quote adding Apple stock load ready Appl top story adding apple top story ready that's all anything else not at this time anything else not right now not right now goodbye so in this instance when the script was called Victoria which exists in speakable items when the script loads up he goes to a subfolder in the speakable items folder called tasks and within that folder our individual scripts the names of which she holds into memory when I say show my newspaper she loads that script and then executes that so you have a script running a script and all of the commands that Victoria were doing are not included in the standard speakable items commands they are a subset so you can create these individual personalities so that's just two examples of how you can use Apple script in speech and if you're interested in how to do this the Apple script website has a complete overview it has an apple script guide book of how to use speech and Apple script together thank you [Applause] some issues if you're going to include speech in your application then there's a few things that you need to keep in mind educate you users about how to speak a good example would be go to speech preference panel and turn on speakable items and you'll see a sheet come down and you'll see how we explain to users that they shouldn't pause and and so on let them know about background noise being a problem you might want to refer them to head mounted microphones we train the speech recognizer on North American English and so officially that's what we say we support that happens that it is somewhat forgiving and so I'm Australian and our group we have Jerome from France we have Devon where are you Devon native Gujarati speaker we have mutt ears from speech swiss-german as a native language we even have Tom from the Bronx and understands all of us but there again there are limits and localization is an issue currently as I said we are US only so it's time to code for speech programming 101 I'd like to invite Matias near Hoffa up on the stage Matias okay now that Kim has told you what to do with speech Kevin and I are going to talk about how to do this using our speech technologies in Mac OS 10 is pretty simple we are installed in every install of Mac OS 10 to use us just link with carbon framework or if you have a carbon-based application with the carbon lip our api's are identical for cocoa for carbon the same API is that you used on Mac OS 9 you can use them from Objective C from C or from pretty much any language that we ship on Mac OS 10 let's start with speech synthesis let's say you want your application to say something how difficult can this be turns out it's not difficult at all it's a single line and you will get hello world okay that was simple enough if you want to have a little bit more control you open a speech channel giving a voice this can either be something that you get from a menu you give to the user or if you pass in now you get the default voice you probably shouldn't hard-code a voice unless you know exactly why you would want to do this then you can adjust parameters as you like them and once you have them to your liking you can speak the actual text by calling speak text all of these calls are asynchronous so it will actually return control to you before the text is entirely spoken we offer a lot of control you can control the speech rate to peak speak slower for younger users for instance or quicker in a game situation for instance we can control the speech pay modulation so it sounds more lively at the volume to customize the way speech sounds in your application we also give you callback routines so you can when you have a screen reader you can highlight words as on-screen as they are spoken or if you have an animated character on-screen you can animate the lips of the character as the phonemes are spoken you can see many of these options in action in our Koko speech synthesis example which ships on the developer CD most of these controls you actually don't have to write any code because you can simply embed them in text so for instance in the sentence if you want to emphasize the word next you just embed an emphasis and emphasis command in front of the word don't meet LLL make Tuesday this is very important because to really have speech synthesis work for you the best it can you should customize what is spoken basically your application just knows a lot more about how things should be spoken then our engines can know by default so there are a number of things you can and should do with your application first of all you should filter the text that is passed on to the text-to-speech engine for instance if you have a stock ticker application and you come across the acronym AAPL what you should do is tell the speed texture speech engine to say Apple computer instead second you should customize the pronunciation of words that don't come out right and last of all you should customize the intonation of what is spoken now we try to have a huge dictionary as cam already said but even the biggest dictionary cannot possibly handle all the words and not all proper names especially for instance my first name is tricky it's certainly our system cannot pronounce it by default in the past some developers have just used funny spellings to get it to work approximately the right way like this my name is Monty here sounds almost right but we don't recommend this because if you use words that are not part of the English language or strange combinations we might change in the future how this is pronounced second this is not a very precise way of specifying what you want said so instead what you should do is use embedded commands to temporarily switch to phoneme input using the phoneme notation we describe this notation in inside McIntosh speech it's explained with examples on a page doesn't take very long at all to learn how to use it and the result is something like this my name in whom RTL sounds somewhat better second you should customize the intonation of the text you pass on to text-to-speech because words the written words alone are not always enough to convey the meaning for instance if you see the sentence you can read it as John only introduced Mary to Bill he didn't introduce her to anybody else you could read it as John only introduced married to Bill he didn't introduce Caroline to him or you can read it as John only introduced Mary to Bill he didn't ask her to marry him so these distinctions can be very important so you should annotate the text you pass on our system tries to do it the best it can to to find out how a sentence should be spoken but it this can be very difficult if not impossible to do in the general case your application has domain knowledge of my much of the text that is spoken and has the potential to do bus much better for instance take a flight reservation system at the end it gives it a confirmation text and I'm going to play your two different versions of saying this confirmation text first of all is not annotated at all and the second is annotated I'm not going to say anything between the two versions your first flight is with Alaskan airlines flight 2762 departing from San Jose on Monday May 24th at 6:10 landing in San Francisco at 7:10 p.m. thank you for using tea two years travel your first flight it's with Alaskan airlines flight 27 62 departing from San Jose on Monday May 24th at 6:10 p.m. landing in San Francisco at 7:10 p.m. thank you for choosing TT years travel so so did you hear a difference raise your hand if you heard the difference between the two versions excellent I see that the hands are up so we did this with quite a bit of annotation and this can basically be distilled to five principles of how to improve the intonation of the spoken elements the first principle is let the user catch up by adding pauses at strategically important points at punctuation wherever appropriate and appropriate does not mean appropriate in the sense of English grammar nobody is going to see what is spoken so feel free to add a comma if you think a pause is necessary at a point breakups larger sentences into smaller ones and insert some explicit pauses with the silence command at major pause points so in our example we added punctuation we added pauses all of this lets the user catch up second command second principle is to let family familiar things go into the background by de-emphasizing repeated words for instance if the minutes are identical you should de-emphasize the second instance also de-emphasize in items inferable from your overall application scenario you know that you're booking a flight so you don't have to emphasize this word third principle is to liven it up simply by adding an exclamation point at the end fourth principle is to focus the users attention by emphasizing the important words this can be done with an emphasis command or simply by inserting a colon before the most important item and faced and maybe most important use paragraph intonation group your sentences together into international paragraphs and the first sentence in this paragraph you should raise the pitch pitch range and then reset it for the rest this makes quite a bit of difference for longer texts that are read you write you raise the pitch base and increase the pitch modulation and then decrease it after the first sentence and between paragraphs add an extra pause so to summarize you should customize the pronunciation of words that you say if a word you notice that a word you have hard-coded in your application gets mispronounced used for nemo to get it pronounced correctly you should customize the intonation of the texture that is said which helps understanding that you helps the user understand the text a lot better and gives the user a much better overall speech experience now let's move on to speech recognition with my colleague Kevin Aitken who is not lazy at all I might add can y'all hear me yes I can hear myself well my manager Kim is actually yeah assured me that this is not necessarily a commentary on my work ethic so I feel better but if you're like me once in a while I do feel a little bit lazy and at those times I just love having a simple solution to be really productive and so on the next few thing next 15 minutes or so I'm going to show you two easy methods that you can use to add spoken commands to your application and so hopefully an afternoon is worth of work you can walk into your manager's office or your co-workers office and say oh by the way our Mac os10 application understands spoken commands so let's get started so as I mentioned I'm going to provide two methods so the first method is to use the speakable items application that's built into Mac OS 10 as Kim demoed in the beginning it's designed for end-users so they can easily add spoken commands to any application and so you as a developer can use this also so it's great because you don't have to write any speech code speakable AIIMS takes care of this for you by taking a list of items building the language model and then waiting for the recognition result and itself showed in his demonstration it understands how to execute Apple script so that you can easily send Apple events to your application or even other applications for that matter now the second method I'm going to describe is to use the speech recognition API you may be familiar with this a mac os9 it gives you a little bit more flexibility you can have multiple command lists so you can have one set of commands for when the user is selected in an object and one set of commands when they haven't and I'm going to show you in a little bit of an example that gives you a really easy three-step approach to adding spoken commands to your application using the speech recognition API well one of the things that both of these methods have in common are commands and so let's talk just for a second about what makes a good command well commands are like menu items but we suggest that they're normally from three to six words long the longer the better generally because the recognition system can understand them easier and they're more unique amongst your other commands but you don't want to too long to where the user has a hard time speaking and fluently also you should avoid single words and especially words like hot cut and quit because those are oftentimes Mis recognized or they sound a light to the recognition system the other important item is that you should test your commands especially test them together to make sure that they're confused with each other and test them with the global commands that are shipped with Mac os10 and a prototype your commands you can use speakable IMS or the SR language modeler applications you'll find on a developer CD okay let's talk about method one in a little more depth and this is using the speakable items application so the first thing is I mentioned you want to create a number of items you can easily do this by bringing your application to the foreground and speaking the command make this application speakable that creates a folder in the speakable items directory as Kim showed you earlier inside that application speakable items folder and then you can begin adding your items as he showed you so once you have all your items together the next thing is you want to bundle those inside of your application so I'm gonna show you a little bit an example you can use project builder to easily copy these files into your application bundle at the time you build it and then finally you need to install those items now we really suggest that you install them at runtime this gives you a couple of added benefits it allows your application to be drag-and-drop installed so therefore in order to support speakable items you don't have to have a separate installer to install the items and it's great for Mac OS 10 support for multiple users because after a user after your application has been installed let's say the administrator creates new users well that new user is just going to get those speakable items the next times they run your application because you'll automatically install them well let's talk about items real quick Kim briefly mentioned those it's basically a file that can be open but there's really two types that are best for you developers the first are Apple script files as I talked about allow you to send Apple events your application the other one that Kim mentioned are the new XML based command files and then allow you to sand keyboard events to your applications so that you can activate menus or controls via keyboard shortcuts well one of the things that I wanted to do in preparing for this WWDC this year is create an example that really showed how easy it is to add spoken commands your application I really wanted to make as simple as copy paste go so as you saw in the sales demonstration you use iTunes and that's a really pretty good real-world application it's shipping and you can see how those are integrated into the application well since I can't ship or give you the source code for iTunes I thought well let me create a clone of it so I've named mine foe tunes courtesy of our French person in the the group and it's up on the web right now that you can go grab it at this URL and hopefully you can either grab it this week or right now or when you get back and start taking a look at it it's really I believe shows a really easy way of getting going so let me go show it to you oh I need to switch like everyone else I forgotten to switch okay let me hide this real quick so let me show you the application my clone of iTunes real quick it has the identical menu items and the window is you know pretty close I mean if you haven't done anything in cocoa or interface builder I basically took 15 minutes through all the menus in there laid out the window and pretty much got an automatic resizing window it's really awesome okay let me show you that it really is listening for commands show commands window show speech commands window okay that's of course display visual get song info get song info there we go as you see it doesn't do anything it just shows the command down below okay so it really is folk tunes okay let's switch into project builder and I'll show you how this is set up let me try the command switch to project builder yeah okay go I'll move those out of the way so we can see the window here I'll put this down since I'll be giving it more commands okay so let me show you real quick what this basic object looks like that manages the window it's really simple has a couple of instance variables and then it pretty much has a method for each one of the menu items and a couple extras to handle some of the controls in the windows so it's really simple all these methods do is basically display at the bottom of that window what's happening okay so as I mentioned the first step was creating items so we've created those items the next step is that we need to include them in our application bundle so what we'll do is we'll go to at the active target and the way we do this is we use a file copies build phase and so we include those down here let me show you where you do that and here if you haven't already seen it you got a new build phase and you get a new copy build phase it's not highlighted right at the moment because I haven't selected a particular item but as you can see I've included the items here I have two sets of of items I have the command files it's a majority of them and so I'm saying place these any folder name command files inside the inside the resources directory of the application bundle and then I have a single Apple script file that I've included in here as well okay so now project builder has made that easy now they're being copied when I build it being copied inside the application bundle the next task is to install them at runtime so we've tried to simplify this a lot by providing a single routine that you can call so here it is install speakable items for this application you'll pass in the name of those folders that you placed your items into your resources directory and then you'll call it and it's smart enough to go out creates the folder if the folder is already there it doesn't create it again in the case of this demo I actually call this routine every time it startup but you could choose to call it lazily later or you could call it in response to a user specifying it and the Preferences dialog or something like that the rest of this file has a tutorial or documentation in more depth than I've talked about here about creating the items adding those items to your application bundle and then how to call this routine and special notes so it's all there ok let's go back to the slides for a second ok great so let's touch on the second method for a minute here and that's using the speech recognition API well you know just like the previous example I want to make it a copy-paste and go solution so we're trying to provide some really simple routines that you can use so what I've done is I've broken down the process of recognition setting up for recognition and handling the recognition into 3 easy steps and you'll see in a minute where I provide you with a single routine to execute each one of these steps so let's talk a bit about basically what the recognition process looks like this is a very simplified version of it and a graphical version of it that just kind of cements it in your minds as to what this example is doing and generally what recognition process is what happens during that ok so step one we provide a routine that basically sets up all the recognition objects instantiates a recognition system a recognizer object a language model object hooks them all together and that's all set up and ready to go something that most virtually every developer has to do when they adopt the the Apple speech recognition API the second step is that you need to tell it what commands will isn't for so in the routine that we give you pass in the recognizer object and then you pass in an array of commands it basically gives a recognizer object those commands to display and the speech commands window then you also pass in the language model object and it gives the commands to that so the recognition engine knows what to listen for and then for the third step you need to implement one Apple event handler the speech dun Apple event handler and so now your application is just sitting there you've set up it's running ready to handle the user spoken command so when the user says something the recognition engine passes it off to the recognizer object it then sends an apple event to your application in this case play this song so we provide you a single routine you pass in the apple event that you get and it returns an ID and then you can take this ID and map it through a switch statement or a table lookup or however you like to a particular action or routine okay so let's go back to project builder and I'll show you how this is done by the way this single project has two targets one for building the speakable items based version of photons and one that uses the API so it's all in here you can build either one and to some extent contrast and compare okay we provided in this speech routines dot C and header file those three three routines so let's see him here it is you basically call setup speech recognition that accomplishes step one then you call add commands with a list of your commands you can call add commands over and over as you need to to change the set of commands if you want to be able to provide the user different commands based on whether they've selected a particular object and and your interface or they're in a particular portion of your application and then inside the Apple event handle that you've created you'll call this one routine pass in the Apple event and you'll get the ID that was connected with that original command that you set up okay so let's look at basically the application delegate object where I do this what I've done is I use an application delegate that gets the application did finish launching method after KOCO has brought my app up and is basically running and so I pretty much do it all at the front just for this demonstration so that code is all in one place the way I approach it is I create a simple table basically that has my command name and then the method to be called so that's the the way I'm doing it here actually do it programmatically at the bottom of this file you could do an XML file you could do it differently it's up to you I register the speech done apple event handler by calling the AE install event handler then I call the routine to set up speech recognition that we provide in that utility file I create an array of the commands because I need that to pass to the add commands routine that we provide and then finally I call our SR start listening routine that we that's part of the API and now the application is up and running page down here this is the Apple event handler just a couple of lines of code I call the routine that we provide an utility file to convert the Apple event into the ID and then so the Wonder is an objective-c I just use as index into a table and then basically go off to the appropriate method selector so that's pretty much it I really urge you to go out and grab this and see how this can be applied to your application so let me summarize real quick are you summarize real quick so we saw in the method of the first method using the speakable AIIMS application that's really easy because you don't have to write any additional speech code all you need to do is include those items inside your application bundle and then install them at runtime with the routine that we provide the second method is using the speech recognition API as I explained that's an easy three-step process that we give you a single routine to execute each one of those steps so I've discussed the lazy way to do it there's more things you can do with the speech recognition API and Matthias is gonna come up here and talk about what to do if you're feeling a little bit more ambitious so thank you thank you Kevin my manager assures me that overachievers does not necessarily apply to my performance either so kevin has shown you how to get 95% of the benefits but 5% of the work however there are some situations where you might need the extra 5% one example of this is chess you've seen it demoed it ships with Mac OS 10 as of this week you can get the source code from this URL here you will find the speech related code in chess listener and chess illustrates important lessons in language model design now you might think the language model of chess is not very complex right d2 to d4 all simple sentences problem is if you just do this as a list of possible moves it gets out of hand pretty quickly if you do the math you find out that if you just have a model with all the possible moves you end up with more than 24,000 moves and clearly this is unacceptable it doesn't help accuracy plus you're not doing the user any favor if you're listening to stuff like rook a1 to h8 its won't do him any good at all in fact it turns out that each in each chess position there are only 20 to 30 moves that are actually legal so there is no reason whatsoever to include the extra moves performance is going to go away up and user satisfaction is going to go up if you only include legal moves however you shouldn't quite over constrain your model there are some illegal moves which are still plausible for instance people frequently put their King into check accidentally even experienced chess players so what you would want to do is to leave a move like this in so you can say I heard you but I won't do it another technique that we use in chess is to use prefabricated parts there are not so many words that are actually used in this language model so we we fabricate that them at start up by calling s our new word to get these word objects and then when we come to a position and see that for instance pawn d2 to d4 would be appropriate we simply grab these prefabricated objects and paste them together to form this this command so to summarize for complex language models you will want to constrain your language model to only those commands which are plausible in each situation and consequently you adapt the language model when the situation changes furthermore in very complex situations you might consider using prefabricated language objects to quickly get your list of commands to build these language models we've included a tool called SR language modeler which helps you to quickly experiment with different language models how well they work for your users SR language modeler allows both live microphone tests for rapid turnaround if you want to try something or grab somebody into your office to have him try something if you want to do systematic scientific research tests you can record your users saying these commands record them into AIFF files and feed those files into SR language model to get a systematic evaluation of how well this performs this tool and all of our sample code which we shaped with Mac OS 10 you will find on the developer CD in the examples slash speech folder and we encourage you to start with that if you want to do anything with speech so let me now turn our session back over to our fearless leader came Silverman for some values of fearless so to summarize speech synthesis and speech recognition are there we've given you a conceptual overview of the api's and I try to give you some ideas about why you would want to use them and Mathias and Kevin have followed up with how to use them well so I want to encourage you all to speech enable your apps at this stage I'd like to single out just a couple of developers who've been doing this you might remember thinking home which got the Apple Design Award last year they've poured of their application to OS 10 and have been at they added speech to it and they found that that adds a lot of value to their users I was talking to one of their developers on the phone yesterday he said they're getting a lot of feedback saying that users just think it's great when they can walk into the room and say things like dim the lights in the living room or turn the upstairs thermostat to cool the folk who were working on Omni web you may have seen their cool browser on OS 10 they've been experimenting also with using speech to its integrate speech e to the browsing experience for those that don't want to or can't deal with the keyboard in the mouse we saw it a prototype of that this morning it's looking really good they've got some great ideas about how to do it so now you're going to give you the guidelines of good better and best about how to put speech into your apps good is the easy way use speech recognition to allow people to speak the visible controls on on the screen things that they can would normally do manipulate them and say them and use speech synthesis to speak simple alerts and alert panels when they come up you can do these by the way with either the speakable items framework or by calling the API directly if you want to go better then use delegation I've mentioned this a few times you probably inferred what I mean normally when we interact with a computer we specify explicitly each step we want the computer to take in order to reach a goal that we have in mind with delegation we delegate the goal to the computer and have it figure out the steps about how to get there and then execute them for us so group what would be otherwise multiple interactive actions into one coke in command for speech synthesis start to customize your texts using the guidelines that Mattias went through so if you want any help on those read back information to your users and if you want to be best then move to interactive spoken dialogues like you saw Sal demonstrating where you delegate a goal to the computer or your agent anything comes back and asks you questions to refine that goal and think about using speech for form filling so that's it thanks a lot for coming I am I