WWDC2003 Session 402

Transcript

Kind: captions Language: en so we're going to talk about the OS 10 speech technology today to introduce myself at you just as I am principal research scientist and manager of the spoken language technologies spoken language technologies means for speech technologies that you've heard of the speech recognition and the speech synthesis and also we have language technologies because we believe you can't deal with speech without dealing with language so for example the junk mail filter is one of our spoken language technologies and if you heard the state of the union address about OS 10 the Japanese input method is now using our speech technologies and that's why they're doing so much better than windows at the moment so before we start I figured I'd go straight into a demo I'm going to show you some of the way speeches usable in Panther and you guys being developers I'm allowed to give you the caveat this is beta software right so I've been through this demo lots of times it always works but because the speech sits on top of every other component of the operating system it means that if anything goes wrong something could break here so we I throw ask you to bear with me when the first thing you should do when turning on speech recognition is let the Machine adjust to the acoustic environment in which it's being used we have speaker independent recognizer I'll talk about what that means a bit more later but even though it's independent of who you are it does need to adapt sample and adapt the acoustic environment we have made it tracks the acoustic characteristics of most places where people would use it but this is outside of the parameters for which we've developed it the distance between these walls combined with the positions of the PA system speakers mean the durations of the echoes are a little bit outside of the spectral range that we're looking at so I'm going to need to adapt it so what you do any time you should use speech recognition in a new place is go to the speech preferences to the speech recognition tab listening and click on volume this ostensibly lets you set the volume on the microphone and also gives you a chance to make sure that you've got the right microphone connected our speech recognition works with a far-field microphone that means a desktop microphone that's about this far from you in this particular situation because of these echoes I'm going to use the alternative which is a headset microphone to put on now you can purchase these pretty cheaply from the Apple stores there are a few brands this particular one is by ZXI your task is just to read down this list of commands as each command is recognized it will flash if a come if it doesn't / you just repeat it until it is if it doesn't / after one or two repetitions and go on so i'll do that now it's actually sampling my voice in this environment while i'm talking right now what time is it quick this application quickness there we go quit this application open a document show me what to say make this page speakable move page down move page down hides this application switch to find it I'll go through that again what time is it quick this application open a document show me what to say make this page speakable move page down hide this application switch to find it good so let's try it what time is it the audio is not plugged in just a moment let's the audio in guys this might go bang I'm still here even though you can't see my face ok it's audio is plugged in let's try that again what time is it what day is it quit this application open my browser close this window go to Google News go to Google News you can set up any web page to be speakable quit this application get my mail so suppose I've got a message here and I see some text that I'd like to send to somebody else I can do that by speech and we do that by integrating with the address book to find out because an email address the whole address book is speakable so I can say for example sent this to speech dude send and this should complain oh I love that sound ok hide this application switch to check whoo he's the man down there Mattias porn d2 to d4 pawn d2 to d4 Knight B 1 2 C 3 you see these ghosts piece pieces wandering around it's great quicks this application phone for Tom bonora thanks okay so for those of you who know about speech there's some stuff that you've seen before and there's some stuff that's new there k ask how many people have not sat down and used speakable items on the OS 10 before okay so if you go I'll walk through quickly what we've done here what we've seen is a bunch of different things we have a speech recognition engine which has a robust API you can call that api's from your applications many of you do already and more hour all the time in addition we ship a couple of applications that also use that API one of those is called speakable items and that's largely what I've been showing you now you turn that on from the system preferences from the speech preferences speech recognition in the on/off window pane speakable items is a very simple idea that's been developing over the years and it's turning out to be quite powerful when you turn it on we create and a folder in your home directory called the speakable items folder open the speakable items folder anything that's in that folder can be launched by speaking it it's just the same as double-clicking on it applications documents templates aliases stationery URLs anything that you can launch in the finder by double-clicking you can launch by speech just by putting it into that folder most of the commands that I use now are part of a kind of a starter kit that we ship we pre populate that folder with a few items that do generally useful things and that's what I've been using so for example there's what day is it what day is it the real power of this is that you can add your own items to that folder and make your own thing speakable so you can customize a speech recognition according to the kinds of things that you do and how you work within that folder that the speaker Bowl items folder there is itself another folder called application speakable items that folder contains folders named by applications the items in those folders are only speakable when that application is in the foreground that provides a framework where you developers can ship commands that are specific to your applications and you don't have to worry about accidentally using the same wording as somebody else with a different application because you put them into that that's framework that folder then they'll only be speakable when your application is in the foreground and in fact you don't need to install them into this folder you can put them in your own application bundle and specular items will find them we have documentation on how to do that so the kinds of things you can put in here are script that send Apple script commands Apple events to your application or other keyboard shortcuts what do I mean by that what I mean is that anything that is speakable as sorry anything that the menu item with a keyboard shortcut can have a spoken command associated with it so if you think let me give you an example of that first we'll go back to the web open my browser what do we got here i'm looking for a picture of course the web is slow one of the accessibility features is that you can zoom in on the screen and that's a keyboard shortcut so I have a textured spoken commands with that so now I can say zoom in zoom out a bit right so anything that has a spoken shortcut oh sorry keyboard shortcut you can attach speech too and that's a one easy way that you can have speech control of your application without doing a lot of extra work a speech is really important for disability solutions let me just give you a story I just heard this just as I was sitting up here I kid you not this is absolutely true just as I was sitting up here about half an hour ago I think his name's bill bill was at you a guy from the projection behind came out and said look I've just gotta tell you excuse me interrupt me while I was setting up he said I saw you give a demo of this stuff a couple of years ago at macworld and so I showed it to a blind friend I turn on his machine and I said that's my browser and the browser opened and we got his mail and we got it to read his mail out to him he said he was just blown away well it turns out that this guy teaches an exercise class full of blind people and so they were doing their exercise class and they had an imac over on the side then they will go over the imac after exercise class and surf the web a bunch of blind people and they do this every Monday have their exercise and then going surf the web by voice using just the things that you're seeing here yeah I was touched really I wanted to share that with you because it means that you can make your applications available to folk with disabilities through technologies like this the section 508 ruled as I understand it of the American Disabilities Act is that everything that can be done with your application must be possible to do without requiring the keyboard or the mouse or that you see the screen and we have an accessibility API that lets you get at screen controls and provide alternative ways of controlling those things we ship one method already built in and that speech so with very little effort you can have stories like that circulating about your applications to within the speakable item framework you can choose what lines of commands you can give this is controlled by the commands tab here and the reason I want to tell you about this is because of this particular guy who's off by default which is front window commands this lets you speak any of the control in the frontmost window of whatever applications in the foreground so with this on I can navigate those preferences here for example speech recognition what we already beers or doesn't show much default voice spoken user interface speaks the phrase so I'm going down the check boxes here speak the alerts text speech recognition we did not build this specially into the speech preferences this is just the general accessibility features that use speech so as long as you use standard apple controls you get that for free in addition you can have the computer readout text that appears on the screen and there are a few different ways of doing this one that I just turned on is talking alerts here's the mentality we have here through the philosophy is when you're interacting with a computer sometimes the computer needs to tell you things and the standard way that will do that is by putting up a sheet or an alert dialog in front of you you should be able to read that think about it and respond to it but sometimes your attention is elsewhere I'm talking on my machine aight aight quit I turn around they have a conversation with somebody and I don't realize that there's an alert saying that I need to save my changes before quitting perhaps I go away to lunch and they come back and I find that my work wasn't save and somebody couldn't get at it so what happens is if the alert has been up for a certain amount of time and you haven't responded to it then we read it out to you in order to get your attention back so let me demonstrate that switch to text edit so type something here and then try to quit from it close it office before I do we ship it with a delay of about 20 seconds by default between the alert appearing and it's being spoken I'll put that back to zero for now so that you guys don't have to wait around so long and that will quit or close its document so that's talking alerts people tell us that they love it the kinds of scenarios I hear feedback from users are some guy said to me he was crawling around on his hands and knees under his computer under his desk and accidentally kicked out the ethernet cable and he didn't know that he had done that and then he heard his voice come from his computer saying the network has been disconnected and he turned around and thought oh yes indeed ahead I was giving a keynote at an international conference in Berlin I arrived the night before and actually I wrote my talk in the flight land in Heathrow on the way over there and when I arrived I thought I'd better get my presentation printed out on transparencies just in case my computer went plug into their projection system so i went to a print shop that had all maybe two dozen different kinds of printers they had big printers a little slide printers high quality stuff kind of like kinkos on steroids and we started printing these things off and while we were doing that I demoed this to them they were delighted so they had they had two dozen carmax they put talking alerts on to each of their power max because their typical business model is somebody arrives at about nine o'clock in the morning with a CD containing a large image that needs to be printed out on a large poster camera ready and will take about two hours to print out so they put into a appropriate machine and started printing and then at about eleven o'clock they go to make sure it has finished okay and there's a message on the screen that we came up five minutes into the printing saying that there was some problem with the cyan ink and so they lose the business because they don't have done in time and they found that this was great with 24 machines going all the time their voices coming up from all over the place saying the printer is out of paper the network is down and and this saves them a lot of time but there was a problem they all had the same voice so a voice would come out of this method machine saying the printer is out of paper and who said it who did that you can set the default voice but they put a different voice onto which machine and you can set what phrase is spoken before the alert is read out by default wish to choose stepping through a small list we have here and they so they made each machine and nouns itself by name so we think talking alerts is useful I'll put us to lay back to doesn't keep talking all the time here another disability feature that we have is the ability to speak any text at the end of the mouth this again uses the accessibility API so if I turn this on and slide the mouse around yep so if I ever fly the mouth around your here text being read out I'm not clicking here okay so if you stand up controlled you'll get that for free as well okay so let's go back and talk about a bit more about what we've actually been seeing here we'll get back to the main machine now so I want to talk a little bit about why you should adopt speech thing to think about about the reasons putting it in there rather than just being a novelty first of all speech gives you a way to get beyond the limits of a graphical user interface graphical user interfaces are mature and they present will the information that they can present but there are limits to what they can do screen real estate is at a premium and so we have all sorts of technology to try to squeeze a little bit more power out of our screen real estate but no matter how big my screen no matter how many monitors I have in front of them there are always things that I need to get to or see that are behind other things I can't see them and of course there's always the issue of what happens when the user is not attending no matter how cool the graphics are the user just might be staring out the window speech gives you an extra modality give to your user more choices about ways of interacting with a computer it's more natural that is we've all been speaking and listening since we're about two years old it's something that comes to us without too much thinking whereas working through a typical computer user interface requires some training so if you put speech into your applications as an alternative control meant not modality you'll find that some users who are new to computing will give more likely to try out your application it's particularly appropriate in an ice busy hands busy scenario so think about your application think of this there's any time where the user is looking at something on the screen and their hands are busy busy for example they're drawing something and they need to make a controller computer for example I'm drawing a line that I would have change the brush size or increase the amount of blur normally with our graphical user interfaces I have to stop drawing go up to a menu pull up a dialogue set some settings and then click out of that then return to drawing so speech is good when a the hands of the eyes need to keep busy with what they're doing think about that in your application and finally speech gives us a way to move out of the 1980s back in the 1980s computers had a little weak speaker soldered onto the motherboard and all it could do was go feed and so we got into the habit of writing our program whether beep written into them whenever we needed to get me users attention we put up a look and alert we go feed well I like to think that life has moved forward somewhat since then of course now instead of just going beep we play lots of different sounds but the verdant is still on the user to understand what all those sounds mean so for example if you want to let the user know that mail has been centrally play one sound if we want to let the user know that somebody has logged onto I checked we play a different sound and it seems to me we should be able to do better than that the application developer who is playing a sound knows the meaning knows what information is trying to convey to the user so why not just say it and think about the mouth the amount the mouth is essentially the equivalent of doing this it's such a narrow I guess 11.2 bit interface we should be able to do better than that if I want to do something with a computer rather than just poking and grunting I should ask you how to say what I want to do so we have a couple of engines that I've mentioned and shown you already the speech recognition is speaker independence that means you don't have to train it to your voice there are speaker dependent speech recognizers around and have got different characteristics and one of those characteristics is that you need to spend at least four hours of speaking to transfer your voice which takes more than four hours then you get a get tired at the end of it then at the end of that four hours you still want to use them for a month or two before they can finally adapt to your particular voice we think that the kinds of users that by Macintoshes expect to just walk up to it and have it work so if we make it speaker independent it works with a far-field microphone we have layers of software that are tracking adapting to and compensating for the background acoustics and the microphone character sticks you can also use it with a head-mounted microphone as you just saw me do over the air its robust against background noise I use it now at Apple at the cafeteria at lunchtime and to my delight it works the secret there is that it's kind of noise that's easiest to compensate for is noise it's steady state so in the cafeteria when there are hundreds of people talking the overall spectrum tend to be fairly constant what's a situation that we have not solved is if I'm in front of a computer trying to talk to it and right next to me there's somebody else talking because then there's two voices at once and the spectrum of the distracting voice is changing all the time so we don't claim to ourselves that one yet it's a large vocabulary speech recognizer we have over 120 1000 words in the dictionary and we have layers of software to figure out how to pronounce words that ant in the dictionary and it's a continuous speech recognizer you don't have to pause between words which is a great relief it's driven by a finite-state grammar that's how your application tells the recognizer what to listen for you are why you should use speech recognition well as i mentioned speech is a very natural way of controlling a computer it gets you beyond the limits of point-and-click because you can't click on what you can't see 2.2 and conversation is a particularly appropriate modality for delegating goals to a computer you can tell a computer what you want to do if you haven't specified enough it can then come back and ask you questions to refine the nature of the goal and can then do what it's good at which is figuring out the steps necessary along the way to get there and of course speech recognition is right for accessibility the latest story being that one that's only 30 minutes old ok we have speech synthesis in there it takes any text and convert it into American English speech I have to say that because I'm getting requests all the time for other varieties of English and other languages there's a range of different voices you can control the speaking rate and that is important because there is no correct answer or no single answer for the question what's the appropriate speaking rate for speech synthesizer the rate at which was fixed synthesizer speaks should depend on why it is speaking we'll talk more about that later I do want to let you know that we are working steadily all the time on improving the quality and the naturalness of the speech synthesis we did a lot going from Puma to Jaguar and we got good feedback from folks who listen to it and said oh wow that's a lot better now and we're stealing a lot more work so when should you speech synthesis there's a bunch of different different areas I won't go to all of these now but one thing that I think is useful is when something happens inside the computer that outside of the users control or not directly relevant of a current task of hand then speeches and appropriate modality for letting them know for example you have new mail from your boss or your compile failed another area is proof reading you know creation of documents used to be an art form and people would spend a lot of time crafting them but the world's got too busy for that we don't have the time so we have tools like spell checkers and grammar checkers well grammar checkers don't do very well they don't often they don't catch awkward constructs and the constructs that they do catch we don't always agree with the Marine correct correct and spell checkers can only find a word that is not in the dictionary often when we make typing mistakes the psychologists will tell us there's good evidence on this when we make a typing mistake we are much more likely to transpose letters if it creates another real word and spelling checkers can never text never catch that but if you have text read out to you you immediately spot it it just becomes so painfully obvious people have asked me to talk a little bit about why should you I use speech synthesis versus recorded speech there are a few reasons if you only have a small amount of things that you need to say to your user then i say go ahead and record them get your voice talent but sometimes recordings impractical for example if you have a huge amount to read out or to save you your users then it takes ages to record it and takes a huge amount of storage the average CD is what about six hundred and forty megabytes and usually about two-thirds of that media content and so if you can reduce the audio by a factor of well typically about 80 by going from audio recordings down to text then you have much more space for real content on your titles on your CDs you also get a consistent voice if you record a voice talent then later on you bring them back to record some more or even the next week from one day to another their voices tend to pick up an inconsistent they're speaking louder one day a bit more relaxed to the next day and then in the user interaction the voice sounds like it's going up and down with speech synthesis you get a consistent voice you can save costs because you don't have to hire a voice challenge you don't have to rehire recording studio it's flexible if I don't know this is whether there have happened to you it's happened to me a lot you're working on application you're about to ship it and just as you're about to go GM somebody says all we have to change some of the strings so you have to call up the voice talent and to get them back into the studio to record something different but no they're on break variation in Brazil now or they would have sorts road and just it's real pain with speech synthesis yes type in the new strings and you're done another important reason for using speech synthesis is if the things that you're saying to your users are longer than a single short sentence then you need to control the intonation to make sure they're spoken in a way that people can track the meaning across the longer sentences and you can't do that if you're piecing together real sentences that were recorded at different times and you just concatenate them together and you get lip synchronization for free alright at this stage I want to invite up Jack Minsky who is the president of software Matt Kiev jack has his company has produced world book which as you may have seen is this wonderful application I think it's about the best OS 10 you I on any applications I've seen he'll show to you it's gorgeous and these guys have been using speech and he's going to tell you about it yes good morning we had a pretty simple goal in mind at the creative labs of software McCabe when we set out to build world book speech edition and that's that we wanted virtually impaired users or even blind users to be able to use the world book to be able to search all 22 volumes 18,000 articles on their own without assistance and that meant we really had to be kind of creative not just how text be able to be read by passing your cursor over it or highlighting something but build in the kind of interaction that would allow a user really to be able to do this on their own and like to show it for it to you first I'm going to do is Kim did to adjust the labs feature there you go going to let this Mac adjust to my voice in this room what time is it quick this application open the document open the document show me what to say make this page speakable move page down hide this application switch to find there so that's done and then just to get started we wanted to use or even be able to launch this from the finder and we were thinking launch start and we're going to need something even friendlier so we chose hello world book is our starting on let's try and see if that works hello world book and immediately they get the feedback of the music and for book starting up so a blind person already knows they're in let this go by just for a second [Music] and then the next step would be to go and so you can see there's all kinds of sounds and things built in there so even someone who can't actually see the screen can hear some of the things going on so the next step was to be able to get them to be able to search through the encyclopedia for a particular article that they're looking for and here you're going to hear me say search please a window will open and blind users can touch type things in so let's see that work search please so now i have a wood i have a window opening if you heard that it said ready to search letting the user again have feedback to know that the thing is working i'm going to type in a simple word here horse picking horse in particular a whole bunch of articles what we've done is to embed sound to the top of the articles again oral feedback so they know when they reach that article will actually hear the animal noise or whatever that's going on you'll also hear more feedback when I hit return because we're dealing with a blind user who might not be able to type in successfully the right word we wanted to give them feedback so it will actually say searching for horse and then at the end if the horse articles found will say search complete so let's try that and when the horse stops running they can now simply again ask the computer to read to them so I'd say read to me so in this way assuming they have typed incorrectly they can get to any article that they can think of the name for now of course we thought people aren't going to be able to necessarily type it incorrectly i know i miss type all the time and i can see just fine so what we did was to build in a catch for that so i'm going to type in and misspell apple computer here and you're going to see on this one that it's going to come up with a series of suggested alternative words and we thought even beyond that as you'll see from this example it will read through the instructions first once and then go through the list one by one pronouncing the alternatives that the user might have meant to type in the first place it will then pause briefly at the end of the list assume that the user didn't hear what they wanted or maybe they didn't weren't sure start the list again but without that long intro introduction of explanation of what they need to do simply repeat the words again so let's try that and there I go and I've got my Apple computer articles so the user can do that quite nicely we also built in a lot of other speech technologies to try and go to the maximum of what Max's 10 has to offer Kim showed the speech under mouth so I won't show that but we have custom controls in some places here it comes for free if you simply enable that for all the dialogues with normal tabs and so forth but if you build custom controls you can go the extra step of making sure those will also work with text under mouth and then we've done one more thing I'm going to pull up another page here it's already set up and that's to take a bunch of the abbreviations that are very common in encyclopedia which won't mean anything to a blind person for example a population is a very common thing in an article about cities so what we've done here is made it so that population will not read as pop here but read out the word and I'll just show you that let's try that again with them down or something like this instead of food so all of those things have been for us the way that you can see someone who's uncited could navigate the sentence like Peter really use it on their own without assistance without someone standing over the shoulder we've gotten a lot of recognition for this this is the first encyclopedia and only one that's fully ad a compliant with section 508 and that's resulted in a number of magazine articles written in the education space about this application also just three weeks ago we had the great honor of the American Association of Education publishers voted this one the best children's software of the past year that's the first time that a mac only application has ever been nominated for this this is you know windows cross-platform everything for the prize but a macintosh only product one that category and also another great reason to do this Apple has put this application on every emac iMac and I book they sell and probably the best thing of all is that we know it software mekia because of the work we did in implementing the speech technologies which were already set up to for us with all the things that are built into Mac os10 if there are literally tens of thousands of visually impaired users and even blind people out there we now have a whole new world opened up to them to be able to explore independently the world book encyclopedia and we feel really great about that thank you you can purchase world book on the apple store online or in the retail stores check it out alright so now it's your turn so will we want to talk to you a little bit about what you can do to incorporate speech in your applications and we'll start with talking about customizing speech synthesis what I mean by this is that when we sent text to a speech synthesizer the speech synthesizer look at each sentence scratches its head and says hmm how should I speak this the answer is the way of sentences spoken depends on why it's being spoken and what the intention of what is conveying to the user the problem is difficult in the general case but you guys have an advantage your application knows a lot more about how things should be spoken than the text-to-speech engine does for example Jack's application news that pop within brackets followed by digits should not be spoken as pop but should be expanded to population the speech synthesizer could never figure that out by itself so there are three things that you can do one is filter the text the way the map kiev guys did another example would be stock quote abbreviations then you can customize the pronunciations and you can customize the information let's talk about that in a little bit more detail to customize the pronunciation you're dealing with a problem with the way the synthesizer pronounces the word is not the way that you want it pronounced this is most often a problem with names or invented names of characters if you have a fantasy game I'm sure you've got some character names in there that are written to be difficult to pronounce some developers send special string to the synthesizers that just use funny spelling we don't recommend that because the way we pronounce and orthodox spelling might change from version to version instead we recommend that you use what we call phoneme input which is looks obscure but is actually very quick to learn as a totally precise unambiguous way to specify how words ought to be pronounced you can embed phonemes like this into the text or you can load a custom dish ritu the synthesizer that has these mapping jewelry ready in it then you should customize the intonation the intonation is the pitch and the timing that we use when we speak it's not what we say it's the way that we say it and the problem is that once you've synthesized the words so that they are clear you've not synthesized enough considered a sentence John only introduce Mary to bill now if I say it like that it means he didn't introduce Mary to anybody else John only introduced married to Bill but suppose I say John only introduced Mary to bill that he might have in truth introduced her to other people as well but to build he only introduced Mary quite a different meaning and if I say join only introduced Mary to build then it means he didn't encourage them to call from form a partnership together so the problem is the meaning of a sentence depends crucially on the intonation it's difficult to generate in the oil in the general case because we need to know what's the intended meaning but your application often knows that and so your developers can employ your domain knowledge knowledge within your application to do a better job let me work through an example so here is a text for an application that people are using to book flights and here's the confirmation that's being a sense of the user I will read this out first by just passing the text as you see it directly to the speech synthesizer and it was sort of do okay but it won't sound all that great here we go oh the audio is not going out from the old demo machine is there a reason for that all right well uh what can you do okay we can put them off microphone on it yeah is this mic working ok this is high-tech let's see if this works life is a lesson in life like when he can see speaking recording dance and as we all know that may 24-second exchange here landing in san francisco we're collecting champion thank you confusing to you to dance travel whoa okay was that he rrible all right i thought it didn't sound that good so let's talk about what you can do about that there are commands that you can embed into the texts that you send to the synthesizer you can embed those commands by rule and they'll give the synthesizer hints about how to speak the text last year and in developers prior developers conferences we've given some instruction on how to use some of these commands and according to those this would be the kind of way that you would annotate the the text that synthesizer I've put the embedded commands into a smaller font so that you can see them but we've been working on the front end of the synthesizer and some of this information we can now infer because we're now tracking the topic as we go through texts and modifying the way we say it according to the topic structure and the block structure that means that some of these are no longer needed so those ones you get for free but there are others here that I've left behind which do depend on domain knowledge let's take a some examples I'm going to go through these by laying out some simple principle you can use the first one is I'm calling let the user catch up what you should do is add pauses at major sense units where are pieces of information seem to cohere together make sure they are separated from other pieces of infamy information and you can do that just by sprinkling punctuation around there if you want to increase longer you can add the embedded command I've got there sln see which means add in this case 500 milliseconds of silence you can also adjust the speaking rate to be appropriate for the purchase purpose of the speech in this particular case the user needs to transcribe the information and so you want to read it a little more slowly if the user already knew that information and you're just reading it back for confirmation then you would read it back more quickly so here for example is one of those sentences power with just a plain text it sounds like this or play this out through the demo machine again well hang on it didn't play all right so what I've done here is added a command to slow down the rate a little bit and added some colons and commerce that you can see at the end of those lines and a little bit of extra silence let's see if this one will play out that's an unadorned text now with those commands that you can see it sounds like this to hear a difference okay go on second principle is for many of things go in the background when we speak we don't equally highlight every word we mark for our listeners which things are what we're saying that's referring to what they already know and what things that we're saying our new and important and the way we do that is by reducing the emphasis on things that listeners already know so you can do that by the emphasizing repeated words for example departing at 610 landing at seven ten but that one you now get for free because we're tracking things like that in the synthesizer but in addition you can do emphasize words it could be inferred from the overall application scenario so for example in this case the text started with you your first flight is but the user already knows that it's talking about flight and so it's appropriate to de-emphasize flight so it should be spoken as your first slide is I'll plays that first of all without that embedded command and you'll hear there's equal emphasis on the words first and flight then I'll play it immediately afterwards with this embedded command which takes the emphasis off the word flight see if you hear a difference did you put the audio off again because we have the audio on all right well sir let's go through this again run okay first without that embedded command and then Wiz then with you hear a difference okay third principle is liven it up if you add an exclamation mark at the end of a sentence then that stops us from gradually rolling the pitch off or the way through the sentence and it makes it sound a little bit more involved a little bit more lively so if you're hearing your synthesizer having a kind of a board sound this is one way that you can reduce that don't use it everywhere use it judiciously then you can focus the users attention on what the important biting em extra emphasis on the most important words by embedding m+ just before them and finally we suggest using what we call paragraph information when we speak we don't string all of our sentences together into one long undifferentiated dream of speech but rather we group our sentences together into larger units that span multiple sentences that relate to the topic structure and we mark that through our users for example when I start talking about a new topic I raised my voice just a little bit and now as I talk about that topic I keep lower my voice down to its normal voice range and then towards the end of that topic I kind of roll my voice off then for the next topic I raised my voice again you hear that we all do this listen to people at lunch time you'll hear it's going up up and down all the time to signal the topic structure so you can do that we have told people that you can you er should raise the pitch range at the first end of a paragraph by some embedded commands and then lower the pitch range at each subsequent sentence and then put extra silence in well now you get all that for free so what you need to do is put in a blank line between sentences and we will do the rest so in this particular case the last sentence thank you for choosing TTS travel is not related to the topic of the previous information and so we can separate it just by a blank line and that now sounds like this it's are absolutely unforgivable at least concern among international certain p.m. thank you for choosing the example for comparison I'll just play that text again unadorned so you can see where we've come by accumulating all these commands oh no I won't play all right we'll go on all right nothing this so to summarize customizer pronunciations when you're using speech synthesis customize the information using those principles and together those things will help you to give your users a better experience now I'd like to introduce a new tool that we're making available do you guys starting today to further customize the information I'd like to go back over the demo machine please the problem that we're addressing is that sometimes no matter how many embedded commands you put in the text you can't quite get it to be spoken the way you want it to be spoken with the personality or the emotion that you want would it be great if you could just record yourself saying is from a sentence the way you'd like the synthesizer to say it and have it copy you well that's what this tool does let's start up over here that's not there alright it's called repeat after me this tool we've had going in the lab for quite some time and was an internal tool that ran on Mac OS 9 it has been ported to Mac os10 that a new user interface has been put onto it the make it easier to use and more consistent with mac OS 10 and this that work was done for us by the folk at software mac here's and so we're very grateful to us Jack let's give me a head for doing this and the plants down who did the work so you can type in some text we are at WWDC and this will tell you first of all the phonemes that the synthesizer used to pronounce it there's the wii is the are then down here it plots with time going this way and pitch going this way the fundamental frequency the tune that's generated by the synthesizer for that sentence so if I speak it will sound like this is this a machine going through the sound system now ok here we go I'll plate it again now suppose I think that's all spoken a little bit too quickly I'd like to slow it down perhaps to time it with an animation that I have well I can just click on the end up here and drag it out and make it take longer or if my animation is really quick so I'm gonna have this quite short [Music] let's go back to the default if i want to emphasize the wii which is here's the w and here's the e of we that's the me on that a bit more then i can just raise a pitch up there let's pick it up and pick this up and st. could take longer and now that will sound like this i can record myself let's try it and have it give me a recording all right let's try it is hello hello the audio input working one two three let's check the sound preferences sound guitar sound input oh oh let's plug this microphone is a trick computer i'm going to disappear again but i'm actually still here so you can't escape yet where's the connection aha all right oh there is so you sound input works on iOS 10 hello there we go we add up now let's try that again we are at WWDC alright save that my audio file comes up where is it all right audio isn't completely working as I said this is pants are bear with it but I've got some pre-prepared one here to show you just in case that happened so my recording might might sound wave should have come up down here so we'll show show you one would've previously prepared which is here we go so here's the original signal and here's Victoria now copying we're going to make this available to developers watch the speech developers mailing list to find out the method that you by which you can get hold of this we're also planning on running a kitchen on it because I see a show of hands are people who would be interested in the coming to a kitchen to learn how to use this define kitchen a bunch of developers that's good a good question but bunch of you come along to Apple as I'll get and sit there and we teach you how to use it you bring along text from your application and sit down with some machines and we sit with you all day and teach you how to use it so who would like to come along and do that I'll quite a quite a number of you okay cool let me just give you a couple of examples of what you can do with this I've Tuesday's up in itunes Baba bong bong bong well our buddies in World Book used this for us a speech that you heard although it wasn't very loud that spoken back while you're doing a search for example you type in Panther and the computer will say searching for Panther if you just can send the text to the synthesizer it sounds like this that's well with customization using this tool they go to the sound like this do you hear the difference when the search is complete it would say which didn't sound that natural so they use this tool and now it sounds like this yeah I had an application where people would call up an information system typing their ID number and it would then read out news and email and so on to them and it would greet them by name the developers of this system got a voice talent to record greetings to about 5,000 different names and they found so their dismay that this has very little coverage for names we have actually 65,000 themes in our dictionary that gives us about eighty percent coverage of English names if we increase it by another 65,000 that would put it up to about eighty nine percent coverage so names are difficult right right that the statistics of names so they use our speech synthesizer and when they put past test text to the synthesizer it didn't sound the way they want to have found here's an example of some names being spoken just from text a bit tedious so we using this tool we're going to now sound like this so that's the tool that you guys can use ok let's go back to the main machine another thing we want to introduce for you today is cocoa classes and our philosophy here is that they should be simple to use inspired by LMK we think simple things should be simple and complex things should be possible and so here to tell you about them is Kevin aitken so you see author and you can blame him alright thanks camp and yeah fill three there's definitely a lot of people have contributed to but I'm willing to take the blame I guess so let me get started into this first of all we've worked really hard on this on panther now offers coco developers the ability to easily access the most popular features of our speech engines so or the next few slides i'm going to take you through the in a speech recognizer class which allows you to listen to and respond to the users spoken commands as in the NS speech synthesizer class which will allow you to generate synthesized speech you through the computer speaker or to a file so let's get started with the NS speech recognizer class so first of all we designed this to be really easy so all you do virtually you just give it a list of strings and tell it to start speaking they means that you don't need to understand concepts like language models and recognition results just to get started but we've made sure that it's dynamic you can change it on the fly and you can have several recognition objects running at the same time so it's very flexible so what I'm going to take you through as a couple of coding examples for this example with Anna speech recognizer to think of your writing application a game allows user to move through maze using four commands north south east and west and so let's get started so I've broken these in kind of two sections for sectional just get us listening in the second session will section will handle the result so first thing we're going to do is going to create a recognizer object and then we're going to set the delegate remember a delegate object is just a helper object in this case is going to receive the message when the wreck nishan recognizer system has heard something then we're going to set the commands as I said before this is just a simple array of strings in this case north south east west and then we're going to start listening so now your application is listening for those four command so the user starts using you're navigating through that maze and so they say one of those so what happened so what's going to happen is your delegate object is going to receive a digit recognized command message and as the command parameter you're just going to receive one of those strings that you originally gave it so you can use that strings compared to one of your gnome strings I've just used a simple if then else I'm sure there's much more efficient to more exciting ways to do it and then it converts that into some action okay so that's pretty easy so let's go on and talk about NS speech synthesizer so it's going to allow you to speak a circus Lee is so the computer's speaker or to a file ok because it's speaking eight successfully you can handle certain events during the speech generation process specifically you can get notification when the speech is finished you can get notification when a phoneme is about to be spoken and when a word is about to be spoken we give you access to all the voices that are installed on the system so you can get information about each one of those and create a pop up for the user just select one and finally can combine both the NS speech synthesizer and speech recognizer class to create spoken user interactions those are kind of dialogues between your application and the user those we've got a code an example of that we're just going to instantiate our synthesizer object using the default to initializer here so it's going to use the the default voice to the user is chosen in the speech preference panel we're going to set that delegate object and then we're going to start speaking by calling start speaking string now alternatively this is going to be coming out of the the default output device alternatively we can all start speaking string to URL to have it written to a file and then now your application is speaking away you can handle some of those events so we can implement the did finish speaking method on our delegate object so we know when it's finished speaking so you can say update your user interface you can be notified when it's about to speak a word so that you could do the follow the bouncing ball on the screen or highlight a word on screen as it's being spoken and you can also find out when it's about to spoken speak a phoneme so you can animate a mouse on screen or avatar some character whatever you like so anyway that's a wrap up of the speech classes do I have I'm going to go over to demo machine you really quick show you the example this guy up and going really all right so let me quickly take you and show you where this is so we have example applications in here under speech we've added some for recognition and we've a Z in a speech synthesizer example here let me show you what we built with this so using most of the callbacks let me choose a voice here and start them speaking so that's what we created using the NSB synthesize a class and it was really fast really easy and hopefully you'll find that as well so that that example is on your Panther CD and there's some other example with in there so go take a look you want face up so that it brings us to the end of those work material we've prepared for you today to summarize we've talked about we've introduced a speech technology for those who aren't quite familiar with them we've introduced a tool for customizing speech synthesis which we're going to make available to all developers we've introduced Coco classes and we have given some guidelines about when you should use speech and what kinds of principles are behind your adoption of it for those that are interested interested in more background information about this you might want to look at the introduction to developing applications with Coco to find out about Coco programming or two you might want to see the applescript update because speech and Apple scripts have such a strong synergy that many forget apples say those two together are the two most strategic technologies at apple and you can find out more about the accessibility API at the Mac os10 accessibility session we don't have time for questions now but the way the team will be gathered just outside there and happy to say as long as any of you would like to answer any questions if you have any questions subsequently which is us to your contacts John Gill NZ the is a manager of software evangelism and his email address is up there it is hard to read its Gill NZ GTL ey NF e @ apple.com and go to the speech web page to find out about the speech developers list and documentation of all the things we've shown you and more the URL is up there thanks a lot [Applause]