WWDC2003 Session 422

Transcript

Kind: captions Language: en good afternoon I'd way to me brownies at lunch and if you're like me you'll probably just knock right off in about an hour here but it's the last couple of sessions this week and the purpose of this session is to present to use of new technology that's in Panther called search kit and the value of search kid I think is that it provides some functionality that's been missing in the system first of all and that wasn't available to you as the developer and secondly if you can leverage that functionality in your application you can really deliver a very consistent user experience for searching not to mention the fact that you'll have a search engine that's already written for you in your application so to talk you through the specifics of this new technology in Panther I want to introduce Wayne loop role since age [Applause] thanks John so welcome to the search good session grabbing trouble finding a seat there are some up up front so today we're going to talk about a bit about search get and how it's used in Mac OS 10 it's actually been used for a while even before it's been released as a public API bit about what it can do and how to add search to your application so specifically we'll talk about what search get is some of the challenges of providing searching that search gets souls and then how to go about doing indexing in three steps and how to do searching in three steps and then we'll talk a bit about an application of search get called summarization which provides a specific API you you may also be interested in using in your application so where to search get fit into the technology framework so you'll all see in this diagram right now in this diagram we also show core services in the middle the middle green one there and that's precisely where search gets its in so this is a layer above Darwin and below the level of the frameworks so just a bit of history some of you actually may be already be familiar researchgate if you look at all at V twin or a I 80 this is a technology that was available in Mac OS 9 it was called the Apple information access to look at if you can believe it and it's a C++ API it used its own custom data types and it was a rather complex API in fact i think the manual fort was about 300 pages so we wanted to improve this in that go f10 and so what we did was we created a very streamlined capi that's based on the core foundation types has about maybe 40 or 50 calls and it's a much shorter documentation so as I mentioned search gets used in a number of the applications in the system in particular in address book the sort of searches he type that lets you find addresses in your address book use the search get in Apple help you want to know how do i connect to the internet Apple help will use search get to find all the documents that are relevant to that query in apple mail it's also used to search mailboxes each mailbox is indexed using search get and you can search individual indexes or all them at the same time and you get relevant to rank results and in the finder in the content searching so when you do content searching the finder search get does the underlying search and you again you get relevant ranked results based on the contents of the documents in your hand reduce so now to demonstrate these few of the applications I like to invite David caseros to find that content tech lead thank you very much I'm going to do we can have demo one here I'm going to do a couple of very brief demonstrations first I'll show you mail where we have a few mailboxes and I can put a little text right in here I type in itunes I find the one message in all those mailboxes that has it i can type in beatles and i get that one how about switch it's several there get a switch commercial got an article yeah yes it does have the word switch in it and because this is male it's doing prefix searching so i can type there and i get the ones that have abbey road all i typed with a BB that's mail now in the finder now that's still male there we go in the finder I can search some files that have a set of demo files here there's 770 files in here and they are previously indexed so you're not going to see the initial indexing process but the finder does something much more complicated it uses a couple of other frameworks to go through to do something complicated with search jet and it will first search the existing index then start validating the index to make sure that it's still up to date and it will search repeatedly while it's doing that you'll see that it's all very quick there we are relevance rank results we've got all kinds of files here we've got PDFs we've got text files and so forth and so on this works in practically any language you can use with the mac and it's very much improved in Panther and that's fine by content thank you David so there are a number of challenges to providing search in search get in particular doing relevance rank results you don't just want the documents that match a particular query you also want them ranked by how relevant they are to what it is you were asking for being able to support multiple languages as they would mentioned with search Kate you can index a large number of different languages these are human languages my computer languages and get correct results we need be flexible about what a document means it could be a documented on your desk but it could also be a message and mail or an address in address book and we have to be flexible about the kinds of queries you saw an example of prefix searching as well as you can do weird base searching and some other ones that I'll talk about in a bit so that a search get deal with some of these challenges well one example that comes up in Apple help is that you'd like to be able to type in a query like how do i connect to the internet do you really mean to find documents that contain the exact phrase how do i connect to the internet well no that's not what you're looking for what you want to know is documents that are relevant to the query how do i connect to the internet see if you're not just doing a string matching search so what documents match this query well it's really hard to know without breaking it first into words so that's what search get does it breaks it into words but which document is is relevant to these words is a document that contains the words do I too and the very relevant to the query well probably not it's probably not as relevant as a document that contains how connect and internet and so one of the things that search get does is to try to determine which words in the query are most relevant to what it is that you're looking for and it does this using some sistas statistical analysis so another challenge in search get is dealing with different languages so it's fairly easy to break things into words when you're dealing with English say or word or language that separates words by spaces but what if you're dealing with Japanese there are no spaces between words and so what search get does is uses a Japanese language analysis which was developed by the folks in Apple Japan it's the language analysis framework and it uses us to do a grammatical analysis of Japanese and then it could break Japanese into words and then use the same statistical techniques to then find the most relevant document what is the document is a document just a file on your hard disk well that would be a fairly limited capability if that's all you could look for but you know so we do search text documents we also searched a documents of other file types such as stand back here for the quicker works word and PDF and HTML but in addition we support searching items that are within the application such as mail messages book items or in your application really any object or maybe an entry from a database really anything that contains text can be indexed and searched with search good the other area of flexibility that we need is in what a query is already gave the example of a natural language queries such as how do i connect to the internet but we also want to be able to do prefix queries such as the one that David demonstrated so that you can type just the beginnings of words and even in the middle of your typing you can do a search and get relevant results back without having to complete a word you also want to be able to do a boolean query in some cases you want to have an exact match and so you'll be able to construct a boolean query and then there's a popular method using some of the internet search engines of marketing things with pluses and minuses to indicate inclusion or exclusion and then as well you might also want to be able to find things you've already found some items and now you want to find other items that are most likely items you've already found and so we also support a similarity search so you can provide documents with the input and produce documents with the output we talked a little bit about a typical usage scenario for using search get really it's very simple you index the documents and then you search the index and then we'll go into more detail about that of course the really the index is only purpose the only purpose of creating an index was to make searching fast you could certainly look through all the text content of all your documents to find things if that's what you wanted to do but you'd be missing a couple things you've been missing speed first of all but you'll also be missing the relevance ranking and some of the statistical analysis that goes into the index once you've created an index you don't need the original documents in order to do searching the index contains references to those documents not the content itself but it also contains in statistical information about the words in the document and the index could be stored in memory if you like for her performance or if you want it to be persistent because you're going to use it again then you could create it in a file the basic process is indexing is pretty simple you provide the documents to search get it analyzes the text and updates the index the process of searching is equally simple you provide it with a query and the index research and it searches the index and returns result so that's the basic outline of course we'll go into more detail indexing can be done in three basic steps you open the index that you'd like to search you add the documents to the index and then flusher so opening the index of course could be opening an existing one or creating a new one opening an existing one is pretty basic I won't won't bother talking about that creating a new one there's a decision to be made about what kind of index you want to create and there there are three different types the first is an inverted index this kind of an index maps the terms or words in each document to the document itself and provide statistical information about them this kind of index is used for most of the query kinds of searches a vector index maps the other way it Maps documents to terms this kind of index is actually seldom used more common as the third kind of index which is an inverted vector index which does both so you might ask you know why not just always create an inverted vector index because it obviously has the capabilities of the first two and really the answer is just space and performance it takes a bit longer to create as more information to put in it and it takes more space on the desk or in memory so it turns out that for most purposes and inverted index is really the kind that you that that you're going to want to use both for the performance reasons and for space reasons but just breaking it down a bit more if you're doing ranked prefix required a boolean searching the four kinds that I talked about four of the kinds that I talked about then you really want to be using an inverted index it'll pretty much do it'll give you the best performance and size ratio if you're going to be doing similarity searching it's okay to use an inverted index it will still work it's going to be a bit lower performance in doing similarity searches than the inverted sorry the inverted vector index is going to be but if you're only doing similar researching pretty seldom then that's probably the best choice if you're going to do a lot of similarity searching then you probably want an inverter vector index ok so we've either opened an existing index or created one now we got to add the document so as I mentioned before documents can be any of a number of different kinds it could be something on your disk something inside your application for each of these documents what you provide is a document reference the index doesn't need to store a lot about the the document itself other than a reference and the statistical information that it wants to gather and when you get results back you're going to get references back rather than the actual document itself these references are just URLs or they're created from URL so i should say to be more precise the other thing you need to provide is of course the text of the document and their number of ways to do this if you have a file such as the first example on the left then you use a file and reference and that tells search get that it's something that it knows how to read and the text from the document will be read automatically in addition if you'd like it to handle multiple file formats we have built-in support for a number of them in search get and all you need to do is load the default extractors which is a call in search get and that means whenever you read a file it will do the text extraction for you but if you'd like to handle any kind of item in particular you want to handle something that's inside your application or you have a document format that we don't know about that you know how to get the text from then you can provide the text of CF string that's the universal method and it can be the document reference at that point can be any kind of scheme and any kind of hierarchy that you'd like to create within your application and then the last step is simply flushing the index to disk or to memory and at that point you're done indexing so it's actually pretty simple now you're ready to search so searching has three basic steps as well and talk about each of those first you create the search group then you send the query to search get and process the results I get to sit the blank button there okay so searching one thing that search get supports is the ability to have more than one index to be searched at the same time now you might ask why we do this one reason is to support multiple attributes it might be that the objects in your application have not just one text attribute the content but they may have other text attributes maybe it's a description of a movie or maybe the you know additional attributes that you like the index as well one example of this is in the finder when find my content does indexing the finder it enix is not only the names of the files but also the contents of the files and you'll see in these results of the top few items there were found because of their names the search was for courts right and so a couple of files had courts from the name and the bulk of the files there had courts in the content so there are two indexes and they get searched simultaneously so creating a search group allows you to search multiple indicators at the same time now you might ask why use a search group why not just search each index individually and then combine the results somehow well the reason the reason is I guess I gave any exactly let me give another example here before I so and that's that multiple containers is another reason why you might create multiple indexes an example here is is male so in mail each mailbox is indexed distinctly so when you want to do an entire content search for all the mailboxes in mail at the same time it creates a search group in order to search multiple mailboxes alright now we'll get to the point about normalizing ranking if you were to do searching on individual indexes and try to combine the results yourself the problem is that the relevance ranking would only be relative to the content of each individual index you haven't normalized the rankings across all the indexes so by creating a search group the statistics are normalized and then the documents you find will be the most relevant across all the different categories anto courses possible greatest search group would just want index and that may be a common case as well okay so you've got your search group created and now it's time to send the query to search get to do the search but we're not quite ready for that because we have to determine what kind of search we're going to do and the kinds as I mentioned before our ranked prefix boolean required and similarity searching just to go into a little more detail about each one ranked supports kind of a natural language kind of query so if you're trying to do something like Apple help does where you've got a large number of documents you're trying to find the ones that are most relevant from a sort of English or other language point of view then this is the kind of search you're gonna want to do so the user can ask a question they can name a topic or they can provide just related words and it'll do a reasonably good job of finding the documents that are most relevant again an example of this as an apple help prefix searching allows you to do the search as you type so if you want one of those interfaces where as you type the results come down like you'll see in address book or in mail as well then this is the kind of searching you want to do it's basically just like the previous one except that approval it supports prefixes which has the advantage of supporting searches you type one disadvantage it has is of course to go get more matches than might be intended because even if you type what seems like a complete word it may be the prefix of some other word and so you'll probably get more results than you would otherwise an example this was in the grasberg boolean searching is just what you think it would be you know andale or not and you can combine things with parentheses etc and this one's maybe lots off on use because most users aren't really familiar with doing this except advanced users but it is in fact supported and male so you'll notice if you type in ampersand or a vertical bar and male it'll do boolean searching and the way it does this is it looks at the query that you typed in and if it contains any of those symbols and it sends it on as a boolean search and otherwise it sends it on as they ranked eraser a prefix search and then required searching this blast site and that allows you to add pluses and minuses we don't we don't have any examples in any of the Apple apps of using this but as a popular search technique with use on the internet and then similarity searching is a different call for that one you provide the sample documents you don't have any sort of text query at all and search great returns documents that are similar to the ones that you provided it okay so now we're ready to send the query we decided let's say to do a rank search in this case so you send the query to search get search get efficiently searches the index and then returns the results and then the next step is of course to process those results so the results contain a number of bits of information that are helpful in displaying the results to your user one is the document reference which is really the same thing you provided it before it it's up to you what kind of a hierarchy you want to create to make it easy to find items in your application in this case as an example I said mail message 21 now that's one way to identify document you might have more of a hierarchy of like which mailbox with 10 and we do by subject or something so it's up to you to then map that back and say okay these are the documents I want to display based on that reference the second bit of information of returns with the relevance ranking and this is of course used in displaying the results and the third in the case that you have multiple indexes that you were searching simultaneously then it's going to return a reference to which index this particular result came from and the reason for that in the mail example is you need to know well let's go through all of them the document reference tells you which message the relevance rank tells you of course the role of column and the third one I mentioned the image reference is to tell you which mailbox that came from because it has multiple Linux's that it has to know where it came from so that's basically searching in three steps and that point you've added searching to your application so how easy is that now one of the things that search gate is capable of doing if you use it in just the right way is summarization but to make it easy we've brought that API up at a higher level so you can directly do summarization of documents and I'd like to invite David back up to talk about civilization [Applause] hello again summarization is my favorite feature of OS Ken and I'm wondering if I could see a show of hands how many of you have discovered the summary service in some of you not all of you it'll be my pleasure to show that off it's been we've had summarization as the service since since 10 1 and it became better in 10 2 it'll be essentially the same in panther now we've provided this you heard me mentioned find by content before and it's a high level in your face to search kit and it lives in app services which you probably already linked to and you want to find a header called find by content H to see the syntax there's search kit there's fine by content find by contact call search kit now to do summarization of text it uses a very simple technique and it's using the search kit its indexing the sentences document now is a sentence and it takes that index and it searches for the best sentences by using it constructs an idea of what is the meaningful stuff in this document in this thing that we're summarizing looks for that among the sentences and gives you back the best ones so you see there it's indexing and it's searching it's exactly like what waiting described and here is the demo they should play a little music while we walk back and forth here i have a weblog weblogic a document here for an interesting thing that i found on the web it's an article by a guy named David Stutz that he wrote as part of his farewell to Microsoft when he quit working there I love it because it has this great sentence at the end I don't know if you can read it he says stop looking over your shoulder and invent something but you know when I get this on the web I don't really know if it's something that I want to read or not it's kind of long so here in Safari i'm going to select all go to the Safari menu services summarize and there's the summary and it's giving me a default size summary I can read that or it can take the slider here I can find the best sentence I can find a paragraph that contain the best sentence I can zoom up a little bit to get more I can just scrub back and forth and get more and less i can get the entire text like that is that cool that's the demo but because it's a service a lot of people have never even found it at all and this should be in your application by the fact that you're here at all i'm guessing that quite a few of you have applications that are text oriented so why would you use this well to begin with it's useful as I just described you can make little thumbnails you can make these dynamic summaries that a user can play around with and generally improve the usefulness of your application and you can advertise it and it's a lot of fun text is not usually much fun yet maybe you envy the graphics people who can put all this wonderful court stuff in their applications you can put summarization with dynamic site summaries and it's quite easy if you're doing a fixed-size summarization like to create a set of thumbnails of text there's one function that you call you give it a CF string you tell it how many sentences you want if you pass 0 it'll figure out a good number of sentences gives you back a CF string it's that simple the resizable summarization like what you saw there is twice as hard there are two functions you have to call there's a there's two others but they're not interesting first function does all the hard work and you can do that before the user even asks for a summary do that in the background on a thread it does all of the analysis of the text in two sentences it does the indexing it gives back an object that you had that you can keep around and from that object you can construct the different sized summaries very very quickly as you saw one thing you should think of though is that that object takes about as much memory as the original text how does it work it works by some very simple stuff it's statistics on the words in the sentences it's just search get searching like you saw Wayne doing and the key is that it's analyzing the text in two sentences and we do that probably as well as anything that any other code that I've seen we only keep complete sentences we throw out everything else and that makes the summary actually come out better because it throws away page footers it throws away little headings and things so it's only keep only keeping the real sentences and it does not care anything about meaning and grammar and that's summarization thank you very much thanks David so as you've seen you can add the searching to your application fairly easily you can add summarization to your application fairly easily and both of these AP is are available in mac OS 10 Panther summarization is actually available in Jaguar so search get is a powerful text searching framework that's now available in Mac OS 10 it indexes anything that has text in it whether it's in your application or on disk it provides a powerful fast searching and summarization so now the cats out of the bag and it's up to you guys so I'd like to point with some other sessions that may be of interest to you if you use unicode in your in your application well of course these sessions have already occurred but for folks watching DVDs if you'd like to look look at the accession if you're using Unicode in your application of the session number 40 for Unicode for Japanese Chinese and everything else using if you're interested in searching the address book then you probably want to use the address book API is rather than the search good api's and similarly if you're interested in providing Apple help Apple help use the search gonder neath but you want to go look at the session for Apple help number 408 and if you're interested in the indexing things off the internet then you probably want to look at the session an Internet technology session on the advanced foundation URL API and of course our friend John if the person to contact and for more information there is a search get reference that's available on the ADC site for ADC members which is a free membership and as well you can look at the header files on the Panther CD that you have and if you just go into some library frameworks core services you'll see in the frameworks folder there's a search get folder which has all the headers and those are fairly well documented with comments as well and similarly you can look at the fine by content and summarization api's there you