WWDC2003 Session 422
Transcript
Kind: captions
Language: en
good afternoon I'd way to me brownies at
lunch and if you're like me you'll
probably just knock right off in about
an hour here but it's the last couple of
sessions this week and the purpose of
this session is to present to use of new
technology that's in Panther called
search kit and the value of search kid I
think is that it provides some
functionality that's been missing in the
system first of all and that wasn't
available to you as the developer and
secondly if you can leverage that
functionality in your application you
can really deliver a very consistent
user experience for searching not to
mention the fact that you'll have a
search engine that's already written for
you in your application so to talk you
through the specifics of this new
technology in Panther I want to
introduce Wayne loop role since age
[Applause]
thanks John so welcome to the search
good session grabbing trouble finding a
seat there are some up up front so today
we're going to talk about a bit about
search get and how it's used in Mac OS
10 it's actually been used for a while
even before it's been released as a
public API bit about what it can do and
how to add search to your application so
specifically we'll talk about what
search get is some of the challenges of
providing searching that search gets
souls and then how to go about doing
indexing in three steps and how to do
searching in three steps and then we'll
talk a bit about an application of
search get called summarization which
provides a specific API you you may also
be interested in using in your
application so where to search get fit
into the technology framework so you'll
all see in this diagram right now in
this diagram we also show core services
in the middle the middle green one there
and that's precisely where search gets
its in so this is a layer above Darwin
and below the level of the frameworks so
just a bit of history some of you
actually may be already be familiar
researchgate if you look at all at V
twin or a I 80 this is a technology that
was available in Mac OS 9 it was called
the Apple information access to look at
if you can believe it and it's a C++ API
it used its own custom data types and it
was a rather complex API in fact i think
the manual fort was about 300 pages so
we wanted to improve this in that go f10
and so what we did was we created a very
streamlined capi that's based on the
core foundation types has about maybe 40
or 50 calls and it's a much shorter
documentation
so as I mentioned search gets used in a
number of the applications in the system
in particular in address book the sort
of searches he type that lets you find
addresses in your address book use the
search get in Apple help you want to
know how do i connect to the internet
Apple help will use search get to find
all the documents that are relevant to
that query in apple mail it's also used
to search mailboxes each mailbox is
indexed using search get and you can
search individual indexes or all them at
the same time and you get relevant to
rank results and in the finder in the
content searching so when you do content
searching the finder search get does the
underlying search and you again you get
relevant ranked results based on the
contents of the documents in your hand
reduce so now to demonstrate these few
of the applications I like to invite
David caseros to find that content tech
lead thank you very much I'm going to do
we can have demo one here I'm going to
do a couple of very brief demonstrations
first I'll show you mail
where we have a few mailboxes and I can
put a little text right in here I type
in itunes I find the one message in all
those mailboxes that has it i can type
in beatles and i get that one how about
switch it's several there get a switch
commercial got an article yeah yes it
does have the word switch in it and
because this is male it's doing prefix
searching so i can type there and i get
the ones that have abbey road all i
typed with a BB that's mail
now in the finder now that's still male
there we go in the finder I can search
some files that have a set of demo files
here there's 770 files in here and they
are previously indexed so you're not
going to see the initial indexing
process but the finder does something
much more complicated it uses a couple
of other frameworks to go through to do
something complicated with search jet
and it will first search the existing
index then start validating the index to
make sure that it's still up to date and
it will search repeatedly while it's
doing that you'll see that it's all very
quick there we are relevance rank
results we've got all kinds of files
here we've got PDFs we've got text files
and so forth and so on this works in
practically any language you can use
with the mac and it's very much improved
in Panther and that's fine by content
thank you David so there are a number of
challenges to providing search in search
get in particular doing relevance rank
results you don't just want the
documents that match a particular query
you also want them ranked by how
relevant they are to what it is you were
asking for being able to support
multiple languages as they would
mentioned with search Kate you can index
a large number of different languages
these are human languages my computer
languages and get correct results we
need be flexible about what a document
means it could be a documented on your
desk but it could also be a message and
mail or an address in address book and
we have to be flexible about the kinds
of queries you saw an example of prefix
searching as well as you can do weird
base searching and some other ones that
I'll talk about in a bit so that a
search get deal with some of these
challenges well one example that comes
up in Apple help is that you'd like to
be able to
type in a query like how do i connect to
the internet do you really mean to find
documents that contain the exact phrase
how do i connect to the internet well no
that's not what you're looking for what
you want to know is documents that are
relevant to the query how do i connect
to the internet see if you're not just
doing a string matching search so what
documents match this query well it's
really hard to know without breaking it
first into words so that's what search
get does it breaks it into words but
which document is is relevant to these
words is a document that contains the
words do I too and the very relevant to
the query well probably not it's
probably not as relevant as a document
that contains how connect and internet
and so one of the things that search get
does is to try to determine which words
in the query are most relevant to what
it is that you're looking for and it
does this using some sistas statistical
analysis so another challenge in search
get is dealing with different languages
so it's fairly easy to break things into
words when you're dealing with English
say or word or language that separates
words by spaces but what if you're
dealing with Japanese there are no
spaces between words and so what search
get does is uses a Japanese language
analysis which was developed by the
folks in Apple Japan it's the language
analysis framework and it uses us to do
a grammatical analysis of Japanese and
then it could break Japanese into words
and then use the same statistical
techniques to then find the most
relevant document what is the document
is a document just a file on your hard
disk well that would be a fairly limited
capability if that's all you could look
for but you know so we do search text
documents we also searched a documents
of other file types such as stand back
here for the quicker works word and PDF
and HTML but in addition we support
searching items that are within the
application such as mail messages
book items or in your application really
any object or maybe an entry from a
database really anything that contains
text can be indexed and searched with
search good the other area of
flexibility that we need is in what a
query is already gave the example of a
natural language queries such as how do
i connect to the internet but we also
want to be able to do prefix queries
such as the one that David demonstrated
so that you can type just the beginnings
of words and even in the middle of your
typing you can do a search and get
relevant results back without having to
complete a word you also want to be able
to do a boolean query in some cases you
want to have an exact match and so
you'll be able to construct a boolean
query and then there's a popular method
using some of the internet search
engines of marketing things with pluses
and minuses to indicate inclusion or
exclusion and then as well you might
also want to be able to find things
you've already found some items and now
you want to find other items that are
most likely items you've already found
and so we also support a similarity
search so you can provide documents with
the input and produce documents with the
output we talked a little bit about a
typical usage scenario for using search
get really it's very simple you index
the documents and then you search the
index and then we'll go into more detail
about that of course the really the
index is only purpose the only purpose
of creating an index was to make
searching fast you could certainly look
through all the text content of all your
documents to find things if that's what
you wanted to do but you'd be missing a
couple things you've been missing speed
first of all but you'll also be missing
the relevance ranking and some of the
statistical analysis that goes into the
index once you've created an index you
don't need the original documents in
order to do searching the index contains
references to those documents not the
content itself but it also contains in
statistical information about the words
in the document
and the index could be stored in memory
if you like for her performance or if
you want it to be persistent because
you're going to use it again then you
could create it in a file the basic
process is indexing is pretty simple you
provide the documents to search get it
analyzes the text and updates the index
the process of searching is equally
simple you provide it with a query and
the index research and it searches the
index and returns result so that's the
basic outline of course we'll go into
more detail indexing can be done in
three basic steps you open the index
that you'd like to search you add the
documents to the index and then flusher
so opening the index of course could be
opening an existing one or creating a
new one opening an existing one is
pretty basic I won't won't bother
talking about that creating a new one
there's a decision to be made about what
kind of index you want to create and
there there are three different types
the first is an inverted index this kind
of an index maps the terms or words in
each document to the document itself and
provide statistical information about
them this kind of index is used for most
of the query kinds of searches a vector
index maps the other way it Maps
documents to terms this kind of index is
actually seldom used more common as the
third kind of index which is an inverted
vector index which does both so you
might ask you know why not just always
create an inverted vector index because
it obviously has the capabilities of the
first two and really the answer is just
space and performance it takes a bit
longer to create as more information to
put in it and it takes more space on the
desk or in memory so it turns out that
for most purposes and inverted index is
really the kind that you that that
you're going to want to use both for the
performance reasons and for space
reasons but just breaking it down a bit
more if you're doing ranked prefix
required a boolean searching the four
kinds that I talked about four of the
kinds that I talked about then you
really want to be using an inverted
index it'll pretty much do it'll give
you the best performance and size ratio
if you're going to be doing similarity
searching it's okay to use an inverted
index it will still work it's going to
be a bit lower performance in doing
similarity searches than the inverted
sorry the inverted vector index is going
to be but if you're only doing similar
researching pretty seldom then that's
probably the best choice if you're going
to do a lot of similarity searching then
you probably want an inverter vector
index ok so we've either opened an
existing index or created one now we got
to add the document so as I mentioned
before documents can be any of a number
of different kinds it could be something
on your disk something inside your
application for each of these documents
what you provide is a document reference
the index doesn't need to store a lot
about the the document itself other than
a reference and the statistical
information that it wants to gather and
when you get results back you're going
to get references back rather than the
actual document itself these references
are just URLs or they're created from
URL so i should say to be more precise
the other thing you need to provide is
of course the text of the document and
their number of ways to do this if you
have a file such as the first example on
the left then you use a file and
reference and that tells search get that
it's something that it knows how to read
and the text from the document will be
read automatically in addition if you'd
like it to handle multiple file formats
we have built-in support for a number of
them in search get and all you need to
do is load the default extractors which
is a call in search get and that means
whenever you read a file it will do the
text extraction for you
but if you'd like to handle any kind of
item in particular you want to handle
something that's inside your application
or you have a document format that we
don't know about that you know how to
get the text from then you can provide
the text of CF string that's the
universal method and it can be the
document reference at that point can be
any kind of scheme and any kind of
hierarchy that you'd like to create
within your application and then the
last step is simply flushing the index
to disk or to memory and at that point
you're done indexing so it's actually
pretty simple now you're ready to search
so searching has three basic steps as
well and talk about each of those first
you create the search group then you
send the query to search get and process
the results I get to sit the blank
button there okay so searching
one thing that search get supports is
the ability to have more than one index
to be searched at the same time now you
might ask why we do this one reason is
to support multiple attributes it might
be that the objects in your application
have not just one text attribute the
content but they may have other text
attributes maybe it's a description of a
movie or maybe the you know additional
attributes that you like the index as
well one example of this is in the
finder when find my content does
indexing the finder it enix is not only
the names of the files but also the
contents of the files and you'll see in
these results of the top few items there
were found because of their names the
search was for courts right and so a
couple of files had courts from the name
and the bulk of the files there had
courts in the content so there are two
indexes and they get searched
simultaneously so creating a search
group allows you to search multiple
indicators at the same time now you
might ask why use a search group why not
just search each index individually and
then combine the results somehow well
the reason the reason is I guess I gave
any exactly let me give another example
here before I so and that's that
multiple containers is another reason
why you might create multiple indexes an
example here is is male so in mail each
mailbox is indexed distinctly so when
you want to do an entire content search
for all the mailboxes in mail at the
same time it creates a search group in
order to search multiple mailboxes
alright now we'll get to the point about
normalizing ranking if you were to do
searching on individual indexes and try
to combine the results yourself the
problem is that the relevance ranking
would only be relative to the content of
each individual index you haven't
normalized the rankings across all the
indexes so by creating a search group
the statistics are normalized and then
the documents you find will be the most
relevant across all the different
categories anto courses possible
greatest search group would just want
index and that may be a common case as
well
okay so you've got your search group
created and now it's time to send the
query to search get to do the search but
we're not quite ready for that because
we have to determine what kind of search
we're going to do and the kinds as I
mentioned before our ranked prefix
boolean required and similarity
searching just to go into a little more
detail about each one ranked supports
kind of a natural language kind of query
so if you're trying to do something like
Apple help does where you've got a large
number of documents you're trying to
find the ones that are most relevant
from a sort of English or other language
point of view then this is the kind of
search you're gonna want to do so the
user can ask a question they can name a
topic or they can provide just related
words and it'll do a reasonably good job
of finding the documents that are most
relevant again an example of this as an
apple help prefix searching allows you
to do the search as you type so if you
want one of those interfaces where as
you type the results come down like
you'll see in address book or in mail as
well then this is the kind of searching
you want to do it's basically just like
the previous one except that approval it
supports prefixes which has the
advantage of supporting searches you
type one disadvantage it has is of
course to go get more matches than might
be intended because even if you type
what seems like a complete word it may
be the prefix of some other word and so
you'll probably get more results than
you would otherwise an example this was
in the grasberg boolean searching is
just what you think it would be you know
andale or not and you can combine things
with parentheses etc and this one's
maybe lots off on use because most users
aren't really familiar with doing this
except advanced users but it is in fact
supported and male so you'll notice if
you type in ampersand or a vertical bar
and male it'll do boolean searching and
the way it does this is it looks at the
query that you typed in and if it
contains any of those symbols and it
sends it on as a boolean search and
otherwise it sends it on as they ranked
eraser a prefix search
and then required searching this blast
site and that allows you to add pluses
and minuses we don't we don't have any
examples in any of the Apple apps of
using this but as a popular search
technique with use on the internet and
then similarity searching is a different
call for that one you provide the sample
documents you don't have any sort of
text query at all and search great
returns documents that are similar to
the ones that you provided it okay so
now we're ready to send the query we
decided let's say to do a rank search in
this case so you send the query to
search get search get efficiently
searches the index and then returns the
results and then the next step is of
course to process those results so the
results contain a number of bits of
information that are helpful in
displaying the results to your user one
is the document reference which is
really the same thing you provided it
before it it's up to you what kind of a
hierarchy you want to create to make it
easy to find items in your application
in this case as an example I said mail
message 21 now that's one way to
identify document you might have more of
a hierarchy of like which mailbox with
10 and we do by subject or something
so it's up to you to then map that back
and say okay these are the documents I
want to display based on that reference
the second bit of information of returns
with the relevance ranking and this is
of course used in displaying the results
and the third in the case that you have
multiple indexes that you were searching
simultaneously then it's going to return
a reference to which index this
particular result came from and the
reason for that in the mail example is
you need to know well let's go through
all of them the document reference tells
you which message the relevance rank
tells you of course the role of column
and the third one I mentioned the image
reference is to tell you which mailbox
that came from because it has multiple
Linux's that it has to know where it
came from so that's basically searching
in three steps and that point you've
added searching to your application so
how easy is that now one of the things
that search gate is capable of doing if
you use it in just the right way is
summarization but to make it easy we've
brought that API up at a higher level so
you can directly do summarization of
documents and I'd like to invite David
back up to talk about civilization
[Applause]
hello again summarization is my favorite
feature of OS Ken and I'm wondering if I
could see a show of hands how many of
you have discovered the summary service
in some of you not all of you it'll be
my pleasure to show that off it's been
we've had summarization as the service
since since 10 1 and it became better in
10 2 it'll be essentially the same in
panther now we've provided this you
heard me mentioned find by content
before and it's a high level in your
face to search kit and it lives in app
services which you probably already
linked to and you want to find a header
called find by content H to see the
syntax there's search kit there's fine
by content
find by contact call search kit now to
do summarization of text it uses a very
simple technique and it's using the
search kit its indexing the sentences
document now is a sentence and it takes
that index and it searches for the best
sentences by using it constructs an idea
of what is the meaningful stuff in this
document in this thing that we're
summarizing looks for that among the
sentences and gives you back the best
ones so you see there it's indexing and
it's searching it's exactly like what
waiting described and here is the demo
they should play a little music while we
walk back and forth here i have a weblog
weblogic a document here for an
interesting thing that i found on the
web it's an article by a guy named David
Stutz that he wrote as part of his
farewell to Microsoft when he quit
working there I love it because it has
this great sentence at the end I don't
know if you can read it he says stop
looking over your shoulder and invent
something but you know when I get this
on the web I don't really know if it's
something that I want to read or not
it's kind of long so here in Safari i'm
going to select all go to the Safari
menu services summarize and there's the
summary and it's giving me a default
size summary I can read that or it can
take the slider here I can find the best
sentence I can find a paragraph that
contain the best sentence I can zoom up
a little bit to get more I can just
scrub back and forth and get more and
less i can get the entire text like that
is that cool that's the demo but because
it's a service a lot of people have
never even found it at all and this
should be in your application by the
fact that you're here at all i'm
guessing that quite a few of you have
applications that are text oriented so
why would you use this well to begin
with it's useful as I just described you
can make little thumbnails you can make
these dynamic summaries that a user can
play around with and generally improve
the usefulness of your application and
you can advertise it and it's a lot of
fun text is not usually much fun yet
maybe you envy the graphics people who
can put all this wonderful court stuff
in their applications you can put
summarization with dynamic site
summaries and it's quite easy
if you're doing a fixed-size
summarization like to create a set of
thumbnails of text there's one function
that you call you give it a CF string
you tell it how many sentences you want
if you pass 0 it'll figure out a good
number of sentences gives you back a CF
string it's that simple the resizable
summarization like what you saw there is
twice as hard there are two functions
you have to call there's a there's two
others but they're not interesting first
function does all the hard work and you
can do that before the user even asks
for a summary do that in the background
on a thread it does all of the analysis
of the text in two sentences it does the
indexing it gives back an object that
you had that you can keep around and
from that object you can construct the
different sized summaries very very
quickly as you saw one thing you should
think of though is that that object
takes about as much memory as the
original text how does it work it works
by some very simple stuff it's
statistics on the words in the sentences
it's just search get searching like you
saw Wayne doing and the key is that it's
analyzing the text in two sentences and
we do that probably as well as anything
that any other code that I've seen we
only keep complete sentences we throw
out everything else and that makes the
summary actually come out better because
it throws away page footers it throws
away little headings and things so it's
only keep only keeping the real
sentences and it does not care anything
about meaning and grammar and that's
summarization thank you very much
thanks David
so as you've seen you can add the
searching to your application fairly
easily you can add summarization to your
application fairly easily and both of
these AP is are available in mac OS 10
Panther summarization is actually
available in Jaguar so search get is a
powerful text searching framework that's
now available in Mac OS 10 it indexes
anything that has text in it whether
it's in your application or on disk it
provides a powerful fast searching and
summarization so now the cats out of the
bag and it's up to you guys so I'd like
to point with some other sessions that
may be of interest to you if you use
unicode in your in your application well
of course these sessions have already
occurred but for folks watching DVDs if
you'd like to look look at the accession
if you're using Unicode in your
application of the session number 40 for
Unicode for Japanese Chinese and
everything else using if you're
interested in searching the address book
then you probably want to use the
address book API is rather than the
search good api's and similarly if
you're interested in providing Apple
help Apple help use the search gonder
neath but you want to go look at the
session for Apple help number 408 and if
you're interested in the indexing things
off the internet then you probably want
to look at the session an Internet
technology session on the advanced
foundation URL API and of course our
friend John if the person to contact
and for more information there is a
search get reference that's available on
the ADC site for ADC members which is a
free membership and as well you can look
at the header files on the Panther CD
that you have and if you just go into
some library frameworks core services
you'll see in the frameworks folder
there's a search get folder which has
all the headers and those are fairly
well documented with comments as well
and similarly you can look at the fine
by content and summarization api's there
you