WWDC2004 Session 412

Transcript

Kind: captions
Language: en
today we're going to talk about how to
move your application to Unicode well
one question you might ask is why should
i move my application to unicode and
there are several reasons probably the
most important reason is that customer
requirements are changing unicode is the
character set used on Windows more and
more it's the character set that youth
is used on the internet and so unicode
support is very important for
cross-platform compatibility in addition
in asian markets China Japan Korea and
elsewhere the number of characters that
customers require has been increasing
more and more and the old legacy
character sets that we used on Mac OS 9
are limited in the number of characters
they can support and customers need more
characters than you can supply using
those character sets and Unicode is the
solution for that also and as a result
starting in Tiger we're deprecating the
old world script api's thank you so
they're still there your app will still
continue to run but we strongly
recommend that you not use these api's
for future application development and
that includes quick protex the script
manager text utilities basically
everything related to world script
fortunately Mac os10 has a complete set
of api's that work with Unicode so it's
easy to move your application from the
world script world to the Unicode world
and we're going to talk about how to do
that today so it's definitely time to
move your application to Unicode and
lots of big app developers are doing
that the latest release of microsoft
office office two thousand four is a
Unicode application suite Apple's own
FileMaker Pro 7 is a Unicode application
so it's time to get on board so today
we're going to cover in detail what's
required to move your world script based
application to Unicode and we're going
to do a lightning tour of all the
aspects of an application how to store
text
your human interface and your
localization drawing editing and input
sorting doing transformations on text
and analyzing it and also formatting and
scanning of dates times and numbers and
also calendar manipulations so we're not
going to go into incredible detail on
any of those areas but to sort of give
you a tour so you know which API is to
use to convert your application to
Unicode and all of the api's we're going
to talk about today are in the core
foundation on and carbon areas of Mac OS
10 and we're going to show how to map
from the older technologies the script
manager text utilities and so on to
these newer api's but before we do that
I'd like to just spend a couple of
slides talking about what's new and
tiger in the international area and we
do have some new features and the first
one is something that people have been
asking for for a long time and tiger we
have the first stage of our support for
opentype font layout tables so this
allows opentype fonts to work in unicode
applications if you're using our
standard Unicode api's you don't have to
do anything special in your application
opentype fonts we'll just work so we're
supporting features like ligatures and
language shaping in certain cases and
you'll see the support for open type
layout increase more and more as time
goes on something that was missing from
our Unicode API suite with string
transliterations this is something you
could do in the script manager there is
an API and the text utilities to do it
but there was no unicode equivalent and
we've now got that in Tiger we have even
more locale data available and as with
panther much of that locale data is only
available through unicode api's so it's
very important to move your application
fee unicode so you can take advantage of
all the languages in all the locales
that mac OS 10 supports
just as a side note this is a carbon
session but there has been a carbon date
control for a long time and knew and
tiger is equivalent cocoa date control
so that's available we had some support
for non gregorian calendars and Panther
and we're improving that support in
Tiger so in addition to the japanese and
thai buddhist calendars which we had in
panther we're adding islamic and hebrew
calendar support and in addition in
panther you could only use one of the
Nonnberg orion calendars if you were
using the date and time formats that
went with it so for example you can only
use the japanese calendar with the
japanese locale or the thai buddhist
calendar with the thai locale in tiger
you can select the calendar separately
from the daytime formats so here you see
examples of the Islamic calendar and the
Japanese calendar being used with the US
English locale we've also added more
control over a number and date
formatting so we introduced CF date
formatter and CF number formatter in
Panther but now you have more control
over how they operate there are more
options there's also a new option for
number spell out so in addition to all
the formatting options we had before you
can spell out numbers there's an example
they are 120 3.45 and this is not just
for English it works with any of the
locales that Mac os10 supports so that's
a new feature every release we try to
extend our unit code coverage a little
bit and so this time we're moving more
Roman and Greek and Cyrillic support
into our core phones so we used to have
separate fonts for example for Cyrillic
support and we're extending our course
on sand that's times Helvetica and
courier to support a wider variety of
Roman and Greek characters and also
adding Cyrillic in addition we're
covering some new unit code blocks that
Tomeo Braille eating hexagram symbols
thi Xuan Jing and thi Xuan Jing symbols
and not all of this is in the preview
release that you have in your hands but
it will be showing up and there are
possibly more blocks we might be
covering that aren't listed here that
are still a little bit up in the air but
every release we try to add a little bit
more so oh and one last thing some of
the language IDs that we've used in Mac
OS 10 up till now have not exactly
followed the standards in the area so
for example we use vh underscore TW for
traditional Chinese and in Tiger were
adding support to move to canonical
language ID there are some examples and
there are new AP is that have been added
to CF locale that can help you
canonicalize language IDs so that if you
have a localization for your application
that uses an old ID and you need to
compare it against the new ideas ap's
api's can help you make sure you do the
comparison in a canonical way so in
order to show some of the new features
in Tiger I'd like to ask John Jenkins to
come up on stage and he'll give us a
short demonstration of what's new and
tiger alright Deborah says there's an
awful lot that's new and tiger and we
don't really have time to go over
anywhere near all of it so I'm just
going to hit some of the highlights some
of the really exciting things and we'll
start out with a exciting thing which is
the transliteration api's so here we
have a sentence or a word I guess which
is in Latin and I want to know what this
would look like for example in Greek and
I can change it or I want to know what
it would look like in katakana and I can
change it although I'm not quite sure
that the accent works or the the upside
down ! really works in Japanese but
that's okay also useful is turning it to
xml hex this is very handy this goes
through and takes all of the non-ascii
letters and converts them into the
numeric
entities that you would use on a web
page or in XML or if you really want to
know what's going on you can always get
the unicode name of the character that's
not an ass gape which is also handy and
of course we can go the other way we
have a lot of things that will let us
transform to Latin if I get a sentence
here for example this is the beginning
the first line of the Iliad for all the
people who are fans of the movie troy i
can turn it into latin i could also
strip it of its combining marks if i
wanted to remove all of those i can take
other examples and turn them into latin
so i have an arabic sentence i hope and
i can turn that into latin or it can
take something that's in japanese and
again i can turn it into latin so i can
get a first approximation
transliteration one thing that's useful
for chinese here i have an instance of a
sentence which is partly in Cyrillic
partly in Chinese I can turn the whole
thing into Latin if I want or I can just
take the Chinese part and turn it into
Latin so a lot of useful transliteration
aprs are available now now Deborah also
mentioned that we have a lot more
locales let's just bring up a list of
our locales these are the locales that
are available on the system and Tiger
together with some of the information
you can get about them this is pretty
useful information to the one thing
which I don't think is wildly useful is
metric that is whether or not this
locale uses the metric system because it
amounts to whether or not you are in the
United States but that's okay currently
I you know it defaults to showing them
in the current in the global locale but
we can switch back and forth so I can
see what each of these looks like and
you'll notice that some of these are
showing up in the last resort font these
are locales that we don't have system
support for but that's ok on Mac OS 10
it's very easy for third parties to add
support so if I want to take one of
these and I want to see a date say in
the Islamic calendar
let's not use a unsupported locale okay
German so I want to see what today's
date is this is today's date in the
Islamic calendar as shown in Latin
letters in the German locale so we have
a great deal of flexibility for date and
time formatting that we didn't use to
have all right so that covers to the
third is the really exciting thing I
think this is something as Deborah says
that people have been desperately asking
us for for a long time and that's
opentype support so it's not fully wired
up yet but it is enough there that we
can show it to you so I'm going to take
this is world text which is a standard
application that comes with the
developer tools and I'm going to switch
to Adobe castle on pro this is unaltered
nothing up my sleeves straight out of
the box adobe caslon and i can start
typing see let's come up with something
here okay and you'll notice that as i
typed the ligature formed automatically
FL formed automatically if I formed
automatically again so the ligatures are
forming automatically as is the case
with a 80 fonts at the moment this will
happen with opentype fonts as well if i
bring up the typography palette which
was introduced in jaguar sorry intent
wait i always get the code name in 10 3
there we go you can see more of what's
available in the font for example i can
turn on rare ligatures and as the system
does now with 80 fonts it does with
opentype fonts and tiger it scans
through it sees what is in the font it
gives us support for all of these things
i can turn on lining figures i can turn
on superiors let me see if that works
yeah oh that's kind of cool and so on so
there's a lot of flexibility here that's
available the font has it built in the
system just picks it up so look forward
to it thank you thank you
[Applause]
okay so now that we've seen what's new
and tiger let's start moving into the
detailed part of the presentation we're
going to go in our whirlwind tour of the
world script api's and what to replace
them with so before we start talking
about things that you can do with text
we have to talk about how you store your
text in the first place so that's our
first topic before we do that let's have
a quick refresher on what's different
about unicode compared to world script
the you know uni and unicode means one
and the most important thing about
unicode is there's only one character
set you have to worry about unlike world
script where there were many Unicode
stores characters and 16-bit units in
the utf-16 form which is what we use in
mac OS 10 in cocoa and carbon but since
unicode has more than 96,000 characters
how do you fit that in a 16-bit unit
only answer is you can't and some
characters need more than one unit to be
stored on there's an example right there
the Unicode character 2000 B which is on
the plane to Han characters is actually
stored as two 16-bit units called a
surrogate pair now when I talk about a
character here a Unicode character
that's the programmers concept of a
character what the user thinks of as
character can actually be larger than
that what the user thinks of as a
character we call a graphing or a
cluster and it can consist of one
unicode character or several so here's
here's a couple of examples we have the
word resume but the accented ease are
represented by base letter e plus what's
called a combining acute accent so you
have to unicode characters the E and the
accent that represent one user user
character on the next example there's
even more this is the Vietnamese word
for Vietnamese and you can see that we
have an e with two combining accents
dot below and a circumflex above so this
is 3 unicode characters that represent
what the user thinks of as a single
character to make life a little more
interesting there's actually multiple
ways to do this in unicode in addition
to the base letter and combining marks
that we used in this example there are
also precomposed versions of these
characters so there is an e with an
acute accent that's a single Unicode
character and there is an e with a dot
below and a circumflex above that's a
single Unicode character but you can't
always represent every character in a
totally precomposed form and conversely
you can't always represent a given
character in a totally decomposed form
so even though there are versions of
Unicode that we call precomposed and
decomposed they really mean as
precomposed as possible and as
decomposed as possible so even in
precomposed unicode you do have to worry
about things like combining marks
because they can be present okay so in
the world script world the way you
stored your text was in a pascal string
or in a c string those don't support
Unicode or at least not in the form that
we needed for carbon and cocoa so what
do you do in the new world wealthy if
you using core foundation you can use CF
string or CF mutable string or new and
tiger is CF attributed string if you
need to work at a lower level you can
just store unico text as a raise of unit
car which is a type defined in carbon or
actually at a very low level so there
are a lot of api's for CF string and
friends and I don't have time to go
through them all but just to give you a
flavor of how the API works here's a few
examples you can create a CF string
using an array of unit cars on this
example we pass null which indicates the
default storage allocator for core
foundation we pass an array of unit cars
and we pass their number and that will
give us back a CF string object you can
also get characters back from a CF
string
in order to get the best performance you
can use an inline buffer and the way you
do that is you set up an inline buffer
on a CF string and then you can ask to
get a character at any index and the
inline buffer will take care of batching
access to the string so you get the most
efficient access possible new and tiger
is attributed string support and an
attributed string wraps an existing CF
string as opposed to you putting the
characters into the attributed string
directly so you can create an attributed
string by passing a CF string and a
dictionary of attributes and they can be
totally arbitrary attributes that's not
just a fixed set although there is there
are a predefined set and you can also
get the attributes at a particular point
of an attributed string so you pass the
attributed string the index where you
want to get attributes and also a
pointer to arrange the range gets set to
the run of the attributes so you know
how big a stretch of text has those
particular attributes and then the
function call returns the dictionary of
attributes that apply to that range now
something that you are those of you have
programmed in world script now is that
when you're dealing with double by
character sets you can't just break a
string at an arbitrary byte offset
because it might be in the middle of a
double by character and you used
character by type to determine if there
was a safe place to break we can't use
character by type in a Unicode
application but there's a similar issue
to worry about and that is the user
characters that I talked about earlier
or what's called a cluster or graphing
you don't want to break in the middle of
that because if you do and then you only
display that the first part before the
break you'll actually mangle what the
user thinks of as their character and
display the wrong thing so there are
api's available to help you find a safe
place to break if you're using a CF
string you can use cs3 and get range of
composed characters at index and that
will find a safe place to break if
you're using a unit car
array then you can use the Unicode
utilities find text break API and look
for a cluster boundary and that will
also tell you a safe place to break so
here's an example we have a string and
an offset and we want to figure out a
safe place to break so we call CF string
get range of composed characters and
index we pass our string and we passed
the index that is the place where we
would like to break and what the API
returns is a range which is the
beginning and the end of the user
character or the cluster that
corresponds to that offset so in this
case we take that range and we go to the
end of it we take the beginning location
and add the length in and that's the
place where we can break we could also
just use the beginning part of the range
instead of the end that depends on how
you want your application to work
another thing you need to do is to
figure out what kind of character a
given character is is the letter is it a
digit so forth in the world script world
you use character type for that but that
doesn't work with Unicode so we can't
use it anymore and in the Unicode world
there's two ways to do this there's a CF
character set api's and core foundation
and at a lower level there's you see get
car property so here's an example to
determine whether a character is a
decimal digit now you might think to
determine whether something is a decimal
digit you can just say well is it a ski
0 through 9 but it turns out in Unicode
there are lots more decimal digits than
just those there are decimal digits for
indic languages for arabic and all of
those are just as valid as decimal
digits as the ASCII versions that were
used to so to test whether any character
is a decimal digit we can use CF
character set so first we get a
predefined CF character set in this case
the set of decimal digits and then we
can call CF character set is long
character member in order to order to
determine whether a given Unicode
character as a member of that set or not
and then we can branch one where it
wouldn't way
or the other depending on the answer
well it would be wonderful if your
application could deal only with Unicode
and never have to think about anything
else but there's still a lot of data out
there that's not in Unicode there's
documents that users have I have
documents on my system that dates back
to almost to the time the Mac was
introduced and those are definitely not
in Unicode because it hadn't even been
invented then there are protocols on the
internet that still require non unicode
character sets the web is a big example
you can use Unicode on the web many many
web pages are not in Unicode so you need
to be able to move between the Unicode
world in the non-unicode world and we've
had support in Mac OS 10 for this for a
long time in the form of the text
encoding converter which is a fairly
low-level API which that even actually
dates back to mac OS 9 but there's
easier ways to do it using CF string and
again there's a wide variety of AP is
that you can use to do this and we're
only going to go through a couple of
them in the first example you can create
a CF string using C string and all you
need to do is pass the C string which is
null-terminated and a text encoding to
use and that will give you back a CF
string which is in Unicode if you want a
little bit more control for example your
string isn't null-terminated you want to
control what happens if the data can't
be converted completely then you can use
CF string create with bytes which gives
you finer control so one question you
might have is what do I pass for that
text encoding and that's actually a
non-trivial question it depends a lot on
where the data is coming from if you're
lucky and the data is coming say from an
internet protocol and it's tagged with
its character set then you know what
encoding to pass but sometimes you have
to guess and to good guesses are the
encoding the corresponds to your
applications human interface so if for
example if your application is running
in Japanese and you call get application
text encoding you'll get max Japanese
back
the encoding a different encoding is CF
string get system encoding and that's
the text encoding that corresponds to
the user's most preferred language now
the users most preferred language is not
always the same as the language that
your application is running in and the
reason for that is that the users makes
most preferred language may be one that
your application is not localized into
so for example if the users most
preferred language is an institute and
you don't have an Anouk to tut
localization in your application then
you're not going to be running an inn
exit hood in that case the application
application text encoding and the user's
text encoding are not going to match so
which one of these you call depends on
your application and where the data is
coming from another thing you have to
worry about on the internet or when
sending Unicode to Windows is that other
systems do not deal with the decomposed
form of Unicode quite as well as Mac
os10 does and so it's better to convert
Unicode to what's called normalization
form C which is the as precomposed as
possible form before you send it to
those systems and you can use CF string
normalized to do that and a new feature
in Tiger is that you can determine the
text encoding used by ml to e so if
you're using the multilingual text
engine which is the carbon unicode text
engine you can now specify the text
encoding to use when opening or saving
to plain text files and that's a new
feature in tiger okay so we've covered
how do the basics of how to store your
text and how to get it in and out of
your application but there's more text
to your application than just what's in
the users document there's also the text
that you create yourself for your human
interface and let's spend a little time
talking about that well in the old world
you use the resource manager to store
the localized pieces of your application
I used resources like the log or menu or
if you're using power plant maybe you're
using PPO be
resources well those resources are all
based on the old world script world and
they can't support Unicode so the modern
equivalents for a Unicode application or
indeed any modern application is the
bundle which I'm sure you've all heard
of about already but I'll just give a
very brief review an application bundle
is a directory tree in the file system
that's made to look to the end-user as
if it's a single file you can store non
localized files localized files files of
any type actually it's totally up to you
movies strings what have you localized
files are stored in an L proj directory
and the L proj directory is tagged with
the ISO language code for the particular
language that that localization
corresponds to so for example en for
English J a for Japanese one of the most
important kinds of things you can store
in your application bundle are interface
builder files or nib files and those are
the files that contain UI elements and
replace the old resources that were used
with the control manager and the
dialogue manager and so forth the ones
that didn't support Unicode and there's
a small set of api's you can use for
nibs with carbon applications you can
create a nib reference from your
application bundle and once you have
that you can get your menu bar out you
can get menus out you can get windows
out with HIV hierarchies it's very
straightforward to use another thing you
have to worry about in localizing is
strings so in the world script world you
would use an str or str pound resource
to store your localized strings of
course those resources don't support
Unicode so we need a modern equivalent
and the modern equivalent is the
localized strings file that's a just
basically a plain text file it can be in
utf-8 or utf-16 and you use CF copy
localized string to get a localized
string out so here's an example we have
two strings in our localized strings
file one asks the user a question and I
guess this application needs a little
work because it doesn't let the user
give any other answer
but yes that could be a problem but
that's for another session so in this
case we have the English version of the
question in the English version of the
answer in the in the English version of
the file and then we have the equivalent
Japanese translation in the Japanese
version and you'll notice that the keys
are the same but the string that the key
corresponds to differs depending on the
localization and we can use CF copy
localized string we just pass the string
that acts as the key and the second
string in the call is actually just
there for documentation purposes if you
run the gin strings tool that will get
written out as comment but that's all
it's used for and the function returns
are the proper localized version of the
cs string and off we go ok well a big
part of any application that deals with
text is drawing it editing it and in
putting it on there are several api's
available to do that now when you talk
about drawing text you can sort of
partition applications into two classes
or at least you can partition text
drawing into two classes first is
drawing short strings and in the world
script world or quick twitch namely
quick draw text you did that with either
draw string or text box the Unicode
equivalents are a draw theme text box
which is very straightforward it just
takes a CF string and you can use that
when you're happy to just use one of the
standard theme fonts if you need more
control you can call I've one of two MLT
EAP is either txn draw CF string text
box or txn draw unicode text box and the
only difference between them is one
takes a CF string and the other takes a
eunuch r star so depending on how your
texts stored and that gives you actual
actually a lot more control not just
fonts but also you can specify a CG
context you can control things like
rotation and so on
now sometimes an application has to draw
large amounts of text and by that I mean
drawing a document implementing a text
editing engine implementing a web
browser where you have to paint large
amounts of text and the api's on the
previous slide are not really
appropriate for those kinds of tasks
also sometimes you need a lot more
control over the way text is rendered
and again the previous api's are a
little too simple well in the quick-draw
text world we use things like draw text
measure text if you supported by
directional text you had to call get
format order there are a whole bunch of
api's to call and it's too complicated
to go in in a talk like go into in a
talk like this the equivalent set of API
is to use in the Unicode world for for
carbon is a tui Apple type services for
Unicode imaging and again as a rather
large API set and rendering complex text
is a sufficiently difficult problem that
I'm not going to get into it in the two
or three minutes I would have to cover
it in this session so there's a great
online reference rendering unico text
without Tsui I strongly recommend you
start there that's who he is new to you
in addition there's a session on Friday
session for 25 modern text layout and
editing for carbon applications where
you can go to hear all about Atsui & M
LTE and to talk to the engineers who
work on it now a much more ideal way of
dealing with text is not to have to
render large amounts of that yourself
but to use one of the Dalton text
editing engines that's a lot easier than
building your own the text editing
engine in the world script world was
called text edit and there's also a
control to go along with it the edit
text control but unfortunately they can
support Unicode and they're now
deprecated so the modern Unicode
equivalent is ml 2 e the multilingual
text engine and again I'm not going to
go into the details of the MLT whoops oh
it's up there but it's not down here I'm
not going to go into the details of the
MLT eh
but there's a very nice online reference
that you can read and it a new option
that was introduced I think in the
Panther 4-h I texy was a chai text view
which makes it even easier to use em LTE
wraps it up in an H I've you object so
it can be part of an HIV hierarchy and
could my monitor picture has disappeared
so it would be nice to get some support
for that in addition to a chai text view
there's also a Unicode version of the
edittext control so that basically gives
equivalent functionality but supports
unicode and i have here a few a p.i
examples just to give you a flavor you
call a chai text view create and that
will create a new h i text you for you
that wraps up an ml te object and the
nice thing about a chai text view is
it's not totally Oh peg you can get at
the underlying MLT um LTE object so that
you can do more advanced operations with
it you can save and open documents and
so forth and so on and you just call a
chai text view get txn object to get
that out and Unicode text control is
very easy to create you just call create
Unicode edit unicode text control you'll
have to forgive me as my head swivels
around for a while as I've lost my
monitor here maybe I'll move over here
so I can see the podium monitor while
they're taking care of that okay another
problem if you are implementing your own
text editing engine or for some other
reason you have to handle text input
directly then in the very very very old
world you might have called way next
event or in the ancient world even get
next event hopefully nobody's calling
that anymore if you are supporting
languages like Japanese or Chinese
hopefully your application is already
using TSM
and you were calling new TSM document on
specifying a tech service document
interface type well unfortunately that
doesn't support Unicode but there is a
new document type Unicode document
interface type that you can call new TSM
document with and that will create a TSM
document that supports unicode in older
versions of the OS that was done with
Apple events but for the last several
releases it's been done with carbon
events and you want to avoid the
keyboard class carbon events because
those are raw keyboard events and if you
look at those that will be able that
will be before the input method has a
chance to work on them so you want to
look at the text after the input method
is processed it and the two carbon
events for that are the text input
Unicode for key event and that that's
what comes from input methods or
keyboard layouts and then there's the
text input Unicode text event and that's
what comes from non keyboard entry
methods such as the character palette or
ink and you can basically handle those
pretty much the same way now if you're a
TSM aware application there's several
more carbon events you have to deal with
but those are the same between Unicode
and non-unicode applications and so
we're not going to talk about them today
okay so we know how we're storing our
text we know we know how we're getting
it into and out of our application we
know how we're drawing and inputting it
but there's also operations on the text
itself something that's important in a
lot of applications is sorting and
searching in the old world script world
we only supported sorting and you would
call string order or text order in order
to do a comparison of two strings and of
course that depended on what the current
script system was in the Unicode world
there's several api's available you can
click the easiest one to use is CF
string compare and you just give it to
CF strings and some options on how you
want the strings compared and it will
tell you whether there
the same or one is less than or greater
than the other if you're working with
arrays avena cards you can call the
lower level API you see compare text now
if you're going to be doing sorting
you're going to be doing a lot of key
comparisons in your sort and you may be
comparing the same key multiple times
there is some overhead involved in doing
a language and unicode sensitive
comparison so if you're going to be
doing something like sorting a large
amount of data it's more efficient to
get something that's called a collation
key and a collation key is a string of
bytes that does a binary compare the
same way that the underlying string
would do a language and Unicode
sensitive compare so what you can do is
call the Unicode utilities get collation
key for a given text collator and string
of unit cars and you'll get back a
binary key that you can just compare
using binary ordering and that can make
your sort go significantly faster if
you're something that you couldn't do in
the world script world but you can do in
the Unicode world is search for
substrings and again CF string makes it
very easy there's CF string find you
give your target string and a substring
that you want to look for in that target
string and search options and it will
find the instances you can step through
them you can also look for more than
just a substring you can also find
instances of characters in a CF
character set and again this is just a
sample of the api's that are available
there are a lot more AP is available for
sorting and searching and I urge you to
check out the documentation for CF
string it has a lot of capabilities
sometimes you need to change the case of
something and we had uppercase text and
lowercase text available in text
utilities for doing that but they don't
work with Unicode the modern equivalent
for Unicode application is on CF string
and there's a CF string upper case which
converts everything to uppercase you'll
notice that it takes two parameters a
string and a locale the reason for that
is that the rules about how to convert
uppercase to lowercase or lowercase to
uppercase differ a little from language
to language for example in Turkish the
rules are different from English and so
you need to pass a locale if you want
the case conversion to be done in a
correct language sensitive fashion
something that you can do with cs string
that you couldn't do in the script
manager is capitalized that is convert
only the first letter of every word to a
capital letter and we talked about CF
string normalized a little earlier it
can convert a string not just to the pre
compose to normalization form C but to
any of the four Unicode normalization
forms so one of the one of the things
that you used to have to do in the
script manager because there was no
Unicode equivalent was to do
transformations on texts such as
transliterate to a different script or
to strip out accents or diacritics and
this is one of the last pieces that
we've come up with a Unicode equivalent
fourth new and tiger a new API called CF
string transform and you pass it a
mutable string an identifier for the
kind of transformation you want to
perform you can optionally limit it to a
sub range of the string and you can also
specify whether you want to transform to
go forward or reverse and there's a
several transforms available this is not
a complete list but one is to strip by
combining marks another will transform
as much as possible to Latin from
arbitrary Unicode scripts it doesn't
cover all of Unicode and these first two
transforms are not reversible because
because of a basic property called
entropy once you lose the information
you can't get it back so once i have it
in latin I don't know what the original
scripts were so these are irreversible
there are reversible transformations
such as between Latin and hit agana so
there's an example we transliterate
konichiwa from Len from a romanization
to hit agana and we can go in either
direction because we're specifying
which script we're using there's also a
transformation of unicode idia graphic
characters Han characters according to
the Mandarin pinion transliteration
system so in this case we have the city
named Shanghai written in Han characters
and that's transliterated to Latin and
finally there's the XML hex
transliteration which John demoed
earlier which will take non-ascii
printable characters and convert them to
a hex escape sequence and you can apply
some of these transformation serially
for example you could convert to Latin
and then call strip diacritics to strip
out the dye critics if you don't want
them there and again this is new and
tiger and it this is in the WWDC preview
release that you've received so you can
experiment with it there's also other
manipulations on strings just basically
moving parts of strings around and in
the world script world we had munzer
muncher just works on bites and addition
in addition it requires that your text
being a handle and there are several
options available to replace munzer if
you're working with unicode CF string
replace is very easy to use I take a
mutable string a range of that string
that you want to replace and what to
replace it with very straightforward
there's also CF string create with
format and CF string append with format
which work a lot like printf and again
those are fully Unicode compatible
there's also CF string trim which will
remove constants drinks from the
beginning or end of a CF string or a
mutable string that is and also CF
string trim white space which will
remove whitespace characters and if you
really need to just move bytes around
then there's the standard C library
routine mem move which handles arbitrary
bite bite moves and deals with issues
like overlapping source and destination
if you have an application that displays
text in a list or presents text in a
fixed size space if you have a string
that's too large for that space or in a
list if it's too large for the column
then you need to truncate the string and
that needs to be done in a Unicode and
language sensitive way in the script
manager world we had trunk string and
trunk text to do that there's two ways
to do that in the Unicode world one very
nice option if you're using s we
directly is to use that to these line
truncation tag and well what that will
actually do is truncate the string while
it's being drawn so you don't actually
have to modify the string itself in
memory what you can do is tell us we
that you need to draw the string in a
fixed width and if you specify the line
truncation tag if it fits by itself
that's fine if it's a little too big
that's who will try to squish it down a
little bit first so it can draw the
whole thing and if it still doesn't fit
then absolutely will in will truncate
the string and insert an ellipsis if you
want to actually truncate the data
itself which is the way the trunk string
and trunk text worked then you can call
truncate team text which is a unicode
equivalent something that's very
important for applications the deal with
text is finding appropriate boundaries
so we already talked about a cluster
boundary which corresponds to what the
user thinks of the character but there
are other boundaries as well so let's
take a look at this slide there's an
example at the bottom that illustrates
line and word break and you'll see that
line break and word breaker not the same
thing although they're often thought of
as being the same thing so for example
if I'm doing line breaking it's
acceptable to the to break after the
hyphen but if I'm doing word breaking
that is determining what constitutes a
word either for double clicking or for
doing whole word searching then breaking
in the middle of that is not acceptable
so line breaking in word breaking or
different
at the moment the only api's that are
available for doing this kind of
breaking operate at the eunuch our array
level so that's the unicode utilities
the first step is to create a text break
locator by calling you see create text
break locator and you specify when you
create it which kinds of texts
boundaries you're interested in whether
it's a cluster boundary or a word
boundary or a line boundary and then you
can call you see fine text break to
iterate through the breaks in your text
either in a forward or backward
direction if you're interested in
cluster boundaries then as I mentioned
earlier in the talk there's CF string
get range of composed characters at
index which works at the CF string level
but if you need line or word breaks then
you need to call the Unicode utilities
ok the last topic that we're going to
cover is dates times and numbers so one
of there are several things you need to
be able to do with dates times and
numbers one is to convert a date that's
in a binary format or a time into a
string to display to the end user or the
user might have typed a data a time into
a text entry field and you need to
convert it back to a binary number so
you can perform an operation on it and
in the old world there were several
api's available for that I'm not going
to read them all off but they're all
deprecated now in Panther we introduced
CF date formatter which is a new set of
api's and core foundation that do this
in the Unicode world and so we'll go
through a small example here again cfd a
former formatter has a fair number of
api's that we don't have time to go into
detail on all of them so I'll just go
through a short example you can use CF
date formatter create string with
absolute time to use a CF date formatter
and convert time a binary number into a
string if you're going in the other
direction you use CF date format or get
absolute time from string again you pass
a CF date formatter and
string and you'll get back a binary time
thirdly CF day for matters have
properties that you can set on them that
control how the formatting is done and
you can use CF date formatter set
property to set a particular property on
the date formatter so here's we'll go
here's a complete example we'll go
through first we create our date
formatter again we pass null to indicate
the standard storage allocator for core
foundation we need to pass a locale to
specify what kind of date formatting
we're doing because the date formatting
for say us English is very different
from that for Japanese or German or
Dutch or what have you so we called CF
local copy current which gives us back
the users current locale now if you were
doing this in a real application you'd
want to save the users current locale so
that you don't keep calling CF locale
copy current over and over again because
first of all you get a lot of copies and
second of all you want to take a
snapshot of the users current locale so
that you get consistent results the
other thing we need to specify when
we're creating our CF date formatter is
what style of date and time we want in
this case we're saying we want the long
date style and the long time style and
the next thing we're going to do is
since we're in this example we're going
to convert a date entered by the user
into a binary time we're going to set
the lenient property on the day time for
matter and we do that by calling CF date
formatter setproperty passing the
formatter and the key for lenient
property and setting it to true now what
that's used for is if you don't set this
property when you try to convert a date
or time string to a binary number cfa
formatter will try to match it exactly
against the template that's used for
formatting dates for converting a date
from a binary number to a string and if
it doesn't exactly match that template
the conversion will fail what the
lenient property does is it sets the
date formatter so that it will try as
hard
is possible to interpret the input
string as a date or time even if it
doesn't match the template that is
expecting so it you pretty much always
want to set this unless you're doing
some kind of validation and the final
call we make is get absolute CF date
formatter get absolute time from string
we pass our CF day for matter the string
that's the input you have the ability to
pass some options but we're passing null
in this case and finally you pass a
pointer to the CF absolute time to be
filled in now sometimes you have to do
operations on dates that other than
converting them to strings work and
converting them back from strings to a
binary number so I mean sometimes you
need to do calendar operations an
example might be take this date and add
one month or take this date and add one
year and so in the script manager world
there were api's like toggle date and
validate and long day two seconds and
long seconds to date that converted
between the binary form of time and a
structure which specified the year month
day etc separately so the new time for
new API is the time type for new API is
a CF absolute time and for a while
there's been a set of api's for CF
absolute time for doing computations
with the Gregorian calendar and those
were I don't know what released they
were introduced in but they've been in
for a couple of releases now but those
api's can't handle non Gregorian
calendars which we're adding more
support for in Tiger and so we're
introducing a new type CF calendar it's
a new core foundation type and it's a
set of api's that will work with any
kind of calendar to do calendar
computations such as toggling dates
validating dates and getting components
of days and this this API did not make
the preview release the WWDC
you release but it is something we're
working on for tiger so I'm just going
to tell you a little bit about it today
since you can't work with it yet see if
a CF calendar can do things like create
a set of calendar values to an absolute
time so for example if you give it a
year a month in a day you can convert
that to an absolute time it can also go
in the other direction it can take an
absolute time and pick out the calendar
components that correspond to it and
finally you can do toggling operations
such as taking an absolute time and
adding a fixed quantity to it such as a
year a month or a day so this is the
this is the multi calendar replacement
for the Gregorian calendar API that are
in there right now and look for it in a
tiger release coming fin well very
similar to dates and times we also need
to be able to convert numbers between a
binary format and a string that a user
can understand so and again that needs
to be done in a locale sensitive way
because different countries have
different conventions for the way that
numbers are formatted in the world text
world there were a P is available for
doing that in Panther we introduced CF
number formatter which is the Unicode
equivalent and again we'll go through a
short example CF number formatter has
several api's that we don't have time to
go into you can create a string with a
value using CF number formatter and you
just pass the formatter you have to
specify the type of the value because it
could be say a floating point number a
double along what have you so you need
to specify what type it is you can also
go in the other direction you can take a
string and interpret it as a number
using CF number format or get get value
from string and again you pass the
formatter the string and some other
options and you'll get a number out
finally you can also set the format
that's used for a number formatter if
you create a number formatter with a
locale you'll get the default format for
locale but number four matters use a
formatting string which is very similar
to the pattern string that you might see
in a spreadsheet program such as Excel
and you can set your own format strings
to format numbers in a particular way
and you do that by calling CF number
formatter set format and passing a
string that represents the format
pattern to use so here's an example will
format a number we create our number
formatter using again the default
storage allocator again we pass a copy
of the users current locale and again
you want to save that away as opposed to
getting it every time you make this API
call and in this case we're saying we
want a number formatter that uses the
currency style because we're going to be
formatting currency we have a double
which stores the currency amount we want
to format it's a floating point number
42 we call CF number formatter create
string with value again the default
storage allocator we passed the number
formatter that we created two lines back
we specify that we're passing a double
and then we pass the address of the
variable and this API will then return a
string with that number formatted as
currency according to the conventions of
the users current locale so that's that
has been our whirlwind tour of the
Unicode AP is that our replacements for
world script again we did not have time
to go into detail on all of them because
there are a lot of api's out there but
the goal of this presentation was to
help you to understand how to translate
a particular piece of your existing
world script application to the Unicode
world so hopefully this application this
presentation gave you the pointers you
need to know where to go in the
documentation to do that if you have
further questions the first person you
should contact is da VL ago who is the
representative for these technologies
and world wide developer relations you
can also contact me but please do try
xavi a first rather than give you a long
list of URLs to go to for information on
Unicode api's there's a one-stop
shopping page and this is the URL if you
go to our Unicode reference library page
you'll find links to all the API sets
and all the documentation you need to
convert your application to unicode