---
title: WWDC2004 Session 412
framework: wwdc
role: article
path: wwdc/wwdc2004-412
---

# WWDC2004 Session 412

## Transcript

Kind: captions Language: en today we're going to talk about how to move your application to Unicode well one question you might ask is why should i move my application to unicode and there are several reasons probably the most important reason is that customer requirements are changing unicode is the character set used on Windows more and more it's the character set that youth is used on the internet and so unicode support is very important for cross-platform compatibility in addition in asian markets China Japan Korea and elsewhere the number of characters that customers require has been increasing more and more and the old legacy character sets that we used on Mac OS 9 are limited in the number of characters they can support and customers need more characters than you can supply using those character sets and Unicode is the solution for that also and as a result starting in Tiger we're deprecating the old world script api's thank you so they're still there your app will still continue to run but we strongly recommend that you not use these api's for future application development and that includes quick protex the script manager text utilities basically everything related to world script fortunately Mac os10 has a complete set of api's that work with Unicode so it's easy to move your application from the world script world to the Unicode world and we're going to talk about how to do that today so it's definitely time to move your application to Unicode and lots of big app developers are doing that the latest release of microsoft office office two thousand four is a Unicode application suite Apple's own FileMaker Pro 7 is a Unicode application so it's time to get on board so today we're going to cover in detail what's required to move your world script based application to Unicode and we're going to do a lightning tour of all the aspects of an application how to store text your human interface and your localization drawing editing and input sorting doing transformations on text and analyzing it and also formatting and scanning of dates times and numbers and also calendar manipulations so we're not going to go into incredible detail on any of those areas but to sort of give you a tour so you know which API is to use to convert your application to Unicode and all of the api's we're going to talk about today are in the core foundation on and carbon areas of Mac OS 10 and we're going to show how to map from the older technologies the script manager text utilities and so on to these newer api's but before we do that I'd like to just spend a couple of slides talking about what's new and tiger in the international area and we do have some new features and the first one is something that people have been asking for for a long time and tiger we have the first stage of our support for opentype font layout tables so this allows opentype fonts to work in unicode applications if you're using our standard Unicode api's you don't have to do anything special in your application opentype fonts we'll just work so we're supporting features like ligatures and language shaping in certain cases and you'll see the support for open type layout increase more and more as time goes on something that was missing from our Unicode API suite with string transliterations this is something you could do in the script manager there is an API and the text utilities to do it but there was no unicode equivalent and we've now got that in Tiger we have even more locale data available and as with panther much of that locale data is only available through unicode api's so it's very important to move your application fee unicode so you can take advantage of all the languages in all the locales that mac OS 10 supports just as a side note this is a carbon session but there has been a carbon date control for a long time and knew and tiger is equivalent cocoa date control so that's available we had some support for non gregorian calendars and Panther and we're improving that support in Tiger so in addition to the japanese and thai buddhist calendars which we had in panther we're adding islamic and hebrew calendar support and in addition in panther you could only use one of the Nonnberg orion calendars if you were using the date and time formats that went with it so for example you can only use the japanese calendar with the japanese locale or the thai buddhist calendar with the thai locale in tiger you can select the calendar separately from the daytime formats so here you see examples of the Islamic calendar and the Japanese calendar being used with the US English locale we've also added more control over a number and date formatting so we introduced CF date formatter and CF number formatter in Panther but now you have more control over how they operate there are more options there's also a new option for number spell out so in addition to all the formatting options we had before you can spell out numbers there's an example they are 120 3.45 and this is not just for English it works with any of the locales that Mac os10 supports so that's a new feature every release we try to extend our unit code coverage a little bit and so this time we're moving more Roman and Greek and Cyrillic support into our core phones so we used to have separate fonts for example for Cyrillic support and we're extending our course on sand that's times Helvetica and courier to support a wider variety of Roman and Greek characters and also adding Cyrillic in addition we're covering some new unit code blocks that Tomeo Braille eating hexagram symbols thi Xuan Jing and thi Xuan Jing symbols and not all of this is in the preview release that you have in your hands but it will be showing up and there are possibly more blocks we might be covering that aren't listed here that are still a little bit up in the air but every release we try to add a little bit more so oh and one last thing some of the language IDs that we've used in Mac OS 10 up till now have not exactly followed the standards in the area so for example we use vh underscore TW for traditional Chinese and in Tiger were adding support to move to canonical language ID there are some examples and there are new AP is that have been added to CF locale that can help you canonicalize language IDs so that if you have a localization for your application that uses an old ID and you need to compare it against the new ideas ap's api's can help you make sure you do the comparison in a canonical way so in order to show some of the new features in Tiger I'd like to ask John Jenkins to come up on stage and he'll give us a short demonstration of what's new and tiger alright Deborah says there's an awful lot that's new and tiger and we don't really have time to go over anywhere near all of it so I'm just going to hit some of the highlights some of the really exciting things and we'll start out with a exciting thing which is the transliteration api's so here we have a sentence or a word I guess which is in Latin and I want to know what this would look like for example in Greek and I can change it or I want to know what it would look like in katakana and I can change it although I'm not quite sure that the accent works or the the upside down ! really works in Japanese but that's okay also useful is turning it to xml hex this is very handy this goes through and takes all of the non-ascii letters and converts them into the numeric entities that you would use on a web page or in XML or if you really want to know what's going on you can always get the unicode name of the character that's not an ass gape which is also handy and of course we can go the other way we have a lot of things that will let us transform to Latin if I get a sentence here for example this is the beginning the first line of the Iliad for all the people who are fans of the movie troy i can turn it into latin i could also strip it of its combining marks if i wanted to remove all of those i can take other examples and turn them into latin so i have an arabic sentence i hope and i can turn that into latin or it can take something that's in japanese and again i can turn it into latin so i can get a first approximation transliteration one thing that's useful for chinese here i have an instance of a sentence which is partly in Cyrillic partly in Chinese I can turn the whole thing into Latin if I want or I can just take the Chinese part and turn it into Latin so a lot of useful transliteration aprs are available now now Deborah also mentioned that we have a lot more locales let's just bring up a list of our locales these are the locales that are available on the system and Tiger together with some of the information you can get about them this is pretty useful information to the one thing which I don't think is wildly useful is metric that is whether or not this locale uses the metric system because it amounts to whether or not you are in the United States but that's okay currently I you know it defaults to showing them in the current in the global locale but we can switch back and forth so I can see what each of these looks like and you'll notice that some of these are showing up in the last resort font these are locales that we don't have system support for but that's ok on Mac OS 10 it's very easy for third parties to add support so if I want to take one of these and I want to see a date say in the Islamic calendar let's not use a unsupported locale okay German so I want to see what today's date is this is today's date in the Islamic calendar as shown in Latin letters in the German locale so we have a great deal of flexibility for date and time formatting that we didn't use to have all right so that covers to the third is the really exciting thing I think this is something as Deborah says that people have been desperately asking us for for a long time and that's opentype support so it's not fully wired up yet but it is enough there that we can show it to you so I'm going to take this is world text which is a standard application that comes with the developer tools and I'm going to switch to Adobe castle on pro this is unaltered nothing up my sleeves straight out of the box adobe caslon and i can start typing see let's come up with something here okay and you'll notice that as i typed the ligature formed automatically FL formed automatically if I formed automatically again so the ligatures are forming automatically as is the case with a 80 fonts at the moment this will happen with opentype fonts as well if i bring up the typography palette which was introduced in jaguar sorry intent wait i always get the code name in 10 3 there we go you can see more of what's available in the font for example i can turn on rare ligatures and as the system does now with 80 fonts it does with opentype fonts and tiger it scans through it sees what is in the font it gives us support for all of these things i can turn on lining figures i can turn on superiors let me see if that works yeah oh that's kind of cool and so on so there's a lot of flexibility here that's available the font has it built in the system just picks it up so look forward to it thank you thank you [Applause] okay so now that we've seen what's new and tiger let's start moving into the detailed part of the presentation we're going to go in our whirlwind tour of the world script api's and what to replace them with so before we start talking about things that you can do with text we have to talk about how you store your text in the first place so that's our first topic before we do that let's have a quick refresher on what's different about unicode compared to world script the you know uni and unicode means one and the most important thing about unicode is there's only one character set you have to worry about unlike world script where there were many Unicode stores characters and 16-bit units in the utf-16 form which is what we use in mac OS 10 in cocoa and carbon but since unicode has more than 96,000 characters how do you fit that in a 16-bit unit only answer is you can't and some characters need more than one unit to be stored on there's an example right there the Unicode character 2000 B which is on the plane to Han characters is actually stored as two 16-bit units called a surrogate pair now when I talk about a character here a Unicode character that's the programmers concept of a character what the user thinks of as character can actually be larger than that what the user thinks of as a character we call a graphing or a cluster and it can consist of one unicode character or several so here's here's a couple of examples we have the word resume but the accented ease are represented by base letter e plus what's called a combining acute accent so you have to unicode characters the E and the accent that represent one user user character on the next example there's even more this is the Vietnamese word for Vietnamese and you can see that we have an e with two combining accents dot below and a circumflex above so this is 3 unicode characters that represent what the user thinks of as a single character to make life a little more interesting there's actually multiple ways to do this in unicode in addition to the base letter and combining marks that we used in this example there are also precomposed versions of these characters so there is an e with an acute accent that's a single Unicode character and there is an e with a dot below and a circumflex above that's a single Unicode character but you can't always represent every character in a totally precomposed form and conversely you can't always represent a given character in a totally decomposed form so even though there are versions of Unicode that we call precomposed and decomposed they really mean as precomposed as possible and as decomposed as possible so even in precomposed unicode you do have to worry about things like combining marks because they can be present okay so in the world script world the way you stored your text was in a pascal string or in a c string those don't support Unicode or at least not in the form that we needed for carbon and cocoa so what do you do in the new world wealthy if you using core foundation you can use CF string or CF mutable string or new and tiger is CF attributed string if you need to work at a lower level you can just store unico text as a raise of unit car which is a type defined in carbon or actually at a very low level so there are a lot of api's for CF string and friends and I don't have time to go through them all but just to give you a flavor of how the API works here's a few examples you can create a CF string using an array of unit cars on this example we pass null which indicates the default storage allocator for core foundation we pass an array of unit cars and we pass their number and that will give us back a CF string object you can also get characters back from a CF string in order to get the best performance you can use an inline buffer and the way you do that is you set up an inline buffer on a CF string and then you can ask to get a character at any index and the inline buffer will take care of batching access to the string so you get the most efficient access possible new and tiger is attributed string support and an attributed string wraps an existing CF string as opposed to you putting the characters into the attributed string directly so you can create an attributed string by passing a CF string and a dictionary of attributes and they can be totally arbitrary attributes that's not just a fixed set although there is there are a predefined set and you can also get the attributes at a particular point of an attributed string so you pass the attributed string the index where you want to get attributes and also a pointer to arrange the range gets set to the run of the attributes so you know how big a stretch of text has those particular attributes and then the function call returns the dictionary of attributes that apply to that range now something that you are those of you have programmed in world script now is that when you're dealing with double by character sets you can't just break a string at an arbitrary byte offset because it might be in the middle of a double by character and you used character by type to determine if there was a safe place to break we can't use character by type in a Unicode application but there's a similar issue to worry about and that is the user characters that I talked about earlier or what's called a cluster or graphing you don't want to break in the middle of that because if you do and then you only display that the first part before the break you'll actually mangle what the user thinks of as their character and display the wrong thing so there are api's available to help you find a safe place to break if you're using a CF string you can use cs3 and get range of composed characters at index and that will find a safe place to break if you're using a unit car array then you can use the Unicode utilities find text break API and look for a cluster boundary and that will also tell you a safe place to break so here's an example we have a string and an offset and we want to figure out a safe place to break so we call CF string get range of composed characters and index we pass our string and we passed the index that is the place where we would like to break and what the API returns is a range which is the beginning and the end of the user character or the cluster that corresponds to that offset so in this case we take that range and we go to the end of it we take the beginning location and add the length in and that's the place where we can break we could also just use the beginning part of the range instead of the end that depends on how you want your application to work another thing you need to do is to figure out what kind of character a given character is is the letter is it a digit so forth in the world script world you use character type for that but that doesn't work with Unicode so we can't use it anymore and in the Unicode world there's two ways to do this there's a CF character set api's and core foundation and at a lower level there's you see get car property so here's an example to determine whether a character is a decimal digit now you might think to determine whether something is a decimal digit you can just say well is it a ski 0 through 9 but it turns out in Unicode there are lots more decimal digits than just those there are decimal digits for indic languages for arabic and all of those are just as valid as decimal digits as the ASCII versions that were used to so to test whether any character is a decimal digit we can use CF character set so first we get a predefined CF character set in this case the set of decimal digits and then we can call CF character set is long character member in order to order to determine whether a given Unicode character as a member of that set or not and then we can branch one where it wouldn't way or the other depending on the answer well it would be wonderful if your application could deal only with Unicode and never have to think about anything else but there's still a lot of data out there that's not in Unicode there's documents that users have I have documents on my system that dates back to almost to the time the Mac was introduced and those are definitely not in Unicode because it hadn't even been invented then there are protocols on the internet that still require non unicode character sets the web is a big example you can use Unicode on the web many many web pages are not in Unicode so you need to be able to move between the Unicode world in the non-unicode world and we've had support in Mac OS 10 for this for a long time in the form of the text encoding converter which is a fairly low-level API which that even actually dates back to mac OS 9 but there's easier ways to do it using CF string and again there's a wide variety of AP is that you can use to do this and we're only going to go through a couple of them in the first example you can create a CF string using C string and all you need to do is pass the C string which is null-terminated and a text encoding to use and that will give you back a CF string which is in Unicode if you want a little bit more control for example your string isn't null-terminated you want to control what happens if the data can't be converted completely then you can use CF string create with bytes which gives you finer control so one question you might have is what do I pass for that text encoding and that's actually a non-trivial question it depends a lot on where the data is coming from if you're lucky and the data is coming say from an internet protocol and it's tagged with its character set then you know what encoding to pass but sometimes you have to guess and to good guesses are the encoding the corresponds to your applications human interface so if for example if your application is running in Japanese and you call get application text encoding you'll get max Japanese back the encoding a different encoding is CF string get system encoding and that's the text encoding that corresponds to the user's most preferred language now the users most preferred language is not always the same as the language that your application is running in and the reason for that is that the users makes most preferred language may be one that your application is not localized into so for example if the users most preferred language is an institute and you don't have an Anouk to tut localization in your application then you're not going to be running an inn exit hood in that case the application application text encoding and the user's text encoding are not going to match so which one of these you call depends on your application and where the data is coming from another thing you have to worry about on the internet or when sending Unicode to Windows is that other systems do not deal with the decomposed form of Unicode quite as well as Mac os10 does and so it's better to convert Unicode to what's called normalization form C which is the as precomposed as possible form before you send it to those systems and you can use CF string normalized to do that and a new feature in Tiger is that you can determine the text encoding used by ml to e so if you're using the multilingual text engine which is the carbon unicode text engine you can now specify the text encoding to use when opening or saving to plain text files and that's a new feature in tiger okay so we've covered how do the basics of how to store your text and how to get it in and out of your application but there's more text to your application than just what's in the users document there's also the text that you create yourself for your human interface and let's spend a little time talking about that well in the old world you use the resource manager to store the localized pieces of your application I used resources like the log or menu or if you're using power plant maybe you're using PPO be resources well those resources are all based on the old world script world and they can't support Unicode so the modern equivalents for a Unicode application or indeed any modern application is the bundle which I'm sure you've all heard of about already but I'll just give a very brief review an application bundle is a directory tree in the file system that's made to look to the end-user as if it's a single file you can store non localized files localized files files of any type actually it's totally up to you movies strings what have you localized files are stored in an L proj directory and the L proj directory is tagged with the ISO language code for the particular language that that localization corresponds to so for example en for English J a for Japanese one of the most important kinds of things you can store in your application bundle are interface builder files or nib files and those are the files that contain UI elements and replace the old resources that were used with the control manager and the dialogue manager and so forth the ones that didn't support Unicode and there's a small set of api's you can use for nibs with carbon applications you can create a nib reference from your application bundle and once you have that you can get your menu bar out you can get menus out you can get windows out with HIV hierarchies it's very straightforward to use another thing you have to worry about in localizing is strings so in the world script world you would use an str or str pound resource to store your localized strings of course those resources don't support Unicode so we need a modern equivalent and the modern equivalent is the localized strings file that's a just basically a plain text file it can be in utf-8 or utf-16 and you use CF copy localized string to get a localized string out so here's an example we have two strings in our localized strings file one asks the user a question and I guess this application needs a little work because it doesn't let the user give any other answer but yes that could be a problem but that's for another session so in this case we have the English version of the question in the English version of the answer in the in the English version of the file and then we have the equivalent Japanese translation in the Japanese version and you'll notice that the keys are the same but the string that the key corresponds to differs depending on the localization and we can use CF copy localized string we just pass the string that acts as the key and the second string in the call is actually just there for documentation purposes if you run the gin strings tool that will get written out as comment but that's all it's used for and the function returns are the proper localized version of the cs string and off we go ok well a big part of any application that deals with text is drawing it editing it and in putting it on there are several api's available to do that now when you talk about drawing text you can sort of partition applications into two classes or at least you can partition text drawing into two classes first is drawing short strings and in the world script world or quick twitch namely quick draw text you did that with either draw string or text box the Unicode equivalents are a draw theme text box which is very straightforward it just takes a CF string and you can use that when you're happy to just use one of the standard theme fonts if you need more control you can call I've one of two MLT EAP is either txn draw CF string text box or txn draw unicode text box and the only difference between them is one takes a CF string and the other takes a eunuch r star so depending on how your texts stored and that gives you actual actually a lot more control not just fonts but also you can specify a CG context you can control things like rotation and so on now sometimes an application has to draw large amounts of text and by that I mean drawing a document implementing a text editing engine implementing a web browser where you have to paint large amounts of text and the api's on the previous slide are not really appropriate for those kinds of tasks also sometimes you need a lot more control over the way text is rendered and again the previous api's are a little too simple well in the quick-draw text world we use things like draw text measure text if you supported by directional text you had to call get format order there are a whole bunch of api's to call and it's too complicated to go in in a talk like go into in a talk like this the equivalent set of API is to use in the Unicode world for for carbon is a tui Apple type services for Unicode imaging and again as a rather large API set and rendering complex text is a sufficiently difficult problem that I'm not going to get into it in the two or three minutes I would have to cover it in this session so there's a great online reference rendering unico text without Tsui I strongly recommend you start there that's who he is new to you in addition there's a session on Friday session for 25 modern text layout and editing for carbon applications where you can go to hear all about Atsui & M LTE and to talk to the engineers who work on it now a much more ideal way of dealing with text is not to have to render large amounts of that yourself but to use one of the Dalton text editing engines that's a lot easier than building your own the text editing engine in the world script world was called text edit and there's also a control to go along with it the edit text control but unfortunately they can support Unicode and they're now deprecated so the modern Unicode equivalent is ml 2 e the multilingual text engine and again I'm not going to go into the details of the MLT whoops oh it's up there but it's not down here I'm not going to go into the details of the MLT eh but there's a very nice online reference that you can read and it a new option that was introduced I think in the Panther 4-h I texy was a chai text view which makes it even easier to use em LTE wraps it up in an H I've you object so it can be part of an HIV hierarchy and could my monitor picture has disappeared so it would be nice to get some support for that in addition to a chai text view there's also a Unicode version of the edittext control so that basically gives equivalent functionality but supports unicode and i have here a few a p.i examples just to give you a flavor you call a chai text view create and that will create a new h i text you for you that wraps up an ml te object and the nice thing about a chai text view is it's not totally Oh peg you can get at the underlying MLT um LTE object so that you can do more advanced operations with it you can save and open documents and so forth and so on and you just call a chai text view get txn object to get that out and Unicode text control is very easy to create you just call create Unicode edit unicode text control you'll have to forgive me as my head swivels around for a while as I've lost my monitor here maybe I'll move over here so I can see the podium monitor while they're taking care of that okay another problem if you are implementing your own text editing engine or for some other reason you have to handle text input directly then in the very very very old world you might have called way next event or in the ancient world even get next event hopefully nobody's calling that anymore if you are supporting languages like Japanese or Chinese hopefully your application is already using TSM and you were calling new TSM document on specifying a tech service document interface type well unfortunately that doesn't support Unicode but there is a new document type Unicode document interface type that you can call new TSM document with and that will create a TSM document that supports unicode in older versions of the OS that was done with Apple events but for the last several releases it's been done with carbon events and you want to avoid the keyboard class carbon events because those are raw keyboard events and if you look at those that will be able that will be before the input method has a chance to work on them so you want to look at the text after the input method is processed it and the two carbon events for that are the text input Unicode for key event and that that's what comes from input methods or keyboard layouts and then there's the text input Unicode text event and that's what comes from non keyboard entry methods such as the character palette or ink and you can basically handle those pretty much the same way now if you're a TSM aware application there's several more carbon events you have to deal with but those are the same between Unicode and non-unicode applications and so we're not going to talk about them today okay so we know how we're storing our text we know we know how we're getting it into and out of our application we know how we're drawing and inputting it but there's also operations on the text itself something that's important in a lot of applications is sorting and searching in the old world script world we only supported sorting and you would call string order or text order in order to do a comparison of two strings and of course that depended on what the current script system was in the Unicode world there's several api's available you can click the easiest one to use is CF string compare and you just give it to CF strings and some options on how you want the strings compared and it will tell you whether there the same or one is less than or greater than the other if you're working with arrays avena cards you can call the lower level API you see compare text now if you're going to be doing sorting you're going to be doing a lot of key comparisons in your sort and you may be comparing the same key multiple times there is some overhead involved in doing a language and unicode sensitive comparison so if you're going to be doing something like sorting a large amount of data it's more efficient to get something that's called a collation key and a collation key is a string of bytes that does a binary compare the same way that the underlying string would do a language and Unicode sensitive compare so what you can do is call the Unicode utilities get collation key for a given text collator and string of unit cars and you'll get back a binary key that you can just compare using binary ordering and that can make your sort go significantly faster if you're something that you couldn't do in the world script world but you can do in the Unicode world is search for substrings and again CF string makes it very easy there's CF string find you give your target string and a substring that you want to look for in that target string and search options and it will find the instances you can step through them you can also look for more than just a substring you can also find instances of characters in a CF character set and again this is just a sample of the api's that are available there are a lot more AP is available for sorting and searching and I urge you to check out the documentation for CF string it has a lot of capabilities sometimes you need to change the case of something and we had uppercase text and lowercase text available in text utilities for doing that but they don't work with Unicode the modern equivalent for Unicode application is on CF string and there's a CF string upper case which converts everything to uppercase you'll notice that it takes two parameters a string and a locale the reason for that is that the rules about how to convert uppercase to lowercase or lowercase to uppercase differ a little from language to language for example in Turkish the rules are different from English and so you need to pass a locale if you want the case conversion to be done in a correct language sensitive fashion something that you can do with cs string that you couldn't do in the script manager is capitalized that is convert only the first letter of every word to a capital letter and we talked about CF string normalized a little earlier it can convert a string not just to the pre compose to normalization form C but to any of the four Unicode normalization forms so one of the one of the things that you used to have to do in the script manager because there was no Unicode equivalent was to do transformations on texts such as transliterate to a different script or to strip out accents or diacritics and this is one of the last pieces that we've come up with a Unicode equivalent fourth new and tiger a new API called CF string transform and you pass it a mutable string an identifier for the kind of transformation you want to perform you can optionally limit it to a sub range of the string and you can also specify whether you want to transform to go forward or reverse and there's a several transforms available this is not a complete list but one is to strip by combining marks another will transform as much as possible to Latin from arbitrary Unicode scripts it doesn't cover all of Unicode and these first two transforms are not reversible because because of a basic property called entropy once you lose the information you can't get it back so once i have it in latin I don't know what the original scripts were so these are irreversible there are reversible transformations such as between Latin and hit agana so there's an example we transliterate konichiwa from Len from a romanization to hit agana and we can go in either direction because we're specifying which script we're using there's also a transformation of unicode idia graphic characters Han characters according to the Mandarin pinion transliteration system so in this case we have the city named Shanghai written in Han characters and that's transliterated to Latin and finally there's the XML hex transliteration which John demoed earlier which will take non-ascii printable characters and convert them to a hex escape sequence and you can apply some of these transformation serially for example you could convert to Latin and then call strip diacritics to strip out the dye critics if you don't want them there and again this is new and tiger and it this is in the WWDC preview release that you've received so you can experiment with it there's also other manipulations on strings just basically moving parts of strings around and in the world script world we had munzer muncher just works on bites and addition in addition it requires that your text being a handle and there are several options available to replace munzer if you're working with unicode CF string replace is very easy to use I take a mutable string a range of that string that you want to replace and what to replace it with very straightforward there's also CF string create with format and CF string append with format which work a lot like printf and again those are fully Unicode compatible there's also CF string trim which will remove constants drinks from the beginning or end of a CF string or a mutable string that is and also CF string trim white space which will remove whitespace characters and if you really need to just move bytes around then there's the standard C library routine mem move which handles arbitrary bite bite moves and deals with issues like overlapping source and destination if you have an application that displays text in a list or presents text in a fixed size space if you have a string that's too large for that space or in a list if it's too large for the column then you need to truncate the string and that needs to be done in a Unicode and language sensitive way in the script manager world we had trunk string and trunk text to do that there's two ways to do that in the Unicode world one very nice option if you're using s we directly is to use that to these line truncation tag and well what that will actually do is truncate the string while it's being drawn so you don't actually have to modify the string itself in memory what you can do is tell us we that you need to draw the string in a fixed width and if you specify the line truncation tag if it fits by itself that's fine if it's a little too big that's who will try to squish it down a little bit first so it can draw the whole thing and if it still doesn't fit then absolutely will in will truncate the string and insert an ellipsis if you want to actually truncate the data itself which is the way the trunk string and trunk text worked then you can call truncate team text which is a unicode equivalent something that's very important for applications the deal with text is finding appropriate boundaries so we already talked about a cluster boundary which corresponds to what the user thinks of the character but there are other boundaries as well so let's take a look at this slide there's an example at the bottom that illustrates line and word break and you'll see that line break and word breaker not the same thing although they're often thought of as being the same thing so for example if I'm doing line breaking it's acceptable to the to break after the hyphen but if I'm doing word breaking that is determining what constitutes a word either for double clicking or for doing whole word searching then breaking in the middle of that is not acceptable so line breaking in word breaking or different at the moment the only api's that are available for doing this kind of breaking operate at the eunuch our array level so that's the unicode utilities the first step is to create a text break locator by calling you see create text break locator and you specify when you create it which kinds of texts boundaries you're interested in whether it's a cluster boundary or a word boundary or a line boundary and then you can call you see fine text break to iterate through the breaks in your text either in a forward or backward direction if you're interested in cluster boundaries then as I mentioned earlier in the talk there's CF string get range of composed characters at index which works at the CF string level but if you need line or word breaks then you need to call the Unicode utilities ok the last topic that we're going to cover is dates times and numbers so one of there are several things you need to be able to do with dates times and numbers one is to convert a date that's in a binary format or a time into a string to display to the end user or the user might have typed a data a time into a text entry field and you need to convert it back to a binary number so you can perform an operation on it and in the old world there were several api's available for that I'm not going to read them all off but they're all deprecated now in Panther we introduced CF date formatter which is a new set of api's and core foundation that do this in the Unicode world and so we'll go through a small example here again cfd a former formatter has a fair number of api's that we don't have time to go into detail on all of them so I'll just go through a short example you can use CF date formatter create string with absolute time to use a CF date formatter and convert time a binary number into a string if you're going in the other direction you use CF date format or get absolute time from string again you pass a CF date formatter and string and you'll get back a binary time thirdly CF day for matters have properties that you can set on them that control how the formatting is done and you can use CF date formatter set property to set a particular property on the date formatter so here's we'll go here's a complete example we'll go through first we create our date formatter again we pass null to indicate the standard storage allocator for core foundation we need to pass a locale to specify what kind of date formatting we're doing because the date formatting for say us English is very different from that for Japanese or German or Dutch or what have you so we called CF local copy current which gives us back the users current locale now if you were doing this in a real application you'd want to save the users current locale so that you don't keep calling CF locale copy current over and over again because first of all you get a lot of copies and second of all you want to take a snapshot of the users current locale so that you get consistent results the other thing we need to specify when we're creating our CF date formatter is what style of date and time we want in this case we're saying we want the long date style and the long time style and the next thing we're going to do is since we're in this example we're going to convert a date entered by the user into a binary time we're going to set the lenient property on the day time for matter and we do that by calling CF date formatter setproperty passing the formatter and the key for lenient property and setting it to true now what that's used for is if you don't set this property when you try to convert a date or time string to a binary number cfa formatter will try to match it exactly against the template that's used for formatting dates for converting a date from a binary number to a string and if it doesn't exactly match that template the conversion will fail what the lenient property does is it sets the date formatter so that it will try as hard is possible to interpret the input string as a date or time even if it doesn't match the template that is expecting so it you pretty much always want to set this unless you're doing some kind of validation and the final call we make is get absolute CF date formatter get absolute time from string we pass our CF day for matter the string that's the input you have the ability to pass some options but we're passing null in this case and finally you pass a pointer to the CF absolute time to be filled in now sometimes you have to do operations on dates that other than converting them to strings work and converting them back from strings to a binary number so I mean sometimes you need to do calendar operations an example might be take this date and add one month or take this date and add one year and so in the script manager world there were api's like toggle date and validate and long day two seconds and long seconds to date that converted between the binary form of time and a structure which specified the year month day etc separately so the new time for new API is the time type for new API is a CF absolute time and for a while there's been a set of api's for CF absolute time for doing computations with the Gregorian calendar and those were I don't know what released they were introduced in but they've been in for a couple of releases now but those api's can't handle non Gregorian calendars which we're adding more support for in Tiger and so we're introducing a new type CF calendar it's a new core foundation type and it's a set of api's that will work with any kind of calendar to do calendar computations such as toggling dates validating dates and getting components of days and this this API did not make the preview release the WWDC you release but it is something we're working on for tiger so I'm just going to tell you a little bit about it today since you can't work with it yet see if a CF calendar can do things like create a set of calendar values to an absolute time so for example if you give it a year a month in a day you can convert that to an absolute time it can also go in the other direction it can take an absolute time and pick out the calendar components that correspond to it and finally you can do toggling operations such as taking an absolute time and adding a fixed quantity to it such as a year a month or a day so this is the this is the multi calendar replacement for the Gregorian calendar API that are in there right now and look for it in a tiger release coming fin well very similar to dates and times we also need to be able to convert numbers between a binary format and a string that a user can understand so and again that needs to be done in a locale sensitive way because different countries have different conventions for the way that numbers are formatted in the world text world there were a P is available for doing that in Panther we introduced CF number formatter which is the Unicode equivalent and again we'll go through a short example CF number formatter has several api's that we don't have time to go into you can create a string with a value using CF number formatter and you just pass the formatter you have to specify the type of the value because it could be say a floating point number a double along what have you so you need to specify what type it is you can also go in the other direction you can take a string and interpret it as a number using CF number format or get get value from string and again you pass the formatter the string and some other options and you'll get a number out finally you can also set the format that's used for a number formatter if you create a number formatter with a locale you'll get the default format for locale but number four matters use a formatting string which is very similar to the pattern string that you might see in a spreadsheet program such as Excel and you can set your own format strings to format numbers in a particular way and you do that by calling CF number formatter set format and passing a string that represents the format pattern to use so here's an example will format a number we create our number formatter using again the default storage allocator again we pass a copy of the users current locale and again you want to save that away as opposed to getting it every time you make this API call and in this case we're saying we want a number formatter that uses the currency style because we're going to be formatting currency we have a double which stores the currency amount we want to format it's a floating point number 42 we call CF number formatter create string with value again the default storage allocator we passed the number formatter that we created two lines back we specify that we're passing a double and then we pass the address of the variable and this API will then return a string with that number formatted as currency according to the conventions of the users current locale so that's that has been our whirlwind tour of the Unicode AP is that our replacements for world script again we did not have time to go into detail on all of them because there are a lot of api's out there but the goal of this presentation was to help you to understand how to translate a particular piece of your existing world script application to the Unicode world so hopefully this application this presentation gave you the pointers you need to know where to go in the documentation to do that if you have further questions the first person you should contact is da VL ago who is the representative for these technologies and world wide developer relations you can also contact me but please do try xavi a first rather than give you a long list of URLs to go to for information on Unicode api's there's a one-stop shopping page and this is the URL if you go to our Unicode reference library page you'll find links to all the API sets and all the documentation you need to convert your application to unicode
