CFStringTokenizer
Declaration
class CFStringTokenizerOverview
CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.
You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.
In addition, with CFStringTokenizer:
You can de-compound German compounds
You can identify the language used in a string (using CFStringTokenizerCopyBestStringLanguage(_:_:))
You can obtain Latin transcription for tokens
To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex(_:_:). To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken(_:). To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange(_:). You can use CFStringTokenizerCopyCurrentTokenAttribute(_:_:) to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens(_:_:_:_:) to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage(_:_:).
Topics
Creating a Tokenizer
Setting the String
Changing the Location
Getting Information About the Current Token
CFStringTokenizerCopyCurrentTokenAttribute(_:_:)CFStringTokenizerGetCurrentTokenRange(_:)CFStringTokenizerGetCurrentSubTokens(_:_:_:_:)
Identifying a Language
Getting the CFStringTokenizer Type ID
Constants
See Also
Related Documentation
- String Programming Guide for Core Foundation