CFStringTokenizer

Declaration

class CFStringTokenizer

Overview

CFStringTokenizer allows you to tokenize strings into words, sentences or paragraphs in a language-neutral way. It supports languages such as Japanese and Chinese that do not delimit words by spaces, as well as de-compounding German compounds. You can obtain Latin transcription for tokens. It also provides language identification API.

You can use a CFStringTokenizer to break a string into tokens (sub-strings) on the basis of words, sentences, or paragraphs. When you create a tokenizer, you can supply options to further modify the tokenization—see Tokenization Modifiers.

In addition, with CFStringTokenizer:

You can de-compound German compounds
You can identify the language used in a string (using CFStringTokenizerCopyBestStringLanguage(_:_:))
You can obtain Latin transcription for tokens

To find a token that includes the character specified by character index and set it as the current token, you call CFStringTokenizerGoToTokenAtIndex(_:_:). To advance to the next token and set it as the current token, you call CFStringTokenizerAdvanceToNextToken(_:). To get the range of current token, you call CFStringTokenizerGetCurrentTokenRange(_:). You can use CFStringTokenizerCopyCurrentTokenAttribute(_:_:) to get the attribute of the current token. If the current token is a compound, you can call CFStringTokenizerGetCurrentSubTokens(_:_:_:_:) to retrieve the subtokens or derived subtokens contained in the compound token. To guess the language of a string, you call CFStringTokenizerCopyBestStringLanguage(_:_:).

Topics

Creating a Tokenizer

CFStringTokenizerCreate(_:_:_:_:_:)

Setting the String

CFStringTokenizerSetString(_:_:_:)

Changing the Location

Getting Information About the Current Token

Identifying a Language

CFStringTokenizerCopyBestStringLanguage(_:_:)

Getting the CFStringTokenizer Type ID

CFStringTokenizerGetTypeID()

CFStringTokenizer

Declaration

Overview

Topics

Creating a Tokenizer

Setting the String

Changing the Location

Getting Information About the Current Token

Identifying a Language

Getting the CFStringTokenizer Type ID

Constants

Relationships

Conforms To

See Also

Related Documentation

Opaque Types