Tokenizing Natural Language Text

Enumerate the words in a string.

Overview

When you work with natural language text, it’s often useful to tokenize the text into individual words. Using NSLinguisticTagger to enumerate words, rather than simply splitting components by whitespace, ensures correct behavior in multiple scripts and languages. For example, neither Chinese nor Japanese uses spaces to delimit words.

The example and accompanying steps below show how you use NSLinguisticTagger to enumerate over the words in natural language text.

let text = """
All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""

let tagger = NSLinguisticTagger(tagSchemes: [.tokenType], options: 0)
tagger.string = text

let range = NSRange(location: 0, length: text.utf16.count)
let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace]
tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { _, tokenRange, _ in
    let word = (text as NSString).substring(with: tokenRange)
    print(word)
}

Create an instance of NSLinguisticTagger, specifying tokenType as the tag scheme to be used.
Set the string property of the linguistic tagger to the natural language text.
Enumerate over the entire range of the string by calling the enumerateTags(in:unit:scheme:options:using:) method, specifying NSLinguisticTaggerUnit.word as the tag unit and tokenType as the tag scheme to enumerate, and omitting any punctuation or whitespace.
In the enumeration block, take a substring of the original text at tokenRange to obtain each word.
Run this code to print out each word in text on a new line.

Tokenizing Natural Language Text

Overview

See Also

Related Documentation

First Steps