Tokenizing Natural Language Text
Enumerate the words in a string.
Overview
When you work with natural language text, it’s often useful to tokenize the text into individual words. Using NSLinguisticTagger to enumerate words, rather than simply splitting components by whitespace, ensures correct behavior in multiple scripts and languages. For example, neither Chinese nor Japanese uses spaces to delimit words.
The example and accompanying steps below show how you use NSLinguisticTagger to enumerate over the words in natural language text.
let text = """
All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""
let tagger = NSLinguisticTagger(tagSchemes: [.tokenType], options: 0)
tagger.string = text
let range = NSRange(location: 0, length: text.utf16.count)
let options: NSLinguisticTagger.Options = [.omitPunctuation, .omitWhitespace]
tagger.enumerateTags(in: range, unit: .word, scheme: .tokenType, options: options) { _, tokenRange, _ in
let word = (text as NSString).substring(with: tokenRange)
print(word)
}Create an instance of NSLinguisticTagger, specifying tokenType as the tag scheme to be used.
Set the string property of the linguistic tagger to the natural language text.
Enumerate over the entire range of the string by calling the enumerateTags(in:unit:scheme:options:using:) method, specifying NSLinguisticTaggerUnit.word as the tag unit and tokenType as the tag scheme to enumerate, and omitting any punctuation or whitespace.
In the enumeration block, take a substring of the original text at
tokenRangeto obtain each word.Run this code to print out each word in
texton a new line.