Tokenization

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.

[edit] Methods and obstacles

Typically, tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often a tokenizer relies on simple heuristics, for example:

  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
  • Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.
  • Punctuation and whitespace may or may not be included in the resulting list of tokens.

In languages such as English (and most programming languages) where words are delimited by whitespace, this approach is straightforward. However, tokenization is more difficult for languages such as Chinese which have no word boundaries.[1] Simple whitespace-delimited tokenization also presents difficulties when word collocations such as New York should be treated as one token. Some ways to address this problem are by developing more complex heuristics, querying a table of common collocations, or fitting the tokens to a language model that identifies collocations in a later processing step.

[edit] See also

[edit] References

Personal tools
Namespaces
Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages