Document-term matrix

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix is a specific instance of a document-feature matrix where "features" may refer to other properties of a document besides terms.[1] It is also common to encounter the transpose, or term-document matrix where documents are the columns and terms are the rows. They are useful in the field of natural language processing and computational text analysis.[2] While the value of the cells is commonly the raw count of a given term, there are various schemes for weighting the raw counts such as relative frequency/proportions and tf-idf.

Terms are commonly single tokens separated by whitespace or punctuation on either side, or unigrams. In such a case, this is also referred to as "bag of words" representation because the counts of individual words is retained, but not the order of the words in the document.

General Concept[edit]

When creating a data-set of terms that appear in a corpus of documents, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ij cell, then, is the number of times word j occurs in document i. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents:

  • D1 = "I like databases"
  • D2 = "I dislike databases",

then the document-term matrix would be:

I like dislike databases
D1 1 1 0 1
D2 1 0 1 1

which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document.

As a result of the power-law distribution of tokens in nearly every corpus (see Zipf's law), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use tf-idf, which divides the term frequency by the term's document frequency.

Choice of Terms[edit]

A point of view on the matrix is that each row represents a document. In the vectorial semantic model, which is normally the one used to compute a document-term matrix, the goal is to represent the topic of a document by the frequency of semantically significant terms. The terms are semantic units of the documents. It is often assumed, for Indo-European languages, that nouns, verbs and adjectives are the more significant categories, and that words from those categories should be kept as terms. Adding collocation as terms improves the quality of the vectors, especially when computing similarities between documents.

Applications[edit]

Improving search results[edit]

Latent semantic analysis (LSA, performing singular-value decomposition on the document-term matrix) can improve search results by disambiguating polysemous words and searching for synonyms of the query. However, searching in the high-dimensional continuous space is much slower than searching the standard trie data structure of search engines.

Finding topics[edit]

Multivariate analysis of the document-term matrix can reveal topics/themes of the corpus. Specifically, latent semantic analysis and data clustering can be used, and more recently probabilistic latent semantic analysis and non-negative matrix factorization have been found to perform well for this task.

See also[edit]

Implementations[edit]

  • Gensim: Open source Python framework for Vector Space modelling. Contains memory-efficient algorithms for constructing term-document matrices from text plus common transformations (tf-idf, LSA, LDA).

References[edit]

  1. ^ "Document-feature matrix :: Tutorials for quanteda". tutorials.quanteda.io. Retrieved 2021-01-02.
  2. ^ "15 Ways to Create a Document-Term Matrix in R". Dustin S. Stoltz. Retrieved 2021-01-02.