Language model

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability to the whole sequence.

The language model provides context to distinguish between words and phrases that sound similar. For example, in American English, the phrases "recognize speech" and "wreck a nice beach" sound similar, but mean different things.

Data sparsity is a major problem in building language models. Most possible word sequences are not observed in training. One solution is to make the assumption that the probability of a word only depends on the previous n words. This is known as an n-gram model or unigram model when n = 1. The unigram model is also known as the bag of words model.

Estimating the relative likelihood of different phrases is useful in many natural language processing applications, especially those that generate text as an output. Language modeling is used in speech recognition,[1] machine translation,[2] part-of-speech tagging, parsing,[2] Optical Character Recognition, handwriting recognition,[3] information retrieval and other applications.

In speech recognition, sounds are matched with word sequences. Ambiguities are easier to resolve when evidence from the language model is integrated with a pronunciation model and an acoustic model.

Language models are used in information retrieval in the query likelihood model. There, a separate language model is associated with each document in a collection. Documents are ranked based on the probability of the query Q in the document's language model : . Commonly, the unigram language model is used for this purpose.

Model types[edit]


A unigram model can be treated as the combination of several one-state finite automata.[4] It splits the probabilities of different terms in a context, e.g. from


In this model, the probability of each word only depends on that word's own probability in the document, so we only have one-state finite automata as units. The automaton itself has a probability distribution over the entire vocabulary of the model, summing to 1. The following is an illustration of a unigram model of a document.

Terms Probability in doc
a 0.1
world 0.2
likes 0.05
we 0.05
share 0.3
... ...

The probability generated for a specific query is calculated as

Different documents have unigram models, with different hit probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two documents:

Terms Probability in Doc1 Probability in Doc2
a 0.1 0.3
world 0.2 0.1
likes 0.05 0.03
we 0.05 0.02
share 0.3 0.2
... ... ...

In information retrieval contexts, unigram language models are often smoothed to avoid instances where P(term) = 0. A common approach is to generate a maximum-likelihood model for the entire collection and linearly interpolate the collection model with a maximum-likelihood model for each document to smooth the model.[5]


In an n-gram model, the probability of observing the sentence is approximated as

It is assumed that the probability of observing the ith word wi in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (nth order Markov property).

The conditional probability can be calculated from n-gram model frequency counts:

The terms bigram and trigram language models denote n-gram models with n = 2 and n = 3, respectively.[6]

Typically, the n-gram model probabilities are not derived directly from frequency counts, because models derived this way have severe problems when confronted with any n-grams that have not been explicitly seen before. Instead, some form of smoothing is necessary, assigning some of the total probability mass to unseen words or n-grams. Various methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good-Turing discounting or back-off models.


Bidirectional representations condition on both pre- and post- context (e.g., words) in all layers.[7]


In a bigram (n = 2) language model, the probability of the sentence I saw the red house is approximated as

whereas in a trigram (n = 3) language model, the approximation is

Note that the context of the first n – 1 n-grams is filled with start-of-sentence markers, typically denoted <s>.

Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence *I saw the would always be higher than that of the longer sentence I saw the red house.


Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

where is the partition function, is the parameter vector, and is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Neural network[edit]

Neural language models (or continuous space language models) use continuous representations or embeddings of words to make their predictions.[8] These models make use of Neural networks.

Continuous space embeddings help to alleviate the curse of dimensionality in language modeling: as language models are trained on larger and larger texts, the number of unique words (the vocabulary) increases.[a] The number of possible sequences of words increases exponentially with the size of the vocabulary, causing a data sparsity problem because of the exponentially many sequences. Thus, statistics are needed to properly estimate probabilities. Neural networks avoid this problem by representing words in a distributed way, as non-linear combinations of weights in a neural net.[9] An alternate description is that a neural net approximates the language function. The neural net architecture might be feed-forward or recurrent, and while the former is simpler the latter is more common.[example needed][citation needed]

Typically, neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution


I.e., the network is trained to predict a probability distribution over the vocabulary, given some linguistic context. This is done using standard neural net training algorithms such as stochastic gradient descent with backpropagation.[9] The context might be a fixed-size window of previous words, so that the network predicts

from a feature vector representing the previous k words.[9] Another option is to use "future" words as well as "past" words as features, so that the estimated probability is


This is called a bag-of-words model. When the feature vectors for the words in the context are combined by a continuous operation, this model is referred to as the continuous bag-of-words architecture (CBOW).[10]

A third option that trains slower than the CBOW but performs slightly better is to invert the previous problem and make a neural network learn the context, given a word.[10] More formally, given a sequence of training words , one maximizes the average log-probability

where k, the size of the training context, can be a function of the center word . This is called a skip-gram language model.[11] Bag-of-words and skip-gram models are the basis of the word2vec program.[12]

Instead of using neural net language models to produce actual probabilities, it is common to instead use the distributed representation encoded in the networks' "hidden" layers as representations of words; each word is then mapped onto an n-dimensional real vector called the word embedding, where n is the size of the layer just before the output layer. The representations in skip-gram models have the distinct characteristic that they model semantic relations between words as linear combinations, capturing a form of compositionality. For example, in some such models, if v is the function that maps a word w to its n-d vector representation, then

where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[10][11]


A positional language model[13] assesses the probability of given words occurring close to one another in a text, not necessarily immediately adjacent. Similarly, bag-of-concepts models[14] leverage the semantics associated with multi-word expressions such as buy_christmas_present, even when they are used in information-rich sentences like "today I bought a lot of very nice Christmas presents".

Despite the limited successes in using neural networks,[15] authors acknowledge the need for other techniques when modelling sign languages.


Various data sets have been developed to use to evaluate language processing systems.[7] These include:

  • Corpus of Linguistic Acceptability[16]
  • GLUE benchmark[17]
  • Microsoft Research Paraphrase Corpus[18]
  • Multi-Genre Natural Language Inference
  • Question Natural Language Inference
  • Quora Question Pairs[19]
  • Recognizing Textual Entailment[20]
  • Semantic Textual Similarity Benchmark
  • SQuAD question answering Test[21]
  • Stanford Sentiment Treebank[22]
  • Winograd NLI


Although contemporary language models, such as GPT-2, can be shown to match human performance on some tasks, it is not clear they are plausible cognitive models. For instance, recurrent neural networks have been shown to learn patterns humans do not learn and fail to learn patterns that humans do learn.[23]

See also[edit]


  1. ^ See Heaps' law.



  1. ^ Kuhn, Roland, and Renato De Mori. "A cache-based natural language model for speech recognition." IEEE transactions on pattern analysis and machine intelligence 12.6 (1990): 570-583.
  2. ^ a b Andreas, Jacob, Andreas Vlachos, and Stephen Clark. "Semantic parsing as machine translation." Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2013.
  3. ^ Pham, Vu, et al. "Dropout improves recurrent neural networks for handwriting recognition." 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE, 2014.
  4. ^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval, pages 237–240. Cambridge University Press, 2009
  5. ^ Buttcher, Clarke, and Cormack. Information Retrieval: Implementing and Evaluating Search Engines. pg. 289–291. MIT Press.
  6. ^ Craig Trim, What is Language Modeling?, April 26th, 2013.
  7. ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018-10-10). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
  8. ^ Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks".
  9. ^ a b c Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. 3. p. 3881. Bibcode:2008SchpJ...3.3881B. doi:10.4249/scholarpedia.3881.
  10. ^ a b c Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space". arXiv:1301.3781 [cs.CL].
  11. ^ a b Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado irst4=Greg S.; Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality (PDF). Advances in Neural Information Processing Systems. pp. 3111–3119.
  12. ^ Harris, Derrick (16 August 2013). "We're on the cusp of deep learning for the masses. You can thank Google later". Gigaom.
  13. ^ Lv, Yuanhua; Zhai, ChengXiang (2009). "Positional Language Models for Information Retrieval in" (PDF). Proceedings. 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR).
  14. ^ Cambria, Erik; Hussain, Amir (2012-07-28). Sentic Computing: Techniques, Tools, and Applications. Springer Netherlands. ISBN 978-94-007-5069-2.
  15. ^ Mocialov, Boris; Hastie, Helen; Turner, Graham (August 2018). "Transfer Learning for British Sign Language Modelling". Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018). Retrieved 14 March 2020.
  16. ^ "The Corpus of Linguistic Acceptability (CoLA)". Retrieved 2019-02-25.
  17. ^ "GLUE Benchmark". Retrieved 2019-02-25.
  18. ^ "Microsoft Research Paraphrase Corpus". Microsoft Download Center. Retrieved 2019-02-25.
  19. ^ Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
  20. ^ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment" (PDF). Retrieved February 24, 2019.CS1 maint: multiple names: authors list (link)
  21. ^ "The Stanford Question Answering Dataset". Retrieved 2019-02-25.
  22. ^ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". Retrieved 2019-02-25.
  23. ^ Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (2018-01-09). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5.


  • J M Ponte and W B Croft (1998). "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval. pp. 275–281. CiteSeerX maint: uses authors parameter (link)
  • F Song and W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280. CiteSeerX maint: uses authors parameter (link)
  • Chen, Stanley; Joshua Goodman (1998). An Empirical Study of Smoothing Techniques for Language Modeling (Technical report). Harvard University. CiteSeerX

External links[edit]