# Kneser–Ney smoothing

Kneser–Ney smoothing, also known as Kneser-Essen-Ney smoothing, is a method primarily used to calculate the probability distribution of n-grams in a document based on their histories.[1] It is widely considered the most effective method of smoothing due to its use of absolute discounting by subtracting a fixed value from the probability's lower order terms to omit n-grams with lower frequencies. This approach has been considered equally effective for both higher and lower order n-grams. The method was proposed in a 1994 paper by Reinhard Kneser, Ute Essen and Hermann Ney [de].[2]

A common example that illustrates the concept behind this method is the frequency of the bigram "San Francisco". If it appears several times in a training corpus, the frequency of the unigram "Francisco" will also be high. Relying on only the unigram frequency to predict the frequencies of n-grams leads to skewed results;[3] however, Kneser–Ney smoothing corrects this by considering the frequency of the unigram in relation to possible words preceding it.

## Method

Let ${\displaystyle c(w,w')}$ be the number of occurrences of the word ${\displaystyle w}$ followed by the word ${\displaystyle w'}$ in the corpus.

The equation for bigram probabilities is as follows:

${\displaystyle p_{KN}(w_{i}|w_{i-1})={\frac {\max(c(w_{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-1},w')}}+\lambda _{w_{i-1}}p_{KN}(w_{i})}$[4]

Where the unigram probability ${\displaystyle p_{KN}(w_{i})}$ depends on how likely it is to see the word ${\displaystyle w_{i}}$ in an unfamiliar context, which is estimated as the number of times it appears after any other word divided by the number of distinct pairs of consecutive words in the corpus:

${\displaystyle p_{KN}(w_{i})={\frac {|\{w':0

Note that ${\displaystyle p_{KN}}$ is a proper distribution, as the values defined in the above way are non-negative and sum to one.

The parameter ${\displaystyle \delta }$ is a constant which denotes the discount value subtracted from the count of each n-gram, usually between 0 and 1.

The value of the normalizing constant ${\displaystyle \lambda _{w_{i-1}}}$ is calculated to make the sum of conditional probabilities ${\displaystyle p_{KN}(w_{i}|w_{i-1})}$ over all ${\displaystyle w_{i}}$ equal to one. Observe that (provided ${\displaystyle \delta <1}$) for each ${\displaystyle w_{i}}$ which occurs at least once in the context of ${\displaystyle w_{i-1}}$ in the corpus we discount the probability by exactly the same constant amount ${\displaystyle {\delta }/\left(\sum _{w'}c(w_{i-1},w')\right)}$, so the total discount depends linearly on the number of unique words ${\displaystyle w_{i}}$ that can occur after ${\displaystyle w_{i-1}}$. This total discount is a budget we can spread over all ${\displaystyle p_{KN}(w_{i}|w_{i-1})}$ proportionally to ${\displaystyle p_{KN}(w_{i})}$. As the values of ${\displaystyle p_{KN}(w_{i})}$ sum to one, we can simply define ${\displaystyle \lambda _{w_{i-1}}}$ to be equal to this total discount:

${\displaystyle \lambda _{w_{i-1}}={\frac {\delta }{\sum _{w'}c(w_{i-1},w')}}|\{w':0

This equation can be extended to n-grams. Let ${\displaystyle w_{i-n+1}^{i-1}}$ be the ${\displaystyle n-1}$ words before ${\displaystyle w_{i}}$:

${\displaystyle p_{KN}(w_{i}|w_{i-n+1}^{i-1})={\frac {\max(c(w_{i-n+1}^{i-1},w_{i})-\delta ,0)}{\sum _{w'}c(w_{i-n+1}^{i-1},w')}}+\delta {\frac {|\{w':0[5]

This model uses the concept of absolute-discounting interpolation which incorporates information from higher and lower order language models. The addition of the term for lower order n-grams adds more weight to the overall probability when the count for the higher order n-grams is zero.[6] Similarly, the weight of the lower order model decreases when the count of the n-gram is non zero.

## Modified Kneser–Ney smoothing

Modifications of this method also exist. Chen and Goodman's 1998 paper lists and benchmarks several such modifications. Computational efficiency and scaling to multi-core systems is the focus of Chen and Goodman’s 1998 modification.[7] This approach is once used for Google Translate under a MapReduce implementation.[8] KenLM is a performant open-source implementation.[9]

## References

1. ^ 'A Bayesian Interpretation of Interpolated Kneser-Ney NUS School of Computing Technical Report TRA2/06'
2. ^ Ney, Hermann; Essen, Ute; Kneser, Reinhard (January 1994). "On structuring probabilistic dependences in stochastic language modelling". Computer Speech & Language. 8 (1): 1–38. doi:10.1006/csla.1994.1001.
3. ^ 'Brown University: Introduction to Computational Linguistics '
4. ^ 'Kneser Ney Smoothing Explained'
5. ^ 'NLP Tutorial: Smoothing'
6. ^ 'An empirical study of smoothing techniques for language modeling'
7. ^
8. ^ Large Language Models in Machine Translation
9. ^