Spelling suggestion is a feature of many computer software applications used to suggest plausible replacements for words that are likely to have been misspelled.
Spelling suggestion features are commonly included in Internet search engines, word processors, spell checkers, medical transcription, automatic query reformulation, and frequency-log statistics reporting.
Any spell checker must have some data about the words in the target language, either in general usage or with specialized knowledge (like medical vocabulary). This can come from:
- A dictionary of all known words.
- A text corpus which includes typical text, known to be correctly spelled.
- A list of frequently misspelled words, mapping errors to corrections.
- Logs of human text input, such as from a popular search engine. This is essentially a crowdsourced corpus, but it is assumed there will be some spelling mistakes. Data might be included about when people click on a spelling suggestion or make a second, very similar query; this creates a crowdsourced mapping of misspelled words to reliable corrections.
A list of frequently misspelled words, possibly including multi-word phrases, can simply be consulted to see if any of the input words or phrases are listed.
To make use of a dictionary without a pre-existing mapping from misspellings to corrections, the typical technique is to calculate the edit distance between an input word and any given word in the dictionary. The Levenshtein distance metric considers an "edit" to be the insertion, deletion, or substitution (with another letter) of one letter. The Damerau–Levenshtein distance adds transpositions (the swapping of neighboring letters). Dictionary words that are an edit distance of 1 away from the input word are considered highly likely as corrections, edit distance 2 less likely, and edit distance 3 sometimes included in suggestions and sometimes ignored.
A text corpus can be summed up as a dictionary of known words, with a frequency of appearance for each word. This can be used to sort the spelling suggestions. For example, if there are multiple suggestions of edit distance 1, the words that appear most frequently in the corpus are most likely to be the desired correction.
Because a dictionary of known words is very large, calculating the edit distance between an input word and every word in the dictionary is computationally intensive and thus relatively slow. Various data structures can be utilized to speed up storage lookups, such as BK-trees. A faster approach adopted by Peter Norvig generates all the permutations from an input word of all possible edits. For a word of length n and an alphabet of size a, for edit distance 1 there are at most n deletions, n-1 transpositions, a*n alterations, and a*(n+1) insertions. Using only the 26 letters in the English alphabet, this would produce only 54*n+25 dictionary lookups, minus any duplicates (which depends on the specific letters in the word). This is relatively small when compared to a dictionary of hundreds of thousands of words. However, tens or hundreds of thousands of lookups might be required for edit distance 2 and greater. A further innovation adopted by Wolf Garbe, known as SymSpell ("sym" as in "symmetry") speeds the input-time calculation by utilizing the fact that only permutations involving deletions need to be generated for input words if the same deletion permutations are per-calculated on the dictionary.
The algorithms described so far do not deal well with correct words that are not in the dictionary. Common sources of unknown words in English are compound words and inflections, such as -s and -ing. These can be accommodated algorithmically, especially if the dictionary contains the part of speech.
These algorithms have also assumed that all errors of a given distance are equally probable, which is not true. Errors that involve spelling phonetically where English orthography is not phonetic are common, as are errors that repeat the same letter or confuse neighboring letters on a QWERTY keyboard. If a large set of known spelling errors and corrections is available, this data can be used to generate frequency tables for letter pairs and edit types; these can be used to more accurately rank suggestions. It is also more common than chance for a word to be spelled in the wrong dialect compared to the rest of the text, for example due to American and British English spelling differences.
Spelling suggestions can also be made more accurate by taking into account more than one word at a time. Multi-word sequences are known as n-grams (where n is the number of words in the sequence). A very large database of n-grams up to 5 words in length is available from Google for this and other purposes.
- Search 101 - Google VP of Engineering and CIO Douglas Merrill
- Edit distance
- Damn Cool Algorithms, Part 1: BK-Trees
- How to Write a Spelling Corrector
- 1000x Faster Spelling Correction algorithm (2012)
- Alex Franz; Thorsten Brants (3 Aug 2006). "All Our N-gram are Belong to You".
- Deep Spelling
- Algorithms and techniques for spell checking