Letter frequency: Difference between revisions
Undid revision 223419499 by 79.102.110.113 (talk) |
No edit summary |
||
Line 1: | Line 1: | ||
The '''frequency of letters''' in text has often been studied for use in [[cryptography]], and [[frequency analysis]] in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. [[Linotype machine]]s sorted the letters' frequencies as '''[[ETAOIN SHRDLU|etaoin shrdlu]] cmfwyp vbgkqj xz''' based on the experience |
The '''frequency of letters''' in text has often been studied for use in [[cryptography]], and [[frequency analysis]] in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. [[Linotype machine]]s sorted the letters' frequencies as '''[[ETAOIN SHRDLU|etaoin shrdlu]] cmfwyp vbgkqj xz''' based on the experience |'''s'''||6.327% |
||
More recent analyses show that letter frequencies, like word frequencies, tend to vary, both by writer and by subject. One cannot write an essay about x-rays without using frequent Xs, and different authors have habits which can be reflected in their use of letters. |
|||
Everyone writes differently – for example, [[Ernest Hemingway|Hemingway]]'s writing style is visibly different from [[William Faulkner|Faulkner]]'s. Letter, [[bigram]], [[n-gram|trigram]], word frequencies, word length, and sentence length can be calculated for specific authors, and used to prove or disprove authorship of texts, even for authors whose styles aren't so divergent. |
|||
Accurate average letter frequencies can only be gleaned by analyzing a large amount of representative text. With the availability of modern computing and collections of large [[corpus linguistics|text corpora]], such calculations are easily made. |
|||
==Relative frequencies of letters in the English language== |
|||
[[Image:English-slf.png|right|380px|thumbnail|Relative frequencies of letters in text.]] |
|||
[[Image:English-slf2.PNG|right|380px|thumbnail|Relative frequencies ordered by frequency.]] |
|||
{|class="wikitable sortable" |
|||
|- |
|||
!Letter |
|||
!Frequency |
|||
|- |
|||
|'''a'''||8.167% |
|||
|- |
|||
|'''b'''||1.492% |
|||
|- |
|||
|'''c'''||2.782% |
|||
|- |
|||
|'''d'''||4.253% |
|||
|- |
|||
|'''e'''||12.702% |
|||
|- |
|||
|'''f'''||2.228% |
|||
|- |
|||
|'''g'''||2.015% |
|||
|- |
|||
|'''h'''||6.094% |
|||
|- |
|||
|'''i'''||6.966% |
|||
|- |
|||
|'''j'''||0.153% |
|||
|- |
|||
|'''k'''||0.772% |
|||
|- |
|||
|'''l'''||4.025% |
|||
|- |
|||
|'''m'''||2.406% |
|||
|- |
|||
|'''n'''||6.749% |
|||
|- |
|||
|'''o'''||7.507% |
|||
|- |
|||
|'''p'''||1.929% |
|||
|- |
|||
|'''q'''||0.095% |
|||
|- |
|||
|'''r'''||5.987% |
|||
|- |
|||
|'''s'''||6.327% |
|||
|- |
|- |
||
|'''t'''||9.056% |
|'''t'''||9.056% |
||
Line 57: | Line 5: | ||
|'''u'''||2.758% |
|'''u'''||2.758% |
||
|- |
|- |
||
| |
|.68%||9.3% |
||
|- |
|||
|'''w'''||2.360% |
|||
|- |
|||
|'''x'''||0.150% |
|||
|- |
|||
|'''y'''||1.974% |
|||
|- |
|||
|'''z'''||0.074% |
|||
|} |
|||
<ref>[http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html English letter frequencies<!-- Bot generated title -->]</ref> |
|||
==Relative frequencies of letters in other languages== |
|||
[[Image:Frecuencia de uso de letras en español.PNG|thumb|right|175px|Spanish letter frequencies]] |
|||
{|class="wikitable sortable" |
|||
|- |
|||
!Letter |
|||
![[French_language|French]] <ref>{{cite web |url=http://gpl.insa-lyon.fr/Dvorak-Fr/CorpusDeThomasTemp%C3%A9 |title= CorpusDeThomasTempé|accessdate=2007-06-15}}</ref> |
|||
![[German_language|German]] <ref>Albrecht Beutelspacher, ''Kryptologie'', 7. Aufl., Wiesbaden: Vieweg Verlagsgesellschaft, 2005, ISBN 3-8348-0014-7, p.10</ref> |
|||
![[Spanish_language|Spanish]] <ref>Fletcher Pratt, ''Secret and Urgent: the Story of Codes and Ciphers Blue Ribbon Books'', 1939, pp. 254-255.</ref> |
|||
![[Esperanto_language|Esperanto]] <ref>{{cite web |url=http://lingvakritiko.com/2007/09/13/literoftecoj-kaj-tabelvortoftecoj/ |title= La Oftecoj de la Esperantaj Literoj|accessdate=2007-09-14}}</ref> |
|||
![[Italian_language|Italian]]<ref>Simon Singh, ''Codici e Segreti'', 1999, RCS, ISBN 88-17-12539-3</ref> |
|||
![[Turkish_language|Turkish]] |
|||
![[Swedish_language|Swedish]]<ref>Simon Singh, ''Kodboken'', 1999, Norstedts, ISBN 91-1-1300708-4</ref> |
|||
|- |
|||
|'''a'''||7.636%||6.51%||12.53%||12.12%||11.74%||11.68%||9.3% |
|||
|- |
|- |
||
|'''b'''||0.901%||1.89%||1.42%||0.98%||0.92%||2.95%||1.3% |
|'''b'''||0.901%||1.89%||1.42%||0.98%||0.92%||2.95%||1.3% |
||
|- |
|- |
||
|'''c'''||3.260%||3.06%||4.68%||0.78%||4.5%||0.97%||1.3% |
|'''c'''||3.260%||3.06%||4.68%||0.78%||4.5%||0.97%||1.3% |
||
|- |
|||
|'''d'''||3.669%||5.08%||5.86%||3.04%||3.73%||4.87%||4.5% |
|||
|- |
|||
|'''e'''||14.715%||17.40%||13.68%||8.99%||11.79%||9.01%||9.9% |
|||
|- |
|||
|'''f'''||1.066%||1.66%||0.69%||1.03%||0.95%||0.44%||2.0% |
|||
|- |
|- |
||
|'''g'''||0.866%||3.01%||1.01%||1.17%||1.64%||1.34%||3.3% |
|'''g'''||0.866%||3.01%||1.01%||1.17%||1.64%||1.34%||3.3% |
Revision as of 11:25, 14 July 2008
The frequency of letters in text has often been studied for use in cryptography, and frequency analysis in particular. No exact letter frequency distribution underlies a given language, since all writers write slightly differently. Linotype machines sorted the letters' frequencies as etaoin shrdlu cmfwyp vbgkqj xz based on the experience |s||6.327% |- |t||9.056% |- |u||2.758% |- |.68%||9.3% |- |b||0.901%||1.89%||1.42%||0.98%||0.92%||2.95%||1.3% |- |c||3.260%||3.06%||4.68%||0.78%||4.5%||0.97%||1.3% |- |g||0.866%||3.01%||1.01%||1.17%||1.64%||1.34%||3.3% |- |h||0.737%||4.76%||0.70%||0.38%||1.54%||1.14%||2.1% |- |i||7.529%||7.55%||6.25%||10.01%||11.28%||8.27%*||5.1% |- |j||0.545%||0.27%||0.44%||3.50%||0.00%||0.01%||0.7% |- |k||0.049%||1.21%||0.00%||4.16%||0.00%||4.71%||3.2% |- |l||5.456%||3.44%||4.97%||6.14%||6.51%||5.75%||5.2% |- |m||2.968%||2.53%||3.15%||2.99%||2.51%||3.74%||3.5% |- |n||7.095%||9.78%||6.71%||7.96%||6.88%||7.23%||8.8% |- |o||5.378%||2.51%||8.68%||8.78%||9.83%||2.45%||4.1% |- |p||3.021%||0.79%||2.51%||2.74%||3.05%||0.79%||1.7% |- |q||1.362%||0.02%||0.88%||0.00%||0.51%||0||0.007% |- |r||6.553%||7.00%||6.87%||5.91%||6.37%||6.95%||8.3% |- |s||7.948%||7.27%||7.98%||6.09%||4.98%||2.95%||6.3% |- |t||7.244%||6.15%||4.63%||5.27%||5.62%||3.09%||8.7% |- |u||6.311%||4.35%||3.93%||3.18%||3.01%||3.43%||1.8% |- |v||1.628%||0.67%||0.90%||1.90%||2.10%||0.98%||2.4% |- |w||0.114%||1.89%||0.02%||0.00%||0.00%||0||0.03% |- |x||0.387%||0.03%||0.22%||0.00%||0.00%||0||0.1% |- |y||0.308%||0.04%||0.90%||0.00%||0.00%||3.37%||0.6% |- |z||0.136%||1.13%||0.52%||0.50%||0.49%||1.50%||0.02% |- |à||0.486%||0||0||0||see a||0||0.0% |- |å||0||0||0||0||0||0||1.6% |- |ä||0||0||0||0||0||0||2.1% |- |œ||0.018%||0||0||0||0||0||0 |- |ç||0.085%||0||0||0||0||1.26%||0 |- |ĉ||0||0||0||0.66%||0||0||0 |- |è||0.271%||0||0||0||see e||0||0.0% |- |é||1.904%||0||0||0||see e||0||0.0% |- |ê||0.225%||0||0||0||0||0||0 |- |ë||0.000%||0||0||0||0||0||0 |- |ĝ||0||0||0||0.69%||0||0||0 |- |ğ||0||0||0||0||0||1.13%||0 |- |ĥ||0||0||0||0.02%||0||0||0 |- |î||0.045%||0||0||0||0||0||0 |- |ì||0||0||0||0||see i||0||0 |- |ï||0.005%||0||0||0||0||0||0 |- |ı||0||0||0||0||0||5.20%*||0 |- |ĵ||0||0||0||0.12%||0||0||0 |- |ñ||0||0||0.03||0||0||0||0 |- |ò||0||0||0||0||see o||0||0 |- |ö||0||0||0||0||0||0.87%||1.5% |- |ŝ||0||0||0||0.38%||0||0||0 |- |ş||0||0||0||0||0||1.94%||0 |- |ß||0||0.31%||0||0||0||0||0 |- |ù||0.058%||0||0||0||see u||0||0 |- |ŭ||0||0||0||0.52%||0||0||0 |- |ü||0||0||0||0||0||1.99%||0 |}
-*See Turkish dotted and dotless I
See also
- Corpus linguistics
- ETAOIN SHRDLU
- RSTLNE
- Frequency analysis (cryptanalysis)
- Linotype machine
- Most common words in English
- Scrabble