Talk:Letter frequency

From Wikipedia, the free encyclopedia
Jump to: navigation, search
          This article is of interest to the following WikiProjects:
WikiProject Mathematics (Rated B-class, Low-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
B Class
Low Importance
 Field: Probability and statistics
WikiProject Statistics (Rated B-class, Low-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 Low  This article has been rated as Low-importance on the importance scale.
 
WikiProject Cryptography / Computer science  (Rated B-class, Low-importance)
WikiProject icon This article is within the scope of WikiProject Cryptography, a collaborative effort to improve the coverage of Cryptography on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B-Class article B  This article has been rated as B-Class on the quality scale.
 Low  This article has been rated as Low-importance on the importance scale.
Taskforce icon
This article is supported by WikiProject Computer science (marked as Low-importance).
 

Dispute[edit]

Actually, the source is Cryptographical Mathematics, by Robert Edward Lewand and it does not state the sample size. The 15000 word sample is from someone named Tom's apparnetly independent analysis. hnw555, 11/28/06

The sources for the statistics are taken from: http://www.central.edu/homepages/LintonT/classes/spring01/cryptography/letterfreq.html And are based on a ridiculously small sample size (15000 characters). It might also be copyrighted information.

Oh dear, I didn't notice that sample size. I think the thing to do is remove the data from this page and link to (preferably several) others with data. Objections? Frencheigh 03:23, 20 July 2005 (UTC)
I found a nice letter frequency calculator and I'm gonna give it input from a largish number of Wikipedia articles. It'll contain some bias from stub templates and things, but it should be a somewhat accurate estimate of the frequencies of letters in the English language. --Ihope127 19:37, 9 September 2005 (UTC)
...Whoops, it poofed. Ah well... --Ihope127 13:41, 10 September 2005 (UTC)
Wikipedia is a very biased source, besides the templates, there are going to be many foreign words which will affect the frequency. For English you'll be better off getting 50MB of text files from Project Gutenberg. -- 6 October 2005
I don't think that sort of thing could be used here, as original research is not permitted on Wikipedia. (Wikipedia:No original research)
I guess the results from Project Gutenberg that I just posted could be construed as original research. If so, I apologize, and I guess take them down. :-( I can provide detailed explanation of my methods and source code, as well as full result data, to anyone interested. Matt Whitlock 21:54, 12 April 2006 (UTC)
Removed, for interested persons this is when they were added. Frencheigh 17:59, 5 July 2006 (UTC)
Meanwhile, I've heard that plain-old data isn't protected by copyright. So instead of what I suggested above, I envision a table where each row is a letter and each column is a different study, and in the cells is the frequency of that letter as given by that study. That way the page would be immediately useful for what I bet is the main reason somebody would want to view it. Anyone know if that would be legal? Frencheigh 23:38, 6 October 2005 (UTC)

Well the data that is in there now doesn't seem terribly good, so much better would be to replace it with something else. There's got to be published sources that have used large representative samples. Any ideas of where to look? - Taxman Talk 18:38, 30 November 2005 (UTC)

Ok I did some searching and found corpus linguistics, which seems a much better way to do it. Summary statistics on the prominent corpus' such as the Brown Corpus and the British National Corpus seems much more valuable than what is in the article. Only I couldn't find them. All I could find when searching was this that lists some interesting letter frequencies in various languages, but they appear to be just from some guys webpage that calculated them. Help on finding summary statistics on the corpus' would be great. - Taxman Talk 22:02, 30 November 2005 (UTC)

You've all misunderstood the original quoted article. The frequencies given in the Wikipedia article are correct; note that they all match the second source quoted of British National Corpus to the accuracy given. The mistake was that the title "Tom's Letter Frequencies (in order)" in the center of the page is NOT the caption to the table above; rather, it is the heading for the paragraph and table BELOW. Note that the paragraph even says it is "below" and also that the second table, based on the 15,000 letter sample, is in "order" of frequency. Thus, the original Wikipedia article should stand as being accurate. (JPP)

Is the factual accuracy still disputed? Argyriou 20:09, 3 July 2006 (UTC)
It appears that the four following sections are still based upon that 15,000-char analysis. If there are no objections, I think I'll remove those four sections and attribute the rest above them to "Cryptographical Mathematics" by Robert Edward Lewand. Now, are we sure on the title? "Cryptological Mathematics" gets many more google hits. ([1], [2]). ((signature added later - comment by User:Frencheigh, PDT 15:59, 5 July 2006))
I've removed the {{disputed}} label. The sections User:Frencheigh removed can be found at [3] Argyriou 21:58, 11 July 2006 (UTC)
Directly above i was referring to the "Top 10 beginning of word letters", "Top 10 end of word letters", "Most common bigrams (in order)", and "Most common trigrams (in order)" sections, which were present during the above discussion, unlike the other Project Gutenberg ones I deleted recently (on account of their being OR, see farther up). I suppose I'll leave it for a bit again and clarify; I intend to remove all sections but "Relative frequencies of letters", "See also", and "External links", because the others are from the 15000-char analysis. Frencheigh 08:41, 12 July 2006 (UTC)
Done. Frencheigh 20:15, 19 July 2006 (UTC)

When I was a kid, I read a book on cryptography (I think it may have been "The First Book of Codes and Ciphers," which you can see Neal Stephenson reading in his author photo in "Cryptonomicon"!) that gave the frequency list as ETAONRISHDLFCMUGYPWBVKXJQZ. Anyone else recognize this ordering? Anyone know what statistical source it might have come from? I'm obviously not the only one who's ever thought it was the authoritative ordering, since googling that string of letters produces 187 results. --Mr. A. 21:07, 16 July 2006 (UTC)

Chart ordered by frequency would be helpful[edit]

The chart shown graphing letter frequency vs. letter is ordered alphabetically. An additional chart ordering the vertical bars by frequency (rather than alphabetically) would enhance the presentation.

I generated such a frequency-ordered chart on my Windows system using the Excel spreadsheet chart facility. I have not tried to add the result to the Wiki article because it's relatively ugly and because I couldn't figure out how to convert it to a .png file.

I've found an ordered letter frequency of the english language in this page: http://www.csm.astate.edu/~rossa/datasec/frequency.html The source of the table is: H. Beker and F. Piper, Cipher Systems, Wiley-Interscience, 1982. I don't know if it would be ok to put it here.


How about using this source:

Case-sensitive letter and bigram frequency counts from large-scale English corpora. MN Jones, DJK Mewhort - Behavior Research Methods, Instruments, & Computers, 2004

A link to it can be found here [4]

--Zip123 (talk) 16:37, 23 October 2008 (UTC)

CAN SOMEONE PLEASE ADD INFORMATION ABOUT HOW TO GENERATE LETTER FREQUENCY TABLES IN FOREIGN LANGUAGES (i.e. from texts that are loaded into a computer program)?!? [24.59.100.23]

I wrote a program that takes a file as input and generates a primitive frequency table...it won't work for Unicode, though. I have to fix that. If you want it, I can upload it to Wikipedia (is that legal?). 7 July 2006 - dargueta

I made a Mathematica Simple Code for this purporse. Save a text with only lower-case characteres and with only letters from a to z (don't use others characteres like á,à,ê etc). Save it with the name Liber. The code is the following.

Doc = Import["Liber.txt"]; Numb = Sum[StringCount[Doc, FromCharacterCode[i]], {i, 97, 122}]; K = Table[{StringCount[Doc, FromCharacterCode[i]]*(100./Numb), FromCharacterCode[i]}, {i, 97, 122}]; TableForm[K[[Ordering[100 - K]]]] —Preceding unsigned comment added by 201.58.15.73 (talk) 14:59, 25 December 2007 (UTC)

Statistics from a larger sample size[edit]

In the book The Code Book: The Science of Secrecy from Ancient Egypt to Quantum Cryptography by Simon Singh, I found the following table with a caption that reads:

This table of relative frequencies is based on passages taken from newspapers and novels, and the total sample was 100,362 alphabetic characters. The table was compiled by H. Beker and F. Piper, and originally published in Cipher Systems: The Protection Of Communication.

Note that the values below add to 100.3 due to rounding.

Letter Percentage Letter Percentage
a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1


The following table sorts the values given above in order of letter frequency.

Letter Percentage Letter Percentage
e 12.7 m 2.4
t 9.1 w 2.4
a 8.2 f 2.2
o 7.5 g 2.0
i 7.0 y 2.0
n 6.7 p 1.9
s 6.3 b 1.5
h 6.1 v 1.0
r 6.0 k 0.8
d 4.3 j 0.2
l 4.0 x 0.2
c 2.8 q 0.1
u 2.8 z 0.1


I took a little class on cryptology once, and The Code Book and Cryptological Mathematics were our textbooks. I'm pretty sure they have the same data, but in The Code Book it's rounded. --Ravi12346 19:40, 30 July 2006 (UTC)

Query[edit]

sth is a surprise in a list of high-frequency trigrams. On its own it's an abbreviation of south, and I can think of a few words containing it, but not enough to account for its listing here. Can anyone tell me what it is I haven't thought of?

I grepped a dictionary and came up with 414 results... admittedly, almost all of them you wouldn't use in conversation (try "somesthetic" and "chromesthesia"), but there are a couple like 'firsthand' and 'guesthouse' that aren't so outlandish.

is as has was / this the that there they

Trigraphs ignoring spaces may not be of great practical use though. Uldoon 10:33, 10 March 2006 (UTC)

Given that this seems suspicious, and that we have reproducable numbers from PG, might this section (and sections 1-4) be gotten rid of? Onepairofpants 14:38, 30 May 2006 (UTC)

I agree that the top portion of the page should be deleted. The sample size of that portion is 15000 characters with only 2700 words. And the input is definitely biased (license agreement from Sun, teaching philosophy of a computer science professor, letter of recommendation). This is probably why "sth" appears in the results.

American English[edit]

contains a lot more "z"s than British English. 218.102.218.250 03:02, 5 April 2006 (UTC)

Mainly, I assume, thro' a preference for -ize as a suffix in the US rather than -ise; this despite the reverance in which the Oxford English dictionary is held, and its general preference for the former spelling.

Average Word length[edit]

I would be interested to know some more statistics about these letter frequencies, but I lack the skill to extract the relevant information from the PG archive's ample selection of texts; what is the average word length in english? I read somewhere that it was 4.26, though this was with a rather small sample size. Is the distribution of word lengths a standard distribution? if so, what is the std deviation? How does letter frequency vary with word length? obviously at words of 1 letter, the frequencies will be 0 apart from "I", "A" and possibly "O"... Would anyone have the ability and the capability to satisfy my curiosity? 86.20.233.151 20:59, 1 June 2006 (UTC)

Wheel of Fortune[edit]

So...H and D appear more than L, but the "gimme" letters in the last round of Wheel of Fortune are RSTLNE. That should appear somewhere towards the bottom of this article. --JD79 19:20, 26 January 2007 (UTC)

Pure speculation here, but (1) I think those are the gimme letters because people had gotten the idea that they were a "good set" and guessed the exact same letters all the time, and the producers probably wanted to mix it up; and (2) if you've already guessed S and T, you can probably infer the locations of H's, as well as the likelihood of the last letter's being D (the past tense prefix no doubt being the reason why D is so common). --Mr. A. 04:21, 27 January 2007 (UTC)

"Text messages"?[edit]

From the article:

"The frequency of letters in text messages has often been studied for use in cryptography..."

In the UK at least, the term "text messages" (SMS messages TO SOME DUMBASSES), and readers could hardly understand it to mean anything else. However, from the context it doesn't seem very likely that this was the intended meaning, at least not exclusively. If not then I suggest "text messages" is replaced by just "text". I won't change it just yet in case anyone has a good reason why it should read as it does. Matt 20:36, 21 April 2007 (UTC).

I think you're absolutely right, and I've gone ahead and made the change. --Mr. A. 10:49, 24 April 2007 (UTC)

Hemingway and Faulkner[edit]

I've removed the Hemingway vs Faulkner section in Letter frequencies, again, because, aside from the quote about conventional use of punctuation, it's unreferenced, and may be original research. It's also not really germane to the article; showing that two particular works have different letter frequencies doesn't really say anything about those writers' styles - they may have chosen character or place names which account for the entire discrepancy. An analysis, to be meaningful, would need to show similarities across all of an author's works for several authors, and consistent differences between authors. If there is any published literature on the subject, then it would be appropriate to summarize it and include it in this article under a section heading such as letter frequency variation and authorial style. Argyriou (talk) 19:39, 23 July 2007 (UTC)

ETAONRISH[edit]

I recall having read a book that used an alphabet ordered by frequency, beginning with the letters above, that was useful in solving substitution ciphers. Searching for "etaonrish" on Google gets quite a few results, too, so I wonder why this order isn't discussed in this article. B7T (talk) 04:11, 17 April 2008 (UTC)

This is the list that I mention at the end of the Dispute section above. I'd be interested in knowing where this list of frequencies came from. --Geenius at Wrok (talk) 10:41, 19 April 2008 (UTC)
If the title in that section is correct, a query on Amazon.com suggest that the book in question may be a children's book, The First Book of Codes and Ciphers by Sam and Beryl Epstein, illustrated by Laszlo Roth, published in 1956, if anyone wants to try to find a copy to verify it; I'm fairly certain that I first heard of this sequence in a different book for young people. A search for "etaonrish" on Amazon.com yielded results of its mention in five other books in the context of ciphers or cryptography, none of which are children's books (although one is about activities to do with children). B7T (talk) 19:44, 24 April 2008 (UTC)
If you ever figure it out, perhaps it would also be useful in the ETAOIN SHRDLU article. --68.0.124.33 (talk) 17:21, 9 March 2010 (UTC)

German ä, ö, and ü[edit]

This list is immediately suspect. Although I have no figures to back it up, I know enough German to wager that these three letters are nowhere near the given zero percent. Mamarazzi (talk) 05:35, 16 June 2008 (UTC)

The appendix to Fletcher Pratt's classic Secret and Urgent also gives a frequency table for German that omits these letters; it notes that umlauted a, o, and u have been counted with the non-umlauted a, o, and u. The reason for doing this is not explained. Perhaps in telegraphy they were treated the same? 76.199.88.163 (talk) 12:30, 8 July 2008 (UTC)
In The American Cryptogram Association's Xenocrypyts, umlauts and other diacritical remarks are stripped. Pratt does seem to have been a member:
http://voynichcentral.com/transcriptions/Voynich-101/strong_letters.pdf.
My guess would be that the frequency tables he printed had been compiled by members of the ACA.
--jdege (talk) 22:18, 15 July 2008 (UTC)
But if they weren't counted, don't put in a number such as 0... AnonMoos (talk) 14:02, 20 August 2008 (UTC)

Shouldn't we be warning readers about potential problems in the data? I know German but blithely copied the data without thinking, assuming them to be correct until I started seeing major anomalies in the statistical totals I was tabulating across languages! If I hadn't noticed this, it could have thrown my work way out!Matthew Slyman (talk) 09:07, 29 April 2013 (UTC)

I've just noticed the discrepancy as well. It should be fixed since its incorrect and misleading. Both this page, and its German Wikipedia equivalent cite the same source but unfortunately, I do not have a copy the source available. The German page says that ä, ö, and ü were counted as if they were ae, oe, and ue (standard practice when those characters aren't available). It also states the ligature ſz (which would later develop into ß) is counted independently from ß itself, which suggests some older texts were used in the analysis. Ideally another source should be found if possible. Note I do in fact assume the German page is right, since its totals sum to around 100% while the English page sums to about 102 or 103%. 74.12.29.232 (talk) 03:45, 4 October 2013 (UTC)

Apostrophe[edit]

This is a great article, but I would find it useful to include the apostrophe character in the list of characters for which data is gathered, sorted by frequency, etc. I recall that in some languages (including English?) the apostrophe has a higher frequency than a few other letters of the alphabet. I went to this Wikipedia page to check this, but couldn't find such information, because the apostrophe is not even considered. Even more importantly, in languages like Italian the apostrophe can be part of a word. Even in languages like English or German there may be scenarios where one could wish to include the apostrophe (as used in possessives and contractions) in such statistics. Hi-Toro (talk) 01:19, 22 March 2011 (UTC)

Frequencies are inherently a property of the specific type of text. The traditional letter frequency counts are from telegraph text - upper case, with numbers spelled out and punctuation either dropped or spelled out. Mixed case, with punctuation, and/or with spaces, and/or with umlauts/accents/etc., will all have different frequency counts. As will Baudot code, ASCII, every different flavor of Unicode, etc.
We can't possibly include tables for them all.
jdege (talk) 01:43, 23 March 2011 (UTC)

Move to "Letter frequency"?[edit]

Shouldn't this article be moved to Letter frequency per WP:SINGULAR? Leon math (talk) 21:47, 4 January 2009 (UTC)

From what it can be read, it might be either because English people do not know how to use a keyboard, or because dictionary spelling is not enforced. — Preceding unsigned comment added by 86.75.160.141 (talk) 22:29, 27 October 2012 (UTC)

French ë?[edit]

Why is the ë 0.000% instead of 0? Is there only one word in their entire language with it? And is it only used in certain times of year? (Noël) Is Noël the only example? Is it 0.000% because the percentage is less than 0.000% but greater than 0? Uber-Awesomeness (talk) 19:49, 10 February 2009 (UTC)

Off top of my head, I can think of "canoë", "continguë", "ambiguë", "noël", "exiguë", "ciguë", "aiguë", and "Israël", "Staël", "Saint-Saëns". We also use "ü" in "capharnaüm", "Saül", "Esaü". And of course, we use lots of "ï" as in "ambiguïté", "exiguïté", "égoïste", "aïeul", "glaïeul", "haïr", "maïs", "coïncider", "inouï", etc... —Preceding unsigned comment added by 84.72.92.4 (talk) 00:58, 16 April 2009 (UTC)

Diactric in english language[edit]

Why not to include in statistics English diacritics such as those within English terms with diacritical marks article. — Preceding unsigned comment added by 86.75.160.141 (talk) 22:23, 27 October 2012 (UTC)

Letter Frequency Not Reliable Enough?[edit]

I do not think that the English letter frequency on this page is very reliable. The source (http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html) only counted 15000 characters, which is not really enough to get a good picture of letter frequency. It only uses three documents, all of which are relatively specialized and so do not accurately represent letter frequency in the language as a whole.

Additionally, letter frequency from Project Gutenberg alone (http://en.wikipedia.org/w/index.php?title=Letter_frequencies&diff=48178370&oldid=45375638) is unreliable, as it only contains certain types of text and styles of writing. A combination of styles is necessary to be truly reliable. —Preceding unsigned comment added by Humanperson0 (talkcontribs) 19:38, 28 June 2009 (UTC)

(http://letterfrequency.org/) cites letter frequency as "e t a o i n s r h l d c u m f p g w y b v k x j q z". I do not know how reliable that site is, but my own study of letter frequency (http://mtgap.bilfo.com/letter_frequency.html) – in which I used 750,000 characters and a variety of writing styles – came up with the same result so it's probably pretty reliable. Also, the site includes many different types of frequency (word frequency, letter frequency in religious writings, etc) which points towards the study being thorough.

I do not have the skill necessary to completely modify the English section of the letter frequency page, but I recommend that it be seriously modified. The letter frequency that is currently used is unreliable.

Humanperson0 (talk) 19:24, 28 June 2009 (UTC)

I don't think we should be doing our own frequency counts. That would be original research. We should be referring to - and citing - frequency counts which we can can properly document via reliable sources. Yes, different sources have different frequency counts, but different kinds of text have different letter frequencies. We shouldn't try to hide that.
--jdege (talk) 12:34, 30 June 2009 (UTC)

As someone who's interested in these frequencies in order to teach someone Braille, I'd like to point out that the deaf and blind folks have a slightly different ordering. Maybe it's worth it to mention, although, to be fair, they don't cite their sources either: http://www.deafandblind.com/word_frequency.htm for what it's worth Phillipkwood (talk) 22:39, 10 July 2009 (UTC)

The English letter frequency counts in the article from http://pages.central.edu/emp/LintonT/classes/spring01/cryptography/letterfreq.html (at the top of the webpage) as well as the bigram and trigram frequencies listed at the bottom of the webpage are from Cryptological Mathematics by Robert Lewand, pages 36 and 37 respectively (google books link:[5]) and are NOT based on the webpage author Tom Linton's own 15000 character sample. (Lewand's book is approaching letter statistics from the perspective of cryptanalysis of messages with all spaces stripped, which gives the odd distribution of trigrams: some like "edt" and "sth" are found mostly across word boundaries.) It seems more appropriate to cite Lewand's book rather than Linton's website for the numbers used in the article so as to avoid any confusion with Linton's own low-quality dataset. I'm not sure what corpus Lewand used, but it must have been much larger than 15,000 characters. Alternatively, if we could find a reliable source that actually states what corpus they used, that could be better. --Speight (talk) 02:36, 4 August 2009 (UTC)

Toki Pona[edit]

I must question whether Toki Pona — a constructed language of recent origin, with only a few dozen users, and intentionally designed not to be of general use as an international auxiliary language — is sufficiently notable to merit inclusion in the "Relative frequencies of letters in other languages" table. Comments? Richwales (talk · contribs) 06:11, 22 December 2010 (UTC)

There is absolutely no excuse for the obscure Toki Pona to be in this article. The language itself is not really notable enough to deserve an article (I'm shocked it has one). If the goal is to include a made-up language, I'd suggest Pig Latin. If the goal is to include a language with a very different distribution, I'd suggest Hawaiian. If it weren't such a pain to edit tables on Wikipedia, I would have ripped Toki Pona out already. Perhaps someone knows of an easier way to edit tables. RoyLeban (talk) 22:06, 9 September 2011 (UTC)

There is already a "made-up" language in the article, Esperanto. This designed language is in use by many people all over the word, so covering another made up language by including Toki Pona or Pig Latin isn't necessary unless those languages have interest in their own right. Decorian (talk) 13:12, 28 September 2011 (UTC)

My impression[edit]

I have always understood that - although, admittedly, there is dispute about this - the most common letters in English by popular consent are - in decreasing order of frequency - ETAONRISH

I am not sure what comes next in the list, but it would probably be D followed by U or L. ACEOREVIVED (talk) 17:22, 24 March 2011 (UTC)

Universal translator?[edit]

Is there a source for letter frequency by language more generally? (That is, for languages not tabulated here.) It seems to me it would be useful to mention or link to, for people wanting more information. TREKphiler any time you're ready, Uhura 16:29, 13 April 2011 (UTC)

Strictly WP:OR[edit]

Entirely original research, but if anyone is seeking independent confirmation of (approximate) letter frequencies, I'd recommend taking a look at their computer keyboard. My netbook - a few months old - had a matt surface on the keys. The well-used ones are now glossy... AndyTheGrump (talk) 03:15, 8 June 2011 (UTC)

Introduction[edit]

At 7 paragraphs, the intro for this article appears to be excessive. Rather than deleting any content, perhaps some of the material could be moved into its own section or merged with others. I'd like to add a cleanup tag, called {{lead too long}}, to resolve this problem ... unless there are objections, or a specific reason for the verbosity? — VoxLuna  orbitland  22:11, 9 November 2011 (UTC)

Data miscopied from source?[edit]

Upon comparing the letter frequencies in the first table of this Wiki article against the source citation [4], I find that the frequencies for the three letters K, V, W do not agree with the source, as follows:

Letter Freq in Wiki Freq in [4]
K 0.747 0.772
V 1.037 0.978
W 2.365 2.360

As verification, the sum of the frequencies of all 26 letters of the alphabet should add up to 1. The frequencies listed in the Wikipedia article add up to 1.00038. The frequencies listed in the source [4] add up to .99999 (differs from 1 presumably due to roundoff). I went ahead and revised the numbers for these three letters in the Wiki table to match those of the source [4]. AlanSiegrist (talk) 15:31, 14 October 2012 (UTC)

Not sure what "[4]" source you were talking about last year, but the current article seems a little confusing, citing the 1982 Cipher Systems for the letter frequencies "listed below", then telling the reader that it differs from some Cornell table, then saying that the Concise Oxford dictionary includes some analysis, before finally saying that the Wikipedia list is using "this table" - sourced to some coder's personal website! The data seems to match the numbers on the coder's site, but he doesn't say where he got them from, only vaguely citing a 2000 textbook as "sources".
I've tried to clean this up by cutting the first statement as false, moving the second after the table, leaving the third and explaining the fourth. But what do we think is the best source to use here? --McGeddon (talk) 18:53, 19 July 2013 (UTC)

Tables do not add up to 100%[edit]

The sum of the frequency of the letters for the languages English, French, German, Spanish, Portuguese, Esperanto, Italian, Turkish, Swedish, Polish, Dutch, Danish, Icelandic, and Finnish are currently 100.00%, 99.16%, 102.36%, 104.47%, 104.09%, 99.99%, 101.14%, 94.01%, 100.00%, 108.01%, 101.59%, 100.00%, 100.00%, and 100.00%. Whilst I understand that frequencies may not always add up to 100%, I cannot imagine why they should ever exceed 100%. I cannot avoid the conclusion that some letters are being counted twice. — Preceding unsigned comment added by 2.104.4.142 (talk) 14:08, 15 February 2014 (UTC)

It's possible that the diacritics are being counted twice - checking a reference at random, the source for the Portuguese distribution differs slightly (it totals 100.01% versus this article's 104.09% - the .01% is presumably just a rounding error, although H, W and X have different values) and it says absolutely nothing about diacritics, so perhaps somebody just added them afterwards from a different source. --McGeddon (talk) 09:58, 21 February 2014 (UTC)
Looks like these edits from May/June 2013 might be the culprit - an IP editor added more diacritic rows and updated the stats for every language, somehow bumping them up to three decimal places of accuracy despite the sources only using two. Assuming good faith, this could have been somebody running their own analysis of corpus texts and getting a finer level of accuracy, but it's left us with a lot of data that doesn't match its sources, and apparently doesn't quite add up. --McGeddon (talk) 10:05, 21 February 2014 (UTC)