Wikipedia:Typo Team/moss/Archive

From Wikipedia, the free encyclopedia

DNA[edit]

DNA sequences, like those in

Hmm, I will have to ask around on MoS or something. Thanks for finding that. -- Beland (talk) 01:38, 19 July 2018 (UTC)[reply]

If not, we could make one, a template with at bare minimum <span class="dna-sequence">{{{1}}}</span> Do similarly for the poem structure patterns. We did this with trade designations for horticultural plants, and it has worked out well: {{tdes}}. Turn out the nomenclature authority requires them (in a scientific name) to be in a differenced font, so we used kerned monospace (it supports extra options, but that part was probably a bad idea). Anyway, here's all "Template:"-namespace pages with "dna" in their titles here's those with "gene", in case there's already a template for this (I have not pored over them).  — SMcCandlish ¢ 😼  14:48, 21 July 2018 (UTC)[reply]
Oof, that has resulted in some pretty ugly plant name typography; I wish we hadn't followed the typographical conventions of that source. I do like the idea of a template, though - that would make it easy for anyone who is interested to find all of the DNA sequences on Wikipedia...which is a thing that could happen? I put your code in {{DNA sequence}} and applied that to this article; thanks for throwing that together! I'll ponder poem patterns a bit more. -- Beland (talk) 05:44, 25 July 2018 (UTC)[reply]

Fixed[edit]

These are due to difficult-to-parse mixtures of tables and templates. ::sigh:: I think I can fix this in code. -- Beland (talk) 00:49, 19 July 2018 (UTC)[reply]

These should be ignored in the next run (20 July 2018 dump or later). -- Beland (talk) 22:28, 25 July 2018 (UTC)[reply]
ghola is either related to West Bengal, Pakistan, Afghanistan, or related to the Dune universe; Wiktionary does not have either wikt:ghola or wikt:gholas;
  24 - "gholas" : of 24 matches only one (Hasnabad (community development block)) is not from the Dune universe
427 - "ghola"
  59 - "ghola" -"bengal"
of 59, only 7 are not about the Dune universe: Ghoul, Prem Pujari/List of songs recorded by Kishore Kumar (a song title) Mount Paiko/Kharkoo (places) Bogeyman List of rampage killers/List of rampage killers (familicides) (a town)
So this is the plural of a word that is most often a made-up term from the Dune universe, not exactly ready for Wiktionary! What to do? Shenme (talk) 19:12, 28 August 2018 (UTC)[reply]
Ah, we have a redirect from ghola; I can add redirects to the exclusion list. I'll have to be careful of those with {{R from misspelling}} and variations, and we'll have to go through all untagged redirects and tag those that are also misspellings. (In the end, I think all redirects will be tagged; categorizing them helps projects decide whether or not they are worthy for inclusion in a print version or CD, etc.) -- Beland (talk) 19:53, 13 September 2018 (UTC)[reply]
Oh, redirects are already included in the dictionary. I just created a redirect from gholas, so this should be ignored on the next run. -- Beland (talk) 04:43, 24 September 2018 (UTC)[reply]

Notes from Apr 2018[edit]

Poems[edit]

These are patterns used to describe poetry. Not sure they are appropriate for Wiktionary; if not, I will whitelist them. -- Beland (talk) 00:49, 19 July 2018 (UTC)[reply]

There may be a better and even conventionally marked-up way to represent these. Check poetry sources? Maybe they done as c-d-c-d or whatever.  — SMcCandlish ¢ 😼  14:38, 21 July 2018 (UTC)[reply]

Oh, there are lots more where that came from. Maybe these should be tagged or maybe I can fix in code with a pattern recognizer or something. I'll have to ponder. -- Beland (talk) 01:42, 19 July 2018 (UTC)[reply]

From longest:

I think these should be capitalized or enclosed in quotes, either of which would prevent them as showing up here as spelling errors. I started a discussion at Wikipedia talk:Manual of Style § Rhyme scheme patterns. -- Beland (talk) 22:26, 25 July 2018 (UTC)[reply]

Continued at Wikipedia:Typo Team/moss#Repeating patterns. -- Beland (talk) 02:04, 17 August 2018 (UTC)[reply]

Notes from Jan 2019[edit]

Statistics[edit]

2018-04 to 2018-09[edit]

Misspellings
per article
2018-04-01 dump
moss 4933ad4
2018-07-01 dump
moss 4933ad4
2018-07-20 dump
moss 5e6b2ce
2018-08-01 dump
moss 0f7ddbf
2018-08-20 dump
moss 032a6be
2018-09-01 dump
moss 816c025
2018-09-20 dump
moss 7e26fe6
Total change
(to 2018-09-20)
0 4839889 4910541 (+70652) 4948698 (+38157) 4956727 (+8029) 4975895 (+19168) 4986531 (+10636) 5066713 (+80182) (+226824)
1 319509 319315 (-194) 315926 (-3389) 312871 (-3055) 311641 (-1230) 309785 (-1856) 268592 (-41193) (-50917)
2 104405 104591 (+186) 90630 (-13961) 89861 (-769) 89701 (-160) 89286 (-415) 71105 (-18181) (-33300)
3 40270 40099 (-171) 38430 (-1669) 37891 (-539) 37832 (-59) 37669 (-163) 29796 (-7873) (-10474)
4 22793 22739 (-54) 21069 (-1670) 20900 (-169) 20909 (+9) 20859 (-50) 16180 (-4679) (-6613)
5 13355 13331 (-24) 12561 (-770) 12392 (-169) 12357 (-35) 12315 (-42) 9483 (-2832) (-3872)
6 9398 9422 (+24) 8700 (-722) 8620 (-80) 8625 (+5) 8574 (-51) 6411 (-2163) (-2987)
7 6599 6614 (+15) 6150 (-464) 6095 (-55) 6098 (+3) 6076 (-22) 4573 (-1503) (-2026)
8 5314 5312 (-2) 4854 (-458) 4832 (-22) 4812 (-20) 4839 (+27) 3474 (-1365) (-1840)
9 3992 3985 (-7) 3723 (-262) 3643 (-80) 3665 (+22) 3631 (-34) 2640 (-991) (-1352)
10-19 16753 16879 (+126) 15508 (-1371) 15437 (-71) 15497 (+60) 15458 (-39) 10260 (-5198) (-6493)
20-29 4997 4992 (-5) 4597 (-395) 4594 (-3) 4524 (-70) 4512 (-12) 2596 (-1916) (-2401)
30-39 2169 2211 (+42) 1976 (-225) 1962 (-14) 1934 (-28) 1929 (-5) 1011 (-918) (-1158)
40-49 1177 1205 (+28) 1061 (-144) 1061 (0) 1031 (-30) 1027 (-4) 525 (-502) (-652)
50-59 674 695 (+21) 619 (-74) 618 (-1) 560 (-58) 553 (-7) 296 (-257) (-378)
60-69 453 476 (+23) 420 (-56) 419 (-1) 378 (-41) 377 (-1) 179 (-198) (-274)
70-79 299 326 (+27) 243 (- 83) 241 (-2) 214 (-27) 218 (+4) 119 (-99) (-180)
80-89 213 218 (+5) 179 (-39) 181 (+2) 177 (-4) 179 (+2) 81 (-98) (-132)
90-99 140 153 (+13) 131 (-21) 126 (-5) 126 (0) 128 (+2) 61 (-67) (-79)
100-199 456 521 (+65) 434 (-87) 435 (+1) 416 (-19) 414 (-2) 196 (-218) (-260)
200-299 90 113 (+23) 93 (-20) 95 (+2) 91 (-4) 96 (+5) 44 (-52) (-46)
300-399 44 45 (+1) 41 (-4) 42 (+1) 41 (-1) 42 (+1) 27 (-15) (-17)
400-499 19 26 (+7) 21 (-5) 22 (+1) 18 (-4) 18 (0) 9 (-9) (-10)
500-599 12 13 (+1) 13 (0) 13 (0) 16 (+3) 16 (0) 7 (-9) (-5)
600-699 8 9 (+1) 9 (0) 8 (-1) 6 (-2) 7 (+1) 2 (-5) (-6)
700-799 2 3 (+1) 3 (0) 5 (+2) 5 (0) 5 (0) 1 (-4) (-1)
800-899 2 3 (+1) 3 (0) 3 (0) 3 (0) 2 (-1) 0 (-2) (-2)
900-999 6 7 (+1) 6 (-1) 6 (0) 5 (-1) 5 (0) 0 (-5) (-6)
1000-1999 25 27 (+2) 27 (0) 26 (-1) 24 (-2) 23 (-1) 0 (-23) (-25)
2000-2999 3 5 (+2) 5 (0) 5 (0) 5 (0) 3 (-2) 0 (-3) (-3)
4000-4999 0 2 (+2) 2 (0) 2 (0) 3 (+1) 1 (-2) 0 (-1) (0)
Parse failed 193671 194777 (+1106) 191813 (-2964) 195147 (+3334) 203588 (+8441) 203583 (-5) 201420 (-2163) (+7749)

The spell checker has been getting smarter over time, so more recent versions report fewer false alarms. This explains most of the drop in the number of possible typos reported. Most of the gains for pages with more than 100 possible typos is due to changes that ignore pages with {{cleanup}} and similar tags, which indicate the page may not be ready for spell checking. I have been specifically tagging pages with a high number of possible typos to bring them to the attention of interested editors. Pages tagged for cleanup are reported in the statistics of cleanup-related work queues.

Some variation in the number of typos fixed between runs is also explained by the differences in the amount of time between runs. The biggest sources of variance are the unusually long time between the first two runs and the fact that dumps snapshotted on the first day of the month (which have a lot of additional data the spell checker doesn't need) take longer for Wikimedia servers to generate than the dumps snapshotted on the twentieth day of the month. There is also considerable activity from other editors writing new material and correcting typos as they find them while reading or editing articles.

moss project participants have been correcting hundreds or thousands of typos per month (yay!) mostly in articles with a single typo. We have also been adding somewhere from handfuls to dozens of entries to Wiktionary a month. Looking only at the generated reports, these numbers are difficult to separate from the other changes in data and code, but we do see progress as we strike through or remove items from the todo lists.

Since figuring out which words are not typos is such a big part of the problem to be solved, the code may need to get smarter in the future, but we're probably going to have an upcoming period of relative stability as we work through some low-hanging fruit. Hopefully upcoming statistics will reflect progress in actually reducing typos more than changes in spell checker code. -- Beland (talk) 18:20, 12 October 2018 (UTC)[reply]

2018-09 to 2019-03[edit]

At least 10% of possible typos reported in the old statistics are definitely misspellings, but it's unclear how many of the remaining 90% are. Below is a new way of breaking down possible typos, by type instead of count per article. The "T1" items are almost all typos, and those are what we've been working on in the main "by article" section. Some of the other types have their own reports on this page, but most will require further analysis to either automatically distinguish typos vs. legitimate strings, or produce a more useful report for human editors.

Reporting symbol Explanation Instances/Unique strings, 2018-09-20 dump (7e26fe6) Instances/Unique strings, 2018-10-20 dump (7649023) Instances/Unique strings, 2018-11-01 dump (0aa8575) Instances/Unique strings, 2018-12-20 dump (03be966) Instances/Unique strings, 2019-01-20 dump (1bcf51c) Instances/Unique strings, 2019-02-01 dump (c6ce3ab) Instances/Unique strings, 2019-03-01 dump (ff8b9d2) Instances/Unique strings, 2019-03-20 dump (692642d)
TS Missing or whitespace or dash (or new compound) 152985/84720 194758/114535 194711/114518 195044/114675 192811/114167 193752/114734 191701/113928 183795/109989
T1 Edit distance 1 from common English word 111429/70527 104280/68352 103043/67652 96081/64513 89549/61018 89355/60879 83353/57483 75941/53339
T2 Edit distance 2 from common English word 82638/53517 81793/53146 81721/53191 81536/53093 81170/52980 82727/53945 81410/53326 72093/47849
T3 Edit distance 3 from common English word 91844/61332 90769/60713 90778/60760 90382/60574 89841/60397 91893/61566 90328/60825 79609/54610
T4 Edit distance 4 from common English word 76336/52684 75139/52090 75006/52101 74757/51828 74536/51752 76323/52938 75335/52296 -
T5 Edit distance 5 from common English word 52071/36450 50970/35807 50882/35812 50614/35649 50571/35624 51785/36446 50852/36022 -
T6 Edit distance 6 from common English word 30437/21927 29755/21481 29704/21478 29490/21302 29440/21280 30134/21759 29685/21506 -
T7 Edit distance 7 from common English word 15392/11095 14972/10854 14977/10858 14858/10736 14765/10698 15153/10939 14929/10790 -
T8 Edit distance 8 from common English word 7138/5060 6966/4936 6970/4947 6911/4902 6863/4881 6967/4959 6811/4886 -
T9 Edit distance 9 from common English word 2450/1868 2383/1823 2380/1822 2349/1822 2348/1819 2407/1867 2386/1848 -
T10 Edit distance 10 from common English word 1027/721 987/705 986/706 995/702 978/697 992/708 960/693 -
T11 Edit distance 11 from common English word 399/324 390/317 389/316 380/312 378/309 386/315 388/316 -
T12 Edit distance 12 from common English word 122/105 119/102 119/102 120/103 117/101 118/101 118/101 -
T13 Edit distance 13 from common English word 44/29 44/29 44/29 44/29 45/30 45/30 45/30 -
T14 Edit distance 14 from common English word 15/13 14/12 14/12 13/11 1/1 6/5 5/5 -
T15 Edit distance 15 from common English word 1/1 1/1 1/1 0/0 1/1 0/0 0/0 -
T16 Edit distance 16 from common English word 2/2 0/0 0/0 0/0 0/0 1/1 1/1 -
R A-Z only, not near a common English word 168446/121107 165841/119452 165960/119619 165403/119208 165091/119086 169103/121936 166235/120111 101178/77389
I Letters with accents or mixed with punctuation (other than hyphen) 266937/143960 261310/144833 261653/145040 263654/145754 263679/146027 275444/153887 229579/149303 93902/70014
W Not in English Wikitionary, in non-English Wiktionary - - - - - - - 82548/48389
L Probable Romanization (transLiteration) - - - - - - - 4294/2610
ME Probable coMpound, English - - - - - - - 51279/33301
MI Probable coMpound, non-English (International) in English Wiktionary - - - - - - - 194949/133055
MW Probable coMpound, found in non-English Wiktionary - - - - - - - 51656/36961
ML Probable coMpound, transLiteration - - - - - - - 4010/2791
C Chemistry words 6581/4604 6597/4619 6613/4629 6631/4638 6633/4624 6618/4618 6637/4625 1853/1399
D DNA sequences (a, c, g, t) 51/18 15/3 16/4 16/4 15/3 15/3 2/2 0/0
N A-Z plus numbers and hyphens 25061/20114 25728/20854 25702/20846 25748/20899 25582/20737 26201/21255 25969/21130 26620/21685
P Patterns (e.g. rhyme schemes) 808/461 796/484 790/484 778/478 736/439 744/443 493/423 47/33
H HTML/XML/SGML tag - - - - - - 3389/1592 3519/1593
HB Known bad HTML tag, like <font> - - - - - - 14417/49 15366/49
HL Bad HTML-like linking, like <http://...> - - - - - - 519/5 516/5
Parse failure Mismatched punctuation ? ? ? 202583 203044 203611 214525 199130 articles
Total 1092214/690639 1113627/715148 1112459/714927 1105804/711232 1095150/706671 1120169/723334 1075547/711296 1043175/695061

2019-03 to 2020-02[edit]

From 2018-09-20 to 2019-03-01, the number of typos classified as T1 (edit distance 1 from an English word, the most likely to be actual misspellings) dropped by 35,488, or 32%, and this appears to be due to the hard work of editors participating in the moss project fixing typos on the T1 lists. Amazing progress! The numbers for categories we aren't fixing have remained relatively stable, though for all categories there is some bouncing around as new typos are created and fixed in the normal course of writing and editing articles.

While processing the 2019-03-01 dump, I made a major change to how typos are classified. (You can see the old method in the archived statistics.) I've dropped categories with an edit distance greater than 3 from an English word (T4 thru T16) since these are quite unlikely to be misspellings. Most of the reported typos that are not likely English misspellings are either compound words or non-English words. (Some of the non-English words are also misspelled.) Some English compounds end up as TS, if they are caught by a conventional spell checker; the rest are now classified as ME. (There are various other categories for compounds, all starting with M, and these will all need to be refined later because a fair number of words are up there that don't belong.) In an effort to exclude as many non-English words as possible, I've started looking at non-English Wiktionaries; any words found there but not in the English Wiktionary are classified as W. Romanizations are not eligible for Wiktionary; words native to non-Latin writing systems are entered under those other systems. I've written some code that attempts to perform transliteration from any given writing system. It's starting to catch a few thousand words (classified as L) but is obviously missing a lot and so will need to be further refined. I've also added some categories for bad HTML tags and similar problems.

Since the classification changes make the new numbers incomparable with the old numbers, I've started a new table below. I've started posting some TS typos as well as T1s, so expect to see both those numbers to improve significantly in the coming months. -- Beland (talk) 07:30, 23 March 2019 (UTC)[reply]

Reporting symbol Explanation Change from 2019-03-01 to 2020-02-20 Instances, 2019-03-01 dump (692642d) Instances, 2019-03-20 dump (802b6c0) Instances, 2019-04-01 dump (ab3fabd) Instances, 2019-04-20 dump (7bb97ba) Instances, 2019-05-01 dump (dcb388a) Instances, 2019-05-20 dump (dcb388a) Instances, 2019-06-01 dump (30a59f6) Instances, 2019-07-01 dump (2fc381f) Instances, 2019-07-20 dump (41f99ab) Instances, 2019-08-01 dump (bc954d6) Instances, 2019-08-20 dump (c600526) Instances, 2019-09-01 dump (4660042) Instances, 2019-09-20 dump (18f7307) Instances, 2019-10-01 dump (08a1438) Instances, 2019-10-20 dump (e07a89f) Instances, 2019-11-01 dump (e07a89f) Instances, 2019-11-20 dump (e07a89f) Instances, 2019-12-01 dump (95d1a53) Instances, 2019-12-20 dump (0434c67) Instances, 2020-01-20 dump (99af116) Instances, 2020-02-20 dump (99af116)
TS Missing or extra whitespace or dash (or new compound) -39368 (-21%) 183795 182018 (-1777/.97%) 178591 (-3427/1.9%) 177391 176266 175163 173312 170828 168401 166966 164205 161344 160707 157832 155980 155218 152621 147666 146591 144424 144427
T1 Edit distance 1 from common English word -36192 (-48%) 75941 73600 (-2341/3.1%) 70756 (-2844/3.9%) 69261 68790 66099 64732 61255 57141 55160 51987 48904 45926 44275 40436 39285 39106 39721 39301 38737 39749
T2 Edit distance 2 from common English word -7560 (-10%) 72093 71615 (-478/.66%) 70949 (-666/.93%) 70909 70684 70247 69741 69629 69365 69266 69146 68748 68657 67161 66173 65589 64952 64890 64886 64691 64533
T3 Edit distance 3 from common English word -5276 (-7%) 79609 78925 (-684/.86%) 78209 (-716/.91%) 78139 78046 77541 76954 76887 76672 76691 76663 75998 76061 75096 74636 74327 73995 74030 74551 74419 74333
R Regular word (A-Z only) not near a common English word -3525 (-3%) 101178 100067 (-1111/1.1%) 99491 (-576/.58%) 99722 99694 99236 98856 98788 98646 98498 98411 97438 97588 96865 96775 96746 96490 96593 96948 97342 97653
I Definitely not English (International) due to accents or mixed with punctuation (other than hyphen) -22196 (-24%) 93902 90875 (-3027/3.2%) 88564 (-2311/2.5%) 87748 87925 84690 81042 81284 82263 82412 82431 71982 71240 70248 70349 70385 70510 70468 70714 70856 71706
W Not in English Wiktionary, in non-English Wiktionary -6764 (-8%) 82548 82519 (-29/.04%) 80041 (-2478/3.0%) 79664 79486 77888 76310 76309 76224 76177 76142 75508 76248 75263 74906 74816 74851 74991 75294 75663 75784
L Probable Romanization (transLiteration) +81 (+2%) 4294 4306 (+12/.28%) 4206 (-100/2.3%) 4219 4237 4197 4168 4181 4189 4188 4191 4191 4234 4115 4126 4132 4182 4195 4228 4282 4375
ME Probable coMpound, English (with and without dash) +976 (+2%) 51279 51052 (-227/.44%) 50845 (-207/4.1%) 50932 50902 50659 50263 50352 50439 50419 50700 50606 50708 50392 51830 51791 51782 51830 52026 52173 52255
MI Probable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash) -18475 (-9%) 194949 192743 (-2206/1.1%) 189661 (-3082/1.6%) 189758 190172 187870 184497 185101 185733 185960 186074 175904 176069 174746 173592 173700 173611 173710 174881 175528 176474
MW Probable coMpound, found in non-English Wiktionary -5544 (-11%) 51656 51240 (-416/.81%) 50288 (-952/1.9%) 50026 49785 48728 47641 47642 47544 47831 47555 46854 46850 46342 46232 46026 45944 45968 46031 45947 46112
ML Probable coMpound, transLiteration -124 (-3%) 4010 3964 (-46/1.1%) 3925 (-39/.98%) 3881 3892 3835 3829 3827 3826 3857 3853 3849 3852 3779 3750 3759 3786 3798 3834 3863 3886
C Chemistry words -176 (-9%) 1853 1855 (+2/.11%) 1863 (+8/.43%) 1862 1858 1864 1569 1559 1554 1560 1561 1552 1551 1665 1662 1651 1635 1639 1657 1662 1677
D DNA sequences (a, c, g, t) 0 0 0 (-) 0 (-) 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
N A-Z plus numbers and hyphens -1391 (-5%) 26620 25854 (-766/2.8%) 25711 (-143/.56%) 25739 26263 26134 25945 25841 25703 25650 25664 26664 25776 25557 25245 25072 24942 24993 25119 25107 25229
P Patterns (e.g. rhyme schemes) -20 (-43%) 47 50 (+3/6.4%) 49 (-1/2.0%) 50 48 47 50 49 45 42 38 37 39 17 18 16 17 19 21 19 27
H HTML/XML/SGML tag -539 (-15%) 3519 3459 (-60/1.7%) 3423 (-36/1.0%) 3420 3404 3237 3197 3160 3173 3180 3190 3059 3078 3003 3016 3673 3012 3019 3019 2978 2980
HB Known bad HTML tag, like <font> -1080 (-7%) 15366 14837 (-529/3.4%) 14541 (-296/2.0%) 14776 14622 16313 16286 16818 16816 ? 15558 14620 15525 15262 14494 14891 14872 15003 15116 14164 14286
HL Bad HTML-like linking, like <http://...> -98 (-19%) 516 510 (-6/1.2%) 501 (-9/1.8%) 500 497 492 491 496 492 493 492 474 482 459 448 449 441 441 446 433 418
U URL -94 (-7%, from 2019-03-20) - 1284 1242 (-42/3.3%) 1235 1222 1225 1218 1225 1227 1213 1200 1219 1213 1192 1197 1196 1194 1199 1205 1192 1190
BC Bad characters -12678 (-6%, from 2019-09-01) - - - - - - - - - - - 205046* 196231 194847 194674 194281 192895 192845 192679 192523 192368
BW Bad words -6542 (-5%, from 2019-09-20) - - - - - - - - - - - 306181* 120289* 115983 116073 115612 115522 117419 115418 114602 113747
Total -39115 (-3%, from 2019-09-20) 1043175 instances 1030773 instances (-12402/1.2%) 1012856 instances (-17917/1.7%) 1009232 1007793 995465 980102 975232 969454 964828 959061 1440178* instances 1242324* instances 1224099 instances 1215612 instances 1212615 instances 1206360 instances 1204437 instances 1203965 instances 1200605 instances 1203209 instances
Parse failure Mismatched punctuation -5145 (-3%) 199130 articles 200032 articles (+902/.45%) 195598 articles (-4434/2.2%) 195995 articles 196330 articles 196566 articles 196882 articles 197380 articles 197810 articles 198086 articles 198442 articles 158283 articles + 40465 MOS:STRAIGHT violations 158564 articles + 40523 MOS:STRAIGHT violations 151604 articles + 39214 MOS:STRAIGHT violations 151827 articles + 39333 MOS:STRAIGHT violations 152017 articles + 39428 MOS:STRAIGHT violations 152167 articles + 39590 MOS:STRAIGHT violations 152254 articles + 39727 MOS:STRAIGHT violations 152557 articles + 39971 MOS:STRAIGHT violations 152835 articles + 40112 MOS:STRAIGHT violations 153494 articles + 40491 MOS:STRAIGHT violations

* Affected by significant algorithm changes. 1 Sep 2019: Added BC and BW. (Parse failures dropped due to JWB-powered MOS:STRAIGHT cleanup.) 20 Sep 2019: BC and BW restricted to lowercase; added TS+COMMA, TS+BRACKET, TS+EXTRA.

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

2020 statistics[edit]

In the year from March 2019 to March 2020, moss volunteers fixed over 94,000 typos! The most impressive progress is in the T1 category (single-letter misspellings), where we eliminated about half from the English Wikipedia. During this period we also started fixing missing spaces (focusing on those around punctuation) and those have dropped by about one-fifth. As we make progress, clear misspellings are increasingly mixed in with unclear cases; I'll be doing some more work on separation algorithms to keep the typo reports useful, so you'll probably see some more changes to typo classifications. Thanks to everyone who has been helping out! -- Beland (talk) 16:54, 28 April 2020 (UTC)[reply]

Reporting symbol Explanation Change from 2019-03-01 to 2020-02-20 Instances, 2020-04-01 dump (9f6d726) Instances, 2020-04-20 dump (5ff589d) Instances, 2020-05-01 dump (1a96ded) Instances, 2020-05-20 dump (e511f74) Instances, 2020-06-01 dump (509f79a) Instances, 2020-06-20 dump (825ceb4) Instances, 2020-07-01 dump (db9db23) Instances, 2020-07-20 dump (caa619f) Instances, 2020-08-01 dump (cf76e8c) Instances, 2020-08-20 dump (f104e58) Instances, 2020-09-01 dump (4654d88) Instances, 2020-09-20 dump (a26ccca) Instances, 2020-10-01 dump (686f5db) Instances, 2020-10-20 dump (4f90810) Instances, 2020-11-01 dump (ac54580) Instances, 2020-11-20 dump (6dbd61d) Instances, 2020-12-01 dump (917bcc8) Instances, 2020-12-20 dump (0b3409d)
TS Missing or extra whitespace or dash (or new compound) -39368 (-21%) 145297 144673 331658** 330624 328249 325399 324179 322282 321801 318621 317183 315825 314747 312110 310537 309386 308280 308977
T1 Edit distance 1 from common English word -36192 (-48%) 41090 41081 39967 39452 38783 38379 38436 38271 37803 36783 35976 34036 33539 33764 32347 33097 33559 33427
T2 Edit distance 2 from common English word -7560 (-10%) 64526 63263 60690 60321 59589 58603 58649 58521 58200 58085 57845 57329 57152 57487 57387 57511 57386 57348
T3 Edit distance 3 from common English word -5276 (-7%) 74396 73255 70516 70039 68887 68192 68149 68020 67769 67788 67482 67226 67025 67101 67002 67213 67298 67399
R Regular word (A-Z only) not near a common English word -3525 (-3%) 97726 96916 94793 93855 93252 91537 91489 91746 91521 91729 91513 91613 91339 91813 92329 93246 93377 93493
I Definitely not English (International) due to accents or mixed with punctuation (other than hyphen) -22196 (-24%) 72151 69118 65842 64827 63630 61844 61888 61782 61899 62113 61916 62003 62049 62274 62287 62390 62234 62471
W Not in English Wiktionary, in non-English Wiktionary -6764 (-8%) 75913 74351 86935 85604 83173 81894 81946 82173 81943 82170 81912 81968 81792 81256 81052 81224 81131 81192
L Probable Romanization (transLiteration) +81 (+2%) 4435 4486 4266 4199 4120 4122 4104 4113 4137 4140 4151 4164 4165 4207 4203 4234 4240 4260
ME Probable coMpound, English (with and without dash) +976 (+2%) 52269 48761 47187 47153 46830 46856 46967 47163 47052 47170 47009 47070 47066 47045 47023 47193 47142 47302
MI Probable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash) -18475 (-9%) 177646 176929 171484 169592 166216 164828 165140 165351 165605 166016 166208 166499 166572 167349 167961 169044 168953 169409
MW Probable coMpound, found in non-English Wiktionary -5544 (-11%) 46113 45103 43501 42931 40436 41383 41325 41440 41173 41234 40990 40956 40795 40353 40272 40454 40411 40338
ML Probable coMpound, transLiteration -124 (-3%) 3909 3874 3707 3663 3672 3575 3589 3593 3628 3639 3658 3717 3724 3779 3769 3825 3830 3822
C Chemistry words -176 (-9%) 1782 7564 7530 7644 7640 7655 7658 7659 7660 7662 7654 7644 7659 7661 7665 7659 7674 7700
N A-Z plus numbers and hyphens -1391 (-5%) 25209 23813 22650 22511 22290 22020 22052 22053 21971 22009 21960 21923 21879 21856 21885 21898 21893 21943
Z Decimal fraction missing leading Zero - 47* 0* 11405** 11418 11414 11398 11402 11421 11455 11530 11546 11578 11598 11669 11683 11703 11728 11762
P Patterns (e.g. rhyme schemes) -20 (-43%) 27 28 7 9 7 7 3 2 2 4 5 4 5 5 4 5 5 5
H HTML/XML/SGML tag -539 (-15%) 3010 2886 2938 2903 2904 2848 2693 2697 2680 2747 2757 2729 2565 2569 2542 2538 2540 2572
HB Known bad HTML tag, like <font> -1080 (-7%) 14465 14121 12903 13928 12919 14733 14022 11428 11670 11198 10191 8860 8756 8842 9725 11088 10164 10556
HL Bad HTML-like linking, like <http://...> -98 (-19%) 414 418 377 394 394 421 408 425 420 413 373 359 356 329 324 315 318 328
U URL -94 (-7%, from 2019-03-20) 1179 1152 1118 1134 1117 1122 1129 1124 1120 1124 1124 1103 1101 1099 1091 1096 1050 1055
BC Bad characters -12678 (-6%, from 2019-09-01) 192230 190482 186651 186517 185572 178698 175325 166116 159095 124158 112959 112755 112695 112633 112479 110608 110025 109808
BW Bad words -6542 (-5%, from 2019-09-20) 113682 106327 381288** 380259 378710 374982 375107 375206 375431 375306 374622 374740 374560 375010 375008 375557 374989 375663
Total -39115 (-3%, from 2019-09-20) 1207516 instances 1188601 instances 1647413** instances 1638977 instances 1619804 instances 1600496 instances 1595660 instances 1582586 instances 1574035 instances 1535639 instances 1519034 instances 1514101 instances 1511139 instances 1510211 instances 1508575 instances 1511284 instances 1508227 instances 1510830 instances
Parse failure Mismatched punctuation -5145 (-3%) 154084 articles + 40705 MOS:STRAIGHT violations 153033 articles + 40838 MOS:STRAIGHT violations 214365 articles + 37697 MOS:STRAIGHT violations 214463 articles + 37667 MOS:STRAIGHT violations 214101 articles + 37607 MOS:STRAIGHT violations 214465 articles + 37767 MOS:STRAIGHT violations 214732 articles + 37849 MOS:STRAIGHT violations 215081 articles + 37993 MOS:STRAIGHT violations 215447 articles + 38067 MOS:STRAIGHT violations 215915 articles + 38169 MOS:STRAIGHT violations 216227 articles + 38210 MOS:STRAIGHT violations 216472 articles + 38205 MOS:STRAIGHT violations 216738 articles + 38213 MOS:STRAIGHT violations 216991 articles + 38246 MOS:STRAIGHT violations 217192 articles + 38338 MOS:STRAIGHT violations 217660 articles + 38498 MOS:STRAIGHT violations 217861 articles + 38625 MOS:STRAIGHT violations 218207 articles + 38789 MOS:STRAIGHT violations
  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

* Identification of Z was broken
** Affected by major bug fix for counting inter-word typos (e.g. involving punctuation)

2021 statistics[edit]

Dump (moss version) Parse failures (articles + articles with MOS:STRAIGHT violations) TOTAL (instances) BC BW C H HB HL I L ME MI ML MW N P R T1 T2 T3 TS U W Z D
2021-01-01 (b4af24a) 218317 + 38841 1505808 108661 375875 7705 2550 10726 311 62583 4262 47274 169504 3841 40131 21954 4 93373 32968 56903 66819 306445 1054 81112 11753
2021-01-20 (a249b2d) 218455 + 38930 1506940 108030 376079 7679 2616 11036 298 62746 4298 47044 170234 3885 39960 21959 4 93467 33598 56688 66688 306776 1042 81049 11764
2021-02-01 (8279235) 218833 + 38960 1506004 107000 375979 7677 2595 11729 298 62829 4305 47053 171005 3888 39771 21971 2 93726 33237 56822 66707 305573 1035 81079 11723
2021-02-20 (2f00c51) 218991 + 39035 1504064 106534 375909 7682 2602 11697 275 62942 4342 47036 171313 3897 39732 22009 3 93959 32705 56529 66617 304463 1020 81041 11757
2021-03-01 (248159a) 219198 + 39155 1494162 106421 376305 7669 2624 9291 281 62978 4328 46830 169666 3876 39189 21936 4 92221 32762 56197 66069 302377 1020 80338 11780
2021-03-20 (57aaae7) 219556 + 39371 1492923 106284 375853 7695 2610 9965 278 63055 4331 47064 170453 3880 39172 21998 2 92721 32523 56052 66087 299751 1002 80305 11842
2021-04-01 (d47c725) 219692 + 39478 1484879 105670 375757 7697 2620 8857 205 62842 4309 46966 170369 3884 38886 21964 0 92575 32160 55810 65706 296009 995 79736 11862
2021-04-20 (d169566) 220014 + 39634 1476477 104505 374548 7686 2648 8863 199 62668 4327 47036 170547 3878 38644 21973 4 92336 30560 55284 65191 293170 985 79487 11938
2021-05-01 (7719363) 219292 + 39601 1445819 103253 367236 7661 2387 7682 178 59749 3966 44397 165787 3774 38591 21697 4 91448 30666 56556 65257 283967 980 78634 11949
2021-05-20 (c6359fc) 219284 + 39761 1444570 102794 368258 7678 2271 7878 176 59913 3978 44514 166538 3804 38629 21725 4 91887 29205 56341 65171 282093 983 78651 12079
2021-06-01 (076f14c) 219111 + 39759 1441769 102409 368046 7689 2275 7827 166 59876 3943 44658 166622 3818 38567 21755 5 92077 28507 56157 64919 280645 975 78682 12151
2021-06-20 (ffbc72f) 219625 + 39935 1435330 101926 367522 7694 2276 7108 162 59650 3964 44692 167038 3819 38298 21687 8 92365 28020 55983 64688 276538 955 78621 12316
2021-07-01 (cb3d5e8) 219791 + 39990 1433415 101916 367581 7704 2263 6921 169 59663 3960 44770 167508 3837 38299 21674 8 92600 27369 55755 64301 275024 946 78720 12427
2021-07-20 (5c3b9e9) 220086 + 40132 1429627 101518 367954 7688 2136 6702 137 59995 3955 44805 167818 3824 38179 21646 7 92660 26469 55565 64171 272147 950 78624 12677
2021-08-01 (86e7022) 220338 + 40213 1424448 101229 367552 7708 2123 6252 121 61727 3767 44851 168279 3812 36769 21643 0 93146 26555 55547 64124 271406 953 74189 12695
2021-08-20 (33a14e3) 220370 + 40254 1414854 100973 367172 7719 2047 5736 119 59520 3746 44729 167010 3811 37772 21537 2 92763 24146 54950 63571 266761 960 77075 12735
2021-09-01 (90e0a3b) 220449 + 40268 1411194 100113 367110 7714 2046 5801 120 59567 3733 44623 167222 3824 37710 21525 2 92833 23310 54796 63455 265044 953 76926 12767
2021-09-20 (c71a444) 220781 + 40328 1412140 99635 367286 7713 2040 5650 121 59595 3766 44828 167997 3843 37719 21561 0 93701 22924 54661 63575 264775 948 76966 12836
2021-10-01 (cdd699c) 221094 + 40362 1405448 99065 367498 7683 2060 5774 111 59546 3710 44579 167357 3831 37696 21381 2 93027 22576 54268 63134 261463 952 76883 12851 1

A major upgrade to word categorization was made in October 2021. The same dump is shown on the old and new systems for comparison. R, I, W, MI, MW, and ML were eliminated and sorted by language as TE or TF instead. New categories:

  • A = mAth
  • T/ = Suspected MOS:SLASH violation
  • TE = AI thinks it's trying to be English
  • TF = AI thinks it's trying to be a non-English language (Foreign to English Wikipedia), sorted by language (e.g. TF+el)
Dump (moss version) Parse failures (articles + articles with MOS:STRAIGHT violations) TOTAL (instances) A BC BW C H HB HL L ME N P T/ T1 TE TF TS U Z
2021-10-01 (2ec07e4) 221094 + 40362 1457644 17030 175488 367537 4049 2060 5774 111 5428 237959 2329 37 3237 54108 10076 439099 118822 1649 12851
2021-10-20 (b44e087) 221396 + 40415 1452333 22433 173701 381776 7762 2032 5341 95 5399 219482 2351 6 3252 53679 10151 438103 112265 1613 12892
2021-11-01 (0786728) 221592 + 40396 1476996 22385 97423 481799 7793 1573 5122 97 5399 219638 2297 9 3246 53546 10145 440061 111957 1607 12899
2021-11-20 (34069e9) 153165 + 42992 1491000 23808 99945 497995 7816 1609 5587 111 5688 222435 2340 9 3373 53516 9847 426498 116119 1642 12662
2021-12-01 (0fc2fb3) 153177 + 42994 1489025 23727 99782 496905 7828 1558 5602 104 5702 222571 2346 8 3359 53405 9816 425937 116070 1627 12678
2021-12-20 (d20f520) 153289 + 42902 1488550 23761 99074 496904 7845 1561 5601 108 5715 223063 2351 4 3337 53580 9806 425623 115890 1618 12709

2022 statistics[edit]

Dump (moss version) Parse failures (articles + articles with MOS:STRAIGHT violations) TOTAL (instances) A BC BW C D H HB HL L ME N P T/ T1 TE TF TS U Z
2022-01-01 (92506e2) 153265 + 42919 1488043 23730 98949 496872 7872 0 1561 5712 108 5744 222842 2355 8 3337 53020 9801 425923 115845 1608 12756
2022-01-20 (f63dc78) 153371 + 42894 1490532 23729 98433 497315 7875 1 1603 6158 108 5794 223402 2345 5 3325 53057 9667 426560 116722 1594 12839
2022-02-01 (8fbf720) 153444 + 43002 1621627 23804 98366 497551 7934 1 1579 6051 108 6007 240216 2381 13 3334 58724 11652 531477 117630 1599 13200
2022-02-20 (8245233) 153724 + 43135 1622459 23835 98083 497766 7956 1 1604 5177 102 5999 240497 2370 14 3281 59384 11661 531576 118343 1616 13194
2022-03-01 (8245233) 153733 + 43208 1624427 23837 98107 497855 7989 1 1571 5815 102 6027 240789 2371 16 3278 59744 11669 531890 118567 1608 13191
2022-03-20 (fb66b79) 153882 + 43327 1624509 23823 97961 498466 7996 1 1552 4746 106 6059 241192 2363 15 3311 60058 11638 531382 119054 1601 13185
2022-04-01 (fb66b79) 153932 + 43430 1626452 23823 97828 498085 8000 1 1594 4793 105 6063 241718 2375 16 3327 60572 11642 532088 119684 1591 13147
2022-04-20 (fb66b79) 154017 + 43596 1630486 23789 97841 498611 8012 1 1607 4990 105 6065 242940 2374 17 3337 60977 11649 532927 120483 1587 13174
2022-05-01 (fb66b79) 153825 + 43698 1631287 23793 97801 498632 8020 1 1609 5048 104 6073 243306 2384 20 3337 61453 11694 533878 119359 1579 13196
2022-05-20 (cc63e5f) 153870 + 43814 1635174 23851 97718 498090 8043 1 1636 4925 107 6103 243986 2385 19 3337 59550 11866 538310 120406 1574 13267
2022-05-20 (ae346b0)* 164831 + 29862 1620797 23846 92522 487792 8099 1 1631 4930 110 6076 244851 2308 18 3335 60170 11838 538751 119670 1580 13269
2022-06-01 (6090418) 164899 + 29887 1620209 23786 92402 487512 8099 1 1620 4620 113 6090 245017 2309 16 3331 60318 11803 538115 120085 1587 13385
2022-06-20 (97d23b9) 164770 + 29816 1617952 23775 91799 486712 8102 0 1611 4705 116 6087 245190 2319 13 3300 59666 11763 538585 119215 1568 13426
2022-06-20 (1432a2f)† 164877 + 29821 1677855 23781 91816 547534 8102 0 1611 4706 116 6071 245153 2318 13 3297 59659 11764 537643 119292 1554 13425
2022-07-01 (9ab6dad) 164769 + 29855 1674273 23732 91585 547881 8113 0 1644 4657 116 6110 244376 2295 143 3261 59286 11657 535628 118761 1559 13469
2022-07-20 (06d752b) 164636 + 29850 1674512 23605 91172 547558 8111 0 1663 4856 126 6127 244725 2294 144 3272 58857 11659 536841 118429 1550 13523
2022-08-01 (622271d) 164730 + 29865 1675287 23593 90912 547590 8080 0 1660 4926 127 6144 244829 2284 145 3273 58908 11604 537355 118773 1553 13531
2022-08-20 (597dbd2) 163908 + 29808 1667614 23508 90561 544710 8081 0 1653 5137 121 6136 243853 2287 122 3234 58163 11473 536597 117099 1535 13344
2022-08-20 (5ee7ffd)‡ 162500 + 29580 1210578 10681 86656 540463 7981 0 1611 5136 122 2073 182672 1964 114 2307 43457 6582 206072 97829 1522 13336
2022-08-20 (6965e1f)⹋ 162432 + 29567 1205869 10669 86557 538964 7979 0 1610 5131 122 2041 181481 1963 114 2298 43278 6540 204575 97689 1520 13338
2022-09-01 (cda0784) 161909 + 29468 1198769 10663 86161 536440 7990 0 1603 5399 120 1977 180548 1945 99 2270 42927 6445 202651 96760 1485 13286
2022-09-20 (4689b50) 162154 + 29594 1199166 10676 85924 536599 7981 0 1621 6730 125 1985 180428 1950 99 2267 42279 6383 202327 96972 1487 13333
2022-10-01 (e725bbd) 161370 + 29450 1193722 10646 84999 534429 7981 0 1623 6988 123 1964 179378 1934 99 2259 42089 6356 201547 96530 1466 13311
2022-10-20 (e725bbd) 161347 + 29546 1192591 10632 84851 534850 7998 0 1623 6987 121 1981 178500 1921 101 2271 41414 6264 201358 96915 1454 13350
2022-11-01 (ebbea0e) 161388 + 29603 1192455 10634 84376 535156 8036 0 1633 6505 116 1976 178546 1917 102 2270 41341 6217 201463 97334 1450 13383
2022-11-20 (84f0fc4) 161548 + 29683 1193478 10659 84327 535811 8112 0 1614 6622 115 1970 178817 1918 102 2259 41326 6187 201180 97563 1444 13452
2022-12-01 (d57116b) 161334 + 29741 1193626 10650 84229 536307 8124 0 1604 6503 110 1981 178844 1913 102 2262 41018 6181 201090 97779 1446 13483
2022-12-20 (003741b) 161351 + 29828 1189035 10658 83972 535095 8218 0 1592 4957 110 1971 178831 1917 1 2236 41413 6177 198807 98124 1431 13525

* ae346b0 started ignoring content inside curly quotes
† 1432a2f excluded more end sections
‡ 5ee7ffd started ignoring italicised content
⹋ 6965e1f started ignoring content inside single quotes

Likely new words by frequency (non-English)[edit]

From 2019-02-01 dump:

From 2019-02-01 dump, but clearly not foreign words (need to figure out what to do with them):

Case notes from 2019-06-01 dump[edit]

  • 1 - QueA RNA motif - wikt:preQ --this appears as preQ1 which does have a Wiktionary entry, wikt:preQ1 so why is it included here?
    • Weird, I'll have to debug that. -- Beland (talk) 08:47, 16 June 2019 (UTC)[reply]
      • Oh, of course, because sup and sub tags cause text on either side to be in different tokens. I'll try changing that and see if it is an overall improvement. That should also fix things like chemical formulas, so I think it will be good. -- Beland (talk) 02:10, 9 May 2020 (UTC)[reply]