Wikipedia talk:AutoWikiBrowser/Typos

Home
Introduction and rules
User manual
How to use AWB
Discussion
Discuss AWB, report errors, and request features
User tasks
Request or help with AWB-able tasks
Technical
Technical documentation

WT:AWB/T

Archives

Womens > Women's

Should this rule apply to lowercase "womens" only? See Apostrophe#Possessives in names of organizations. -- John of Reading (talk) 17:13, 11 September 2010 (UTC)[reply]

I can't find any "womens", capitalized or not, in wikipedia. We can delete the rule altogether.--BillFlis (talk) 06:45, 1 October 2010 (UTC)[reply]

Womens Bay, Alaska, Sheffield Wednesday Womens F.C., List_of_WWE_Women's_Champions, University of Pittsburgh Medical Center, Womens Bay, Women in Ancient Rome, Apostrophe, 2009 Adelaide Football Club season. I think you can get the picture! Regards, SunCreator ^(talk) 00:40, 15 October 2010 (UTC)[reply]

Italicise Latin words and phrases

Please italicise Latin words and phrases, the most common being et cetera (or etcetera, et caetera or et cætera), de facto, de jure, id est, ad libitum, circa, floruit and exempli gratia. McLerristarr / Mclay1 07:49, 14 September 2010 (UTC)[reply]

I suggested this earlier but it got archived before anything was done about it. Manual archiving, like on Wikipedia talk:AutoWikiBrowser/Feature requests, would be much better. McLerristarr / Mclay1 03:18, 4 October 2010 (UTC)[reply]

Rules for "Consider" and "Considered"

I don't agree with the rule for Considered changing "consideres" → "considered", as the proper word could be "considers". (e.g. this edit) I hope you'll reconsider (pun intended) this rule. Speaking of which, adding "(Re)" to the beginning of these rules would be good too. Thanks! GoingBatty (talk) 02:55, 24 September 2010 (UTC)[reply]

Rules expanded for Re- prefix. Rjwilmsi 11:21, 24 September 2010 (UTC)[reply]

"consideres" could be either -ed or -s, we don't support options so choose the most likely one. Rjwilmsi 11:21, 24 September 2010 (UTC)[reply]

False positive

"Diary products" could be legitimate; I nearly committed this edit to "Dairy products" before I noticed. I was too scared to screw up the code to edit it; could someone who knows what they're doing, please? --John (talk) 06:48, 29 September 2010 (UTC)[reply]

Done here. I removed "diary product" but I added some other similar trailing words. Shadowjams (talk) 08:40, 29 September 2010 (UTC)[reply]

What does '"Diary products" could be legitimate' mean? Did you actually find it anywhere? It seems way beyond likely to me.--BillFlis (talk) 03:16, 30 September 2010 (UTC)[reply]

My initial instinct too. I found 2 examples of it (searching for the phrase finds the two... I don't remember them now). Frankly the typo seems more likely; I'd be fine with it added back (although I added some others too so don't remove those) Shadowjams (talk) 03:43, 30 September 2010 (UTC)[reply]

Actually, I just now found an instance of "diary products"! I corrected it to "personal organizers". I think the rule can now be safely restored.--BillFlis (talk) 07:14, 1 October 2010 (UTC)[reply]

womens' → women's'

For the article Guide to Life, AWB wants to convert womens' to women's'. Could someone please update the men's rule to fix this? Thanks! GoingBatty (talk) 21:44, 30 September 2010 (UTC)[reply]

There's another suggestion about the men/women rule at the top of the page, too -- John of Reading (talk) 06:04, 1 October 2010 (UTC)[reply]

Profiling heads up for you guys

Hi All, Thanks for the great work.

Little heads up for you. I was poking at AWB doing some profiling, and Regextypofix takes nearly a 3rd of the time whilst processing an article. Most of this, is doing match evaluation.

—Ree dy 17:55, 3 October 2010 (UTC)[reply]

Is it possible for you to drill down deeper and see which or what kinds of regexes take the longest? Anyways we can optimize what's here from the rule-writing perspective? Shadowjams (talk) 21:25, 3 October 2010 (UTC)[reply]

Not exactly. MaxSem seems to think there was, but we'll have to dig it out. I imagine, there are a lot of rules that won't ever get matched, and are probably just pointless keeping around. I need to do a new TypoScan dump, and if I do it with some extra stats, such as the word/the rule it matched, it might give us a better idea. We have a lot of regexes!! —Ree dy 21:51, 3 October 2010 (UTC)[reply]

Yeah, it's huge. One-third is less than I would have guessed for the typo rules. There was a conversation (I think it's above) about whether using alteration (pipes) or character classes (brackets) was faster, since the latter is significantly faster in some implementations. For AWB it turns out the difference is small, but classes are slightly faster.

While I'm interested in the optimization issues it's mostly academic; I don't personally find the speed right now a serious issue. Even on old hardware I don't have trouble working with anything in AWB. If anything the API for saving changes (gets are quick) is a larger slow-down. If I do large database dump scans that takes a while but even then it's not extraordinarily long, and it's easily batched which is probably a more long-term and cheaper solution (in terms of coding time) than on optimizing everything. That's something I guess you ultimately get to decide, but just my two-cents. Thanks for the info, let me know if I can help speed anything up. Shadowjams (talk) 22:59, 3 October 2010 (UTC)[reply]

1/3of the time seems very good! Rich Farmbrough, 11:01, 7 October 2010 (UTC).[reply]

Re: which typo rules are the slowest. We have the 'profile typos' option to run on a particular page, but that is only for a particular page. We also have to be careful that just because a rule doesn't match any pages in a given database dump doesn't mean the rule is useless. Somebody may have fixed 20 typos using that rule the day before the dump. However, the last time I did profile typos on a page there were certain rules that were much slower than others, so we might achieve a reasonable performance improvement by focusing on a handful of rules. Still, I don't think current performance is a problem, the "1/3 of the time" Reedy mentions depends entirely on the page you run against. Rjwilmsi 11:13, 7 October 2010 (UTC)[reply]

I have posted the 50 slowest typo rules, based on profiling Tiger Woods. The number at the start is the time (I think this is probably the time in milliseconds to apply the typo 100 times or something), and then the regex of the rule is given. Note that the quickest typo has a time of 2, a typical value for the majority of the rules is around 50. Therefore some rules are 5 or 10 times slower than average. Rjwilmsi 11:31, 7 October 2010 (UTC)[reply]

Quick example on the 11th slowest: ($1nally): originally 0.87 seconds using Expresso for 10 iterations on Tiger Woods, using \b([A-Za-z]{2,}[a-mo-z])(?:nalyl|anlly)\b instead is 0.67 seconds. That's about 20% faster with no change to the rule's matching. Rjwilmsi 11:53, 7 October 2010 (UTC)[reply]

A lot of these start with "\b(\w+)", which I think can be safely eliminated.--BillFlis (talk) 12:37, 7 October 2010 (UTC)[reply]

No, not quite true, we want to match the whole word so the edit summary shows whole words being corrected. Rjwilmsi 13:00, 7 October 2010 (UTC)[reply]

Converting \w to [A-Za-z] for performance improvement: that reduced typical typo time on Tiger Woods from average 7.7 seconds to average 6.9 seconds on my laptop, ~10% better. [A-Za-z] may be better as [a-z], I'll see about that. Rjwilmsi 13:36, 7 October 2010 (UTC)[reply]

I think \w covers [A-Za-z0-9_] and maybe (depending on the language) extended Latin/Cyrillic characters. Mitigating that though, in most cases those probably aren't intended. Shadowjams (talk) 16:18, 7 October 2010 (UTC)[reply]

2007 Brazilian Grand Prix

Oposta => Opposta wrongly. Rich Farmbrough, 11:01, 7 October 2010 (UTC).[reply]

Marking sections so AWB doesn't search for typos?

Is there a way to mark sections of articles that are in foreign languages (e.g. Middle Scots#Sample text) so that AWB won't search them for typos? Thanks! GoingBatty (talk) 00:32, 10 October 2010 (UTC)[reply]

Yes. You can enclose them in the language template, like this:

That comes out like this:

Mi gato se llama Rebecca.

It doesn't make the text look different in the article, but AWB doesn't flag typos inside it. --Auntof6 (talk) 03:56, 10 October 2010 (UTC)[reply]

Perfect - thanks! GoingBatty (talk) 04:06, 10 October 2010 (UTC)[reply]

Inocentes → Innocentes

[1] Doesn't AWB usually not run typo fixing within quotes? –xeno^talk 20:50, 12 October 2010 (UTC)[reply]

That's within italics, not quotes, and we've only had hiding of text in italics since rev 7042. Rjwilmsi 21:15, 12 October 2010 (UTC)[reply]

My bad, looked like quotes in the diff view. –xeno^talk 21:21, 12 October 2010 (UTC)[reply]

Please see es:Día de los Santos Inocentes and wikt:inocente. The Spanish word inocente (inocentes in the plural) (meaning "innocent") has only one n before the o.

—Wavelength (talk) 00:41, 7 November 2010 (UTC)[reply]

Edit summary incorrect when two different sets of duplicated words fixed

In this edit, AWB changed "be be" to "be" and "with with" to "with", but the edit summary automatically created was "typos fixed: be be → be (2)"

Yes, when the same typography rule makes more than one fix, the effect of the rule is summarised as you describe. Imagine how long this edit summary would have been if it hadn't done this. -- John of Reading (talk) 10:38, 14 October 2010 (UTC)[reply]

John's explanation is correct, though his example uses AWB find & replace rather than typo fixing, but both do the same edit summary condensing he's explained. Rjwilmsi 11:05, 14 October 2010 (UTC)[reply]

Philippino and variants

Please add the following:

Philippino --> Filipino
Philippinos --> Filipinos
Philippinoes --> Filipinos
Philippina --> Filipina
Philippinas --> Filipino
Filipinoes --> Filipinos

I don't know if there's one out there, in case there aren't please add them. Thanks.--JL 09 ^_{q?_c} 08:11, 16 October 2010 (UTC)[reply]

Why does "Philippinas" not convert to "Filipinas"? --Auntof6 (talk) 12:12, 16 October 2010 (UTC)[reply]

It doesn't "convert" because there is no rule for it here. "Philippina" is a word. E.g., 631 Philippina.--BillFlis (talk) 14:44, 16 October 2010 (UTC)[reply]

Done #1-3 here

Not done #4-5 per comment above

Will let someone else do #6 to ensure rule isn't expanded to "fix" correct spellings too. GoingBatty (talk) 16:39, 16 October 2010 (UTC)[reply]

Done #6 here. -- JHunterJ (talk) 20:31, 22 October 2010 (UTC)[reply]

Could the rule be expanded to cover double and single Ls and Ps? McLerristarr | Mclay1 14:03, 26 October 2010 (UTC)[reply]

Sorry, but I don't understand your request. Could you please specify the exact misspellings that you want to be identified and fixed? Thanks! GoingBatty (talk) 02:16, 27 October 2010 (UTC)[reply]

Possible State capitalization issue

I have had a few pages lately where AWB is trying to capitalize states that are within a web address and I dont think we want to do that. Here is one example. --Kumioko (talk) 19:52, 22 October 2010 (UTC)[reply]

It looks like AWB properly ignored the web address (the part in the brackets that uses the http:// prefix) and only tried to fix the unfortunately worded description of the web address (not in brackets, with no http:// prefix). -- JHunterJ (talk) 20:19, 22 October 2010 (UTC)[reply]

Plurals of SI units

Could the typo facility be used without false positives to change 'kms' and 'kgs' to 'km' and 'kg'? Lightmouse (talk) 18:49, 25 October 2010 (UTC)[reply]

Please look at this code change:

<Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)K(g|m)\b" replace="$1k$2" />

to:

<Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)K(g|m)s\b" replace="$1k$2" />

Would that work? Lightmouse (talk) 23:20, 25 October 2010 (UTC)[reply]

Neither of them seem to work for me in the AWB Regex Tester. In particular, although you want to change "kms" and "kgs" (which contain lower case "k"), the regex only has an uppercase "K". GoingBatty (talk) 02:29, 26 October 2010 (UTC)[reply]

Good call, thanks. Let me add lower case 'k' as an option:

<Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)[Kk](g|m)s\b" replace="$1k$2" />

How about that?

I tried the AWB Regex Tester again using your Find and Replace on the text "Kgs and Kms and kgs and kms and kg and km", and it didn't find anything to replace. Hopefully one of the experts can give you a hand with this. Good luck! GoingBatty (talk) 02:20, 27 October 2010 (UTC)[reply]

Ah, I see the error of my ways - the rule is set up to look for a number before the symbol. GoingBatty (talk) 17:26, 27 October 2010 (UTC)[reply]

If you change that, it will no longer correct "Km" or "Kg", which was the intent of the rule.--BillFlis (talk) 10:51, 27 October 2010 (UTC)[reply]

Are you sure BillFlis? It works for me. Lightmouse (talk) 14:27, 27 October 2010 (UTC)[reply]

The way the proposed rule is written above, it's looking for a terminal "s", as in "Kgs" or "kms".--BillFlis (talk) 17:01, 27 October 2010 (UTC)[reply]

Ah yes! I thought you were saying it wouldn't find an upper case 'K'. Thanks for being patient with me. How about:

<Typo word="kg/km (kilogram/kilometer)" find="([\d\.]+(?:\s| |-)?)[Kk](g|m)s?\b" replace="$1k$2" />

That's also going to result in false positives where it tries to fix km and kg. Since we want to fix kms, Kms, Km, kgs, Kgs, Kg - but not km or kg - how about splitting this into two rules:

<Typo word="kg (kilogram)" find="([\d\.]+(?:\s| |-)?)(Kgs?|kgs)\b" replace="$1kg" />
<Typo word="km (kilometre)" find="([\d\.]+(?:\s| |-)?)(Kms?|kms)\b" replace="$1km" /> GoingBatty (talk) 17:26, 27 October 2010 (UTC)[reply]

The one line version should be faster than the two line version. Yes, it does over-write 'km' with 'km' but it has to parse the text anyway and the outcome is unchanged. Lightmouse (talk) 17:44, 27 October 2010 (UTC)[reply]

How about one rule: <Typo word="kg/km (kilogram/kilometre)" find="([\d\.]+(?:\s| |-)?)(?:K([gm])s?|[Kk]([gm])s)\b" replace="$1k$2$3" /> Could someone please test this? If two rules are necessary, I'd suggest that one handle the capital "K" error, and the other the terminal "s" error.--BillFlis (talk) 19:09, 27 October 2010 (UTC)[reply]

It works for me, Bill. I used the regex tester on:

"foo 5 Kg, 6 Kgs, 7 kgs, 8 Km, 9 Kms, 10 kms bar"

and it produced:

"foo 5 kg, 6 kg, 7 kg, 8 km, 9 km, 10 km bar"

Thanks. Lightmouse (talk) 19:15, 27 October 2010 (UTC)[reply]

I've made the change to the rule. Also, modified the watt rule to correct also "kw" and removed the now-redundant kilowatt rule.--BillFlis (talk) 11:42, 28 October 2010 (UTC)[reply]

Thanks. Lightmouse (talk) 11:51, 28 October 2010 (UTC)[reply]

SI unit spelling: 'gramme' -> 'gram' and 'kilogramme' -> 'kilogram'

I'm trying to add a typo for 'kilogramme' -> 'kilogram'. I think the code is:

<Typo word="kilogram" find="\b([Kk]ilog|[Gg])ramme(s?)\b" replace="$1ram$2" />

Is that correct? Lightmouse (talk) 23:18, 25 October 2010 (UTC)[reply]

This works for me in the AWB Regex Tester - thanks! GoingBatty (talk) 02:31, 26 October 2010 (UTC)[reply]

Adding these to the typo rules would be against WP:ENGVAR. Rjwilmsi 07:21, 26 October 2010 (UTC)[reply]

"Gramme" is rarely used in British English. It's an old spelling. But people must also note that the SI spelling of "meter" is "metre" so just basing spelling on SI is not OK. McLerristarr | Mclay1 13:53, 26 October 2010 (UTC)[reply]

Quite. I'm just referring to the SI unit of mass. wp:engvar says "Wikipedia tries to find words that are common to all varieties of English." There is an occasionally quoted misconception that British spelling requires 'kilogramme'. The spelling 'kilogramme' merely has the status of an old alternative. Since metrication started in the 1970s, the spelling 'kilogram' started to be adopted and is now the default.

The spelling 'kilogram' has been used in legislation for the last 25 years (e.g. Weights and Measures Act 1985). It's the spelling taught by the Department of Education] and in style guides:

Economist: "kilogram or kilo (not kilogramme)"
The Times (UK): "kilogram (not kilogramme)"
The Guardian: "kilogram/s"
University of York: "kilogram not kilogramme"
National Health Service (NHS): "kilogram – not kilogramme"
[http://be.macmillan.org.uk/AboutOurBrand/Macmillanstyleguide.doc Macmillan Cancer Support: "gram not gramme; hence kilogram"
Birmingham (UK) Grid for Learning "kilogram" not "kilogramme"

If there's any doubt, it would be simple enough to raise it in several forums but it seems clear cut to me. Regards Lightmouse (talk) 14:32, 26 October 2010 (UTC)[reply]

Is this in WP:MOSNUM? Rjwilmsi 14:38, 26 October 2010 (UTC)[reply]

Wikipedia:Manual of Style (spelling) says "gramme vs gram: gram is the more common spelling; gramme is also possible in British usage." Lightmouse (talk) 14:47, 26 October 2010 (UTC)[reply]

I would interpret that to mean that the typo rules shouldn't change it then. Rjwilmsi 14:52, 26 October 2010 (UTC)[reply]

OK. Thanks. Lightmouse (talk) 15:06, 26 October 2010 (UTC)[reply]

Excess code in "SI unit symbols"

All of the code in SI unit symbols seems excessive to me. For example, the code that will turn '100 kw' into '100 kW' is:

find="([\d\.]+(?:\s| |-)?)kw\b" replace="$1kW" />

It looks for a digit string. But I think it could be simplified by looking only for the last digit in the string. Thus:

find="(\d(?:\s| |-)?)kw\b" replace="$1kW" />

As far as I can see, that would give the same hit rate and the same false positive rate. The same applies across all 14 SI units. Am I correct? Lightmouse (talk) 14:44, 26 October 2010 (UTC)[reply]

Looks OK to me (unless someone writes "25. kw", which is a different error), but I would change the "?" to "*" to catch multiple spaces:

find="(\d(?:\s| |-)*)kw\b" replace="$1kW" /> --BillFlis (talk) 14:52, 26 October 2010 (UTC)[reply]

We match the entire number so that the edit summary shows the entire unit to make it easier for editors to understand the change. Rjwilmsi 14:53, 26 October 2010 (UTC)[reply]

Ah, good point. I wasn't aware of that. I thought the speed of the code was the deciding factor. Lightmouse (talk) 14:57, 26 October 2010 (UTC)[reply]

Possible duplicate in "SI unit symbols"

It seems to me that the line for kilowatt could be eliminated by changing:

<Typo word="W (watt)" find="([\d\.]+(?:\s| |-)?)([µmMGT])w\b" replace="$1$2W" />

to

<Typo word="W (watt)" find="([\d\.]+(?:\s| |-)?)([µmkMGT])w\b" replace="$1$2W" />

Have I missed something? Lightmouse (talk) 15:05, 26 October 2010 (UTC)[reply]

Duplicate words section

Since "It is" has its own entry in the Duplicate words section to fix "it it" and "is is", should the specific Duplicate words entry be tightened so it doesn't also look for "it it" and "is is"? GoingBatty (talk) 03:46, 30 October 2010 (UTC)[reply]

km² rule

Two questions about the km² rule:

Could someone please expand it so it also fixes "km2" (without the superscript)?
Speaking of superscript, why is the replacement "km2" instead of "km²"? GoingBatty (talk) 06:08, 30 October 2010 (UTC)[reply]

For the same reasons people still use HTML &ndash instead of the UTF-8 character (which they can get from the little tool strip below the edit window): tradition, recalcitrance, personal preference, obstinacy, obtuseness, drunkenness.--BillFlis (talk) 07:58, 30 October 2010 (UTC)[reply]

We use the tags because it's in the MOS. Rjwilmsi 21:01, 31 October 2010 (UTC)[reply]

Thanks for the feedback. So could someone expand the rule so it fixes both "km2" and "km²" (without superscript tags)? GoingBatty (talk) 01:40, 1 November 2010 (UTC)[reply]

I would point out that I have my own personal convert template regex rule, and I think there's a bot going around doing similar things. While both mine and the bot's rules could fix all versions, I currently don't and I don't know what the bot does. It pays to have some standardization... but I'm not hell bent to change the MOS rules for something like this. Shadowjams (talk) 06:26, 4 November 2010 (UTC)[reply]

Ha, there's a bit of a disconnect here somewhere. If on the "Insert" pull-down menu below the "Save page" button you select "Symbols", it makes available both "m²" and "m³" (with the Unicode exponents, not the markup).--BillFlis (talk) 12:04, 4 November 2010 (UTC)[reply]

in in

This is a recent addition; I've only seen it produce false postives so far. There are many phrases ending in "in", such as "bring in", "buy in", "carry in" and so on, which can legally be followed by another phrase that starts with "in", such as "in many cases", "in 2007", and so on. -- John of Reading (talk) 08:21, 31 October 2010 (UTC)[reply]

Hi John - I'm the one who made the addition based on the typo corrected in this edit. Could you please give an example of a grammatically correct sentence that contains "in in"? Thanks! GoingBatty (talk) 14:56, 31 October 2010 (UTC)[reply]

I've just done an AWB Google search for "in in". The rule made no correct changes, and was going to damage these:

Betting (poker) - a player may go all in in exactly the same manner...
Cheating in poker - unless this exceeds the maximum buy-in in which case the player...
List of active drive-in theaters - projection screen of Route 66 Drive-in in Carthage...
Winter of 1962–1963 in the United Kingdom - The thaw set in in early March

A search for "in in early" found a roughly even mixture of correct and incorrect fixes. I didn't save anything, so you can try it yourself. -- John of Reading (talk) 20:50, 31 October 2010 (UTC)[reply]

I recently corrected an "in in" error by an experienced and usually careful AWB user. I added an extraneous comma to prevent it from happening again. MANdARAX • XAЯAbИAM 17:08, 2 November 2010 (UTC)[reply]

Based on John's feedback, I updated the rule here so it looks for a space before the duplicated word, so it won't catch "buy-in in" or "Drive-in in" anymore. GoingBatty (talk) 02:33, 3 November 2010 (UTC)[reply]

"I let the dog in in the morning." Two in's is the same situation as two on's. There's no way of getting around it. The typo fixer cannot possibly correct every typo so copyediting still needs to be done regularly. This is another typo that will have to be found the traditional way. McLerristarr | Mclay1 06:42, 3 November 2010 (UTC)[reply]

I think it's simply too complicated of a grammatical issue to handle with the typo rules. I'd note that there's absolutely nothing stopping anyone from using their own rules in AWB to identify common types of duplicate words (pretty much pronouns and prepositions), or just identifying duplicate words in any case (this should do it \b(\w+)\b\1\b) and using human judgment to fix them. This is probably better used for words that don't have this error. I don't have enough grammar knowledge to be confident about which words those are, but the usual "the the" examples are a good place to start. Shadowjams (talk) 06:24, 4 November 2010 (UTC)[reply]

Based on the discussion, I've reverted my change here. However, I disagree that "There's no way of getting around it."

"a player may go all in in exactly the same manner" → "a player may go all in exactly the same way"
"The thaw set in in early March." → "The thaw set in early March"
"I let the dog in in the morning." → "I let the dog inside in the morning." GoingBatty (talk) 17:41, 4 November 2010 (UTC)[reply]

Thanks for the regex suggestion, Shadowjams, but that didn't work for me. While \b(\w+)\s\1\b did work, I found that \s(\w+)\s\1\s helps to avoid the "buy-in in" examples above. GoingBatty (talk)

Even better is \s([a-z]+)\s\1\s to limit it to lowercase words (e.g. avoid fixing Bora Bora) GoingBatty (talk) 02:58, 5 November 2010 (UTC)[reply]

As well as avoiding "buy-in in" it could avoid "buy buy-in". I know that's not a good example but I can't think of a real one right now. McLerristarr | Mclay1 08:15, 5 November 2010 (UTC)[reply]

GoingBatty, your examples do not really avoid the problem because the typo fixer cannot possibly know what the change should be. McLerristarr | Mclay1 08:17, 5 November 2010 (UTC)[reply]

As a postscript I've tackled "in in" using a variety of Google searches ("in in 1857", "born in in", and so on) and a long regexp to skip most of the false positives; 450 fixes from around 2000 candidates. There will be many others that I've missed, I'm sure. -- John of Reading (talk) 19:51, 6 November 2010 (UTC)[reply]

Great job, John! I've done quite a few too (but not as many as you!) GoingBatty (talk) 23:38, 6 November 2010 (UTC)[reply]

Exactly the same

Please expand the "exactly the same" rule:

this exact same → exactly the same
that exact same → exactly the same
those exact same → exactly the same

Thank you. McLerristarr | Mclay1 16:05, 2 November 2010 (UTC)[reply]

Done here GoingBatty (talk) 16:45, 2 November 2010 (UTC)[reply]

sq.kms → sq.km → km²

Typo fixing will change "sq.kms" to "sq.km" on the first parse, and then change to "km2" in the second parse. (Try Pakhal Lake.) What's the best way to combine the SI unit symbols so this all happens in one parse? GoingBatty (talk) 16:27, 6 November 2010 (UTC)[reply]

continguous → contiguous

The extra n in continguous is an error sometimes seen in the phrase "contiguous United States".

([Cc])ontinguous → $1ontiguous
([Cc])ontinguity → $1ontiguity

Continguity appears rarely. I haven't found continguously and continguousness so those might not be worth the trouble. —Mrwojo (talk) 19:14, 6 November 2010 (UTC)[reply]

Done here to cover all of these. GoingBatty (talk) 23:33, 6 November 2010 (UTC)[reply]

Other duplicated words

Before starting another controversy, does anyone object to expanding the Duplicated words entry to fix "had had" and "that that"? GoingBatty (talk) 00:21, 7 November 2010 (UTC)[reply]

"had had" definitely is not acceptable in the typo list; for sentences like "He had had the apple," that would change the meaning. PleaseStand ^(talk) 00:37, 7 November 2010 (UTC)[reply]

Thanks for the example. Sorry for being dense, but what's the difference between "He had the apple" and "He had had the apple" ? GoingBatty (talk) 00:41, 7 November 2010 (UTC)[reply]

The second is used to refer to an action that happened before another (had something before another thing happened), as in "He had had a drinking problem, so he attended an AA meeting." The typo fixer shouldn't change something that is completely correct. PleaseStand ^(talk) 01:23, 7 November 2010 (UTC)[reply]

I agree that the typo fixer shouldn't change something that is completely correct. So does your example mean "He had a drinking problem, so he attended an AA meeting, and he no longer has a drinking problem." ? Thanks! GoingBatty (talk) 01:40, 7 November 2010 (UTC)[reply]

Found two more: "more more" and "other other" GoingBatty (talk) 01:40, 7 November 2010 (UTC)[reply]

For "had had" see Pluperfect or the splendid article James while John had had had had had had had had had had had a better effect on the teacher; for "that that" consider the sentences "He said that that man was the impostor" or "Not that that made any difference". Please don't add either of these to the automatic list.

"more more" and "other other" look OK to me, though "more more" will run into some false positives with song and TV program titles. (Comment revised after I saw the error in my test regexp) -- John of Reading (talk) 07:52, 7 November 2010 (UTC)[reply]

Thank you for the links. I definitely won't add "had had" or "that that". I hope that the song and TV program titles would be "More More" instead of "more more". GoingBatty (talk) 15:46, 7 November 2010 (UTC)[reply]

Done here so the typo fixer now fixes "more more", "other other" and "become become". GoingBatty (talk) 23:31, 7 November 2010 (UTC)[reply]

Does the typo fixer remove duplicate words in different casings (e.g. other Other)? I don't think it should because the capitalised word could be part of a proper name, making the duplication completely correct. McLerristarr | Mclay1 01:04, 9 November 2010 (UTC)[reply]

No, this rule has been written to match lowercase text only. -- John of Reading (talk) 07:16, 9 November 2010 (UTC)[reply]

there are many other duplicates though, obviously lupus lupus, bubo bubo etc. are legitimate. The top entries as of the last dump are:

solid 17216 "!style="border-style: none none solid solid;"
the 16219
that 15967
new 8773
history 7648
had 7008
in 6213
is 3285
sortable 3155 (table?)
to 3121
edit 2988 (?)
blah 2690
etc 2610 (etc etc should be just etc.
very 2393 (very very is bad style)
and 2057
on 2050
many 1871 (bad style)
it 1832
of 1672

Full list at User:Rich Farmbrough/temp113. Rich Farmbrough, 16:04, 10 November 2010 (UTC).[reply]

Thank you for generating that list - interesting. Why is "history history" so frequent? There are examples at History of Manila and Surviving History, which have [http://www.somewhere.com/history History of Something], but I'm surprised at the 7648 figure. -- John of Reading (talk) 17:40, 10 November 2010 (UTC)[reply]

Thanks indeed. The above includes uppercase instances, Rich? --LilHelpa (talk) 17:46, 10 November 2010 (UTC)[reply]

Is your list across all namespaces? I think the primary concern should be the article namespace. Anyone who wants to type "very very" or "blah blah blah" on a talk page isn't something we should be correcting. GoingBatty (talk) 17:59, 10 November 2010 (UTC)[reply]

Cool list! For comparison, the typo rule is currently fixing the following duplicates: a, am, an, as, at, and, are, become, be, by, could, did, do, for, go, has, he, if, is, it, me, more, no, of, or, other, she, should, the, their, them, then, these, they, this, thus, to, was, were, what, where, when, which, who, whom, why, with, would. GoingBatty (talk) 17:56, 10 November 2010 (UTC)[reply]

"her", "him", "how" and "its" seem to fit amongst those words. Could they be added? McLerristarr | Mclay1 07:03, 11 November 2010 (UTC)[reply]

"have", "shall", "should", ~~"will"~~... There are many words that are unlikely to have false positives. McLerristarr | Mclay1 07:05, 11 November 2010 (UTC)[reply]

Actually, "will" has two meanings so that one is out. McLerristarr | Mclay1 07:07, 11 November 2010 (UTC)[reply]

Done here, except for "shall" (not on Rich's list) and "should" (already part of typo rule) GoingBatty (talk) 01:30, 12 November 2010 (UTC)[reply]

Removed "her her" from list, as there were too many false positives (e.g. "It cost her her life" GoingBatty (talk) 05:12, 12 November 2010 (UTC)[reply]

This rule is getting very long - any speed benefit in breaking it into two rules vs. keeping it as one long rule? GoingBatty (talk) 01:41, 12 November 2010 (UTC)[reply]

What's more more problems are caused by including “more more” than omitting it! Please can we remove “more more”? — Hebrides (talk) 08:56, 22 November 2010 (UTC)[reply]

"What's more" should be followed by a comma. That's a problem with a lot of these rules; they would be correct if they were separated by a comma. McLerristarr | Mclay1 10:40, 22 November 2010 (UTC)[reply]

Pronomial

Is valid, as is pronominal that AWB wants to change it to. Rich Farmbrough, 04:45, 10 November 2010 (UTC).[reply]

Done here GoingBatty (talk) 01:04, 11 November 2010 (UTC) [reply]

Look up pronomial in Wiktionary, the free dictionary.

.

Also binominal. Rich Farmbrough, 16:07, 10 November 2010 (UTC).[reply]
- Done here - per dictionary.com GoingBatty (talk) 01:04, 11 November 2010 (UTC)[reply]

Rule didn't change "european" → "European"

In this edit, AWB fixed several typos, but did not change "european" to "European". The "Eur(asia/ope)" looks like it should do it, but didn't. GoingBatty (talk) 02:59, 15 November 2010 (UTC)[reply]

I think the automatic typo fixes are all turned off inside wikilinks. The only kind of fix that wouldn't break the link is this one, changing the case of the initial letter. -- John of Reading (talk) 07:56, 15 November 2010 (UTC)[reply]

You're right - I wouldn't expect AWB to change [[european individualist anarchism]]. However, since AWB changed "And so an european tendency..." to "And so a european tendency...", I expected it to change to "And so a European tendency..." GoingBatty (talk) 13:40, 15 November 2010 (UTC)[reply]

I found this in the manual - "If a typo rule is matching a wikilink target, this rule will be ignored on the whole page". So on that page, only, AWB thinks that "european" is allowable. -- John of Reading (talk) 14:12, 15 November 2010 (UTC)[reply]

Aha - that explains it! I tried to RTFM before posting this question, but looked in the wrong place. Could this sentence be added to the appropriate place on WP:AWB/T ? Thanks! GoingBatty (talk) 17:25, 15 November 2010 (UTC)[reply]

Done -- John of Reading (talk) 17:37, 15 November 2010 (UTC)[reply]

It would be really cool if we had some data from these in-link matches. Rich Farmbrough, 04:21, 17 November 2010 (UTC).[reply]

Interestingly Creedence at Woodstock Festival does not seem immune. Rich Farmbrough, 12:15, 17 November 2010 (UTC).[reply]

Time for someone to look at the source code... -- John of Reading (talk) 12:26, 17 November 2010 (UTC)[reply]

Not really. The logic works as described. On Woodstock Festival none of the "Creedence Clearwater..." wikilinks match the "Credence" typo rule, so it is applied. Rjwilmsi 17:53, 17 November 2010 (UTC)[reply]

Yes, my mistake. -- John of Reading (talk) 18:34, 17 November 2010 (UTC)[reply]

Pre-Columbian

Not Pre-Colombian. Rich Farmbrough, 04:20, 17 November 2010 (UTC).[reply]

Not sure what you're asking for here. There's already a rule set up to change "Pre-Colombian" to "Pre-Columbian". Are you saying this rule isn't working, or are you suggesting this rule be disabled, or something else? GoingBatty (talk) 04:26, 17 November 2010 (UTC)[reply]

My mistake. I was skipping the change on Columbia - reading the warning, not the diff. Rich Farmbrough, 10:04, 17 November 2010 (UTC).[reply]

Etc.…

OK I'm finding a lot of these, in variations; "etc. ..." etc. I will try and fix as many as possible but looks like a candidate for a typo rule. Rich Farmbrough, 10:04, 17 November 2010 (UTC).[reply]

Do you mean as in a proper etc. and then trailing periods (with or maybe without a space)? The current etc. rule has a kind of complicated negative lookback, so it's probably easier to just make a new rule for properly spaced etc.'s that have that feature. Test this:

Find: ([Ee])tc\.(\s)*\.*([Ee]tc\.?\s*\.*)*

Replace: $1tc.$2

I haven't tested it, that's a first draft attempt though. Shadowjams (talk) 11:00, 17 November 2010 (UTC)[reply]

The change of etc to etc. many times is not helpful. The use of a period becomes a full spot and so converting it with AWB makes this a not automatic process. How about instead convert etc to the full wording etcetera or otherwise not converting at all. Regards, SunCreator ^(talk) 11:08, 17 November 2010 (UTC)[reply]

I'm not sure the distinction between a full stop and a period... they're effectively the same thing... and I don't understand the issue with the change unless you prefer "etc" remains instead of becoming "etc." If you have an example of where the rule's making a mistake, please provide the diff. The manual of style, however, has long considered the "etc." version correct, as has every other style guide I've ever seen outside of Wikipedia. Shadowjams (talk) 11:45, 17 November 2010 (UTC)[reply]

Period and full stop are the same I was attempting to show the difference between a dot at the end of "etc." and the ending a sentence with "etc.". They are both the same and so it's an issue. Here is a made up example.

"During the succession of lead singers of Jones, Tomson, Harry, Dickson etc Smith's vocals had always been distinguishable."

Now if you change "etc" to "etc." you end up with two sentences. "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc. Smith's vocals had always been distinguishable"

A better way would be to change "etc" to "etc.," to keep the sentence going. Splitting the sentence into two by "etc." is grammatically messy at best. Regards, SunCreator ^(talk) 23:24, 17 November 2010 (UTC)[reply]

We can't possibly account for mistakes. That sentence should be "During the succession of lead singers of Jones, Tomson, Harry, Dickson etc., Smith's vocals had always been distinguishable". If the comma has been omitted, that's not our problem. McLerristarr | Mclay1 06:23, 18 November 2010 (UTC)[reply]

You make a good point. Regards, SunCreator ^(talk) 02:42, 19 November 2010 (UTC)[reply]

It's not automatic, but it is complicated. I'm currently using 4 rules

<Typo word="<enter a name>" find="etc\s*.\s*…" replace="etc." />
<Typo word="<enter a name>" find="etc\s*\.\.\.\." replace="etc." />
<Typo word="<enter a name>" find="etc\s*\.\.\." replace="etc." />
<Typo word="<enter a name>" find="etc\. +([A-Z])" replace="etc.. $1" />

Plus of course the built in etc => etc.

Rule 1 deals with the actual ellipsis character.
Rule 2 assumes that four dots represent an abbreviation stop and an ellipsis, and removes the ellipsis.
Rule 2 assumes that three dots represent an ellipsis, and removes the ellipsis, replacing it with a stop.
Rule 4 assumes (very shakily) that a new sentence starts on the next word and inserts an end of sentence stop after the abbreviation stop.

This is, of course, only valid outside quotes, and even then only rules 1-3 can be given a very high positive and low negative hit rate. Rule 4 fails positively on succeeding proper nouns and fails negatively on intervening punctuation, breaks, titles, end of page etc. Rich Farmbrough, 12:25, 17 November 2010 (UTC).[reply]

Rereading the archived discussions about this rule have been enlightening. When would "Etc." (with a capital "E") be correct? GoingBatty (talk) 18:13, 17 November 2010 (UTC)[reply]

There were discussions about that in the archives too. *shrug* Shadowjams (talk) 09:14, 18 November 2010 (UTC)[reply]

I just reread the archives and didn't see it. Could you please show me where this was discussed? Thanks! GoingBatty (talk) 00:43, 19 November 2010 (UTC)[reply]

Sorry, I may be confused; come to think of it, it may have been regarding e.g. or i.e. or something like that. The discussion I'm thinking of had to do with trailing punctuation I think... In any case I think that issue dealt with some peculiarities of the old rule. So your question raises a good point. Shadowjams (talk) 00:52, 19 November 2010 (UTC)[reply]

etc..

I did some searching for etc and found lots of occurrences of "etc..". It seems much more common then "etc" in fact. Regards, SunCreator ^(talk) 10:45, 18 November 2010 (UTC)[reply]

Etc. and etc should be avoided in formal prose, IMO. "Such as ...", and "including ..." are just two subset terms that indicate that a list is incomplete, and avoid the brush-off informality of "etc"

"[number]-fold"

I have just removed the following as not being a typo.

 <Typo word="T(wo/hree/en/welve/wenty/hirty/housand)fold" find="\b([Tt])(wo|hree|en|welve|wenty|hirt(?:y|een)|housand)[-\s]+fold\b" replace="$1$2fold" />

 <Typo word=";(Four/Five/...)fold" find="\b([Ff](our|ive|orty|ift(y|een))|[Ss](ix|even)(teen|ty)?|[Ee](ight(y?|een)|leven)|[Nn]ine(teen|ty)?|[Hh]undred)[-\s]+fold\b" replace="$1fold" />

AFAIK, usage of the -fold suffix (i.e. 'three-fold' as opposed to 'threefold') is a accepted/bona fide variant, and does not fall to be treated as a typo. --Ohconfucius ^¡digame! 04:10, 22 November 2010 (UTC)[reply]

Oxford Dictionaries Online doesn't list them as variants and I can't find any instances on Google, which thinks it's a typo. Usually hyphenated compound words are British but British usage seems to be no hyphen. McLerristarr | Mclay1 07:24, 22 November 2010 (UTC)[reply]

Saavy --> Savvy

A new user recently requested that this typo be fixed by an AWB user. I found the misspelling in 18 articles when I ran the request. --Andrew Kelly (talk) 03:49, 23 November 2010 (UTC)[reply]

Done here GoingBatty (talk) 04:21, 24 November 2010 (UTC)[reply]

Tamborine

Please add "tamborine" → "tambourine", but not when capitalised to avoid changing Tamborine, a place in Queensland. McLerristarr | Mclay1 04:33, 23 November 2010 (UTC)[reply]