Wikipedia talk:AutoWikiBrowser/Typos/Archive 1

From Wikipedia, the free encyclopedia
Jump to: navigation, search


Archive 1 | Archive 2

Contents

Womens'

I'm trying to correct many instances of "womens'" or "womens" to "women's", but I'm having trouble grabbing that trailing apostrophe in the regex. Can someone help me with the syntax? I'm wondering if this is a AWB bug or you have to do something special for apostrophes. --Thiseye 00:19, 13 January 2007 (UTC)

It seems there is some sort of problem related to identifying the end of a word. However, using a whitespace instead of a wordbreak seems to work.
"\b(W|w)omens'(\s)" -->> "$1omen's$2"
Gaius Cornelius 13:08, 13 January 2007 (UTC)
Thanks, that did the trick. :) --Thiseye 01:59, 14 January 2007 (UTC)

Greece

There's an error in the "Greece" entry. It should have $1, not $2. --Thiseye 01:43, 3 January 2007 (UTC)

Gandhi

There are two entries for Gandhi. I believe the newest one was added to avoid some false positives, but the old one wasn't removed. --Thiseye 01:26, 2 January 2007 (UTC)

Poss reconsider

Bizarre as in Some Bizarre Records. Rich Farmbrough 22:53 11 August 2006 (GMT).

Attempt

If a fix for attemp is desired, "\b(A|a)tt?em(p|t)(|ed|ing|s)\b" --> "$1ttempt$3" seems to work for all cases. I don't think it matches any real words.—Mrkwcz 17:23, 12 August 2006 (UTC)

Opposites

What about alternative beginnings to words, as in opposites like accessible and inaccessible? Instead of having two separate entries to check and maintain, we could easily just have one:

<Typo word="Accessible/Inaccessible" find="\b(A|a|Ina|ina)ccessab(le|ility)\b" replace="$1ccessib$2" />

This would simply require a rule that opposites (starting with in-, un-, etc) should not be placed alphabetically, but placed with their root word, and in many cases in the same regex.

An other strategy would be a rule that any word covered like this outside its normal alphabetical order should have a comment line placed in the alphabetical list where it would have gone.

Euchiasmus 12:55, 20 August 2006 (UTC)

Sounds like a great idea, reducing duplication is always good. thanks Martin 18:58, 20 August 2006 (UTC)

Victuals and eke

I removed these new additions:

Typo word="Victuals" find="\b(V|v)ittles\b" replace="$1ictuals"

Typo word="Eke" find="\b(E|e)e(ke|ked|kes|king)\b" replace="$1$2"

Typo word="Eke" find="\b(E|e)e(k)\b" replace="$1ke"

Typo word="Ekes" find="\b(E|e)eks\b" replace="$1kes"

"Vittles" is so old a misspelling that it's kind of its own word now, not to mention the cat food Tender Vittles, etc. (see the Google search).

"Eek" is a really common onomatopoeia for screaming, among other things. There are a lot of false positives on this Google search, and the words on the list should have 0 false positives. "Eeks" seems the same (lots of legit uses), there seem to be two legit uses of "eeke", and there are only 4 mainspace results for "eeked", 2 for "eeking", and none for "eekes". --Galaxiaad 18:34, 25 August 2006 (UTC)

Ah, sorry. I even happen to know that Eek is a town in Alaska (you can't get there from here, or even from there—Google Maps fails!). But has "eek"-the-onomatopoeia been verbed? "The scream queen eeked out a living"? BTW, I've just made probably a few hundred changes to the list—I gather that you're genuinely interested, so you might want to take a gander.--BillFlis 23:36, 25 August 2006 (UTC)
Yeah, all the changes are impressive and a bit overwhelming. I definitely want to look though. I didn't mean to sound harsh in my previous comment; sometimes it's hard for me to sound human instead of just stating facts, heh. Hm, doesn't look like it's been verbed, but there is the plural in "Eeks and Squeaks". (The instances of "eeked" actually were typos for "eked" but there were only 4, which isn't enough to merit inclusion.) Hey, I'm just wondering and you'd probably know: what does the word="whatever" bit actually do? --Galaxiaad 13:58, 26 August 2006 (UTC)
Actually, I thought your points were well-taken. I figure the word="whatever is just informational. My understanding of the AWB is that it's not a bot, it just helps someone make the same kind of edit over and over very quickly. I have a question about how it uses this typo list: I've noticed that some of the rules here have sort of the opposite of a false positive; that is, the correct spelling will trigger a change, back to the correct spelling. There's no harm done, but isn't this inefficient? Should I be stamping these cases out?--BillFlis 14:27, 26 August 2006 (UTC)
The word property means they can be sorted in proper alphabetic order (sorting by order of the typo was very difficult to deal with, as duplicates were not adjacent to eachother), also it allows easy location of a specific word, which will hopefully avoid future duplication, and probably explains the enourmous amount of duplication that previously existed. Matching the correct spelling is much more efficient than having 2 separate regexes (which is how is used to be) but not as efficent as having a single regex that manages to avoid the correct spelling, so yes, avoid them when possible, but if it is becoming complicated then it doesnt really matter. And thanks for all the work you have done on this! Martin 14:48, 26 August 2006 (UTC)

Airbourne

I got a false positive running AWB when Airbourne was changed to Airborne. Should this be removed? — Loudsox 16:46, 27 August 2006 (UTC)

I think it should be removed, or maybe changed. I think a more likely misspelling is airborn. What would be really nice is some way to tag a word within the encyclopedia as a deliberate misspelling, like adding "[sic]".--BillFlis 17:34, 27 August 2006 (UTC)

Regexes that match the correct spelling

Sometimes a regex, in providing matches for a variety of possible misspellings, matches the correct spelling. As best I can tell, AWB stops on an article when the regex matches the correct spelling and therefore makes no change.

Example: for "Apparel", the regex

(A|a)pp?arr?e(l|ls|ling|lling|led|lled) 

corrects "Aparrel", "Aparel", and "Apparrel". Unfortunately, those alternatives allow "Apparel" to match, so AWB stops on "Apparel" but shows no diff. Example article: Jones Apparel Group.

So, 1) is what I'm saying true; 2) is there a preference against such regexes; 3) is there a way to fix the regex (while keeping only one regex) to avoid this? (And/or, can AWB be programmed to realize that a null edit has occurred?) Thanks, –Outriggr § 01:23, 16 September 2006 (UTC)

Well, I just played with the "Skip article when no change made" setting (which I could swear was on by default, or that I have always had it on), and I see that AWB no longer stops in the above case. Not such an issue then? –Outriggr § 01:31, 16 September 2006 (UTC)
I've been told that the regex does make the "change" (to the same correct spelling, thus useless work) and is thus wasteful of resources. Have I been given some bad info? I've been trying to stamp out such cases, but maybe the program is smart enough to recognize (i.e., it checks) whether any real change is made, and I'm the one doing the useless work!--BillFlis 20:15, 16 September 2006 (UTC)
The program is smart enough to know if a change was actually made, but it is slightly preferable not to match the correct spelling, though not critical. I suppose it might be more critical in the future if some other software wanted to make use of this list though. Martin 09:31, 17 September 2006 (UTC)

Suggestion of a change

How about "alot" to "a lot". But I am not sure how to program it.--Esprit15d 17:50, 27 September 2006 (UTC)

But it might be "allot".--BillFlis 19:46, 27 September 2006 (UTC)

I suppose:

<Typo word="Alot" find="\b(A|a)lot\b" replace="$1 lot" />

<Typo word="Allot" find="\b(A|a)llot\b" replace="$1 lot" />

Reedy Boy 17:10, 16 October 2006 (UTC)

Upon doing it manually with AWB find and replace the words allotment and ballots came up causing a problem with the search on Allot.

Would running those like that, ensure that only that word is used? Or would it include words that include alot/allot?

Reedy Boy 17:11, 16 October 2006 (UTC)


Seems some people use allot instead of allocate...?

Reedy Boy 17:14, 16 October 2006 (UTC)

reject Allot comes from the sense of "assigning by lot" and therefore implies random allocation. Allotment has a specific political meaning of "to select by random selection" - aka "jury" selection and "sortition". Allocation does not have any sense of chance and e.g. to allocate a person to a jury rather than allot them would imply they were chosen rather than selected at random (which would dramatically change their nature) The two words are very different and in my view to replace "allot" with "a lot" was just vandalism. --Mike 16:10, 18 October 2006 (UTC)

I think what you intended was:

<Typo word="Alot" find="\b(A|a)lot\b" replace="$1 lot" />
<Typo word="Allot" find="\b(A|a)lot\b" replace="$1llot" />

Then you'd have to run AWB manually (isn't this always how it's run?), and decide which rule to accept: alot --> a lot or alot --> allot. Yes, allot means allocate, as "within the allotted time". This would be safe to add, I think:

<Typo word="Allot_" find="\b(A|a)lot(ted|ting|ments?|tees?)\b" replace="$1llot$2" />

where we add the low-line character (_) to signal that only certain endings are being treated.--BillFlis 17:22, 16 October 2006 (UTC)

Reconsider

Rich Farmbrough, 19:33 3 October 2006 (GMT).

    • I'm a bit concerned that people—both those who use AWB, and those who see bad edits—forget that this system is semi-automated. In conjunction with the fact that the AWB user is reviewing his edits, I don't see why it is necessary to get rid of a spelling correction rule even if there are very rare exceptions to that rule. I managed not to "correct" Garry Tallent (in another article) once. I'm not pressing for the removal of the spelling error "tallent". –Outriggr § 00:27, 4 October 2006 (UTC)
      • Simply because the stated aim is to have no false positives. "The lofty goal of RETF is to be completely automatic." It is a courtesy to the creator report problems here. Rich Farmbrough, 21:58 7 October 2006 (GMT).

Two questions

  1. Is "first-hand" really bad? dictionary.com
  2. Comunal->Communal breaks Estadio Comunal de Aixovall, do we care?

Rich Farmbrough, 21:58 7 October 2006 (GMT).

Also, "first hand" can occur together. "I won the first hand."--BillFlis 12:01, 8 October 2006 (UTC)
Actually, "first hand" occurs in Canasta.
Each player is dealt a hand of 11 and a second hand of 13, sometimes referred to as the "hand" and the "foot", respectively. The hand with the lowest bottom card is played first. Once a player plays all cards from his first hand he picks up the second and continues normal play.
It has caused a false positive.Punainen Nörtti 18:15, 25 October 2006 (UTC)

Countries

I've added entries to convert names of countries to Title Case. My process was:

  • copy list of countries from List of countries
  • process to remove text in () or []
  • process "See * for *" lines
  • change lines with "1, 2" into "2 1" (eg "Congo, Republic of")
  • manually inspect and make special changes (eg Taiwan)
  • add to AutoWikiBrowser/Typos and test
  • remove duplicates that had already been put onto the list
  • remove country names that are also words that can be in lowercase (chad, guinea, jersey)

I guess that many of the lines could be manually tweaked to give greater coverage of variants - but this is a start, anyway...

Hope this doesn't generate too many erroneous matches that I haven't thought of...

Euchiasmus 07:40, 8 October 2006 (UTC)

"wale(s)" and "coco(s)" have uncapitalized meanings in http://www.m-w.com. "chile" is a valid spelling of "chili" (capsicum). "india" (occasionally before "ink" and "rubber") isn't always capitalized.--BillFlis 11:54, 8 October 2006 (UTC)

Thanks, Bill - I've removed those. I also realised about turkey and took that out too. Euchiasmus 19:51, 9 October 2006 (UTC)

Because this is an issue of capitalisation rather than spelling, I suggest that these entries are placed in a separate section rather than being distributed into the A, B, C, sections. Gaius Cornelius 13:21, 6 November 2006 (UTC)

Predominately?

Suggested addition - replacing "predominately" (not a word) with "predominantly." | Mr. Darcy talk 20:22, 6 November 2006 (UTC)

Sorry, but "predominately" is indeed a word, meaning--guess what?--"predominantly". See here.--BillFlis 19:58, 10 November 2006 (UTC)

'Logical' punctuation in quotations

I'm changing punctuation at the end of quotations to 'logical' style, per Wikipedia:Manual of Style#Quotations by replacing <," > (comma-quote-space) with <", > (quote-comma-space) throughout (e.g. <"Yes," he said.> to <"Yes", he said.>. I haven't come across any false positives yet. A similar replacement might be possible for embedded full stops at the end of quotations, but that's more controversial and would produce too many false positives, I think, unless someone could suggest a clever method to exclude the case where an entire sentence, including its final punctuation, is being quoted. Colonies Chris 22:59, 6 November 2006 (UTC)

Orignal --> Original

There is a town in Ontario called L'Orignal, mentioned in a few articles, so the regex should exclude this if possible. Colonies Chris 08:23, 9 November 2006 (UTC)

Problem with "definitions"

When presented with the misspelling "defintions" it tries to replace it with "definitons" which is still not the correct spelling. I took a look at the RegEx and I am not quite sure how to fix this problem, so if somebody with more experience can fix it, that would be great. --Maelnuneb (Talk) 19:49, 10 November 2006 (UTC)

OK, fixed, thanks.--BillFlis 19:58, 10 November 2006 (UTC)

Firsthand

I am getting a ton of false-positives with this one. Card game pages are a real big source of false-positives. I am going to remove it from the list due to this. Code for the RegEx was: <Typo word="Firsthand" find="\b(F|f)irst[ -]hand\b" replace="$1irsthand" /> Possible fix: only match first-hand, but I'm not positive that version isn't an acceptable spelling. Any comment on that would be great. --Maelnuneb (Talk) 20:59, 13 November 2006 (UTC)

After looking up first-hand on [1], it suggested firsthand, so I will add checking for "first-hand" back into the system, but not "first hand" as the possibility of a false positive for "first-hand" is non-existent. If people believe that "first hand" should be included still, please debate here. --Maelnuneb (Talk) 21:05, 13 November 2006 (UTC)
And the OED and Webster Unabridged, both more reliable dictionaries, have "first hand" and "first-hand". This is certainly not a typo, and at the very least is an acceptable alternative spelling, if not the better spelling. —Centrxtalk • 21:29, 14 November 2006 (UTC)
Given that, I would agree to not have firsthand in the list of typos. I personally didn't write the rule in the first place, just tweaked it to get rid of false positives and then did a quick search to see if "first-hand" was a correct spelling, running on the assumption that the original contributor that added the rule for firsthand was in fact correct. Centrx, thank you very much for finding evidence of the other spellings and bringing them here. --Maelnuneb (Talk) 17:46, 15 November 2006 (UTC)

Also, this list really does need to be restricted to typos, not bad usage, because quotations and normal sentences will be filled with cases that should not be "corrected". Also, with compound words there are common sentences (such as actually referring to the first hand of something, as in a game of cards or something about physiology) that would never warrant changing. —Centrxtalk • 06:34, 16 November 2006 (UTC)

Typos would still show up in those cases unfortunately. That is the entire reason that the process of fixing typos is not automated. Your point about "first hand" was exactly why I changed the rule to match only "first-hand" actually. I was getting tired of fixing false positives, so I changed the rule to prevent it. --Maelnuneb (Talk) 18:00, 17 November 2006 (UTC)

referrences -> referencces

<Typo word="Reference" find="\b(R|r)efe(?:rr?a|rre)n(ce[ds]?|cing|ts?)\b" replace="$1eferenc$2" />
should likely be
<Typo word="Reference" find="\b(R|r)efe(?:rr?a|rre)n(ce[ds]?|cing|ts?)\b" replace="$1eferen$2" />
~ BigrTex 20:19, 15 November 2006 (UTC)

Thank you for your suggestion! When you feel an article needs improvement, please feel free to make those changes. Wikipedia is a wiki, so anyone can edit almost any article by simply following the Edit this page link at the top. You don't even need to log in (although there are many reasons why you might want to). The Wikipedia community encourages you to be bold in updating pages. Don't worry too much about making honest mistakes — they're likely to be found and corrected quickly. If you're not sure how editing works, check out how to edit a page, or use the sandbox to try out your editing skills. New contributors are always welcome. ~ BigrTex 20:00, 16 November 2006 (UTC)

Society, abundant

  • Societ -> Society
  • abundandt - >abundant
  • abundandtly -> abundantly

I stumbled across "Societ" today, and I have a tendency to add an an unnecessary d to abundant as well, but I don't know how to add these to the filters myself. --Lethargy 00:14, 16 November 2006 (UTC)

I have just added <Typo word="Abundant" find="\b(A|a)bundand(t|tly)\b" replace="$1bundan$2" /> Tankred 00:38, 16 November 2006 (UTC)

<Typo word="Oft(en)times" find="\b(O|o)ft(|en)[- ]times\b" replace="$1ft$2times" /

Often Times to Oftentimes ???

It might be me, but that seems like a use that would be sparsely used?

Or is it just me?

Reedy Boy 15:32, 19 November 2006 (UTC)

New additions section

Can we be more explicit in whether the new additions should be put at the beginning or at the end of the "New additions" section? People put them to both places, which makes the chronology of the section a bit problematic to follow. The section is fairly large now and it would be perhaps a good idea to check the oldest additions again and then to put them to the main body. Tankred 16:55, 19 November 2006 (UTC)

Increase

Suggested addition: While fixing other typos I stumbled upon 'increse' (missing a).

<Typo word="Increase" find="\b(I|i)ncres(e|ed|ing|ingly)\b" replace="$1ncreas$2" />

Thanks. ChrisCork 06:51, 28 November 2006 (UTC)

Added, with the handling of "Decrease" as well.--BillFlis 12:52, 28 November 2006 (UTC)

Super Bowl

Superbowl -> Super Bowl. I see that one a lot, not just on the Wiki. I'm not sure how to add listings that split into two words, so I'm adding it here. --cholmes75 (chit chat) 20:56, 28 November 2006 (UTC)

Done!--BillFlis 21:02, 28 November 2006 (UTC)

Guerilla

<Typo word="Guerilla" find="\b(G|g)uer(?:r?i|ril?)l(as?)\b" replace="$1uerill$2" />

We are replacing Guerrilla with Guerilla, even though the article spells it the 'wrong' way. I have removed the line. ~ BigrTex 00:12, 1 December 2006 (UTC)

Problem with kW, kJ, Hz

I'm getting problems with kW, kJ, Hz because AWB now changes (eg on the Bible page)

[[kw:Bibel]] to [[kW:Bibel]]
[[kj:Ombibeli]] to [[kJ:Ombibeli]]
[[hz:Ombeibela]] to [[Hz:Ombeibela]]

They then get moved out of sequence. I suggest the regex be amended to exclude situations where the word is preceded by square brackets and followed by a colon.

Sorry haven't got time to do it at present - I'm rushing off to work!

Cheers - Euchiasmus 07:08, 1 December 2006 (UTC)

Rule Problems

  • The rule as written changes governement to governmen. -- Saaber 04:07, 4 December 2006 (UTC)
  • The rule as written changes quanity to quantituanit. -- Saaber 11:02, 4 December 2006 (UTC)
  • The rule as written changes 'dominican' to 'Dominica' -- ChrisCork 15:48, 15 December 2006 (UTC)

Miniscule

... is cool, listed as a variant of "minuscule" here and here.--BillFlis 12:50, 9 December 2006 (UTC)

The misspelling has become so widespread that some authorities are listing it as an alternative. However, there is still a clear majority in favour of the correct spelling. I vote we go with the majority and stick to minuscule. Euchiasmus 16:07, 9 December 2006 (UTC)
Dictionary.com shows "miniscule" in three different sources here, which makes a total of at least four, since M-W isn't one of them. Given the policy against changing from one spelling of the same word to another, I don't think we should be automatically changing this. —Krellis 17:31, 11 December 2006 (UTC)
Whatever you do, don't change the occurrences of "miniscule" in the minuscule article. This article does indeed say that "miniscule" has been "traditionally regarded as a spelling mistake," although no reference is offered for this contention. Some discussion with references may be found here.--BillFlis 19:03, 11 December 2006 (UTC)

Changing ordinals to cardinals in dates

Please can we remove the ordinal to cardinal conversion in dates? Maybe the Americans don't habitually use dates like "1st May", but we British do use them and I can't see anything wrong with them. When I read "1 May" it looks very strange, especially in narrative prose.

Here in UK the use of st|nd|rd|th is very common in dates. For example, glancing through filed correspondance I find that the majority of my documents (insurance policies, bank statements, nominet registration, etc) use ordinal numbers in dates. With other regional variations WP allows alternative forms - why not in dates?

Euchiasmus 14:18, 10 December 2006 (UTC)

I personally have mixed feelings about adding things to the typo list that aren't typos or misspellings, but the intention here was clearly to go with the Manual of Style guideline on ordinal suffixes in dates (relevant section here). So you'd really probably be better off bringing it up there. Hope this helps. --Galaxiaad 19:02, 10 December 2006 (UTC)
  • Here are a couple of points:
    • because WP:DATE is a guideline, consensus was reached about the date format to be used. While a guideline is not a rule, we should be striving towards the suggestions given unless there is a strong push for a change, which would mean that there is no longer consensus. Therefore, while consensus still exists, there is no reason to remove the rules removing ordinals from dates.
    • A note to users of WP:AWB/T: be careful not to remove ordinals in direct quotes. --Maelnuneb (Talk) 17:44, 12 December 2006 (UTC)

Error in proclaim rule?

The current rule for proclaim:

word="Proclaim" find="\b(P|p)roclam(e[dsr]?|ing)\b" replace="$1roclaim$2"

changes proclame to proclaime. Was this intended? Euchiasmus 11:36, 17 December 2006 (UTC)

I think not. The "?" shouldn't be there.--BillFlis 13:36, 17 December 2006 (UTC)

'Receive' typo

I see there've been some recent changes to the way 'receive' is corrected, but unfortunatly it's now broken. I'm not too hot on regexp, so could someone take a look for me please? ^_^ ShakingSpirittalk 07:18, 19 December 2006 (UTC)

New words

I'm looking at Wikipedia:Lists of common misspellings and am trying to fix some of them, using AWB. As thus, I'd like someone more skilled with regexes than me to add:

  • Sacrifice
  • Satellite
  • Sandwich
  • Sergeant

Come to think about it, someone with enough time on their hands could just go ahead and look through everything in Wikipedia:Lists of common misspellings. Obviously, I was looking at S, but there's probably a lot missing elsewhere too. Thank you! Jobjörn (Talk ° contribs) 02:00, 25 December 2006 (UTC)

Jobjörn: I am currently working through all the 'S' typos myself. I am about halfway through a dump of the 30-Nov-2006 database. It might make more sense for us not to duplicate this effort - would you mind working on another letter? There are plenty to go round. If you wish I can help you with a whole bunch of regexes. Personally, I like to work on a set of regexes to make sure that there not too many errors or false positives before submitting them to the Wikipedia:AutoWikiBrowser/Typos list. Still, I have added sacrifice, sandwich and satellite for you - but not sergent because it generates false positives against a common surname. You might like to try this regex for lowercase only:
"sargant(s?)" --> "sergeant$1"
Let me know what you think - but it is Christmas and I will be away for a few days! Gaius Cornelius 13:25, 25 December 2006 (UTC)
No, definitely. I'll grab some other letter. Jobjörn (Talk ° contribs) 17:06, 25 December 2006 (UTC)

Targetting/targeting

I don't have the right dictionaries handy to confirm, but AFAIK 'targetting' and 'targetted' are accepted spellings in UK English (and possibly Australian English as well). Could somebody with access to the OED and/or Macquarie please check this and remove them from the list if this is so? --Calair 05:13, 30 December 2006 (UTC)

Typicaly & Essentialy

If someone could add 'typicaly' (typically) & 'essentialy' (essentially) to the regex list that would be great, there seem to be a lot of these errors at the moment.--Hooperbloob 07:31, 4 January 2007 (UTC)

Done. Gaius Cornelius 21:11, 4 January 2007 (UTC)

Manoeuver

I just merged (Out)Manoeuver into Maneuver as (Out)Maneuver. This is the line I deleted:

<Typo word="(Out)Manoeuver" find="\b([Oo]utm|M|m)an(?:[oeu]{1,2})ver(s?|ing|e[dr]|abl[ey]|ability)\b" replace="$1anoeuver$2" />

If someone could double-check my merge, I'd appreciate it. ~ BigrTex 21:23, 5 January 2007 (UTC)

AFAIK, the British spelling is 'manouevre''manoeuvre', so it's probably not a good idea to auto-correct a spelling halfway between the two to the US option without checking context. --Calair 23:19, 5 January 2007 (UTC)
My big American dictionary here has "manoeuvre" and "manoeuver" (but not "manouevre"--are you sure that's right?) as variants of "maneuver", without any indication that they are only British spellings. However, this dictionary says "manoeuvre" is "Chiefly British"; no listing for "manoeuver".--BillFlis 00:05, 6 January 2007 (UTC)
Oops, typo fixed, thanks :-)
I don't have good references handy, but as per American_and_British_English_spelling_differences#-re_.2F_-er the usual UK spelling is 'manoeuvre' and the US spelling is 'maneuver'. (This comes from a combination of US/UK differences on whether to end words with '-re' or '-er', combined with different rules on rendering the ligature 'œ' in a modern alphabet - UK spellings tend to split it into two letters, US spellings go with a single phonetic 'e'.)
'Manoeuver' is halfway between the two; it probably should be corrected where it appears, but I'd recommend checking context (i.e. the subject matter of the article, and failing that the style of the rest of it) to judge which way the correction should go. --Calair 01:31, 6 January 2007 (UTC)

Prepubescent or pre-pubescent

I'm not sure which is the correct format but both exist in quantity here.--Hooperbloob 03:02, 6 January 2007 (UTC)

Comital

While it's a common misspelling for "committal", "comital" is also a legitimate word meaning "pertaining to the count". I don't know enough about regexps to fix this, but perhaps something should be done; I've seen this change made twice in the past month or so. Choess 15:56, 12 January 2007 (UTC)

Sponser

Over 300 of these last time I checked. Should be 'sponsor', 'co-sponsor', 'sponsored', 'sponsoring', etc. --Hooperbloob 08:01, 14 January 2007 (UTC)

Only just under 100 in mainspace, according to wikisearch, I'll take a stab at them and report back. —Krellis 23:13, 15 January 2007 (UTC)
All of these in mainspace and Images: should now be taken care of. —Krellis 00:05, 16 January 2007 (UTC)

Trailor

trailor -> trailer --Hooperbloob 08:25, 14 January 2007 (UTC)

"_Strange" Pattern

I just removed the following pattern:

<Typo word="_Strange" find="(?<!\b([A-Z][a-z]*))(\s[Ss])tange\b" replace="$1trange" />

For two reasons:

  1. "Stange" is a last name that I've run across a number of times, particularly in Major League Baseball articles.
  2. The pattern is broken, replacing "Stange" with "trange" - the negative lookbehind assertion appears to be capturing, so the $1 would need to be $2.

Replacing "stange" to "strange" is probably fine, as long as we don't replace the capitalized version. I don't quite understand why this pattern has the lookbehind stuff, rather than just using word boundaries like other patterns, so I don't feel comfortable replacing it - if the original author (or anyone else) wants to do so, please go ahead, as long as you preserve "Stange" and make sure it replaces the right captured string. —Krellis 20:53, 15 January 2007 (UTC)

I originally added this fix. The purpose of the lookbehind was to elliminate instances of Stange preceeded by a word that begins with a capital letter - which may be a first name. I found this pretty effective at reducing false positives. Gaius Cornelius 21:09, 15 January 2007 (UTC)
Aha, okay, that makes so much more sense now. My brain just wasn't in a regex parsing mood earlier, I guess. Unfortunately, I've come across at least four or five false positives in the past few days - many articles use just the last name to identify individuals once they have been introduced. At least some of the FPs I've seen have been at the beginning of a sentence or line, so matching that in a lookbehind could theoretically help avoid some more, though of course it would probably prevent legitimate errors from being found as well. Given the advice of "don't add if there is one (false positive)" at the top of the list, I would suggest "Stange" be considered a lost cause, and just the lower case version be re-added. —Krellis 23:01, 15 January 2007 (UTC)

((In)De/In/Af)Finite misbehaves!

The list of typos includes the almost impossibly complicated:

<Typo word="((In)De/In/Af)Finite" find="\b([Ii]n|)(F|f|[Dd]ef|[Aa]ff)(?:finite?|f?in[ae]te?|f?init)(s?|ly|ness|y)\b" replace="$1$2init$3" />

It changes infinetly to infinitly - (for example, try it with the Home Construction article).

If I could work out what it was doing right and what it was doing wrong, I would correct it! I think my example is not the only thing it does wrong. Somebody please help! Thanks. - Euchiasmus 20:17, 25 January 2007 (UTC)

Light Year

I ended up having a problem with the light year regex, so I removed it. Here is the original code: <Typo word="_Light year" find="(?<!\b(Buzz ))(L|l)ig?h?tyea(rs?)\b" replace="$1ight yea$2" /> This is what was happening when it ran for me: AWB found "lightyears" and wanted to replace it with "ight yeal". Obviously a problem with the substitution. I tried changing the $1→$2 and the $2→$3, but that did not end up working for me, which does not make any sense to me. If somebody with more experience can attempt to fix this one, that would be great. --Maelnuneb (Talk) 20:26, 26 January 2007 (UTC)

My fix was actually correct. I just had a cache problem getting in the way of having an updated set of typo rules. Problem solved. --Maelnuneb (Talk) 20:31, 26 January 2007 (UTC)
This dictionary says that it's "light-year", with a hyphen.--BillFlis 13:22, 27 January 2007 (UTC)
My home dictionary gives it as two words whereas the wikipedia article says it is either one word or hyphenated. I guess that typo fix had better come out. Gaius Cornelius 19:04, 27 January 2007 (UTC)

Peleton

peleton -> peloton

Thanks, Mk3severo 00:55, 2 February 2007 (UTC)

ususally --> usually

Please add this typo to the list. Harryboyles 05:03, 2 February 2007 (UTC)

Added. Wow, a quick search turned up 380 instances of this weird misspelling!--BillFlis 13:02, 2 February 2007 (UTC)

Simalar -> similar

not really sure how to add that... -ΖαππερΝαππερ BabelAlexandria 14:18, 13 February 2007 (UTC)

 <Typo word="Similar" find="\b(S|s)imalar\b" replace="$1imilar" /> 
Reedy Boy 14:43, 13 February 2007 (UTC)
Just Looked, there is
 <Typo word="(Dis)Similar" find="\b(S|s|[Dd]iss)im(?:mi|u)lar(|ly|ity)\b" replace="$1imilar$2" /> 

So, possibly encorporate with that?

 <Typo word="(Dis)Similar" find="\b(S|s|[Dd]iss)im(?:mi|u|a)lar(|ly|ity)\b" replace="$1imilar$2" /> 

I think. Addition of |a to the middle of the word Reedy Boy 14:46, 13 February 2007 (UTC)

Moniter -> Monitor

Need to handle moniter, monitering, monitered, etc..--Hooperbloob 23:48, 28 February 2007 (UTC)

Misspellings to be added

Should new misspellings go here or in the "Misspellings to be Added" section of the main project page? Regardless, here's about 90 that I've amassed. I'd add them myself, but some of those regexes are pretty complex and scare me. I've verified that all these aren't acceptable by dictionary.com and that there are at least 10 instances of each in Wikipedia. False positives haven't been checked for, however. And there are probably prefixes/suffixes that can be added to most of them.

(Can someone please add some of these? --Thiseye 07:02, 2 March 2007 (UTC))

  • jeapordy → jeopardy
  • likley → likely
  • liqour → liquor
  • literaly → literally
  • minsitry → ministry
  • mountian → mountain
  • newstands → newsstands
  • nobilty → nobility
  • oppenent → opponent
  • orginial → original
  • personna → persona
  • editted → edited
  • posibility → possibility
  • precip(a|ia)tion → precipitation
  • prepatory → preparatory
  • pricipal → principal
  • recruting → recruiting
  • reliquish → relinquish
  • reminicent → reminiscent
  • replacment → replacement
  • responed → responded
  • sectretary → secretary
  • signiture → signature
  • similarily → similarly
  • similiar → similar
  • unsheath → unsheathe
  • valiently → valiantly
  • wherupon → whereupon
  • wheter → whether
  • protray → portray
  • protrayed → portrayed

Questioned

  • widly → widely
    • Might be a typo for "wildly" instead of "widely" -- JHunterJ 11:27, 13 April 2007 (UTC)
  • intitution → institution
    • Might be a typo for "intuition" instead of "institution" -- JHunterJ 16:39, 22 June 2007 (UTC)
  • summery -> summary --John 23:32, 23 July 2007 (UTC)

Reliable sources

Is dictionary.com a reliable source?--Andeh 06:04, 11 August 2006 (UTC)

Nope. See here. alphaChimp laudare 06:19, 11 August 2006 (UTC)
OK, what about Microsoft Word 2000's or higher dictionary?--Andeh 06:25, 11 August 2006 (UTC)

This looks like a good source for misspellings: http://www.misspelled.com/common/a.htm --BillFlis 10:45, 27 August 2006 (UTC)

Full stops, commas, colons, brackets and double spaces

I have felt that following mistakes are too comon (specially in stubs) to ignore:

  • c denotes any alphanumeric character
  • s denotes a space character
Mistake Correction Suggested code
c.c c.sc
<Typofind="\b(a-zA-Z).(a-zA-Z)\b" replace="$1. $2" />
cs.c c.sc
<Typofind="\b(a-zA-Z) .(a-zA-Z)\b" replace="$1. $2" />
cs.sc c.sc
<Typofind="\b(a-zA-Z) . (a-zA-Z)\b" replace="$1. $2" />
c,c c,sc
<Typofind="\b(a-zA-Z),(a-zA-Z)\b" replace="$1, $2" />
cs,c c,sc
<Typofind="\b(a-zA-Z) ,(a-zA-Z)\b" replace="$1, $2" />
cs,sc c,sc
<Typofind="\b(a-zA-Z) , (a-zA-Z)\b" replace="$1, $2" />
c;c c;sc
<Typofind="\b(a-zA-Z);(a-zA-Z)\b" replace="$1; $2" />
cs;c c;sc
<Typofind="\b(a-zA-Z) ;(a-zA-Z)\b" replace="$1; $2" />
cs;sc c;sc
<Typofind="\b(a-zA-Z) ; (a-zA-Z)\b" replace="$1; $2" />
c(c cs(c And so forth
c(sc cs(c And so forth
cs(sc cs(c And so forth
c)c c)sc And so forth
cs)c c)sc And so forth
cs)sc c)sc And so forth
ss s And so forth

Note: Suggested code is based on my preliminary understanding of the pattern of the working code at Wikipedia:AutoWikiBrowser/Typos, and I am very sure it is wrong and needs to be corrected.

Szhaider 15:39, 9 October 2006 (UTC)

These are indeed common mistakes, but unfortunately, in my experience there are too many legitimate exceptions, such as ".NET", the other mistakes may not have so many exceptions though. Martin 16:16, 9 October 2006 (UTC)
Yeah, and what about U.S.A.? Or T.S. Eliot? Also, semi-colon is part of many HTML entities, like "—" etc., which will butt right up against letters.--BillFlis 02:11, 10 October 2006 (UTC)

facilitate

The new entry for facilitate is not correct. It's changing facilitate to facilitatli. I think it should have $3 instead of $2. --Thiseye 00:44, 1 March 2007 (UTC)

Thanks for reporting; fixed. -- intgr 00:47, 1 March 2007 (UTC)

secretarty -> secretary

found in Marita Ulvskog. Jobjörn (Talk ° contribs) 01:21, 8 March 2007 (UTC)

Added to existing "Secretary" entry.--BillFlis 22:33, 8 March 2007 (UTC)

RETF oddities

I noticed something strange that could be a bug in AWB. I've noticed in several articles that if a typo is in wiki tags [[]], then RETF will not catch this. I assumed this was because it's not excluding the brackets as part of the word so it wasn't matching the regex. But then I noticed in the Akshay Pratap Singh article, that the FAR does catch typos within wiki tags. In this article, "politican" is misspelled. I had a FAR entry to correct this which I recently added to RETF. However, I noticed when I disabled the FAR entry, it would no longer be corrected. I updated the FAR regex to exactly that of the RETF regex, and still FAR would correct it, but RETF would not. --Thiseye 22:43, 11 March 2007 (UTC)

I believe this has been discussed a few times over on the AWB talk pages, it has been setup like this purposely. There are reasons for doing it both ways, and i think we are looking into having it check more... Post it on the AWB talk page... Reedy Boy 17:55, 12 March 2007 (UTC)

Not sure if anyone will see this...

I was wondering if the AWB could include the often misused words "reoccur", "reoccured", and "reoccuring". These are not actual words (contrary to popular assumption)! They should all be changed to "recur", "recurred", and "recurring". Mahalo. --Ali'i 20:44, 13 March 2007 (UTC)

Oops, they already are included:

<Typo word="(Re(o)c/Re)currence" find="\b([Rr]eoc|[Oo]c|Re)curran(ces?|t|tly)\b" replace="$1curren$2" /> <Typo word="Recurr(ed/ing)" find="\b(R|r)ec(?:cur?|u)r(ed|ing|ent|ently)\b" replace="$1ecurr$2" />

Sorry about that. Thanks anyway. --Ali'i 20:47, 13 March 2007 (UTC)

Includeing -> Including

As above, suggest replacing includeing with including. Harryboyles 05:59, 17 March 2007 (UTC)

Asian needs to be updated

There is a misspelling in Kai Chen as asain, the current accounts for aisian....

Dependant vs. Dependent

It appears that "dependant" is acceptable in British English, esp. as a noun. If people concur, it should be removed from the typo list IMHO. —Wknight94 (talk) 15:21, 23 March 2007 (UTC)

It's not just British. An American dictionary http://www.m-w.com/dictionary/dependant lists it too.--BillFlis 18:10, 23 March 2007 (UTC)
So it should be removed, no? —Wknight94 (talk) 14:05, 24 March 2007 (UTC)
It definitely needs to be removed. As a noun a dependant is a person looked after by another e.g. a father's dependants are his children (sorry for the approximate definition). Dependant may well be incorrectly used e.g. 'dependant on the weather ...' but can't be fixed this way. Rjwilmsi 19:19, 26 March 2007 (UTC)
I removed it shortly after my last message. —Wknight94 (talk) 21:21, 26 March 2007 (UTC)

Regex/CPU question

I know that we want to reduce the number of regexes to reduce the amount of CPU time used to process them all. I'm assuming this means that there is little to no CPU cost associated with adding a variant to an existing regex compared to adding a completely new entry. Should we avoid adding variants to an existing regex that don't occur too often, or does that matter?

Also, it seems we avoid "catching" the correct spelling within the regex. Is that the standard we should go by? And to what extent should we go to avoid that situation? I've seen some regexes that do catch the correct spelling, so should I try to rework these, or is this sometimes acceptable ("available" is an example). Further, should we avoid trying to catch certain variants of typos to avoid catching the correct spelling? Should we avoid adding a new entry to try to catch a variant to avoid catching the correct spelling ("Vancouver" is an example)? --Thiseye 18:28, 25 March 2007 (UTC)

  • I guess since I received no feedback regarding this, what do people think about creating a "sub-list" of typos that people could manually synch with in AWB to use. This typo list will contain regexes that catch a lot of typos, but occasionally catch false positives. In other words, a list where we're not necessarily trying to get "100% accuracy" like the current list, but still accurate more often than not. A user using this list would have to be more careful of the substitutions made. I keep such a separate list already, but it'd be nice to have a common one. The "New Jersey" post below is one such candidate. --Thiseye 01:29, 8 July 2007 (UTC)

Combining regexes that catch missing "e" before "ly" suffix

I wanted to get some other's thoughts on combining several regexes (and incorporating some new ones). The thing is that if we want to add other variants to these, we'd probably want to separate them out again.

<Typo word="(Accurate/Active/Affectionate/Alternate/Appropriate/(Ab/Re)solute/Collective/Consecutive/Desperate/Exclusive/Extensive/False/Large/Separate/Severe)ly" find="\b((A|a)(ccurat|ctiv|ffectionat|lternat|ppropriat)|([Aa]b|[Rr]e)solut|(C|c)o(llec|nsecu)tiv|(D|d)esperat|(E|e)x(clu|ten)siv|(F|f)als|(L|l)arg|(S|s)e(parat|ver))ly\b" replace="$1ely" />

--Thiseye 00:01, 26 March 2007 (UTC)

I think this is a good idea, I have been using some regexes like this personally and they can work pretty well. Gaius Cornelius 00:05, 26 March 2007 (UTC)
Good idea, but I have a suggestion. No English words end in "ivly" or "avly". This:
<Typo word="-(a/i)vely" find="(a|i)vly\b" replace="$1vely" />
catches your "-ively" words and over a thousand more. I went ahead and added this and a few others under New Additions; I'll let them cook for a while to see if any unforeseen problems arise before deleting any existing entries.--BillFlis 10:29, 26 March 2007 (UTC)

'infinate' fixed to 'infinit'

The typo correction ((In)De/In/Af)Finite fixes 'infinate' to 'infinit'. I'm not competent enough with regex to fix it. Rjwilmsi 19:16, 26 March 2007 (UTC)

Fixed, but I had to take out the case of "infinity".--BillFlis 19:33, 26 March 2007 (UTC)
Thanks. And another: ballon can't be corrected to balloon as 'ballon' exists in French and is quoted e.g. Ballon D'or in the Roberto Baggio article.
That sounds questionable since this is the English Wikipedia. That's one that would need to be rejected manually by the WP:AWB user but shouldn't be removed from the typo list. (My opinion anyway). —Wknight94 (talk) 21:21, 26 March 2007 (UTC)
Yes, but if you search for "ballon", you get not just Ballon D'Or but a host of articles with that word in the title. On the other hand, we could certainly keep the corrections of "balloning", "ballonist", etc. On the third hand, there aren't a lot of these errors.--BillFlis 10:24, 28 March 2007 (UTC)

'responsable(s)' fix needs to be removed

Responsable(s) exists in French so needs to be removed from the "(Ir)Responsible" correction. Rjwilmsi 20:27, 27 March 2007 (UTC)

tPA is corected to TPa but it's correct in articles such as Serpin. Rjwilmsi 20:37, 27 March 2007 (UTC)

Sorry to push back again (as I did above) but this is the English Wikipedia. Shouldn't French words be occurring very very rarely? To me, that's better to cover as an exception by the WP:AWB user (which is what this list is for). —Wknight94 (talk) 22:03, 27 March 2007 (UTC)
While, I tend to agree, the RETF project page does state that the "lofty goal of RETF is to be completely automatic. That is, 100% accuracy." So something's got to give. We can't really have it both ways. I have a couple of ideas that I'm going to propose soon to alleviate this. --Thiseye 04:27, 28 March 2007 (UTC)
From that goal, anytime someone runs across any change in WP:AWB that they need to roll back, they should remove it from the list, right? I'll do that then. Thanks. —Wknight94 (talk) 11:21, 28 March 2007 (UTC)

For phrases in a language other than English, use {{lang}} for the phrase, for example {{lang|fr|Responsable}}, where the second parameter is the ISO 639 code. It stops AWB changing the text, but I'm not sure about WikEd (if not, it probably should). mattbr 10:53, 28 March 2007 (UTC)

Thanks. That's a really useful tip I didn't know about. I'll probably go through and tag all French 'responsable's like that. Rjwilmsi 17:25, 28 March 2007 (UTC)

Typica

Typica exists (in English!) but is corrected to Typical. Wasn't sure how to fix the regex myself. Rjwilmsi 07:03, 28 March 2007 (UTC)

I have removed the regex doing this ((A)Typically). Other changes in he removed regex appear to already be covered in (A)Typical, but someone please update it not. Thanks, mattbr 10:53, 28 March 2007 (UTC)

Another: In (fact/the/a/an) corrects the name Ina

Removed "ina" and "inan" from regex because of name false positives. I'd also be concerned "inan" would be a typo of "inane". --Thiseye 01:24, 29 March 2007 (UTC)

Nation name capitalization

What do folks think about taking out some of the capitalizations since there are so many animal species that use lower-case versions of words that would ordinarily be upper-case (see this edit for an example of the mistakes that are often made). —Wknight94 (talk) 22:03, 27 March 2007 (UTC)

"gum arabic" too. -- Euchiasmus 20:17, 7 April 2007 (UTC)

Millenium Hall

Proposing to remove "Millennium_" since there is a well-known 18th century book, Millenium Hall. —Wknight94 (talk) 00:06, 30 March 2007 (UTC)

There's a band called 'Agression', so the 'agression' -> aggression fix needs to be edited. Rjwilmsi 06:24, 31 March 2007 (UTC)

Official

There is currently an entry for Official, but I'm not sure if it corrects "Offical" --> "Official". Can someone either please add this or let me know that it is in there already? --After Midnight 0001 05:09, 1 April 2007 (UTC)

I added that case, as well as a couple more word endings.--BillFlis 11:17, 1 April 2007 (UTC)

.coms

I couldn't get negative lookahead to work properly on the .com's (OK, brainfart Harvard would be .edu anyway). Try 1 and Try 2. I'm trying to get it to ignore URLs and emails (ex NSAKEY). Can somebody take a peek? I was reloading the file with click/unclick of the RETF option. — RevRagnarok Talk Contrib 17:40, 1 April 2007 (UTC)

AWB ignores external http: links (and from the next release https:, ftp: and mailto:), so these shouldn't be a problem. In regular text, I can't think of a situation where you would write a web or email address outside a link. Could you point me to where you are having the problem? You can try out a regex using the find-and-replace option in AWB, and I don't think clicking/unclicking the checkbox reloads the list, but you can from the last option on the 'General' menu. mattbr 18:12, 1 April 2007 (UTC)
The developers told me click/unclick reloads and that seems to work. The test article is listed above - NSAKEY has the public key for an email @microsoft.com. — RevRagnarok Talk Contrib 18:18, 1 April 2007 (UTC)
Sorry missed that. Wrap the text in <pre></pre> rather than using a space at the beginning. AWB will then ignore them. mattbr 18:30, 1 April 2007 (UTC)
That fixes this case, but on a side note, I'd like to know why the regex didn't work. — RevRagnarok Talk Contrib 18:35, 1 April 2007 (UTC)
Ticking and unticking the box just enables and disables it, it doesnt refresh the typo list. I've just commited a change that if you use the option on the general menu, it will reload them. Reedy Boy 18:41, 1 April 2007 (UTC)
Two weeks ago you said it did reload the typo page. Guess there was a misunderstanding somewhere. Either way, I < pre> tagged the one spot anyway per Matt. — RevRagnarok Talk Contrib 18:52, 1 April 2007 (UTC)
Sorry about that, i thought (as it was a bit of a quick fix), that it did. When i looked over the code just now, i realised, that unless the decleration for the typo's was blank (ie = null), it wouldnt load them. I've now put a parameter on that, so that you can force reload, and that works. Sorry for the confusion/lack of complete attention on my part, and for the next release, it definately has been sorted!! Reedy Boy 19:01, 1 April 2007 (UTC)

Re the regex, sorry bit of a regex novice. Can anyone else help? mattbr 18:50, 1 April 2007 (UTC)

august > August

Since august is a word, should this correction be removed, or improved to fix <number> august > <number> August only? Rjwilmsi 17:53, 3 April 2007 (UTC)

Good point. Probably, but I was having some problems with lookahead in the past (see above). — RevRagnarok Talk Contrib 18:10, 3 April 2007 (UTC)

discribed -> described

As in [2]? Jobjörn (Talk ° contribs) 12:06, 4 April 2007 (UTC)

Added to "Describe", which is now "(De/Pre)scribe".--BillFlis 19:49, 4 April 2007 (UTC)

strengtened > strengthened

as here. Jobjörn (Talk ° contribs) 14:16, 4 April 2007 (UTC)

Added to "Strength".--BillFlis 19:43, 4 April 2007 (UTC)

"significatly" --> "significately" ???

The rule <Typo word="-(b/c/d/g/i/m/s/t/v)ately_" find="([bcdgimstv])atly\b" replace="$1ately" /> converts significatly to significately.

Surely that can't be what the inventor intended?

--Euchiasmus 20:13, 7 April 2007 (UTC)

Yeah, that needs to go away. —Wknight94 (talk) 21:19, 7 April 2007 (UTC)
I added this case to the existing rule for "Significant", and moved the general rules to the end, so this will be treated as a special case before the general rules kick in.--BillFlis 19:39, 9 April 2007 (UTC)

"distictively" --> "districtively" ???

The word "districtively" doesn't even exist.

Let's have rules that rectify a recognised and bounded set of incorrect words, rather than trying to make the rules too general. What do you think? Euchiasmus 20:30, 7 April 2007 (UTC)

Agreed as your other significatly example demonstrates. —Wknight94 (talk) 21:19, 7 April 2007 (UTC)
As the "inventor" of these attempts at general rules, may I ask, what is the harm in replacing one type of error by another? If you did not have the general rule, you would still leave an error. At least its presence in this case alerted you that we need separate rules for these exceptional misspellings. I'll add a rule for "(Di/In)stinctive" to handle your clever discovery!--BillFlis 19:11, 9 April 2007 (UTC)
It turns out that there was an existing rule to handle "distictively" but it was down in the D's, behind the general rules. I've now moved the general rules to the end, to allow the special cases to be handled first. I also modified the previous "Distinction" to "(Di/In)stinctive".--BillFlis 19:17, 9 April 2007 (UTC)

"other than"

The regexp for other than would change "Will have to agree with each other then convince the rest." Another regexp to change "(another|(?:the|each|some)) other then" to "$1 other, then" first and then apply the "then" to "than" fix would avoid it. This could also be extended to handle "better then" and "worse then". Or the line could be removed. Is it too much processing for these cases? -- JHunterJ 11:31, 13 April 2007 (UTC)

KPA vs. kPa

KPA is being changed to kPa because a rule in the Wikipedia:AutoWikiBrowser/Typos#Abbreviations of SI units section is too general. (I've run into other problems with the SI units but I haven't seen them in a while. I'll bring them up next time I see them.) Can we do a (k[pP][Aa]|[Kk][pP]a|KpA) rule instead? —Wknight94 (talk) 21:58, 15 April 2007 (UTC)

Unrelated: Canarian Black Oystercatcher subspecies' scientific name is "Haematopus niger meade-waldoi" but "niger" gets changed to "Niger". —Wknight94 (talk) 02:31, 16 April 2007 (UTC)

Easter

I have found a false positive for the capitalization of "easter" and that is "easter egg" in the sense of Easter egg (virtual). After looking at the what links here page, there are between 250 and 500 links to that page, so there are a fair number of instances of this false positive out there. I personally cannot think of a way to alter the rule to fix this problem. I will be removing the rule. If anyone can think of a way to fix it, feel free to add it back. For future reference, this was the rule: <Typo word="Easter" find="\beaster\b" replace="Easter" /> --Maelnuneb (Talk) 17:07, 17 April 2007 (UTC)

According to Easter egg (virtual), that usage is capitalized as well. I don't see the problem. -- JHunterJ 17:14, 17 April 2007 (UTC)
As an aside, if it had been needed, I believe
<Typo word="Easter" find="\beaster(?! egg)\b" replace="Easter" />
would have accomplished the exception-handling desired. -- JHunterJ 17:29, 17 April 2007 (UTC)
Looking through the page they don't consistently capitalize Easter. Given that, I am going to change the rule to the one you suggested. --Maelnuneb (Talk) 20:25, 17 April 2007 (UTC)
Meh. Inconsistency in that article is the problem to fix first, IMO. Which I've done, just now. I'd still say the original rule should be restored here, but I'll see if someone else agrees. -- JHunterJ 21:09, 17 April 2007 (UTC)

"Comprised of" rule

I'm a little confused about these rules:

<Typo word="comprises" find="\bis comprised (?:up )?of\b" replace="comprises" />
<Typo word="comprise" find="\bare comprised (?:up )?of\b" replace="comprise" />
<Typo word="comprised" find="\b(?:was|were|been) comprised (?:up )?of\b" replace="comprised" />
<Typo word="comprising" find="\b([Cc])omprised (?:up )?of\b" replace="$1omprising" />

Could somebody with a little more English grammar knowledge please explain these. I don't remember there being a problem with "comprised of" but I could be wrong. --Maelnuneb (Talk) 16:33, 19 April 2007 (UTC)

If X, Y, and Z compose a thing (or a thing is composed of X, Y, and Z), that thing comprises X, Y, and Z. See wikt:comprise -- JHunterJ 16:58, 19 April 2007 (UTC)

Thanks for looking that one up. One would think that I would have known to look there before asking questions, but apparently not. --Maelnuneb (Talk) 17:07, 19 April 2007 (UTC)
There was an earlier question about this, which I answered on the RegExTypoFix talk page under the heading Urgh!! -- Euchiasmus 21:26, 19 April 2007 (UTC)

It has been suggested on my Talk that the replacement for "is comprised of" be "is composed of" instead of "comprises". I tend to prefer keeping the base word, and as a bonus I like active voice over passive voice. Any other suggestions or agreements with either choice? -- JHunterJ 11:01, 26 April 2007 (UTC)

Significantion?

signification -> significantion? Looked weird to me so I didn't save the change. --Guinnog 23:04, 26 April 2007 (UTC)

It was wrong. The pattern for "significant" was lacking the word boundaries. Fixed. -- JHunterJ 23:22, 26 April 2007 (UTC)
Wow, that was fast! Thank you. --Guinnog 23:26, 26 April 2007 (UTC)

Distinguish

I think this is wrong:

replace="$1istinguis$2" />

Shouldn't it be

replace="$1istinguish$2" />

It seem to be changing 'distinguish' to 'distinguis'. Colonies Chris 13:14, 28 April 2007 (UTC)

  • It looks like this has been fixed. --Thiseye 01:26, 30 April 2007 (UTC)

Turks

I think I just found a minor regexp bug while editing this revision of self-loading rifle. The suggested edit for "turks" was "Turks$4" (ie. with the variable in the string). Cheers, -- Seed 2.0 14:51, 1 May 2007 (UTC)

Fixed.--BillFlis 17:46, 1 May 2007 (UTC)
Great. Thank you. -- Seed 2.0 18:22, 2 May 2007 (UTC)

Question

Sorry I don't know much about programming or anything, but I'm guessing that we should copy the codes on the page somewhere on our AWB so that it can fix the mistakes when we're using the program, right? I was wondering how we could do that, like where & how do we copy all the typo codes in the program to "make it work" if you see what I mean... Thanks in advance. Zouavman Le Zouave (Talk to me!) 13:06, 3 May 2007 (UTC)

Nope, just set the option and it will be "on" -- AWB reads it from this article itself. -- JHunterJ 13:12, 3 May 2007 (UTC)

Thanks a lot for such a fast answer! ^^ Zouavman Le Zouave (Talk to me!) 13:14, 3 May 2007 (UTC)

comprised of

AWB changes comprised of to composed of. This is not a typo--one of the meanings of comprised is "to constitue, to make up, to compose", or, pass "to be composed of, to consist of". Could someone explain why AWB is effectively making a word choice change under the guise of fixing a typo? Miss Mondegreen talk  00:42, 4 May 2007

One of the meanings of "comprised" is as you say. "Comprised of" is informal or incorrect though (see wikt:comprise and my Talk page), and should be changed either to "comprising" or "composed of". I added it as "comprising", but received some complaints that that was hard to understand, so I switched it to "composed of". I certainly don't mind switching it back, and will do so now. -- JHunterJ 11:02, 4 May 2007 (UTC)
The day I turn to wikitionary as a dictionary is the day I...well, I don't know what, but something as drastic as hell freezing and pigs flying but less cliche.
Considering that we aren't supposed to site Wikipedia or other wiki sites as references, I'll site the OED:

8. Of things: a. To take up, fully occupy (a space). Obs. rare.

b. To constitute, make up, compose.

c. pass. To be composed of, to consist of.

"Comprised of" is not incorrect, nor is it listed as informal.
"Comprised of" isn't listed at all in that excerpt. If "comprise" means "to be composed of", "comprised of" therefore means "to be composed of of", which is why it's wrong (or at best informal). Does the OED not have any usage information for "comprise"? Merriam Webster does [3], as do the American Heritage Dictionary [4], Bartleby's [5], and the Random House Word of the Day [6]. -- JHunterJ 12:33, 4 May 2007 (UTC)
There does seem to be a lot of fervor about this usage. Googling not only gets a variety of definitions that do or don't include the usage, it also yields a number of grammar junkies lecturing on it. This may mainly come from the fact that it is a fairly recent form of the word. The first known occurance comes a century after "comprising" (same meaning), and didn't become well used until halfway through the twentieth century. Regardless, it's both correct, and not listed in the dictionary as being an informal usage. American Heritage alludes to it, but the definition there covers scarely a third of the usages and meanings of the word that the OED covers. I've removed comprised altogether, and unless I've done so correctly or there's an issue with the source I provided, it should stay that way.
I also think it's a bad idea for word changes, changing incorrect word usage to be under the guise of typo fixing. Even if the change is absolutely correct, the AWB edit summary reflects a typo change, unless the edit summary is manually changed. I'm also concerned that apparantly, definitions and usages for words are being obtained from wiktionary, or at best, online dictionaries that do not list complete definitions and usages. Is it too much to ask that people actually go and look the word up in a comprehensive dictionary before making an edit that sets in motion changes throughout Wikipedia? Changes that cannot necessarily be easily undone. This seems to me to be the height of irresponsibility. Miss Mondegreen talk  05:00, 4 May 2007 (UTC)
Well, it's certainly turned out to be contentious, so I don't object to its removal, but I did look it up in other dictionaries first (although I don't have ready access to an OED), so please don't cast the move as irresponsible or hasty. I choose to point to the Wiktionary definition first because it is, like this, a Wikimedia project. -- JHunterJ 12:33, 4 May 2007 (UTC)
One thing that I am 100% sure of is that this change was made with the best intentions and with a lot of debate. This rule has been talked about a lot recently. There are 2 sections now on this page, one of which I personally started, one on JHunterJ's talk page, and one on this page. As far as I am concerned, this rule has been pretty well defended. JHunterJ's talk page and the talk page for RegExTypoFix have more complete information to look at. Please make sure to look at these pages. --Maelnuneb (Talk) 16:55, 4 May 2007 (UTC)
I saw the debate on the talk page. And I understand that it was made with good intentions. However, I'm still concerned. Wiktionary, wikipedia, wikimedia projects are NOT acceptable sources per WP:V, WP:RS etc. An issue with the change was raised, and no one really had the answer. Information was taken from Wiktionary, both incomplete and not ok due to policy, and from a seriously incomplete American Heritage entry. I understand not having access to sources, but if you don't have access to information, then don't make changes based on something you can't site or prove. All the discussions show me is that people fought about something factual without using verified facts; no one pulled out a complete dictionary until I came to the discussion. And that's just irresponsbile. If this was important to you, to any of the people who believed in this change, if you honestly thought that it was incorrect and AWB should be fixing it, then you really needed to get access to a dictionary somehow. I find it very hard to believe that none of you could have gone to a library, or at least hunted down a fellow wiki user with access to something like the OED. Look if someone creates userboxes for access to online databases, I'll be the first person to put them on my userpage and field requests. Half of the articles I do get involved in are because I do have access and I show up to stop a "he said she said" argument about something factual, where all someone needs to do is go and look it up.
But aside from the particulars of this case, I'm concerned in general. Pretend that this is right. Who on earth thinks that replacing a misused word is fixing a typo? Changing a word, misused or no is always going to have subtleties that you can't program into code. The people who use AWB move a mile a minute, and while they check the proposed edits to make sure that they make sense, you're asking them to know a fair amount to catch stuff like this. And I'm betting that when they don't see them not making sense, they just let it go ahead. And that's a problem. Because now you have a machine correcting grammar based on programming by users who don't always use dictionaries when programming that machine, and that's a really bad kind of self-correcting wiki.
Having a mistake in AWB can't really be undone. How many edits get performed with the error before it's caught? How many correct spellings of "dependant" were changed to a different spelling, and therefore different meaning by AWB and had to, or still have to be caught by hand? That's not saying that AWB doesn't serve a good purpose, but if word changes are an even greater liability and if they are going to continue to be considered typos, they should at least be kept in a seperate section, so that a closer eye can be kept on them. Miss Mondegreen talk  14:49, 5 May 2007 (UTC)
Please quote the OED passage that uses the phrase "comprised of" as acceptable in other than informal usage before continuing with the idea of the "mistake" done by AWB. I've "really" had an answer for each objection raised so far. -- JHunterJ 22:02, 5 May 2007 (UTC)

[restarting indent] I was not using an OED passage to rebut the American Heritage passage. I was using the OED definition. The OED will list the definition as informal or slang or archaic, if it is in fact informal or slang or archaic, and you can see above that definition 8 was archaic. The definition and usage I was referring to had no such listing--the OED does not list it as any of these things. I'm including the quotes, and spelling and etmomolgy, and everything for the defintion and usage that is being discussed here. Then, you'll have everything I have. Miss Mondegreen talk  16:41, 5 May 2007 (UTC)

(k{schwa}m{sm}pra{shti}z) Also 5-7 compryse, 5 Sc. compris, 7-9 comprize. [f. F. comprendre (pa. pple. and pret. Ind. compris):{em}L. comprend{ebreve}re, contr. from comprehend{ebreve}re to COMPREHEND. Probably formed by association with emprise, and possibly with enterprise, both of which verbs were derivatives from Eng. ns. of the same form (repr. F. emprise, entreprise, fem. ns. from pa. pple.), but being used as the Eng. reprs. of emprendre, entreprendre, formed a precedent for the analogous representation of other compounds of -prendre by verbs. in -prise: cf. apprise, surprise. (Many of the early passages in which this word occurs are so vague that it is difficult to gather the exact sense.)]

b. To constitute, make up, compose.
1794 G. ADAMS Nat. & Exp. Philos. II. xvi. 238 The wheels and pinions comprizing the wheel-work. 1794 PALEY Evid. I. ix. (1817) 169 The propositions which comprise the several heads of our testimony. 1850 W. S. HARRIS Rudimentary Magnetism iv. 73 These substances which we have termed diamagnetic..and which comprise a very extensive class of bodies. 1907 H. E. SANTEE Anat. Brain & Spinal Cord (1908) iii. 237 The fibres comprising the zonal layer have four sources of origin. 1925 Brit. Jrnl. Radiology XXX. 148 The various fuses etc. comprising the circuit. 1950 M. PEAKE Gormenghast (1968) xiv. 94 Who, by the way, do comprise the Staff these latter days? 1959 Chambers's Encycl. XIII. 653/1 These fibres also comprise the main element in scar tissue. 1969 W. HOOPER in C. S. Lewis Sel. Lit. Ess. p. xix, These essays together with those contained in this volume comprise the total of C. S. Lewis's essays on literature. 1969 N. PERRIN Dr. Bowdler's Legacy (1970) i. 20 As to who comprised this new reading public, Jeffrey..guessed in 1812 that there were 20,000 upper-class readers in Great Britain.

c. pass. To be composed of, to consist of.
1874 Art of Paper-Making ii. 10 Thirds, or Mixed, are comprised of either or both of the above. 1928 Daily Tel. 17 July 10/7 The voluntary boards of management, comprised..of very zealous and able laymen. 1964 E. PALMER tr. Martinet's Elem. Gen. Ling. i. 28 Many of these words are comprised of monemes. 1970 Nature 27 June 1206/2 Internally, the chloroplast is comprised of a system of flattened membrane sacs.

9. The participles are used absolutely: = Including, included (cf. F. y compris); so the gerund.
1653 H. COGAN tr. Pinto's Trav. vii. 21 He had lost above three thousand and five hundred men, not comprising the wounded. 1663 GERBIER Counsel 37 One quarter of the Ionick Column, the Base and Capital comprised. Ibid. 56 Brick-layers will work..the inside for thirty three shillings, arches comprised. 1887 W. G. PALGRAVE Ulysses, Phra Bat, The edifice..is square, about thirty feet in dimension each way, without comprising the outer colonnade.

Hence com{sm}prised ppl. a., com{sm}prising vbl. n. and ppl. a.
c1575 SIR J. BALFOUR Practicks (1754) 147 Redemptioun of comprysit landis. Marg. Difference betwix comprysit landis and wodset landis. 1603 FLORIO Montaigne (1634) 295 If he be in himselfe, they are also two, the comprizing and the comprized. 1609 SKENE Reg. Maj. 110 Comprisings of lands. 1691 E. TAYLOR tr. Behmen 316 Which breaketh the comprized Life again. 1879 SIR G. SCOTT Lect. Archit. I. 229 The subdivisions..three or four under one comprising arch.


Other rules

Thanks. It's the "c" definition above that I was looking for. I'm surprised to learn that it doesn't address the usage question that arises in the other sources. And I am happy to have removed the rule that was replacing a form of comprise with a form of compose. Just to be sure, are you objecting to the other rules (is comprised of -> comprises, etc) or no? I'd still like to replace them, even if both are correct according to the OED, under the "Try to find words that are common to all" part of the style guidelines, but if they're also at issue, they should be removed as well. -- JHunterJ 00:59, 6 May 2007 (UTC)
I'm a little suprised to, but I find time and time again when an issue arises that the OED is so much more complete than other sources that I just go back to it. I suspect that "comprised of" is regarded sometimes as informal because it came into existence later--a whole century after comprises. And it's not like there was no other way to say "comprised of"--there were a few other ways to say it just with the word comprised alone, and in this meaning comprised is practically a synonym for composed, so the usage most likely didn't become integrated into the language quickly the way that other usages and words do when there is a need for them to. However, it's not listed as informal by the OED, the only dictionary I've found to actually list all of the definitions and usages, and I read the stuff you linked to, and the way that the issue is written about seems to be of historical note, though I agree--there are always going to be people who prefer one usage over another an enforce that wherever they can.
My issue with the other rules is that I'd prefer not to mess with people's grammar or writing. I assume that you're referring to Wikipedia:Manual of Style#National varieties of English? Maybe I'm being completely dense, but I really fail to see how on earth that applies to this at all. Can you explain? The thing is at this point, with the remaining rules that you're referring to, is that both are correct, in most instances (unlike spelling, I won't say all). But fixing with AWB could potentially fix something that was correct to something that isn't, or something that read nicely to something that sounds really clumsy because of the sentence structure. Each article is written by different people and they're going to have slightly different tones and be written in different fashions and I think that switching wording like that is a bad idea. There is only a certain extent to which you can copy-edit blindly--there is an art to editing, and it can't be done with an automated browser. Miss Mondegreen talk  02:54, May 6 2007
The "national varieties" reads to cover variations in usage national and otherwise, and this seems to fit its description, if not its heading. While I have come across replacements that would have been wrong to use "comprised of" -> "comprising", I haven't yet found any that would be rendered incorrect by the other rules, "is comprised of" -> "comprises", etc., and I don't think there would be any. Could there be? -- JHunterJ 12:04, 6 May 2007 (UTC)
Sure. "Manjung's land area is predominantly comprised of agricultural land" That's actually the change that brought this to my attention. This is why I'm so against fixing grammar automatically--it's hard enough for human to do. English grammar is complex, obscure, complicated and bizarre--humans have immense difficulty with it. I'm not sure it can be programmed--what absolutes are there? And even then, the programming is dependent on the rest of the article being correct, which is ironic, since it's meant to fix errors. Maybe the minor grammar error that AWB detects and attempts to fix is really a grammar edit elsewhere, but it triggers that phrase that AWB is programmed with. In terms of grammar and word usage, phrases and sentences and paragraphes have to be looked at as ever increasing wholes, until you get to the article as a whole. I just don't think that this is possible. Miss Mondegreen talk  13:05, May 6 2007
That was a change of "comprised of" to "composed of", and would be eliminated by the elimination of the "comprised of" rule. Is there a potential problem with "is comprised of" -> "comprises"? -- JHunterJ 17:26, 6 May 2007 (UTC)
Ooh, sorry, I misread that. Uhh...I'm trying examples in my head. I'm not sure if it makes it incorrect, but there are certainly cases where it makes it clumsy, though I'll admit that the wording I'm using to begin with is clumsy already. For example, "a fruit salad is comprised of apples, oranges and grapes" -- "a fruit salad comprises apples, oranges and grapes" -- "a fruit salad is composed of apples, oranges and grapes".
Now really, I wouldn't user any of these wordings, but composed of and comprised of are best, and comprises is just awful here, though it may be technically correct. But everything I said before, with the wrong example about not wanting to correct grammar with AWB still stands, and it will stand for every instance. English grammar is ridiculously complex and there are so many ifs and ors and buts and we use different spellings and dialects and there are so many variables that I can't see a machine doing this by absolutes, when it is so hard for humans to do this with each individual scenario. Do you really think that AWB can work with grammar the way it does with spelling? Miss Mondegreen talk  21:15, May 6 2007
Well, in that example, "comprises" and "is composed of" are best to my ear. I don't think substituting "comprises" for "is comprised of" reaches the level of grammar fixing, any more than replacing "I ain't" with "I am not" would. It's still just a rote copy edit. (I can go on like this all day, and wouldn't mind the exchange. If you're still not swayed, though, you can edit the list to remove them, or say so here and I'll remove them.) :-) -- JHunterJ 11:08, 7 May 2007 (UTC)
Hmmm, then it's clearly some people are familiar with some usages, because to me, comprises sounds painful there, even though technically, I know.... I don't think it should be in the list though, because since all are technically correct and what you are or are not familiar with is closer to a dialect issue than a grammar issue since they are all right, and AWB definitely shouldn't correct for that. Could you remove it? I'm sure I could, but it's code I'm really not familiar with and I noticed you fixed my removal last time.
By the way, I was serious about the whole userbox thing before. I don't know if anyone is interested in making them, but if so, let me know. Miss Mondegreen talk  10:40, May 8 2007

Capitalization of state names

I just noticed that we seem to have a rule to %s/georgia/Georgia/gcI but not for other states. I haven't gone through the regexp list but we're at least missing the Carolinas and from the looks of it a few other states. *insert semi-obscure Friends quote about getting 56 states here* ;). -- Seed 2.0 01:35, 5 May 2007 (UTC)

You must mean "state names of the United States of America", whereas the Georgia you found is a state of the former Soviet Union. Since we have that, we don't need to duplicate it in the long-but-incomplete list of Geographical Place Names of the United States.--BillFlis 12:01, 5 May 2007 (UTC)

Mineral, suggestion

miniral -> mineral, came across it the other day. Pax:Vobiscum 22:51, 9 May 2007 (UTC)

Stratagy -> stratey?

Should go to strategy, of course. I don't know regexes well so I can't really fix it myself. —Dark•Shikari[T] 13:51, 10 May 2007 (UTC)

Also directer -> director should be added. —Dark•Shikari[T] 21:20, 10 May 2007 (UTC)

efectiv -> effectiveive

Just a quick heads up. I just noticed that the suggested fix for the 'efectiv' on Silver Nanoparticles was 'effectiveive' and figured that I'd rather just report it than mess with the regexp myself. -- Seed 2.0 10:39, 17 May 2007 (UTC)

out added as a prefix to {{infobox}}

Can someone explain why AWB would have made this change? Miss Mondegreen talk  09:02, May 18 2007

I think that's going to be user error. The cursor starts in the upper left, and he may ahve not realized that he was typing in the AWB window. Note the edit summaries in this sequence:
  1. 13:59, 12 May 2007 (hist) (diff) One Piece Grand Battle! (Typo fixing, Typos fixed: american → American, english → English, using AWB) (top)
  2. 13:59, 12 May 2007 (hist) (diff) InuYasha the Movie: Fire on the Mystic Island (Typo fixing using AWB)
  3. 13:58, 12 May 2007 (hist) (diff) Yotsuya Kaidan (Typo fixing, Typos fixed: the the → the, using AWB) (top)

If the user enters text manually, he loses the "Typos fixed:" portion of the automatic edit summary. -- JHunterJ 10:55, 18 May 2007 (UTC)

Leftfield

What should be done about regexes that are likely to generate false positives? I mean specifically this one:

<Typo word="(Center/Left/Right) field" find="\b([Cc]enter|[Ll]eft|[Rr]ight)f(?:ie|ei)ld(|ers?)\b" replace="$1 field$2" />

It changes "leftfield" to "left field" which is problematic in case of the Leftfield duo. Jogers (talk) 11:44, 22 May 2007 (UTC)

In the case where the false positive is a proper noun, just remove the relevant capital letter:
<Typo word="(Center/Left/Right) field" find="\b([Cc]enter|left|[Rr]ight)f(?:ie|ei)ld(|ers?)\b" replace="$1 field$2" />
That will remove the false positives and some of the real positives, which can be added back in as a separate rule:
<Typo word="Left field" find="\bLeftf(?:eild|ield(ers?))\b" replace="Left field$1" />
(untested). -- JHunterJ 12:14, 22 May 2007 (UTC)

francophone --> Francophone and anglophone --> Anglophone

I was advised by another user that the capitalisation of these words and their derivatives is not used in all variants of English - see WP:CAPITAL#Anglo-_and_similar_prefixes. Therefore I think it would be appropriate to remove / comment out these corrections. Opinions? Rjwilmsi 01:21, 2 June 2007 (UTC)

Just the "-one" section? Yes, I think that would be definitely be appropriate. I think commenting out the "-ile" and "-obe" entries would also be appropriate, since they should remain lowercase on Canada-related articles. -- JHunterJ 11:01, 2 June 2007 (UTC)

Problem with "operational" typo fix

My AWB just replaced "opperational" with "operationional" here, so I think the regex could use a second look. TomTheHand 15:29, 4 June 2007 (UTC)

Thanks. I adjusted it. -- JHunterJ 15:34, 4 June 2007 (UTC)

Duplicated words

I collapsed the duplicated words into one entry. It could be made even more generic:

<Type word="Duplicated words" find="\b(\w+)\b\s+\1\b" replace="$1" />

but that'll have more false positives. If you want to be careful with it, add it explicitly to your personal Find & Replace section in AWB. -- JHunterJ 00:18, 10 June 2007 (UTC)

I think your elegant rule is a good contribution, but it doesn't work when the first of the duplicated words is capitalized, as at the beginning of a sentence, which the old clumsy rules were able to deal with. I don't see how to handle all those cases in a general rule.--BillFlis 00:55, 10 June 2007 (UTC)
The rule as written just fixed By by -> by here. -- JHunterJ 00:59, 10 June 2007 (UTC)
Of course, that was in the AWB Find & Replace section, not in the Typos, so maybe it behaves differently in the Typo list. -- JHunterJ 01:00, 10 June 2007 (UTC)
Ah, if that's the case, as I see it is, it seems that AWB is using a very non-standard type of regular expressions!--BillFlis 01:49, 10 June 2007 (UTC)

In my experience of using the duplicate words rules so far, if we only correct lowercase entries there are fewer false positives (say hardly any compared to a few), so perhaps it's better than separate rules for each word. I agree that the above generic line is far too broad for inclusion in the typo list (just consider 'had had', 'in in'), but is useful for very careful use by an individual. Rjwilmsi 07:48, 10 June 2007 (UTC)

BTW, I found the case-insensitive solution:
<Type word="Duplicated words" find="\b(?i:(\w+)\b\s+\1)\b" replace="$1" />
but I'll just leave it here based on Rjwilmsi's note. -- JHunterJ 16:59, 26 June 2007 (UTC)

Using the ?: part

If you need to use parentheses for grouping but not for capturing, it's a good idea to use the (?:blah|yadda) form. This allows subsequent capturing parentheses to be accessible in order ($1 and $2 instead of $1 and $3). Even if there are not subsequent capturing parentheses in the regexp, it's a good idea because it (a) alerts future readers/maintainers that the group is not used in the replacement and (b) it allows for a future editor to add a trailing capture without having to figure out what number it is -- the next $x number can be assumed. In my opinion; that's how I do it in my non-Wikipedia programming. -- JHunterJ 22:39, 18 June 2007 (UTC)

Febuary ->> February

A typo I usually do, Febuary ->> February

37 Pages have that typo.

-Flubeca (t) 16:31, 23 June 2007 (UTC)

Thanks, we've already got that one listed as a correction. I'll do a search for it later today to correct any articles containing it. Rjwilmsi 16:17, 24 June 2007 (UTC)
Update: corrected two more articles. I ran the correction about a month ago using a Google search and got most of them. We'll need to wait for the Google cache to reparse the pages before a Google search is clean (mainspace articles only). Rjwilmsi 21:03, 24 June 2007 (UTC)

Affluent (false positive)

Affluent should NOT correct to Afluent.

Affluent - being rich and wealthy --Breno talk 14:22, 27 June 2007 (UTC)

Fixed. -- JHunterJ 18:31, 27 June 2007 (UTC)

Intension

I suppose that intension should not be changed to intention. Jogers (talk) 17:32, 1 July 2007 (UTC)

Fixed. -- JHunterJ 19:09, 1 July 2007 (UTC)

Centerfield

Changing "Centerfield" to "Center field" produces false positives. Jogers (talk) 17:44, 1 July 2007 (UTC)

Fixed. -- JHunterJ 19:09, 1 July 2007 (UTC)

Cristian → Christian

Cristian is a given name and place and shouldn't be corrected to Christian. Thanks, mattbr 19:37, 2 July 2007 (UTC)

New Jersey

One more, new jersey should not auto-capitalise.

The soccer player got his new jersey today. --Breno talk 13:18, 3 July 2007 (UTC)

Did you actually come across that in wikipedia? It doesn't sound like a very encyclopedic sentence, and ought to be copy-edited.--BillFlis 13:31, 3 July 2007 (UTC)
Yeah, on Australia national rugby union team. The actual quote is "The new jersey, custom-designed by Canterbury, was also designed in consultation..." I hit save on it without checking the sentence context and someone pulled me up on it. --Breno talk 12:51, 6 July 2007 (UTC)
Tsk, that's even worse! "Custom-designed"? "Was also designed"? I've cleaned it up a bit.--BillFlis 13:17, 6 July 2007 (UTC)
Well it's written in Australian English if that helps. I know this is probably the only article that uses "new jersey" in lowercase. Still, it was a false positive and I got feedback for it, so I thought I'd pass it on. --Breno talk 02:45, 7 July 2007 (UTC)

ablilities

  • Abilites & abilitis -> abilities ( i or e to ie) Harryboyles 07:36, 4 July 2007 (UTC)

Three new ones you might want to consider

  • league, instead of leauge
  • science, instead of sciene
  • wonder, instead of woner

There aren't many (if any) on Wikipedia right now, because I fixed them all by myself before I learned of this wonderful thing known as RegexTypoFix. Before I fixed them though, there were a good number of each.

Alex 22:32, 8 July 2007 (UTC)

I won't do "sciene" -> "Science", could be a typo of "scene" instead.
Same "woner" -> "wonder", could be a typo of "owner"
If I understand how it works :
<Typo word="League" find="\b(L|l)eauge\b" replace="$1eague" />

-FlubecaTalk 21:39, 10 July 2007 (UTC)

"League" is a subset or special case of "(Col)League" under New Additions.--BillFlis 21:53, 10 July 2007 (UTC)

Request: Nassarawa → Nasarawa

I was directed to Wikipedia:Bot requests for requesting this typo be fixed, and from there I have been sent here. Would it be possible to add the change: "Nassarawa" → "Nasarawa"? See the first line of Nasarawa State for an explanation. Thanks! Picaroon (Talk) 19:02, 15 July 2007 (UTC)

Not overly sure if it should be added to the list... As it wont be that commmon. Did do a wikisearch, and found 30 odd pages with it on, so im just currently using AWB to fix them for you. See Special:Contributions/Reedy Boy Reedy Boy 19:13, 15 July 2007 (UTC)
Thanks. I appreciate it. :-) Picaroon (Talk) 19:14, 15 July 2007 (UTC)
All done. And i moved a picture with a mispelt name. Reedy Boy 19:34, 15 July 2007 (UTC)

pf

"pf" should not automatically correct to "pF". It's a common misspelling of "of". --Breno talk 15:01, 19 July 2007 (UTC)

as well as a firewall and notation for "piano forte".  — gogobera (talk) 20:44, 30 July 2007 (UTC)

capitalization of species' names

In the Binomial nomenclature, the species name is not capitalized. I just caught a change that made a mistake because of it. I can't think of any good way to keep AWB from making this mistake. One way would be to check, whenever capitalizing a word, if the previous word is in a list of genus names. I can't say that this would be a good way, though. Just thought I'd point it out. Thanks. — gogobera (talk) 20:52, 30 July 2007 (UTC)

One solution is to tag the Latin species names as Latin language, that way the English language typo script will ignore it e.g. use {{lang|la|Hyoscyamus niger}}. Rjwilmsi 17:35, 1 August 2007 (UTC)

That seems like the right idea, regardless of AWB issues. Any thought on how to get people doing it? — gogobera (talk) 03:18, 3 August 2007 (UTC)

diferent -> different

diferent -> different :) -- Stwalkerster talk 12:08, 3 August 2007 (UTC)

Added as special case of "(In)Different".--BillFlis 13:03, 3 August 2007 (UTC)

supercede -> supersede

I've arguably seen the prior spelling more often (though both are valid). Wiktionary notates them as alternative but both correct spellings. Is there a policy on this, like there may or may not be for ise/ize?

[The alteration was noted on National Rugby League (2007 Season).]

Agreed, my desktop dictionary as well as Merriam Webster online lists supercede as an accepted variant spelling "since the 17th century". It is probably not a big deal, except I am seeing several of articles where the only change is supercede to supersede. In a group of typos there may not be resistance to the change, but changing an article for a single typo which is not a typo may cause friction, given the strong feelings about article content adopted by a number of editors. Perhaps AWB might rethink the change? -- Michael Devore 18:26, 6 August 2007 (UTC)
Interesting. I have an older dead-tree M-W (7th ed.), which has only SUPERSEDE. This online American Heritage Dict. has only SUPERSEDE too.--BillFlis 19:45, 6 August 2007 (UTC)

Typos currently not caught by AWB

I have gone over this talkpage, to check whether any of the suggested typos haven't been implemented. Here is the list of typos which are currently not recognized (together with a google count).

  • likley → likely (297) - 25 corrections. Rjwilmsi 21:32, 12 August 2007 (UTC), added it now Voorlandt 19:19, 16 August 2007 (UTC)
  • signiture → signature (273) - added to list & run through ~30 corrections. Rjwilmsi 21:32, 12 August 2007 (UTC)
  • similarily → similarly (233) - added to list & run through. Rjwilmsi 22:05, 12 August 2007 (UTC)
  • wheter → whether (186) - done Rjwilmsi 17:51, 14 August 2007 (UTC)
  • literaly → literally (149) - added to list & run through. Rjwilmsi 21:32, 12 August 2007 (UTC)
  • orginial → original (109) - added to list & run through. Rjwilmsi 21:49, 12 August 2007 (UTC)
  • posibility → possibility (107) - added to list & run through. Rjwilmsi 21:49, 12 August 2007 (UTC)
  • responed → responded (100) - added to list & run through. Rjwilmsi 17:51, 14 August 2007 (UTC)
  • prepatory → preparatory (99) - added to list & run through. Rjwilmsi 17:26, 15 August 2007 (UTC)
  • mountian → mountain (84) - added to list & run through. Rjwilmsi 17:26, 15 August 2007 (UTC)
  • abilites → abilities (77) - added to list & run through. Rjwilmsi 17:26, 15 August 2007 (UTC)
  • replacment → replacement (72) - run through. Rjwilmsi 17:26, 15 August 2007 (UTC), added it now Voorlandt 19:19, 16 August 2007 (UTC)
  • pricipal → principal (65) - added to list & run through. Rjwilmsi 17:26, 15 August 2007 (UTC)
  • protrayed → portrayed (65) - added to list & run through. Rjwilmsi 21:39, 12 August 2007 (UTC)
  • infinate → infinite (55) - done Rjwilmsi 22:05, 12 August 2007 (UTC)
  • personna → persona (52) - done Rjwilmsi 19:31, 17 August 2007 (UTC)
  • newstands → newsstands (47) - done Rjwilmsi 19:31, 17 August 2007 (UTC)
  • protray → portray (40) - added to list & run through. Rjwilmsi 21:39, 12 August 2007 (UTC)
  • jeapordy → jeopardy (36) none to fix Rjwilmsi 19:31, 17 August 2007 (UTC)
  • nobilty → nobility (31) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • includeing → including (31) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • minsitry → ministry (24) - done & added to list. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • unsheath → unsheathe (23) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • oppenent → opponent (19) - done & added to list. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • wherupon → whereupon (18) - done & added to list. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • precipation → precipitation (18) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • reliquish → relinquish (15) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)
  • valiently → valiantly (10) - done. Rjwilmsi 13:30, 8 September 2007 (UTC)

I might try my luck on regex, otherwise could someone please add the most important ones? If you want to test AWB, this list is also on User:Voorlandt/Sandbox Voorlandt 08:24, 7 August 2007 (UTC)

Thanks for pointing these out, I'll work through them over the next couple of days. You can see how many I fix by looking at my contributions. Thanks Rjwilmsi 21:39, 12 August 2007 (UTC)
Thanks a lot for this, I tried my luck on one (as you can see in the history), but I got discouraged since it didnt work when i ran AWB through my sandbox (it contains this list) and nothing showed up. Now I tried it again, with your additions to the Regex, but it still doesnt detect any of these typos. Could you try it on my sandbox to see if it works on your end? Voorlandt 22:04, 12 August 2007 (UTC)
It works now, maybe it was a cache issue? (AWB still using the old regexes?) Voorlandt 07:22, 13 August 2007 (UTC)

anerobic > anaerobic

This is not a common misspeelin but it annoys me because it MUST be right for the pedantic science types... -- Alan Liefting talk 06:26, 25 August 2007 (UTC)

Seperate -> separate (vs separte)

awb tried to change it to separte instead of separate --dputig07 20:26, 29 August 2007 (UTC)

Łódź -> Lodz

I know how AWB works, although I rarely use it anymore. But I'm expanding four articles in relation to the TV show Carnivàle where a (major?) character is named Lodz. And each time a wikipedian comes by with AWB, he replaces Lodz with the town name Łódź, which has to be undone by hand in order to not revert the real typo fixes. So I'd like to either suggest removing

<Typo word="Łódź" find="\bLodz\b" replace="Łódź" />

from Wikipedia:AutoWikiBrowser/Typos, or (if it's possible) ask whether Carnivàle, Avatars (Carnivàle), Characters of Carnivàle and List of Carnivàle episodes can be excluded from typo-autofixing (for this word). Thank you. – sgeureka t•c 16:49, 4 September 2007 (UTC)

Its only really a list of pages where no typo fixing should happen at all. The general way to do it, is to remove that line from the typo fixing. Reedy Boy 17:07, 4 September 2007 (UTC)
I don't understand (or I'm not sure that I understand correctly). Just remove "<Typo word="Łódź" find="\bLodz\b" replace="Łódź" />" from Wikipedia:AutoWikiBrowser/Typos, or what did you mean? I would prefer if someone else does what needs to be done and just lets me know that the Carnivàle articles will no longer be bothered by "Łódź". :-) – sgeureka t•c 17:48, 4 September 2007 (UTC)

Use XML instead of 'pre' for typo list markup?

I noticed that on the French RETF list, they use <source lang="xml"> and </source> instead of <pre> and </pre> and I think the colour markup looks better, and is maybe helpful for reviewing the regex. Opinions? Rjwilmsi 09:32, 9 September 2007 (UTC)

I agree, and i've changed it as such. Good thinking! Reedy Boy 11:11, 9 September 2007 (UTC)

episiode (episoide) ->episode

current regex doesn't account for these 2 misspellings dputig07 18:26, 16 September 2007 (UTC)

Added.--BillFlis 12:00, 20 September 2007 (UTC)

accessdate and accessdate

Should be accessdate. These 2 words are commonly misspelled when creating references using the 'web cite' template.[7] [8]. Could someone help me go through these. Maybe by using some sort of bot? MahangaTalk 02:33, 20 September 2007 (UTC)

OK, I added a rule here so that AWB will make those corrections.--BillFlis 11:57, 20 September 2007 (UTC)
I'll run through the articles needing correction. Rjwilmsi 17:26, 20 September 2007 (UTC)
Done, 246 pages fixed (mainspace only) Rjwilmsi 20:09, 20 September 2007 (UTC)
That was quick. Thank you! MahangaTalk 02:20, 21 September 2007 (UTC)

march

The changing of 'march' to 'March' is problematic. 'march' is also a verb. ssepp(talk) 00:22, 21 September 2007 (UTC)

Is it? The match uses numbers to try to distinguish the verb from the month. Did it generate a false positive? -- JHunterJ 11:03, 27 September 2007 (UTC)
Ahh, I didn't know that. It did create a false positive, the sentence contained something like 'march 241 kilometers.' ssepp(talk) 22:39, 27 September 2007 (UTC)
I'll narrow the match a bit further... -- JHunterJ 22:45, 27 September 2007 (UTC)

Referer

Perhaps referer->referrer should be removed per HTTP referer: Referer is a common misspelling of the word referrer. It is so common, in fact, that it made it into the official specification of HTTP – the communication protocol of the World Wide Web – and has therefore become the standard industry spelling when discussing HTTP referers. ssepp(talk) 00:25, 23 September 2007 (UTC)

I think the correction should stay as 'referer' is a typo in all other situations - Merriam Webster doesn't list it. Rjwilmsi 17:24, 26 September 2007 (UTC)
Removed the "referer" match. I don't think a regexp to determine which situation we're in is likely. -- JHunterJ 11:03, 27 September 2007 (UTC)
Okay, but 'referer' is still 'corrected' by the "(Re/De/In/Trans/Con/Pre)ferred" rule. Rjwilmsi 17:52, 28 September 2007 (UTC)
Aha. Fixed too. -- JHunterJ 21:12, 28 September 2007 (UTC)

Catalog(u)ing

Interesting fact: Cataloging ('incorrect') gets 10 million google hits, while cataloguing ('correct') gets 4 million google hits. Do we still consider it a spelling error if it has this widespread usage? ssepp(talk) 16:12, 26 September 2007 (UTC)

While Merriam Webster accepts 'cataloged' and 'catalogued', only 'dialogued' is accepted, so I vote we leave the correction as it is, since it correctly fixes other variants. Rjwilmsi 17:21, 26 September 2007 (UTC)
Fixed by splitting, so other variants will be handled correctly still. (Note there is no voting -- false positives are false positives and are to be removed regardless.) -- JHunterJ 11:03, 27 September 2007 (UTC)

useable -> usable

I'm not comfortable enough to change this. Thanks Yngvarr (t) (c) 20:23, 27 September 2007 (UTC)

[9] - Added. Reedy Boy 20:45, 27 September 2007 (UTC)
I removed it though. M-W.com lists "useable" as a variant spelling. -- JHunterJ 21:39, 27 September 2007 (UTC)

How to use

Might be stupid but how to use Regex? Is there something to do in AWB? Thanks! --Bombastus 19:55, 3 October 2007 (UTC)

In the 'set options' menu in the bottom panel, check "enable regextypofix". ssepp(talk) 21:13, 4 October 2007 (UTC)

Vigourously?

It currently changes: vigourously → vigorously I'm from the U.S., so I'm not sure, but isn't this the British spelling? Rocket000 22:54, 5 October 2007 (UTC)

Does it? Sample diff? The "vigorous" entry doesn't seem to make that sub, but perhaps one of the other rules does. -- JHunterJ 23:36, 5 October 2007 (UTC)
Well, I ran into this but I didn't save the changes, so there's no diff. I can produce one if you want. Rocket000 23:42, 5 October 2007 (UTC)
Now I ran into just "vigourous" and it wanted to change it. Rocket000 09:23, 7 October 2007 (UTC)
Found it in the new additions. I removed it. Thanks! -- JHunterJ 11:16, 7 October 2007 (UTC)

rarified → rarefied

Sample diff Isn't this an acceptable variant? [10] Rocket000 23:51, 5 October 2007 (UTC)

I removed the entry based on that. Thanks! -- JHunterJ 02:47, 6 October 2007 (UTC)

Abysinnian

can something like this be added?

Abysinnian → Abyssinian

Thanks. Rocket000 09:55, 7 October 2007 (UTC)

Added. -- JHunterJ 11:13, 7 October 2007 (UTC)
Thanks, man! Rocket000 13:00, 7 October 2007 (UTC)

comfirmed → confirmed

Example. Can someone please add this, thanks. --Closedmouth 04:37, 9 October 2007 (UTC)

Added "Conf(i/o)rm".--BillFlis 10:25, 9 October 2007 (UTC)

Proffesor → Professor

and Proffesor → Professor, Profesor → Professor Please add this misspellings. Tirkfl 11:50, 12 October 2007 (UTC)

There seem to be some legitimate occurrences with one S, e.g., El Profesor Hippie.--BillFlis 18:23, 12 October 2007 (UTC)

Cristian → Christian (again)

Please can Cristian be removed as typo for Christian as there a numerous false positives because it is a given name and there are also places of this name. See the prefix search for pages beginning with Cristian. I would remove it myself but the regex it appears to stem from looks complicated and makes many other (apparently valid) corrections. Thanks, mattbr 13:59, 13 October 2007 (UTC)

I removed this.--BillFlis 12:22, 15 October 2007 (UTC)
Thank you, mattbr 19:48, 15 October 2007 (UTC)

-fuly -> -fully

I did a dictionary search and couldn't find any words that end in "fuly". How about adding a general rule something like this? find="fuly\b" replace="fully"--Thiseye 19:28, 14 October 2007 (UTC)

I added this rule. "Usefully" is now just a special case.--BillFlis 12:23, 15 October 2007 (UTC)

Illegible superscripts.

Would be possible to delete the automatic reformatting of superscripts, for example the replacement of 3 by a tiny illegible 3. See for example the superscript in the AWB edit of Lanthanide contraction at 14:30 27 October 2007, which I have just reverted manually to 3. Yes, the source code is now longer, but the article is legible which is more important. Another alternative would be to have the AWB change to a much larger 3.

And similarly for 2 which I have seen replaced by a tiny illegible 2. Dirac66 20:07, 27 October 2007 (UTC)

I can't seem to find the rule here that you're talking about. As I understand it, AWB users can also write their own personal editing rules, which they then apply not entirely automatically, as they have to okay every change that AWB makes. You might want complain to the AWB user who made the edits in question. But I'm curious: how did you know it was a "3" if it was illegible?--BillFlis 09:26, 28 October 2007 (UTC)

Thanks. I haven't used AWB myself so I thought the edit was due to application of an automatic rule. Since you find no such rule, I'll leave a note on the talk page of the user who made the edit. As for identifying the illegible "3", the previous text (before the edit) had a clearly legible 3 superscript so I assumed the editor must have inserted a 3 also, and I also checked by increasing my screen text size to maximum so that the 3 became (barely) legible. However readers who are not editors should be able to read the article without either checking previous edits or increasing screen text size. Dirac66 17:59, 28 October 2007 (UTC)

See - Wikipedia_talk:AutoWikiBrowser#Superscript Reedy Boy 21:15, 28 October 2007 (UTC)

CalTrans → Caltrans; also Caltrain

I see a lot of this, not just in Wikipedia, but on public agency websites that work with Caltrans. The confusion may also be due to the logo being just a "c" and a "t". The California Department of Transportation writes its abbreviated name with a lowercase "t".

In the wake of the frenetic 1960s, the 1970s were a time of austerity. The then-current political philosophy urged alternatives to highway building, a trend that would continue into the 1980s. Such thinking led to a new name for the department, Caltrans, short for the California Department of Transportation. The name change was emblematic of new thinking, and a rise in the concept that while highways have long been vital to the state, other forms of transportation were emerging to complement roadways.

On a different note, there may also be confusion between Caltrans and Caltrain. Oh, and another note: CalTrain was actually an official old name for Caltrain. --Geopgeop (T) 11:55, 30 October 2007 (UTC)

critcism → criticism

A while back, I fixed a case where criticism was spelled with the 2nd I omitted (critcism). I just searched Wikipedia, and I found at least 8 other non-talk pages that appear to still be uncorrrected.

  <Typo word="Criticism" find="\b(C|c)ritisi[sz]?(ms?|e[ds]?|ing)\b" replace="$1riticis$2" />

It appears the current rule for criticism does not correct this spelling error, so I'd like to suggest that this be added to the list. --Smiller933 21:00, 31 October 2007 (UTC)

BillFlis kindly added this correction last week and I've fixed all mainspace articles with this error. Thanks Rjwilmsi 21:57, 11 November 2007 (UTC)

payed → paid

Payed is an obselete spelling. In non-quote situations "paid" should be used. Mbisanz 15:54, 8 November 2007 (UTC)

Wiktionary agrees that 'payed' is obsolete as you say, but to include this correction would introduce a lot of false positives, so I don't think we should do it. Thanks Rjwilmsi 22:02, 11 November 2007 (UTC)
My American Heritage Dictionary says that "payed" is an acceptable spelling (not obsolete) for the past of "paying" out a line (rope). Also, here's a link at Merriam Webster that says "payed" is OK.--BillFlis 00:18, 12 November 2007 (UTC)
Here is a partial OED extract "Past tense and past participle paid, (chiefly in nautical senses) payed. " So unless its being used in a nautical sense, paid would be the appropirate usage. Maybe something better for be to do by hand rather than building into the spellchecker. Mbisanz 22:16, 13 November 2007 (UTC)

A few suggestions for inclusion

Here's some misspellings I ran across that AWB missed:

  • mimicing → mimicking
  • catholic → Catholic
  • anglophone → Anglophone
  • parmesan → Parmesan

Sorry I don't trust my regex skills. -Rocket000 00:47, 12 November 2007 (UTC)

I say no to anglophone and catholic, Merriam-Webster accepts them as both lower and uppercase - anglophone and catholic. I've added the other two. Thanks Rjwilmsi 07:55, 12 November 2007 (UTC)

Why doesn't this work?

Anyone have an idea why this doesn't work?

<Typo word="Triplets" find="([aeiou])([bdfgklmnprstvz])\2\2+(ed|[eo]rs?|ings?)\b" replace="$1$2$2$3" />

It's supposed to fix triple letter errors like "lettter" and "errrors" but it seems to match nothing. Is the \2 backreference feature not supported? It works fine if I plug it into the standard "Find and replace" of AWB. —Wknight94 (talk) 05:12, 13 November 2007 (UTC)

Maybe the backslash is getting "eaten"? Try \\2\\2 there. -- JHunterJ 00:26, 14 November 2007 (UTC)

Some to be added

[Cc]incinati → Cincinnati
cincinnati → Cincinnati
[Cc]inncinati → Cincinnati

Thanks,   jj137 (Talk) 02:06, 22 November 2007 (UTC)

British

Could someone look over the British entries. I got a weird error that resulted in a spelling of Britiish being entered from what I think was Brititish. Mbisanz (talk) 06:57, 25 November 2007 (UTC)

I looked over "Britain" and "British", and they seem correct. Are you sure it was one of these? Or could it have been some other rule?--BillFlis (talk) 12:12, 25 November 2007 (UTC)

Enmore

Please stop Enmore → Emmore. As a locality the spelling is correct, eg Enmore, New South Wales. Many thanks. --Breno talk 13:34, 27 November 2007 (UTC)

AWB's RegexTypoFix on other wiki(pedia)s

Is it possible to make something similar to use on another wiki(pedia)? And to choose language in AWB? 20:54, 28 November 2007 (UTC) - pl:user:Matma Rex

Yep, follow the format of the page. Create it at [[11]] Reedy Boy 21:25, 28 November 2007 (UTC)
Page created and user notified! Reedy Boy 23:39, 28 November 2007 (UTC)

AWB Typo Profiling

Hi Guys, Just a heads up, and a pointer for some reworking for you - Wikipedia:AutoWikiBrowser/Typos/Profiling

MaxSem has added a typo profiler to AWB in debug.

The time, on the left hand side, in miliseconds, is the time for the runs over the page text, and therefore the time taken to "run" the typo fixing...

If people could work on reducing some of the larger times, it'll help speed up AWB's operation, and the inital page processing time.

MaxSem would probably be able to answer any more in-detail questions....

Reedy Boy 16:23, 4 December 2007 (UTC)

Interesting and useful. I have scanned a recent database dump for instances of the most time consuming check which is "(A/Air/In/...)field". I found only four instances which I have since fixed. I propose to simplify the search, removing some of the more obscure cases for the sake of efficiency. Gaius Cornelius (talk) 09:11, 8 December 2007 (UTC)
That rule has been around for a while, which is surely why you found only four instances left. If we divide that long rule into several shorter rules, they'll each probably only take as much time as other, shorter rules. But is that what we really want to do? Anyway, I want ahead and shortened the rule and eliminated a few rare words (proper names).--BillFlis (talk) 14:31, 8 December 2007 (UTC)
Well the output in general puzzles me. Why is the "field" expression more than a hundred times longer than [36, \blatin(|[ao]s?|ate|is[mt]s?|i[sz](e[sd]?|ing))\b > Latin$1]? Are the (aaa|bbb) constructs really so expensive? If so, we need to rethink the overall approach it seems. —Wknight94 (talk) 16:12, 8 December 2007 (UTC)

Misspellings in actual quotations shouldn't be corrected

I'm not sure if this is the proper place to ask about this, but I've noticed that typos have been corrected with AWB on the Mitchell Map, but the "typos" are direct quotes from the map, odd spellings and all. It seems like a case where one wants the words spelled "wrong" and not fixed. So I wonder if there is an easy way to mark text like this so that it doesn't get "fixed"? Thanks. Pfly (talk) 23:28, 8 December 2007 (UTC)

You could cheat by using tags like {{lang|fr|the word}} to identify the text as not modern English - see List_of_ISO_639-2_codes, or add a [sic] (even as a comment) to remind users, or simply remind the user who made the edit to take a little more care. Thanks Rjwilmsi (talk) 19:49, 12 December 2007 (UTC)
And that works because things inside templates are not spell-checked, right? I'd like to get away from that though and have only things marked specifically as {{do not spellcheck}} not be spell-checked. I find myself also putting regexes into the find-and-replace section so templated areas are caught as well. —Wknight94 (talk) 20:08, 12 December 2007 (UTC)

Cret

So in the article Pennine Alps, it ids "Cret" as it should be changed to "Correct". Isn't this a stretch. Mbisanz (talk) 19:50, 9 December 2007 (UTC)

Should be fixed now. —Wknight94 (talk) 20:11, 9 December 2007 (UTC)

Question

Is there a way to have certain articles excluded from specific spelling corrections? Earlier today, an AWB user edited the article on Robert Cliche to correct Cliche to "Cliché", per the inclusion of that here as a common spelling error. However, the politician's surname was definitively Cliche (cleesh) rather than "Cliché", so I had to revert it. Is there an exclusions list that I can have this article added to, or a code I can insert into the article to flag the typo bot to skip this article when looking for "cliche" → "cliché" corrections? Bearcat (talk) 01:55, 10 December 2007 (UTC)

Yo. Anybody home? Another alternative, if possible, would be to have the typo bot skip "cliche → cliché" if the article contains the phrases Robert Cliche, Cliche Commission or Robert-Cliche Regional County Municipality. Bearcat (talk) 23:39, 11 December 2007 (UTC)
Wikipedia talk:AutoWikiBrowser/Dev may be a better venue to bring this up. It seems we need some support for excluding certain words from spell checking. Perhaps by wrapping such words in some template? —Wknight94 (talk) 12:12, 12 December 2007 (UTC)

unspayed

"unspayed" is being changed to "unpaid" Mbisanz (talk) 04:01, 10 December 2007 (UTC)

I'm not seeing that. Example? —Wknight94 (talk) 12:30, 10 December 2007 (UTC)
And I've forgotten the article, I may be back if I find it. Mbisanz (talk) 09:42, 12 December 2007 (UTC)

Steps

AWB is suggesting that words in the form step-x be corrected to stepx as in step-son becoming stepson. Another user mentioned that this really isn't a misspelling or even a poor usage of the word. Could we pull it out of the regex? Mbisanz (talk) 09:42, 12 December 2007 (UTC)

I'd support removing it as a legitimate variant spelling. However, I would keep grand-father→grandfather etc in. Inconsistent I know... iridescent 00:13, 15 December 2007 (UTC)
I need to weigh in on the step issue and thank Iridescent for leading me here. All my life my step-bro was my step (hyphen) brother, not stepbrother. I agree the grandX should remain granddaughter, grandmother, et al, but the stepX/step-X should not be part of AWB "bad words" and I'm glad to see others have noticed. KellyAna (talk) 04:48, 15 December 2007 (UTC)

referenses -> references

Can someone do it? I have difficulties to just corrct the "Refer" entry. -- Magioladitis (talk) 14:20, 14 December 2007 (UTC)

Additionally, it is now catching the correct versions, i.e., it will try to turn "reference" into "reference" which is a waste of time. We try to avoid that. I'll look into catching your new case. —Wknight94 (talk) 14:42, 14 December 2007 (UTC)

destroied -> destroyed

In this diff here [12] , I had to manually correct AWB's output to correct the error. Can this be built into regex? Mbisanz (talk) 07:08, 17 December 2007 (UTC)

I added a rule. —Wknight94 (talk) 10:01, 17 December 2007 (UTC)

disicplined -> dissicplined?

AWB recommended this change on this page. It looks like the same change was made on this diff. Both appear to be a typo of disciplined. KathrynLybarger (talk) 04:00, 21 December 2007 (UTC)

I expanded the Discipline rule to catch this. Since it wasn't being covered there, it was falling through to the Diss- beginning rule. —Wknight94 (talk) 04:20, 21 December 2007 (UTC)

appriciated → apppreciated

The RegexTypoFix suggests the fix: appriciated → apppreciated on Jungle Run. Could someone correct the regex? BTW, I'd like the RegexTypoFix to be able to fix loosly → loosely and makeing → making too. Thanks, Warut (talk) 11:41, 21 December 2007 (UTC)

Part 1 complete (fixed appreciated)... —Wknight94 (talk) 12:51, 21 December 2007 (UTC)
I did Part 2 as well. —Wknight94 (talk) 16:00, 26 December 2007 (UTC)
Thanks a lot! :) Warut (talk) 21:49, 26 December 2007 (UTC)

uber →

I am seeing this stolen prefix used frequently without the umlaut. When transLITERATED into English, the German "ü" must be transcribed to "ue".

Transliterations should follow accepted rules such as those established by the M.L.A., Chicago, etc., regardless of a user's knowledge of German (in this case), transliteration, or the history of a word's origins.

I suggest using the ü, as opposed to the ue transliteration, which I think might further confuse the issue to those unfamiliar with transliterating rules. 76.180.174.222 (talk) 13:23, 26 December 2007 (UTC)

In American English at least, the prefixes uber- and über- are both acceptable. I think ueber- would confuse many readers. (I have never seen the latter used in English.) Of course, if German language is being quoted, the umlaut should just be used here; no transliteration is necessary. And nobody has "stolen" anything, English merely borrowed it; AFAIK, Germans still use "über"and "über-"!--BillFlis (talk) 13:52, 26 December 2007 (UTC)

More requests

  • (in)definately → (in)definitely
  • fianlly → finally
  • Lousiana → Louisiana

Thanks, Warut (talk) 11:31, 28 December 2007 (UTC)

Definately was already done. I added Fianlly and Lousiana. —Wknight94 (talk) 12:23, 28 December 2007 (UTC)
Thanks for your quick addition, Wknight94. I don't know definitely why I included definately in the list. :) Warut (talk) 17:52, 28 December 2007 (UTC)
Now I know why I asked for indefinately: AWB cannot detect indefinately in List of General Hospital characters. But I don't understand why it can't. Warut (talk) 18:09, 28 December 2007 (UTC)
AWB is a bit odd in what it will and won't fix. It won't fix typos inside links - internal or external. Maybe it won't fix yours because it's indented. You can test it out by adding it into the find-and-replace portion of AWB (and check regular expressions on). If it changes it after you do that, then it is in a section that is being excluded for some reason. "Indefinately" is definitely (Face-smile.svg) in the typo list because I tried it this morning. —Wknight94 (talk) 19:22, 28 December 2007 (UTC)
Your explanation and suggestion are much appreciated. I've finally fixed that typo with find-and-replace without further ado with the reason why RegExTypoFix couldn't. Cheers, Warut (talk) 12:14, 29 December 2007 (UTC)

"Musicial" to "musical"

I suggest changing all uses of "musicial" to "musical" -- I found 17 such misspellings on Wikipedia. Thanks, --Skb8721 (talk) 23:14, 3 January 2008 (UTC)

Actually, it now looks like there are over 100 such misspellings. --Skb8721 (talk) 21:12, 4 January 2008 (UTC)
I added the typo, fixed 46 articles, deleted one article, and nominated a category for speedy rename. —Wknight94 (talk) 05:10, 5 January 2008 (UTC)
Great, thanks for doing this! --Skb8721 (talk) 17:13, 6 January 2008 (UTC)

"march" to "March"

Per this diff here [13] the word march as in a parade is being picked up as the month. Can this be fixed. MBisanz talk 21:57, 9 January 2008 (UTC)

I removed the offending match entirely. —Wknight94 (talk) 02:47, 10 January 2008 (UTC)

Teh is very difficult to fix

Many people and things have the word "teh" in them; it appears both uppercase and lowercase. Is there a way to filter out these "legitimate" uses of "teh"? -- King of ♠ 06:07, 14 January 2008 (UTC)

Yea, its a problem, but when I've done googles of wikipedia, there are so many valid chemical uses of teh and the whole article on teh as a different spelling. Don't think its possible. Onthe other hand, you could write a custom replacement for teh->the for individual use. MBisanz talk 19:23, 14 January 2008 (UTC)

Critised

Sorry I can't add these myself. I'm not good with regexes and wouldn't want to screw things up.

Currently: critised → criticized

Could it go to the Commonwealth English not US English "criticised". Websters 1996 [14] —Preceding unsigned comment added by Breno (talkcontribs) 07:22, 15 January 2008 (UTC)

Arctic -> arctic

Perhaps Arctic should not be changed to artic, since arctic as an adjective is lowercase according to wikt:arctic. Arthena(talk) 18:26, 16 January 2008 (UTC)

correspondance -> correspondence

The rule correspondance -> correspondence gives false positives because correspondance is French for correspondence, and the French word is used in many articles. A search for correspondance [15] shows many legitimate uses of the word. Arthena(talk) 18:26, 16 January 2008 (UTC)

Typo lists regarding American and British English

American and British English spelling differences#Compounds and hyphens should probably be cross referenced on our typo list. I was recently noticing (and a fellow editor also pointed out) that words like extracurricular is being marked as wrong if spelled as extra-curricular, which is correct for British English (extends to Australia, Hong Kong, India, and sometimes Philippines. Is there a way we can do this? - Jameson L. Tai talkcontribs 05:10, 19 January 2008 (UTC)

Would be better placed here. The "typo moderators" are more likely to seeit now! Reedy Boy 10:31, 19 January 2008 (UTC)
Thanks for redirecting me here.  :-) - Jameson L. Tai talkcontribs 08:01, 20 January 2008 (UTC)

Louisianian → Louisianan

I've noticed that RegexTypoFix always change Louisianian → Louisianan. However, both Louisianian and Louisianan are valid according to the list of U.S. state residents names. So this may need a fix. Thanks, Warut (talk) 11:58, 20 January 2008 (UTC)

What's more, this dictionary lists both spellings.--BillFlis (talk) 16:28, 21 January 2008 (UTC)

The negative lookbehinds used in the regex lists

We've got four instances of the use of negative lookbehinds (a ?<! to exlcude a string) in the regex typo list. wikEd now uses the list directly, and these can't be supported by wikEd as JavaScript, which wikEd uses, apparently doesn't support lookbehinds. I've tried and failed to find replacement regex for these, can anybody else come up with one? Cacycle has commented the four occurrences with // "invalid quantifier" JS error:. I'll ask him/her for help too. Thanks Rjwilmsi 21:56, 7 September 2007 (UTC)

There's no regular expression equivalent for zero-width look-behind assertions. Commenting them out means that AWB can use them either, it appears. Can we uncomment them now to restore functionality to AWB? Or invoke some alternate "tagging" so that the wikEd program can recognize them and ignore them, or AWB can recognize and include them? -- JHunterJ (talk) 00:48, 5 February 2008 (UTC)
wikEd can actually handle it, it has an error detection and just skips erroneous RegExps. I have uncommented the lines in question. Сасусlе 04:09, 5 February 2008 (UTC)
Thanks! -- JHunterJ (talk) 12:08, 5 February 2008 (UTC)

+ emplyed -> employed

Can someone please add this? Example. Thanks. --Closedmouth (talk) 08:13, 28 January 2008 (UTC)

Done. —Wknight94 (talk) 05:04, 29 January 2008 (UTC)

runnning -> rngnning

This edit is a rather odd mistake that I didn't notice at the time. I'm not good enough at reading regex to figure out what caused it, could someone look into it?--Dycedarg ж 21:18, 28 January 2008 (UTC)

Hmmm, I've seen something similar with a prior version of AWB. Has to do with all regexes being auto surrounded by parentheses in the code. Backreference numbering gets screwed up. I changed it to use named backreferences and now things appear to work. Sorry for the confusion. —Wknight94 (talk) 21:58, 28 January 2008 (UTC)
Thank you!--Dycedarg ж 03:32, 29 January 2008 (UTC)

summery -> summary

The summary fix is incorrectly changing summery to summary. I don't see easily how to change it without dropping the fix for 'sumary'. Ideas? Thanks Rjwilmsi (talk) 22:58, 4 February 2008 (UTC)

I may be misunderstanding your concern... how about this? —Wknight94 (talk) 23:04, 4 February 2008 (UTC)
Hmm, my point was that that regex now misses words matching [Ss]ummer(i[sz](ation|e[ds]?|ing) (i.e. all endings except just y to make summery) which we would want to correct to summar$1 etc. Ideas? Rjwilmsi (talk) 19:59, 6 February 2008 (UTC)
Actually I think the followup edit to mine accomplished more of what you had in mind. Catches all summerxxx words except summery. (Or I assume it does - I didn't actually try it). —Wknight94 (talk) 20:10, 6 February 2008 (UTC)

interupt -> interrupt

Lemon Interupt is the alternate name of this band. From a search of google it appears that the name is used in only seven different articles. Is it worth removing the typo from the list or modifying it somehow?--Dycedarg ж 11:44, 20 February 2008 (UTC)

I added a lookbehind assertion to allow Lemon Interupt (and Lemon Interupts, for that matter, but I don't think that's a big problem). -- JHunterJ (talk) 12:24, 20 February 2008 (UTC)

Broken regex

I'm being told regex is broken and I don't know how to fix it. MBisanz talk 02:57, 29 February 2008 (UTC)

Looks like AWB replaces "manouvers" with "manoeuvers".

Aren't later one is a misspelling? Like should not it be "manoeuvres"? I would say "manouvers" should be replaced with "maneuvers" and "manoeuvers" with "manoeuvres" ... But I could be wrong cuz English is a second language for me. TestPilot 11:02, 10 March 2008 (UTC)

Imtrec Aviation -> Intrec Aviation

Can someone add as exception? Imtrec Aviation is a legitimate company. Should "imtrec"->"intrec" rule be kept at all? TestPilot 15:02, 10 March 2008 (UTC)

Looks like this one got fixed by User:BillFlis. Thanx. TestPilot 07:05, 11 March 2008 (UTC)

Imdadkhani

The script wants to replace perfectly good "Imdadkhani"(28 pages in WP) with nonexistent "Indadkhani" for some reason. TestPilot 16:24, 11 March 2008 (UTC)

Retuned

It is not a valid word - it is a misspelling of returned. TestPilot 17:57, 11 March 2008 (UTC)

Re+tuned, see the usage[16]. MaxSem(Han shot first!) 18:03, 11 March 2008 (UTC)
Opps. Yes. Correct, sorry. TestPilot 18:05, 11 March 2008 (UTC)

in so far → insofar?

Looks like "in so far" is a legitimate spelling. Should we really replace it? TestPilot 14:05, 10 March 2008 (UTC)

I can see a lot of false positives with that. Also [17]. Rocket000 (talk) 23:06, 15 March 2008 (UTC)

AutoCorrect database

I have created a page with huge list of typo corrections from AutoCorrect software. RegExTypoFix got covered lots of entries, but far from all. The list itself was originally based on old list of wiki typo corrections. And it was created by AHK community. The easiest way to check it out in AWB is to create list from "what links here" - Zelavin article. Make sure you enable user space pages. Second, today I started to work on my own utility for typo autocorrection on the fly. It sort of working already, as I type:), and the good news is that it checks against 2200 regexpressions (all that was on AutoWikiBrowser/Typos page) in a blink of an eye. Even faster then that - on relatively old computer. So it do looks like we can expand regex list like tenfold without having to worry too much about performance. TestPilot 02:24, 13 March 2008 (UTC)

I cleaned out list and updated typo page with new rules. TestPilot 03:58, 14 March 2008 (UTC)

heavly → heavily

In this edit, it somehow used avly → avely. Can this be fixed? Thanks. — E talk 23:44, 20 March 2008 (UTC)

Removed. MaxSem(Han shot first!) 13:03, 23 March 2008 (UTC)

Thru -> through

Given the number of legitimate uses (including in article titles - see Special:Prefixindex/Thru), should this be an automatic correction? Black Falcon (Talk) 22:41, 22 March 2008 (UTC)

I think no, so I've removed it. Thanks Rjwilmsi (talk) 12:44, 23 March 2008 (UTC)

Inbhir -> Imbhir

I'm not sure which line is causing the change, but I think there are too many false positive associated with this change of "In" to "Im". Examples of articles on which this would cause errors include Ayr and Cullen. – Black Falcon (Talk) 16:49, 24 March 2008 (UTC)

A similar issue takes place with replacement of "En" with "Em" (e.g. "Enman" -> "Emman", in the article William George Barker). Black Falcon (Talk) 21:36, 24 March 2008 (UTC)

The Ayr and Cullen articles were incorrectly tagged. I've fixed them [18] and [19]. Thanks Rjwilmsi (talk) 00:07, 30 March 2008 (UTC)

I've added an exception so 'Enman' isn't caught - [20]. Rjwilmsi (talk) 00:13, 30 March 2008 (UTC)

Thanks. Black Falcon (Talk) 06:47, 6 April 2008 (UTC)

Consitution > Constitution

Hm? Jobjörn (talk) 14:42, 7 April 2008 (UTC)

Vitaly → Vitally

I recently ran AWB on Category:Soviet actors (123 articles) and encountered three false positives (Boris Babochkin, Vasily Livanov, and Vitaly Solomin) with "Vitaly" → "Vitally". While "vitaly" is probably a common misspelling of "vitally" (and, thus, the fix for it is useful), the fix could cause errors in articles about Russian people. Since names are likely to be written in upper-case, is there any way to restrict the change to lower-case instances of "vitaly" only? If not, is there some other way to reduce the potential for false positives while preserving the typo fix? Black Falcon (Talk) 06:29, 6 April 2008 (UTC)

Hopefully fixed. TestPilot 10:20, 16 April 2008 (UTC)
It seems to be working: I just tried AWB on the articles that produced the false positives and was not prompted for any typo fixes. Thanks! Black Falcon (Talk) 16:49, 16 April 2008 (UTC)

Capitalization in Wikipedia DNS

Noticed it changed wikipedia to Wikipedia in Wikiquote in the dns addresses listed there. Convention dictates they remain lowercase. - Kaobear (talk) 15:05, 8 April 2008 (UTC)

Yeah, I agree, strings "wikipedia.org", "wikipedia.com", "wiktionary.org" and "microsoft.com" should not be capitalized - too many false positives. But, unfortunately, I don't know how to fix that. TestPilot 10:34, 16 April 2008 (UTC)

esp. --> especially

I was thinking:

<Typo word="especially" find="\b(Esp|esp)\.([ \t])\b" replace="$1ecially$2" />

..but am open to corrections... Ling.Nut (talk) 19:29, 26 April 2008 (UTC)

Most likely it will be encountered in quotations, where it shouldn't be changed. MaxSem(Han shot first!) 19:57, 26 April 2008 (UTC)

WP:MOS fixes, such as "no spaces around mdashes"

Is there a reason why AWB doesn't do the more mechanical WP:MOS fixes? Ling.Nut (talk) 09:32, 28 April 2008 (UTC)

These types of fixes can be proposed at Wikipedia talk:AutoWikiBrowser/Feature requests. I'm not sure whether a spacing fix could (or should) be incorporated into this page... Black Falcon (Talk) 20:15, 28 April 2008 (UTC)
Thanks! Ling.Nut (talk) 02:12, 29 April 2008 (UTC)

Fix may be needed

While I was adding a nav box and also doing the general and typo fixes, AWB changed spelling of a word Succeeded from succedded to succeededd in preview. But when i checked using diffs after saving it was Succeeded. Can someone look at this problem? --SMS Talk 16:57, 28 April 2008 (UTC)

BillFlis has corrected this [21]. Thanks Rjwilmsi (talk) 23:44, 30 April 2008 (UTC)

Souffle

It seems to change Souffle into Souffléouffl. Interesting word, but not strictly a correction... -- 20.133.0.13 (talk) 09:40, 29 April 2008 (UTC)

Thanks, this has already been corrected [22] by BillFlis. Thanks Rjwilmsi (talk) 23:42, 30 April 2008 (UTC)

suggesting a change

Here, AWB changed "reciding" to "resideing". Using the link to Dictionary.com on User:Mboverload/RegExTypoFix/rejectedwords, the word "resideing" is not a real word. I suggest that the typo fix for "reciding" be changed to "residing".--Rockfang (talk) 12:33, 7 May 2008 (UTC)

Fixed.--BillFlis (talk) 13:18, 7 May 2008 (UTC)
Thanks.--Rockfang (talk) 13:51, 7 May 2008 (UTC)

Jewelery

Can someone remove "jewellery" → "jewelery" from the typo list? "Jewelery" is an Americanism; in the rest of the world the correct spelling is with two l's. iridescent 16:16, 11 May 2008 (UTC)

Merriam Webster says 'jewelery' isn't a word [23] (jewellery is the British version, jewelry is the American one), so which of these do you think is wrong?
<Typo word="jewellery" find="\b(J|j)ewelery\b" replace="$1ewellery"/>
<Typo word="Jewelery" find="\b(J|j)ewl(|le)ry\b" replace="$1ewel$2ry" />

Thanks Rjwilmsi (talk) 17:38, 11 May 2008 (UTC)

The second is correct as it replaces "jewlery" which definitely isn't a word to "jewelry"; the first should go as it just converts British to American english. I'd add one to convert "jewllery" to "jewellery", too. iridescent 17:45, 11 May 2008 (UTC)

Ignore me (I can never understand regexes) - the corrections should be "jewllery" to "jewellery", "jewelery" to "jewelry" and "jewlery" to "jewelry". I think. iridescent 17:48, 11 May 2008 (UTC)

Yes check.svgOkay, 1 is already corrected, 2 is actually corrected to jewellery which I think is better so haven't changed and 3 is now corrected. Thanks Rjwilmsi (talk) 18:21, 11 May 2008 (UTC)
I am so glad there's someone here who actually understands the way this thing works... iridescent 18:55, 11 May 2008 (UTC)

Targetting and Targetted...

are perfectly fine in British, Canadian and other kinds of English. Please remove them from the list asap. --Slp1 (talk) 21:19, 11 May 2008 (UTC)

Do you have a link to support this – targetted etc. are not listed with double ts at wiktionary, Merriam Webster nor Dictionary.com. Thanks Rjwilmsi (talk) 21:47, 11 May 2008 (UTC)
well, well. How very, very interesting. I have to confess that I can't find any. The OED does include examples of the double tt, but from centuries ago. However major media such as the BBC,[24][25] CBC, [26] Globe and Mail,[27] reputable publishers [28] [29] and scholarly journals [30][31], all use the spelling regularly. It is a fascinating example of dictionaries as prescriptive rather than descriptive. I wonder how long dictionaries can possibly continue not to include it as a frequently used variant given its wide use by reputable sources. What do you guys do in situations like this? I guess it is probably desirable to stick to what dictionaries say, no matter how extensively the variant is used, but I do think some care needs to be taken: there are a number of books [32][33] and articles [34] [35], for example, that use the "incorrect" spelling, and fixing them as typos would not be right obviously. --Slp1 (talk) 14:32, 12 May 2008 (UTC)
Yes check.svgOkay, there's sufficient grounds to remove this as the typo list does not aim to be controversial, so I have [36]. Thanks Rjwilmsi (talk) 17:54, 12 May 2008 (UTC)
No, thank you! I very much appreciate this commonsense approach to the problem! --Slp1 (talk) 18:19, 12 May 2008 (UTC)

Remove references to "Encyclopedia of Cajun Culture"?

I believe that several Wikipedia entries cite my personal web site, the Encyclopedia of Cajun Culture, located at www.cajunculture.com, as a source of information.

However, I have discontinued the Encyclopedia of Cajun Culture and now use the domain in question for other purposes.

As such, could someone create an AWB that would remove all references in Wikipedia to my website, whether it's listed as "Encyclopedia of Cajun Culture" or as "www.cajunculture.com" or even some combination of the two? (I manually deleted one such reference, which included not only "Encyclopedia of Cajun Culture" and "www.cajunculture.com", but also my personal name and that of my co-author.)

Sincerely, --Skb8721 (talk) 01:19, 14 May 2008 (UTC)

Your request sounds reasonable if the website's content is now not relevant to the articles in which it is referenced. However, this is the talk page for typo fixing, so I suggest you re-post your request on the appropriate page – the AWB talk page. Thanks Rjwilmsi (talk) 09:17, 14 May 2008 (UTC)

Ukulele

Can someone remove "ukelele"→"ukulele" from the regex please? "Ukelele" is the correct spelling in British English, and an acceptable variant in the US Thanks! iridescent 01:33, 14 May 2008 (UTC)

Yes check.svgfixed – wiktionary agrees that ukelele is a valid variant – wikt:ukelele. Rjwilmsi (talk) 09:29, 14 May 2008 (UTC)

Of fornames and feilds

It's currently changing "forname" to "oref" and "feilding" to "field$S" which, while both interesting words, are probably not correct; can someone who understands these things fix it? iridescent 20:57, 8 May 2008 (UTC)

I fixed the "feilding" one. Someone else is going to have to tackle the other one.--BillFlis (talk) 23:01, 8 May 2008 (UTC)
I think I've now fixed the forname one too, but I can't test it properly until this evening. Rjwilmsi (talk) 08:52, 16 May 2008 (UTC)
It's still doing this - could someone have another look (or remove it from the regex entirely as an interim measure)? Thanks! iridescent 20:50, 17 May 2008 (UTC)
The typo list is fixed e.g. [37] but there is a bug with AWB in that the released version is stuck loading some old version of the typo list. I reported the bug, it's been fixed in the SVN version, but no new official update has been released. I have re-requested a new release - Wikipedia_talk:AutoWikiBrowser/Dev#Release_next_version_please. I would suggest you politely petition the developers for a release to fix this! Thanks Rjwilmsi (talk) 22:04, 17 May 2008 (UTC)
And to clarify, even if you remove the entry from the typo list (which isn't necessary as it's now correct), the released version of AWB will not pick up the new version of the typo list! Rjwilmsi (talk) 22:06, 17 May 2008 (UTC)
Yes check.svg A new version of AWB has been released to reslove this bug. Rjwilmsi (talk) 07:18, 29 May 2008 (UTC)

Compleat vs. Complete

In the past two months, there have been three edits to the "Weird Al" Yankovic page using AWB that say the word 'Compleat' was changed to 'Complete'. Looking at the diffs, the first time the word was actually changed (said change was reverted); the last two times it wasn't. 'Compleat' is not a misspelling and should not be treated as such, whether or not a change is actually made. My regex skillz are not enough to correct this myself or I would. Hopefully some kind soul can help out.

Here are the diffs in question:


-- BullWikiWinkle 02:40, 29 May 2008 (UTC)

Yes check.svgThere's a known bug in AWB that explains the second and third cases. I've added an exception to the typo list so 'Compleat' will not be changed in the future. Thanks Rjwilmsi (talk) 07:07, 29 May 2008 (UTC)
Thanks for your quick attention to the issue. -- BullWikiWinkle 18:59, 29 May 2008 (UTC)

iii, www, xxx

Not sure what's changed with the 3 letters → 2 letters rule, but can exemptions be made for iii,xxx and www? At the moment AWB's attempting to shorten Roman numerals & website addresses. Thanks... iridescent 19:04, 29 May 2008 (UTC)

Yes check.svgYes, I thought I'd added exceptions for iii and www already. I've tweaked them and added xxx [38] please refresh status in AWB to test. Thanks Rjwilmsi (talk) 20:21, 29 May 2008 (UTC)
Great... thanks! iridescent 20:27, 29 May 2008 (UTC)

Entries to move hyphens to en dashes

Per WP:DASH, I'd like to add some entries here that will convert hyphens to en dashes. This is a bit of a departure, thopugh, so I wanted to discuss it first. I've tested these extensively, and not encountered any false positives (I have others that do have a lot of false positives, but I'm not adding them here).

<Typo word="en dash in page ranges" find="(pages\ ?=\ ?|pp\.?\ )([0-9]+)-([0-9]+)" replace="$1$2&ndash;$3" />
<Typo word="en dash in date ranges" find="(\[?\[?(January|February|March|April|May|June|July|August|September|October|November|December)\ [1-3]?[0-9]\]?\]?,\ \[?\[?[1-2][0-9][0-9][0-9]\]?\]?)\ ?-\ ?(\[?\[?(January|February|March|April|May|June|July|August|September|October|November|December)\ [1-3]?[0-9]\]?\]?,\ \[?\[?[1-2][0-9][0-9][0-9]\]?\]?)" replace="$1&ndash;$3" />
<Typo word="en dash in money ranges" find="(\$[1-9]?[0-9]?[0-9]?[0-9])\ ?-\ ?(\$?[1-9]?[0-9]?[0-9]?[0-9])" replace="$1&ndash;$2" />
<Typo word="en dash in measurement ranges" find="([1-9]?[0-9])\ ?-\ ?([1-9]?[0-9])(\ |\&nbsp;)(years|months|weeks|days|hours|minutes|seconds|kg|mg|kb|km|GHz|Hz|kHz|miles|mi\.|%|MPH|mph)\b" replace="$1&ndash;$2$3$4"
<Typo word="en dash in time ranges" find="([0-1]?[0-9]:[0-5][0-9]\ ?([AaPp][Mm])?)\ ?-\ ?([0-1]?[0-9]:[0-5][0-9]\ ?([AaPp][Mm])?)" replace="$1&ndash;$3" />
<Typo word="en dash in age ranges" find="([Aa]ge[sd])\ ([1-9]?[0-9])\ ?-\ ?([1-9]?[0-9])" replace="$1 $2&ndash;$3 />

So let me know what you think...—Chowbok 17:29, 6 May 2008 (UTC)

Since Wikipedia is now UTF-8–compatible, why don't you replace the hyphens with the single en-dash character "–", rather than the lame old HTML entity "& n d a s h ;", which takes up seven times the space?--BillFlis (talk) 17:40, 6 May 2008 (UTC)
Because the edit box is (for most people) in a monospaced font, which makes it impossible to tell the difference between a hyphen, an en dash, and an em dash. You'll also note that the dash characters are not converted to UTF-8 automatically by AWB, for the same reason.—Chowbok 17:45, 6 May 2008 (UTC)
I'm with BillFlis in preferring that the single character be used rather than the html entity. If AWB can only support the html entity, I'd rather not see this implemented. olderwiser 18:05, 6 May 2008 (UTC)
Sigh. Did you read what I just wrote? At least try to address my point...—Chowbok 18:23, 6 May 2008 (UTC)
I don't really see why the monospaced font display is an issue. I venture that most editors could care less about the difference and we shouldn't be unnecessarily filling the edit screen with techno-jargon. If AWB is unable to make the distinction, I don't think we should be using AWB to implement such a "solution". olderwiser 18:30, 6 May 2008 (UTC)
AWB is capable of putting in the UTF-8 character, I'm not sure how you got that it isn't. Anyway, the monospaced font is very much an issue, and editors that know the difference between the dashes absolutely need to be able to see which has been implemented. It's ridiculous to say that it's not a big deal that commonly-confused characters look identical in the edit box.—Chowbok 18:34, 6 May 2008 (UTC)
A bad assumption perhaps because it is rather inconceivable why anyone would want to clutter the articles up with html entities when there is a perfectly good UTF character available. If it is so very important for editors to be able to distinguish them, then why does the MOS makes no mention whatsoever of the distinction let alone indicate any sort of preference. Now that you indicate AWB is capable of inserting the UTF character, then I very very strongly oppose having it insert the cludgey html entity. olderwiser 19:02, 6 May 2008 (UTC)
I don't see why it's "inconceivable" when I've explained it several times now. The reason is that editors need to be able to see if something is a hyphen, en dash, or em dash when editing an article. The advantage of doing it this way is that it allows that. The disadvantage is that you think HTML entities are ugly. Sorry, I'm not convinced that's the better argument.—Chowbok 19:22, 6 May 2008 (UTC)
Well, if as you say, it is so important to see the distinction, then why is the MOS and other editing guidelines silent on this point? If it is simply a matter of your preference vs. mine, that is certainly something that should be more widely discussed before encoding it into AWB. olderwiser 19:41, 6 May 2008 (UTC)
Do keep in mind that, as I said, AWB already does not move &ndash; to – when fixing Unicode. So if we're discussing this, we need to discuss them removing that exception as well. Also, please see below for my question.—Chowbok 18:45, 7 May 2008 (UTC)

This certainly seems like a worthwhile fix for AWB to do, but I think it would be better as an AWB general fix so it's available to all AWB users, not just those doing typo fixing. Therefore I suggest you post it at Wikipedia talk:AutoWikiBrowser/Feature requests. Thanks Rjwilmsi (talk) 17:52, 6 May 2008 (UTC)

Will do, thanks.—Chowbok 18:23, 6 May 2008 (UTC)

Just beneath the edit window are all these special characters, which an editor can simply click on to insert. Guess what's the very first one? An en-dash character. The second is an em-dash. If we're not supposed to use them, then why are they there?--BillFlis (talk) 22:28, 6 May 2008 (UTC)

I'm not saying it's a policy to use the entities, just good practice. Let me ask you and Bkonrad a question. Suppose I'm editing a page by hand, and I see 1941—1945 in the edit box. What should I do to quickly determine if the correct dash is being used?—Chowbok 18:42, 7 May 2008 (UTC)
Hmm, well just eyeballing it in my edit window it looks to me like an endash. And confirmed by using Firefox's search function. olderwiser 19:29, 7 May 2008 (UTC)
Put a hyphen, an em dash, and an en dash in an edit window. Assuming you're using a monospaced font, I guarantee two of those will be identical.—Chowbok 22:21, 7 May 2008 (UTC)
Yep, I did that. I did nothing special to configure Firefox. The difference between them was pretty easy to spot. olderwiser 00:36, 8 May 2008 (UTC)
Well, I don't know what font you're using, but in Courier, these look the same:
Chowbok 03:01, 8 May 2008 (UTC)
Hmm, I misspoke. 1st, when in response to your example of 1941—1945, I said it looked like an endash in the edit window, but on more careful examination it is an mdash. 2nd, a regular hyphen and an ndash do appear identical in the edit window (immediately above, you show an endash and and mdash which are clearly different, even to my not particularly acute vision. But the Firefox search function does find the correct characters. But in any case, you have not responded to my query about why, if it is so important for editors to be able to make this distinction (based on using the html entities), is no mention made of it in the MOS or other editing guidelines? olderwiser 12:23, 8 May 2008 (UTC)

This brings up a good question: why is the edit box in a monospace font? Plrk (talk) 14:36, 14 June 2008 (UTC)

legitmacy → llegitimacy

I was recently swapping over some templates when I came across Mid-Sha'ban. AWB suggested a change of legitmacy → llegitimacy. I wasn't sure if this was a bug or not.--Rockfang (talk) 17:13, 30 May 2008 (UTC)

Yes check.svgYou're very polite to only suggest that this might be a bug ;) It is, and I've fixed the erroneous entry. Thanks Rjwilmsi (talk) 23:09, 30 May 2008 (UTC)


Illinios -> Illinois

Could the spell check fix this?--DAW0001 (talk) 13:14, 6 June 2008 (UTC)

This entry already fixes it. Thanks Rjwilmsi (talk) 15:21, 6 June 2008 (UTC)
<Typo word="Illinois" find="\b(?:[Ii]l(?:[li]a?noi|ll+[ai]noi?|l+[ai]ni?o|l+ioni)s|illinois)\b" replace="Illinois" />

recieve -> receive

Could the spell check fix this?--DAW0001 (talk) 13:14, 6 June 2008 (UTC)

This entry already catches it. Thanks Rjwilmsi (talk) 15:15, 6 June 2008 (UTC)
<Typo word="(Re/De/(Mis/Pre)Per/(Mis)Con/Trans)ceive" find="\b([RrDd]e|[Pp]er|[Mm]isper|[Cc]on|[Mm]iscon|[Pp]recon|[Tt]rans)ce?iev(e[sd]?|ers?|ing|ership|ables?)\b" replace="$1ceiv$2" />

"Dependant" is valid

Please see this diff. Plrk (talk) 14:30, 14 June 2008 (UTC)

Oddly, just came to make the exact same point. "Dependant" is the correct spelling in British English, can someone remove this one? – iridescent 19:46, 15 June 2008 (UTC)
Yes check.svg Well, it's the correct spelling in some cases as US English uses 'dependent' when British English uses 'dependant'. Unfortunately 'dependant' is often used when 'dependent' should be, so fixing 'dependant' to 'dependent' is often correct: Plrk seems to have made three recent edits changing 'dependant' to 'dependent', ([39] [40] [41]) and only the second one was incorrect. Anyway, typo list updated [42] and [43]. Thanks Rjwilmsi 23:07, 15 June 2008 (UTC)
Thanks! At some point I'll run a search for "dependant" and search-and-replace it when it's being used as an adjective. – iridescent 23:08, 15 June 2008 (UTC)

feauture > feature

feauture > feature! Plrk (talk) 21:21, 15 June 2008 (UTC)

Added.--BillFlis (talk) 21:53, 15 June 2008 (UTC)

Immigrant / Inmigrante

Ignoring "inmigrante" when correcting "immigrant" would save us a lot of accidental "corrections" of spanish proper nouns. Plrk (talk) 22:02, 15 June 2008 (UTC)

Yes check.svgI've added an exception for it. Thanks Rjwilmsi 22:48, 15 June 2008 (UTC)

programm > program

programm > program, programms > programs Plrk (talk) 23:05, 15 June 2008 (UTC)

Occurring vs occuring

AWB keeps changing the (British) spelling of "occuring" to (US) "occurring" on After Dark (TV series). As I understand it this goes against WP:ENGVAR, so is there a way to improve the situation (e.g. maybe telling the software, don't be too quick, this spelling is intended and may even be correct)? Thanks AnOpenMedium (talk) 16:37, 16 June 2008 (UTC)

See my answer on my talk page. Rjwilmsi 17:09, 16 June 2008 (UTC)
Have just checked in the full OED; "occuring" isn't listed as a legitimate variant. – iridescent 17:28, 16 June 2008 (UTC)
So sorry. Many apologies. AnOpenMedium (talk) 11:54, 17 June 2008 (UTC)

It's->its

Rule is an amazing one. Works like a charm. Kudos to whoever created it. Make AWB not only a spellchecker, but grammarfixer too:) TestPilottalk to me! 17:53, 17 June 2008 (UTC)

Progess

I found this one accidentally: "Progess" --> "Progress", "Progessed" --> "Progressed", "Progession" --> "Progression", etc. SpencerT♦C 23:11, 26 June 2008 (UTC)

This existing fix already catches the three typos above:
<Typo word="Progress" find="\b(P|p)(?:rog|togr)ess(ed|ing|ive(?:ly)?|ions?)\b" replace="$1rogress$2" />
Thanks Rjwilmsi 23:49, 7 July 2008 (UTC)

byproduct and by-product

A recent edit[44] implies that "byproduct" is an incorrect spelling. However, some dictionaries [45] [46] imply that "byproduct" is correctly spelled. Someone at Talk:By-product thought that "byproduct" was the preferred form. I hope this doesn't become a repeat of the "email vs. e-mail" controversy (Talk:E-mail#E-mail_vs._email). --68.0.124.33 (talk) 23:38, 7 July 2008 (UTC)

Yes check.svgRemoved the correction as per your link above, wiktionary allows both variants. Thanks Rjwilmsi 23:46, 7 July 2008 (UTC)

Triple letter

It strike sometimes, but seems to be giving too many false positives. Can it be modified so at least it won't catch triple letters at the beginning and ending of the word? TestPilottalk to me! 17:53, 17 June 2008 (UTC)

There are 478 ohhhs in en.wikipedia.org. And oooh is a legitimate word and etc. and so on. TestPilottalk to me! 20:09, 17 June 2008 (UTC)
Yes check.svg Changed so that the rule doesn't catch triple h, and oooh is an exception too. Thanks Rjwilmsi 22:04, 17 June 2008 (UTC)
Hmmm.. That is not the way to go! There 2350 ummms in en.wikipedia.org. And hmmm is a legitimate word and etc. and so on. Majority of false positives I saw was with triples at start or ending of word - sometime made up words, sometimes just short words like above. With all real fixes is triple in the middle. If it could not be written that way - then it better to remove that rule altogether. TestPilottalk to me! 23:21, 17 June 2008 (UTC)
Okay, we'll have it your way for the moment to avoid the false positives you mention – rule changed to only match triple letters in the middle of words. I'll look into whether the rule can be more broad tomorrow. Thanks Rjwilmsi 23:48, 17 June 2008 (UTC)
Thank you. You see, the whole point is to catch and fix typos. And think about this: there are 26 letters in Latin alphabet. So there is 26*25*2== 1300 possible combination like "abbb" and "bbba". How many of them are actual typos? None, or very few. Even among all entries of "asss" here not a single one is a misspelling of "ass". And that rule catching triples not only in 4 letter words. What are the odds that someone made actual typo with triple letter at the beginning of the word? I could not think of an real English word that actually starts with double letter. And I don't think I ever saw such typo. TestPilottalk to me! 00:17, 18 June 2008 (UTC)
Aardvark! Rjwilmsi 06:31, 18 June 2008 (UTC)
Eephus?  Uuencode?  Aa (hmmm, might be a misspelling of a'a)?  Oort cloud?  Aardwolf?    Xeriphas1994 (talk) 20:14, 23 June 2008 (UTC)

Llama? Aaron (ok, its a name) Reedy 20:20, 23 June 2008 (UTC)

I found this AWB edit had incorrectly changed a triple f to a double f in the German word Luftschifffahrt. Could that be added as an exception? Or better, could words inside italics be excluded, as italics are sometimes used for foreign words. Thanks. -84user (talk) 04:43, 11 July 2008 (UTC)

I've added language tags here to the text in the article, so now AWB knows not to apply the English typo list to it. Rjwilmsi 06:48, 11 July 2008 (UTC)

RegExTypoFix

Hello everyone! I'm mboverload and I was the original designer and builder of RegExTypoFix. Before I say anything a quick history (may be out of order - it has been awhile!):

_____________________________
RegExTypoFix began as a vision. A vision of a 100% accurate set of typo fixing regular expressions. Although it would start as being used for Wikipedia it could easily be ported over to any other websites or applications that needed this ability. (I see now that a 100% accurate list of typos is extremely hard and limiting. I am glad that the project took on a more general, but still very accurate, direction.)

- It started off as a saved settings file for AWB. To load it you would have to download the file and use AWB to "Open" the settings and it would be in the list of Find+Replace entries.
- Of course I needed a place to host these typo files so I turned to SourceForge and made a project - RegExTypoFix.
- After awhile Martin (Orig developer of AWB) invited me to integrate the typo list directly into AWB. Every release of AWB would have a new version of the typo list built in in a file called typos.cs. He gave me SVN access so I just update it myself everyday. Thus I because a "developer", if you're forgive me, of AWB.
- Even this was not fast enough so I tried to design a system where people would be alerted about new releases of Typos.cs. That didn't work out too well, either.
- Then Martin had the genius idea of parsing the list directly from the wiki so everyone could work on it. This is how it stands to this day.

I built the first 1,300 lines of RETF by myself by hand. There was no automation at all or anything that it was based off of. I would just surf the wiki and when I found a misspelling I would make an entry for it.

However, due to certain circumstances I had to leave the project abruptly in 2006. A year later I considered coming back but decided against it. Now it is 2008 and I have returned to Wikipedia.
_____________________________

I would like to come back to developing this list. However, I do not want to be seen as "coming back to take the credit". The communities additions have more than eclipsed my original work. Thus I wish to become a fellow steward.

Thank you EVERYONE who has taken this project under their wing for the betterment of Wikipedia!

Comments/questions/concerns?--mboverload@ 04:29, 8 July 2008 (UTC)

Welcome back! Rjwilmsi 23:18, 8 July 2008 (UTC)
Thank you! I would LOVE to talk to some of you. I am availiable on MSN, AIM, and IRC. I can make arrangements to give my nickname to you via email or I can chat with your on the #AutoWikiBrowser channel on IRC. Let me know if this works for anyone =). --User:mboverload) 01:52, 9 July 2008 (UTC)

Proposal: List of developers

I think it would be a good idea to have an official "Project Template" on the main Typos page. I have the old project template here: WP:RETF
That way everyone who contributes gets a "official" recognition.
I'm not sure how we would sort the developers list. Either by seniority (I left so I might be on the bottom) or alphabetically.

Ideas/comments/concern/disdain?--mboverload@ 02:01, 9 July 2008 (UTC)

A list of regular maintainers of the list might be useful, so that somebody with an urgent issue would have a point of contact. For regular stuff there's the talk page.
I don't know how we would sort such a list – perhaps the AWB statistics reports info on users fixing typos? Rjwilmsi 06:42, 9 July 2008 (UTC)
Tis a good idea.. Just as long as you either put it on the page that is transcluded for information already, or on a new page =) and transclude that! Reedy 11:05, 11 July 2008 (UTC)

Cool. Thinking about it...who ARE the developers now? Right now I only know of me, Rjwilmsi, and Reedy. Anyone else? --mboverload@ 18:47, 11 July 2008 (UTC)

Page history? ;p - BillFlis and Rjwilmsi did the majority of the development when you were away... Hmm. As for the Actual AWB developers, MaxSem and I are doing 99% of the work that is being done to AWB, Kingboyk is too busy with other things to really contribute atm Reedy 08:21, 29 July 2008 (UTC)

TypoScan Announcement

From now on I will be scanning every database extract against the entire Typo list. In the future we will be able to "assign" a section of the 'pedia with known typos to an editor and see a real, tangible benefit. The expected size of this list is projected to be over 100,000 articles, or around 4.5% of all articles on Wikipedia.

Once we go through the list we can start recording a blacklist of articles that should not be checked. Eventually this number will be brought down by the information about false-positives.


Technical details
This is EXTREMELY SLOW GOING. At 17 gigabytes of pure text the database is MASSIVE. In addition, our ever expanding typo list needs to be checked against EVERY ARTICLE in Wikipedia. Over 2.4 MILLION! At max speed my current computer will process this in about 3-5 days.

My current limiting factor is the database scanning software and my CPU. The database scanner is not built for dual core systems and thus only uses 50% of my computer's potential.

Amount of memory is not a problem. The DB scanner only takes about 400 megabytes. It's the speed of the memory.

If my hard drive then becomes the problem I will move the database onto my 10,000 RPM SATA system drive.


Current system

  • CPU - Intel Core 2 Duo E6600 (Conroe) @2400 MHz
  • Motherboard - MSI MS-7350 | nForce 650i SLI
  • Memory - 4 gigabytes of DDR2 PC-5400 memory @333MHz

Hardware updates
In order to better support this new endeavor I am going to be upgrading my computer's hardware.

  • CPU - I will be buying the fastest CPU that I can find that doesn't cost 1000 dollars
  • Memory - I will upgrade my computer to DDR2 PC-6400 from DDR2 PC-5400
  • Overclocking - My computer's entire system was built to be overclocked. I anticipate even further gains in speed

--mboverload@ 07:43, 29 July 2008 (UTC)

Im not sure how you could really thread off something reading from a file. Wonder if its worth looking at having a way of using the DBScanner to run against a MySQL instance/similar, so the file has been loaded back into a database (obv have to be local/mirror, not a WP one to save bandwith). Overclocking your CPU will probably help increase processing time, and the faster ram should help. I would also move it to your 10k rpm drive, thats 33% faster rotation, so less seek time etc etc. Reedy 08:27, 29 July 2008 (UTC)
Can't we just find a handful of users to scan a portion of the database dump each, if we all download the same one? I assume the list of articles dump is in the same order as the articles-list dump, then we can just start from article x?
I did try this myself a couple of months ago (March db dump, ~65,000 hits) but gave up due to there being so many false positives for foreign words and Latin/scientific names. A great idea if we get it right though. Rjwilmsi 11:07, 29 July 2008 (UTC)
I don't think that 10K RPM-drives will help. The main slowdown is running all those shiny regexes, so you don't need much raw HDD read speed, and if your file system isn't deadly fragmented, you don't need a fast seek time either. Probably, we could improve speed by making it parallel, but CPU will still be the main dependancy. MaxSem(Han shot first!) 16:11, 29 July 2008 (UTC)
If article specific exeption list will be implemented, plus long ago requested "Prune list" option will be there, it will become possible to spellcheck very long lists, even online. Sure, first run will be slow, because you will need to mark thouse foreign and madeup words as exeptions. But then... Just imagine spellfixing whole en.Wikipedia spending few hours (human time, how long computer works in a background dosn't really matter). TestPilottalk to me! 08:50, 30 July 2008 (UTC)

Reset =(

During a power flux at my house my computer turned off. I will restart the database scan at about 1/3 of the way through. --mboverload@ 19:15, 30 July 2008 (UTC)

ENDING DISCUSSION HERE - WIKIPROJECT NOW FORMED AND UNDER DEVELOPMENT

License

Mboverload tried to claim that list is under GPL. No, it is not! You could not switch license at will, unless you are developer and own code. TestPilottalk to me! 10:13, 30 July 2008 (UTC)

What list? Wikipedia:AutoWikiBrowser/Typos, this list? It is under GFDL, or at least that's what the edit box tells me when I make my contributions to it and agree to license my contributions under GFDL, right? -- JHunterJ (talk) 10:31, 30 July 2008 (UTC)
Yeah, Wikipedia:AutoWikiBrowser/Typos is under GFDL, which is basically mean that no one can integrate it in any GPL based project. TestPilottalk to me! 10:52, 30 July 2008 (UTC)

-->I am the one who built the software InfoBox. I copied the AutoWikiBrowser infobox, which is licensed under the GPL. I simply forgot to change the license to GFDL. --mboverload@ 18:09, 30 July 2008 (UTC)
=( --mboverload@ 23:33, 30 July 2008 (UTC)

Triple letters

I have removed the triple letter RegEx temporarily and it is pasted here:
<Typo word="Triple letters" find="(?!\b(?:Eisschnelllauf|Killlai|(?:Pya|G|g)rrrl?|[Rr]sssf|[Oo]ooh|[A-Za-z]+([a-z])\1\1\1[a-z]*|[a-fw]+)\b)\b([A-Za-z]+)([a-gj-wyz])\3\3([a-z]+)\b" replace="$2$3$3$4" />

The reason I removed it is because, in spite of the great work that went into building it, I have not come accross anything that it has fixed properly after around 1000 randomized edits. Could someone explain this one to me? --mboverload@ 22:57, 27 July 2008 (UTC)

Are you saying there are too many false positives? I did run it against a database dump last month, so that might explain why it doesn't make many fixes at the moment. Rjwilmsi 06:42, 28 July 2008 (UTC)
Ah, ok. Thanks for doing all that work! I'm just wondering if there are too many false positives? I basically think of you as the lead developer so let me know what you think. (See the typos page - we're a project now with your name highlighted in the dev list) --mboverload@ 14:56, 28 July 2008 (UTC)
False positives seemed to be not that many once the exceptions above were included, and remaining ones were usually for foreign words/phrases which needed to be tagged as {{lang|de|worrrd}} etc. What false positives were you getting? There's no need to remove this simply if there are currently no hits – they are sure to build up again. Rjwilmsi 11:31, 29 July 2008 (UTC)
Dear god, I've been looking for that template - no one on IRC seems to know about it! Thank you Rjwilmsi! Can I call you Rj if I'm lazy? --mboverload@ 02:06, 31 July 2008 (UTC)

Per Rjwilmsi I have readded the regular expression. --mboverload@ 20:53, 31 July 2008 (UTC)

"Nasalisation"

These rules seem rather useless. The first two letters are merely transposed, which could happen to any word in the English language. Why not check transposals of interior letters too?! This sort of thing might be worth checking for very common words, but "Nasalisation" isn't one of them. I suggest we delete these two rules.--BillFlis (talk) 22:29, 31 July 2008 (UTC)

Hey Bill, in the future could you copy the rules you are referring too. I'm lazy. --mboverload@ 03:08, 1 August 2008 (UTC)

error while loading typo list

I am getting this error while trying to load the typo list in AWB.--Rockfang (talk) 15:08, 2 August 2008 (UTC)

It appears it was jsut fixed. :) Rockfang (talk) 15:12, 2 August 2008 (UTC)

Yeah, i was just doing some testing and saw it.. =) 15:49, 2 August 2008 (UTC)

TYPO REVIEW: Imp-/Imm-/Imb-

I have disabled this line in production:
<Typo DISABLED="Imp-/Imm-/Imb-" find="(?!\b[Ii]n(?:ba[lr]|migrante)\b)\b(I|i)n(p[b-gi-tv-z]|m[b-np-z]|b[a-npqstv-z])\B" replace="$1m$2" />
It has a nasty habit of finding every word that begins with "In" and replacing it with "Im". Is there a way to make this less inclusive? --mboverload@ 02:42, 3 August 2008 (UTC)

Inserting in a few hours. --mboverload@ 01:58, 4 August 2008 (UTC)

Tae Kwon Do (taekwondo)

<Typo word="Know" find="\b(K|k)(?:wno|on?w|n?wo)(n?|s)\b" replace="$1now$2"/> <Typo word="Know" find="\bNk(?:wo|ow)\b" replace="Know"/>

In the former, kwon is wanting to be changed to known...

Presumably we should change Tae Kwon Do --> taekwondo

Reedy 20:37, 3 August 2008 (UTC)

I agree. Change it. --mboverload@ 20:45, 3 August 2008 (UTC)
Working on it now.--mboverload@ 02:00, 4 August 2008 (UTC)

Amerias

Shouldnt be a typo for America(s)

Reedy 23:27, 3 August 2008 (UTC)

need help - rouge regex

There is some regex that keeps making these changes: [47]. It always changes the second n in a word to m and I can't figure out which regex is doing this. (note I saved the page to show you what was happening - I have already undo the edit) --mboverload@ 01:49, 4 August 2008 (UTC)

TYPO REVIEW: "Honshu-" find="\bHonshu\b" replace="Honshu-"

<Typo DISABLED="Honshu-" find="\bHonshu\b" replace="Honshu-" />
Why does this add the - at the end of the word?--mboverload@ 02:56, 1 August 2008 (UTC)

It used to add a macron over the u, until the massive resort. -- JHunterJ (talk) 08:29, 1 August 2008 (UTC)
=( --mboverload@ 08:30, 1 August 2008 (UTC)
Another problem with a character-based sort is that it separates root words that have rules with and without prefixes. This makes it awkward to detect redundant rules and to consolidate sets of rules within a single rule. Also, words having accented characters get put in unexpected places within the sort. The purpose of sorting is to make it easy on the developers, and a computer sort disturbs this. I've added a guideline not to do this, but to alphabetise in a sensible way, like you would find the root words in a dictionary.--BillFlis (talk) 11:54, 1 August 2008 (UTC)
Also, isn't the intended rule rather hypercorrective? My (American) English dictionary lists Honshu without any macron.--BillFlis (talk) 11:54, 1 August 2008 (UTC)
The WP article uses the macron in its title - that's why I added this rule. The same applies to a lot of other names (e.g. Valparaíso, Chile or Zürich or Łódź, which are often spelled without the diacritics in English, but WP ought to be internally consistent.Colonies Chris (talk) 08:13, 6 August 2008 (UTC)

knots in terms of speed (abbrev. as kn)

The abbreviation for knots (kn) in terms of speed is not on the safe list (currently trying to correct as "know"). Adding this to the library would be great. Thanks. - Jameson L. Tai talkcontribs 22:54, 5 August 2008 (UTC)

RegEx tools

Any suggestions? I would love a tool that showed me all the words that a regex would fit (to a reasonable limit for greedy ones). --mboverload@ 06:40, 6 August 2008 (UTC)

Never thought it's possible for all but most simple regexes. MaxSem(Han shot first!) 07:54, 6 August 2008 (UTC)
I've got a 3.1GHz Core2Duo - I can stand to bruteforce it =P --mboverload@ 17:11, 6 August 2008 (UTC)

Imtuk→Intuk

Moved from WT:AWB

Any idea why the spellchecker is doing this? It's done it twice in two days. CambridgeBayWeather Have a gorilla 22:08, 7 August 2008 (UTC)

It's due to rule <Typo word="Ind-/Inn-/Int-/Inv-" find="\b(I|i)m(d[ac-z][a-ce-z]|n[b-z]|t[a-hj-qs-z]|v)\B" replace="$1n$2" />. Thoughts on fixing, guys? MaxSem(Han shot first!) 22:28, 7 August 2008 (UTC)
1, how did you figure that out, 2, kill it with fire. It is more destructive and false-positiveish than people realize. --mboverload@ 01:17, 8 August 2008 (UTC)
How? You should really keep up with SVN, it has many cute things;) MaxSem(Han shot first!) 08:02, 8 August 2008 (UTC)
Especially when it was partially added from his request, hey Max. ;) Reedy 18:30, 8 August 2008 (UTC)
Meanwhile, I removed that rule. MaxSem(Han shot first!) 21:33, 8 August 2008 (UTC)

on on→on

Not sure how big a problem this is, but thought I'd mention it. A recent edit to The Culture using this tool resulted in a problem (which has been fixed). The text "see everything going on on a given planet" was changed to "see everything going on a given planet". Naturally there are better ways of wording that sentence that eliminate the double "on" but removing one and leaving it otherwise intact is not exactly an improvement. Just mentioning it because there might be other instances as yet undetected. SilentC (talk) 01:20, 8 August 2008 (UTC)

Reedy =0 Thanks Silent! --mboverload@ 01:36, 8 August 2008 (UTC)

payed to paid

I've seen it a few times where the payed should've been played...

Reedy 19:36, 9 August 2008 (UTC)

"Payed" is in this dictionary.--BillFlis (talk) 12:35, 10 August 2008 (UTC)
So therefore, seems better to leave it. I was meaning, that payed --> paid wasnt always correct, it sometimes should've been played, etc. But if its in a dictionary, we shouldnt be changing it! Reedy 12:46, 10 August 2008 (UTC)
And here I thought you were making a pun! I've just taken out the part of the "Paid" rule miscorrecting "payed".--BillFlis (talk) 13:51, 10 August 2008 (UTC)

discogrpahy → discography

A relatively common misspelling: I found and corrected 15 just a few moments ago (diffs), and "discogrpahy" receives about 87,000 Google hits. Since "Discography" (or "Discogrpahy") is often a section title, capitalisation should generally be preserved. –Black Falcon (Talk) 04:16, 16 August 2008 (UTC)

I'm going to add a more general rule, as "graph" is a common root.--BillFlis (talk) 11:14, 16 August 2008 (UTC)
Even better, thanks. –Black Falcon (Talk) 23:26, 16 August 2008 (UTC)

Buoy

Bouy is a place in france...

Probably wants removing then?

Reedy 22:55, 2 August 2008 (UTC)

Let's just remove the question mark so it finds only "Bouys" and "Bouyant".--BillFlis (talk) 11:48, 3 August 2008 (UTC)
Makes more sense. Cheers Reedy 14:20, 3 August 2008 (UTC)
Tweaked to replace "bouy" -> "buoy" and still leave "Bouy" unchanged. -- JHunterJ (talk) 13:31, 22 August 2008 (UTC)

Double fix..

See Scaffolding .."typos fixed: reffered → refered, refered → referred ".

Rich Farmbrough, 10:04 22 August 2008 (GMT).

Yes check.svgThat should fix it. Rjwilmsi 11:30, 22 August 2008 (UTC)

Borken

With a cap.. is a town. Rich Farmbrough, 10:09 22 August 2008 (GMT).

fixed.Rich Farmbrough, 10:10 22 August 2008 (GMT).
Yes check.svgFixed the fix [48] ;) Rjwilmsi 11:08, 22 August 2008 (UTC)

Date Fix

How do you use AWB to change "05/07/2006" to "[[2006-05-07]]". Thanks. - plau (talk) 11:53, 23 August 2008 (UTC)

That date in particular? Find 05/07/2006 and replace it with [[2006-05-07]]. Dates in general? Find (as a regular expression)
(0\d|1[012])/([0-2]\d|3[01])/(\d{4})

and replace it with

[[$3-$1-$2]]

But be careful of changing dates in DD/MM/YYYY (and where the day and month are both 12 or less) format. -- JHunterJ (talk) 12:11, 23 August 2008 (UTC)

Sep...

Church of the Holy Sepulchre should not be given the spelling Sepulcher, regardless of what was wrong with it before. Do we remove the rule or just for capitalised uses? What do we do about alternative spellings of target words? Rich Farmbrough, 14:43 23 August 2008 (GMT).

<Typo word="Sepulcher" find="\b(S|s)epulchure\b" replace="$1epulcher"/>
This doesn't touch 'Sepulchre', so I don't see the issue? Rjwilmsi 18:03, 23 August 2008 (UTC)

Quotes

Would it be possible to skip words in single or double quotes, or italicised? Rich Farmbrough, 14:49 23 August 2008 (GMT).

Or indeed "." as in www.harvard.com Rich Farmbrough, 16:03 23 August 2008 (GMT).
Or in brackets, braces or parentheses. As in "a castrated male (wether)" . And best if we can also have some template(s) or marker(s) such as {{Sic|wether}}, wether<sic />, or {{flquote|lang=lat|Patriam fecisti diversis de gentibus unam}} .Rich Farmbrough, 16:14 23 August 2008 (GMT).
There's an AWB feature request (or bug, I don't remember exactly) to cover the double quotes. Currently the {{quote}} family and {{sic}} templates are already ignored for typo fixing. I don't think we should ignore words just because they're in brackets. Rjwilmsi 18:08, 23 August 2008 (UTC)
As noted below {{Lang}} already exists too... so looks like I need to research these templates a bit more. I did mean, incidentally a word on it;s own in brackets, not part of a phrase, but I have my doubts too. The only other thing that would be nice would be a form of {{sic}} that says - "hey we've spotted this mis-spelling in a quote, but we've left it, can someone check that it's mis-spelt in the original?" Maybe that is already there too! Rich Farmbrough, 01:22 24 August 2008 (GMT).
The template can be used with a question mark "?" to mark situations where the correctness or incorrectness of an apparent error cannot readily be determined. Rich Farmbrough, 11:34 24 August 2008 (GMT).

De-hyphenation of sea- words

The reason these are hyphenated, or often separate words, is to break the ea vowel from the following consonant - thus sea-dog sea bed not sead-og or seab-ed/seab'd. I have doubts about removing these hyphens, maybe we should split the words if the hyphens aren't liked. Rich Farmbrough, 14:56 23 August 2008 (GMT).

User:JHunterJ added that correction a couple of days ago. Perhaps you would contact him/her to see if (s)he has good sources to say it's all correct. Rjwilmsi 17:48, 23 August 2008 (UTC)
I took my list from Merriam-Webster Collegiate 10th (which should correspond closely to http://www.m-w.com ). The rule does not result in seab-ed, but rather seabed. Googling "define:seabed" works; "define:sea-bed" fails, for instance. I looked them up after seeing "sea-port" and knowing it wasn't correct. If there's a contradictory source of some of them being hyphenated or phrases, the entry could be reduced. -- JHunterJ (talk) 21:47, 23 August 2008 (UTC)
I didn't mean it would render as seab-ed just that's how it could be read by analogy to, say seeded. I'll get back to you if I can. Rich Farmbrough, 23:35 23 August 2008 (GMT).

Disimprove->dissimprove

Is this an improvement? Rich Farmbrough, 15:12 23 August 2008 (GMT).

Yes check.svgNo ;) fixed. Rjwilmsi 17:54, 23 August 2008 (UTC)

kn->know

AWB tried to change "kn" to "know" in one of my recent edits. I suppse <Typo word="Know" find="\b(K|k)(?:wno|on?w|n)(n?|s)\b" replace="$1now$2"/> needs to be changed, but I don't know to what exactly. --Conti| 15:34, 23 August 2008 (UTC)

Yes check.svgThat's fixed for you. Rjwilmsi 17:57, 23 August 2008 (UTC)

Various

  • Oficial is Official in other languages
  • Monserrat seems to be a name
  • -chang is part of a number of names

Rich Farmbrough, 16:08 23 August 2008 (GMT).

  1. Then they should be tagged like {{lang|es|Oficial}} and no English typo fixes will be applied to them.
  2. Yes check.svg Fixed
  3. Yes check.svg Also fixed. Thanks Rjwilmsi 18:32, 23 August 2008 (UTC)

One more

instution → intuition if anything should be institution. Rich Farmbrough, 23:39 23 August 2008 (GMT).

Yes check.svg Fixed -- JHunterJ (talk) 11:34, 24 August 2008 (UTC)

realibilty → realibility

Should be reliability. Rich Farmbrough, 13:49 24 August 2008 (GMT).

Yes check.svgExpanded existing rule. Rjwilmsi 14:05, 24 August 2008 (UTC)