Skip to the bottom ►
Please stop your bot from repeating this mistake
here. Thanks! Johnbod (talk) 02:27, 4 December 2019 (UTC)
- @Johnbod: Ah, a different type of Song. Done with Special:Diff/929169451. Sun Creator(talk) 02:45, 4 December 2019 (UTC)
AWB code suggestion
Based on this edit, I think you may have a minor bug in your AWB code. I agree that 'a en suite' is wrong, but I'd expect the correction to be either 'an en suite bathroom' (a fully anglicized version) or 'a bathroom ensuite' (leaving the modifier in French). Judging from your edit history, I think you're probably intending the former. Hope this is helpful. (Also, are you…some kind of editor who must recharge for seven years in order to combat typos faster than any normal Science Patrol member could?) —jameslucas ▄▄▄ ▄ ▄▄▄ ▄▄▄ ▄ 02:51, 6 December 2019 (UTC)
- @JamesLucas:, thanks for the comments! Unfortunately, I manually (in AWB) changed the suggested 'an en suite' as '\an än\' sounds so awkward. As you correctly point out my edit was faulty. According to merriam-webster.com only [en suite] is English, so in an English sentence a French word would seem inappropriate, using French also doesn't resolve the requirement for an indefinite article. So with reluctance it's amended to 'an en suite', although feel free to amend if you feel otherwise. Regarding the recharge, somehow it seems like fun at present to both correct AWB's typo rules and make a dent on the 500K+ known typos on an industrial scale. Sun Creator(talk) 15:23, 6 December 2019 (UTC)
- I agree that the standard English spelling and order you opted for is the reasonable default choice, but I wouldn't fault someone for using the French loanword version since there's widespread acceptance of similar deployments such as 'soup du jour or 'pie à la mode'. Thanks for the fix.
- The volume of your edits awes me. I have dozens of text files each listing 10000 articles that are missing commas in certain specific contexts, and I hope to make it through one file every three years. I'll probably never achieve that rate, so I'll have to live vicariously through your xTools graphs! Cheers —jameslucas ▄▄▄ ▄ ▄▄▄ ▄▄▄ ▄ 18:48, 6 December 2019 (UTC)
- @JamesLucas:@NoAmCom: It is normally effective to:
- Write a specific regex for the task that can be used in the 'Find and replace' option in AWB.
- Use either a) a database dump with database selection or b) Some category selection process to find relevant articles.
- Use pre-parse mode to remove articles that don't apply. With multiple parses if applicable, with multiple AWB sessions if required (can run in the background).
- What remains is then a highly targeted list of articles to manually check with AWB.
- Regards, Sun Creator(talk) 21:07, 6 December 2019 (UTC)
- Previously you did over 10 articles a minute. Not sure what is different now. Sun Creator(talk) 23:00, 6 December 2019 (UTC)
- Thank you! I do follow the basic workflow you outlined. I think there are four major factors that have contributed to the decrease in my edit-rate since the period you cited:
- natural grouping of like content due to alphabetical order – Certain segments of my project are dominated by edits that are easy to confirm. You cited edits from a stretch largely comprising articles on astronomical bodies punctuated at regular intervals by articles on military units of the Civil War era. Manual adjustments were needed seldomly and those instances were relatively easy to spot because they departed from the established rhythm of insertions and deletions. Generally, homogenized segments of the alphabetized list progress faster than highly varied segments.[a]
- striving to allow fewer errors – My project started with virtually no BLPs, and when I suddenly fell into a lot of them, the quality suffered until I wrote code that addressed the new micro-contexts around the comma misuse I was encountering. The new code allows me to make more helpful edits, but it tends to require an extra second of review.
- one runaway expression – I've had an expression among my find-and-replace terms that searches for a hyphen-minus with a space on either side of it and replaces it with an en dash. It gets a lot of hits, and while the vast majority are valid, many require additional manual editing to make things as they should be. I turned off that expression yesterday and found that the process flowed much more smoothly, so maybe I'll leave it off for the foreseeable future (or until I decide to write a more sophisticated version of it).
- more careful review leading to more content read – Things grab my attention as I review my edits. Sometimes I end up reading a bit out of interest. More often I end up gawking at malformed formatting or certain crimes against English. This may lead to 10 seconds of head-shaking or 10 minutes of manual editing. This is an inefficiency I accept as part of my priorities as an editor, but it does occasionally balloon as I work my way through certain segments of my list.[b]
- With that said, there are a few potential improvements to my workflow that I've wished I could implement, so I'll pose them to you:
- Is there a way to get my edit summary to automatically reflect the specific change being made if it isn't a typo correction? You may see that a lot of my edits on BLPs are removing places-of-birth from DOB parentheticals rather than adding more punctuation. I manually paste that edit summary over my default summary each and every time.
- Is there an easy way to ensure that my expressions ignore file names? I could write a new version of the expressions designed to weed out any such hits, but I'm not a pro at such coding and I'd probably end up increasing processing time more than necessary.
- Is there a way to get AWB to stop removing a single space at the end of a paragraph? Those edits drastically increase the amount of scrolling required during edit reviews, and I don't even fully agree that such spaces are something worthy of deliberate removal because a space at the end of a paragraph makes it easier to re-sequence sentences by dragging and dropping.
- Thanks for being willing to share your wisdom! —jameslucas ▄▄▄ ▄ ▄▄▄ ▄▄▄ ▄ 17:41, 7 December 2019 (UTC)
For 1. a regex is required, the idea is to make it robust so you save time later. The following can get you started.
I've setup a test data page which can be used for testing.
For 2. This is about boundaries. Boundaries is a hard part of scoping a regex rule and writing Regex. By 'file names' do you mean folder names like "\myfolder\file" (Windows) and "/dev/file" (Unix), if so, you can check the character preceding the first character, either for what you don't want, a slash '/' or '\'; slashes are metacharacters, so '\\' and '\/' is required, so regex starts \b(?<!\\|\/) or you can check what you do want perhaps a space/tab/newline \b(?<=\s)
If instead you mean filename.exe which is the same format as domains. Use (?![^\s\.]*\.\w) at the end.
So lets say you have a replace marzo (Spanish for the month of March) with march. \b([Mm])arzo\b that would become \b(?<!\\|\/)([Mm])arzo\b(?![^\s\.]*\.\w) and then it avoids altering 'c:\folder\marzo' '/dev/marzo' and 'marzo.exe'
For 3. If I recall correctly, the space at the end of a paragraph is part of general fixes, so you could un-tick the general fixes option. Sun Creator(talk) 20:20, 7 December 2019 (UTC)
- Thanks again. Responses in order:
- Does this address edit summaries? I have a script that does the edit, but I have to replace the edit summary manually. Sorry if I'm missing something.
- I'm mostly thinking of files used to add pictures to articles. If I encounter
[[File:Pine-tree 2007.JPG|thumb|right|300px|A pine tree near [[Boise]], Idaho is one of the tallest in the world.]], I want to add a comma after 'Idaho' (and my current script does that just fine).[c] But if I encounter
[[File:Boise, Idaho pine tree.png|thumb|right|300px|One of the tallest pine trees in the world]], I want to skip that so as not to break a link. I don't know all the forms the code can take (I'm pretty sure
[[Image: is deprecated but still prevalent), but after reviewing your suggestion, I'm thinking that if simply focus on the file suffix, I could probably solve 99% of the cases. Basically I'd be trying to exclude the script from altering anything between
\.(jpe*g|JPE*G|png|PNG|gif|GIF|svg|SVG). I haven't used boundaries before, so that should be a good learning experience for me. Thanks for pointing me that way!
- I think most of the general fixes are valid and worth implementing. I'd be hesitant to turn them all off just to avoid the end-of-paragraph space removal. I was wondering if I could turn off just that one operation. (Alternatively, I guess I could request that the operation in question be removed from the general fixes, but I don't know how much traction I'd get with the community.)
- Cheers —jameslucas ▄▄▄ ▄ ▄▄▄ ▄▄▄ ▄ 16:01, 8 December 2019 (UTC)
I see you are not currently using normal 'Find and replace' Here is how I use it.
In AWB, Options tab
tick the "Find and replace" checkbox, then select the Normal settings button.
For a new row, add in the "Find" column paste the Regex above, in "Replace with" put "$1)" without quotes, the replace part is what shows in the edit summary. Tick the checkboxes CaseSensitive, Regex and Enabled, then make sure the "Add replacements to edit summary" is ticked then click the OK button to save. Sun Creator(talk) 16:57, 8 December 2019 (UTC)
- I am using the normal settings for find-and-replace, but I don't add replacements to the edit summaries because they're not particularly informative about the reasoning behind the changes and because the replacements are automatically separated by commas, they're downright confusing. An edit summary like this would be typical:
/* top */[[MOS:COMMA|comma usage]], replaced: (born June 29, 1969 in [[New York City]]) → (born June 29, 1969), replaced: y 18, 2005 t → y 18, 2005, t, Nevada]] → Nevada]], (2). Now, I certainly could write my expressions differently to make the replacements more constrained, but what I really want my edit summaries to contain is the reason for the replacement—a link to MOS:COMMA, a link to MOS:BLPLEAD, or whatever helps other editors understand why a particular edit is a valid one. —jameslucas ▄▄▄ ▄ ▄▄▄ ▄▄▄ ▄ 17:27, 8 December 2019 (UTC)
- ^ Some homogenized streaks, however, lead me to edit in shorter sessions due to my dislike of them. For example, despite their being easy to review, I hate the streaks of Mediterranean men's names because they're awash with one-sentence bios of football players.
- ^ I read more frequently when editing articles not about specific asteroids or football players.
- ^ Disclaimer: I have not the slightest clue if there if there are any pine trees, tall or otherwise, near Boise.
To avoid replacement in filename add the following code to the end of any existing regex. Text in alt= is replaced but not the actual filename.
'File' and 'image' are case insensitive(so I just learnt) and image= is commonly used in infoboxes i.e | image = filename.123
So if you had an Idaho regex "\bIdaho\b" it would become
Let me know if that doesn't work as desired anywhere. I've not used regex with filenames before. Sun Creator(talk) 21:36, 9 December 2019 (UTC)
Leontxo Garcia: Spanish version updated
Hi, Sun Creator!
Just to tell you the Spanish version of this article was updated today, just in case you would like to update the English one as well.
Thank you very much in advance, — Preceding unsigned comment added by 126.96.36.199 (talk) 21:56, 11 December 2019 (UTC)