Page semi-protected

Wikipedia:Bots/Requests for approval

From Wikipedia, the free encyclopedia
Jump to: navigation, search

BAG member instructions

If you want to run a bot on the English Wikipedia, you must first get it approved. To do so, follow the instructions below to add a request. If you are not familiar with programming it may be a good idea to ask someone else to run a bot for you, rather than running your own.

 Instructions for bot operators

Current requests for approval

Bots in a trial period

FrescoBot 14

Operator: Basilicofresco (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:32, Sunday, September 10, 2017 (UTC)

Automatic, Supervised, or Manual: auto

Programming language(s): python

Source code available: standard pywikipedia

Function overview: removing unwanted special characters like LTR-mark, SHY or no-break special space

Links to relevant discussions (where appropriate):

Specifically:

Edit period(s): monthly

Estimated number of pages affected:

  • LTR marks within wikilinks and categories tags = ~2100
  • no-break space = ~4300
  • SHY = 81
  • zero width space = 147
  • line separator = 82

Namespace(s): articles, categories, files. Manually also on templates, help, wikipedia and portal.

Exclusion compliant (Yes/No): yes

Function details:

  • It will remove unwanted Left-to-right mark (U+200E) from wikilinks and categories tags. Actually part of CHECKWIKI 16. This character is invisible, so in order to see it you have to cut and paste for example [[Category:2000 in French sport‎]] in this webpage. Sometime they are cut and pasted in Hot Cat or other tools and then appear at the end of the category tag. Less often they are also cut and pasted within wikilinks or placed randomly within the page source. The point: there is absolutely no use for them within any wikilinks or category tags, notoriously they will break bot and other tools causing problems[1][2][3], they are completely invisible and should be avoided. Maybe they could be removed almost eveywhere on the encyclopedia, but within wikilinks and other markup is always safe. There is consensus for this fix and it is not cosmetic since it actually prevents real problems. I have been running this substitution for years in order to sanitize the text before applying my other fixes and nobody complained in the merit, I just had to clearly explain in the edit summary that the bot was removing an unwanted invisible mark. I just added it along the way in order to avoid mistakes parsing wikilinks and categories and some months ago I realized that I never properly asked the explicit approval for it.
  • It will remove LTR (U+200E), RTL (U+200F) and other invisible control characters (from U+2027 to U+202F) in the imagename within image markup, category markup and other selected safe locations. Same resons.
  • It will replace with a space the character no-break space (U+00A0) everywhere it is used as a space. It is stored as U+00A0 in the database, but it is looking like a space within the browser. It is considered a really bad practice (MOS:NBSP) and breaks bots and tools. Browsers get confused and tricky problems occours: for example if you search "joined Dionysus" on this page with the last Firefox everything is fine, but with IE11 you will not find anything.
  • It will remove everywhere (manually) the character SHY (U+00AD), aka soft hypen. It is an invisible hypen. It creates confusion and breaks links (eg. any­one vs. anyone), bots (eg. exceptions will not be identified) and tools (it is not the same word). See Soft hyphen (SHY) – a hard problem?. Pretty rare.
  • It will remove everywhere (manually) the character zero width space (U+200B). The same problems as SHY: History​ of Quebec vs. History of Quebec. Pretty rare as well.
  • It will replace everywhere (manually) the unicode character "line separator" (U+2028) with a standard newline character or a space. Please note that there is no use for it and it poses the same problems as SHY.

Discussion

Bottom line: every single fix in the list solves real problems, so they cannot be considered cosmetics. Many bot cannot run safely with these characters around because exceptions could be not identified and unexpected problems arise. The number of affected pages is not high so it definitely worth it. -- Basilicofresco (msg) 14:32, 10 September 2017 (UTC)

How will the bot determine whether these invisible characters are safe to remove, versus them being present to fix things (and therefore should probably be turned into an entity instead of being removed)? For example, LRM to fix cases where adjacent RTL text is causing numbers and punctuation to be displayed as RTL, non-breaking space to prevent separation of numbers and units, SHY to indicate word breaks in tables, and so on. Anomie 02:10, 11 September 2017 (UTC)
(Struck the LRM bit, re-reading the description I see the replacement is proposed only in wikilinks) Anomie 02:13, 11 September 2017 (UTC)
The point is that nobody deliberately places non-breaking spaces using the invisible unicode special character. All the U+00A0 characters I saw are placed in points where a   is not useful. These characters appears with cut-and-paste jobs from wordprocessors / old browsers / other odd sources full of invisible characters esposed to the user. With some trial edits you will be able to see how (bad) are used these invisible non-breaking spaces. -- Basilicofresco (msg) 04:55, 11 September 2017 (UTC)
There are at least a few people who use raw NBSP in their signature, which I notice because I routinely see Village pump edits that change them to   when someone else is using a non-standard editor that does that substitution. On some platforms it's easy enough to type the character, e.g. I can type Compose-Space-Space to get one ( ). Anomie 20:19, 11 September 2017 (UTC)
I do not plan to run it on talk pages. However you mean that maybe on rare cases we could find at least few invisible non breaking spaces properly used... It is not a great deal but ok, I can manually check every invisibile non-breaking space after a digit in order to decide if it is better to replace it with a space or a  . Afterall there are only 188 invisible non breaking spaces between a digit and [[ or a letter. As I said SHY will be replaced manually, so I should be able to identify any exception. -- Basilicofresco (msg) 05:56, 12 September 2017 (UTC)
If there are no additional questions I would start a trial run. {{BAG assistance needed}} -- Basilicofresco (msg) 07:35, 20 September 2017 (UTC)
I mean: there is consensus, there are no objections or questions, twelve days are passed and a new dump file is almost ready. I humbly dare to suggest that the time for a test run is come. Please... -- Basilicofresco (msg) 11:43, 23 September 2017 (UTC)
Approved for trial (50 edits, 10 of each type).. Headbomb {t · c · p · b} 13:40, 25 September 2017 (UTC)

Mdann52 bot 13

Operator: Mdann52 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 10:35, Monday, September 4, 2017 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Pywikibot

Source code available: https://github.com/Mdann52/wikipedia/blob/master/iso4bot.py

Function overview: Help clear up the backlog in Category:Articles with missing ISO 4 redirects

Links to relevant discussions (where appropriate): Wikipedia:Bot_requests#ISO_4_redirect_creation_bot

Edit period(s): One time run

Estimated number of pages affected: ~1000

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: From Wikipedia:Bot_requests#ISO_4_redirect_creation_bot:

To help clear up the backlog in Category:Articles with missing ISO 4 redirects, if a bot could

  • Parse every article containing {{Infobox journal}}, retrieve |abbreviation=J. Foo. Some articles will contain multiple infoboxes.
  • If J. Foo. exists and is tagged with {{R from ISO 4}}
#REDIRECT[[Article containing Infobox journal]]
{{R from ISO 4}}

Thanks! Headbomb {t · c · p · b} 11:57, 31 August 2017 (UTC)

Discussion

Sample edits - here. Mdann52 (talk) 10:40, 4 September 2017 (UTC)

When I saw the function description, I thought you were creating ISO redirects as well as tagging them. Isn't that what the bot request is for? Pinging Headbomb.—CYBERPOWER (Chat) 10:56, 4 September 2017 (UTC)
Nevermind. I didn't read your diffs properly.—CYBERPOWER (Chat) 10:59, 4 September 2017 (UTC)
  • Approved for trial (50 edits).CYBERPOWER (Chat) 10:59, 4 September 2017 (UTC)


Handling dotted vs dotless abbreviations

@Mdann52: @Headbomb: Three things: First, I believe this bot request was about adding a dotless redirect, like Cell Biochem Biophys, only when the dotted version redirect already exists, like Cell Biochem. Biophys.. The code seems to add dotted versions, only based on the abbreviation parameter, which might be too much GIGO: the parameter is incorrect in about 1/8 cases (there's an effort to fix that, see below). Tokenzero (talk)

Second, there's a long discussion on how to exactly categorize the dotless redirects, which seems to have now settled down. Headbomb, should we use the template {{R from dotless ISO 4}} (currently placing the article in the same category as {{R from ISO 4}}) just in case this rebounds, or just keep things simple?

Third, we now have an automatic tool that computes abbreviations. It has an error rate of ~5%, but it detects virtually all errors made in human-edited abbreviations (the 1/8 garbage, see the list of mismatches). So we could handle both dotted and dotless redirects automatically, by only doing that when the human-edited abbreviation parameter matches the computed abbreviation. This should handle most redirects (eventually all but ~5%, as editors will fix the mismatches), with virtually no GIGO and without introducing any new errors. This would be a bit more complicated and there's a few more corner cases (e.g. the bot should not overwrite pages like Ann. Phys., or any redirects to unexpected pages like Ann Phys; it should find all infobox journals when a page has many - I did that with mwparserfromhell for scraping the list). I could write the code for that and give it here, or just submit my own bot. What do you think? Tokenzero (talk) 09:32, 13 September 2017 (UTC)

I don't personally see a consensus to use {{R from dotless ISO 4}} at all [such a consensus may develop in the future, of course, but I don't see it as better than 50-50 that it will]. I also agree that using the automatic tool to verify abbreviations is the superior approach to what I suggested above. If the infobox abbreviation matches the tool's 'probable abbreviation', the bot should create both dotted and undotted versions, and then null edit the original article. Headbomb {t · c · p · b} 11:22, 13 September 2017 (UTC)
  • {{OperatorAssistanceNeeded}} Any update on this?—CYBERPOWER (Message) 23:37, 18 September 2017 (UTC)
    • @Cyberpower678: Working on this when I can - some issues have come up (as alluded to below), so I'm trying to find the time to make the fixes. Mdann52 (talk) 18:00, 20 September 2017 (UTC)
      • @Mdann52: The bot function specification should be changed, as discussed above, and I believe the easiest way would be if I write and submit my own bot for BRFA, which would replace your bot. Do you agree? If you prefer to make the changes yourself, I can send some code for handling infoboxes cleanly and details on the automatic tool. To clarify the proposed changes in the bot function:
  1. It should add redirects only when the infobox abbreviation matches the one given by the automated tool OR if the dotted version redirect already exists (and is categorized as ISO-4),
  2. It should not replace existing pages unless they are just miscategorized redirects to the page we came from (e.g. it should keep disambiguation pages and dotless redirects to them),
  3. Perhaps we'll want to change dotless redirects to {{R from dotless ISO 4}}, but there's no consensus on that, just a thing to keep in mind. Tokenzero (talk) 18:53, 20 September 2017 (UTC)
        • @Tokenzero: either works ok for me - if you wish to take over the task, feel free. Mdann52 (talk) 19:40, 20 September 2017 (UTC)
When I made the original request, the tool didn't exist yet. There's a better way of doing things now, so we should do that. Makes no difference to me who codes it, but Tokenzero could probably code it more quickly as they made the tool the bot would be based on. Headbomb {t · c · p · b} 20:01, 20 September 2017 (UTC)
Ok, I'll take it over then. Since it changes the maintainer, bot account, specification, etc. I think I'll just submit a new BRFA when I'm done with some technicals. Tokenzero (talk) 20:40, 20 September 2017 (UTC)

False positives

Moved from User talk:Mdann52 Hey there -- thanks so much setting up Mdann52 bot to tag ISO 4 redirects with {{R from ISO 4}}. I wanted to call your attention to a false positive that I recently noticed; I figure you may want to know about these things. Berkeley J. Emp. & Lab. L. was tagged as an ISO 4 redirect, but this is actually the Bluebook abbreviation and should be tagged as {{R from Bluebook}}. I think the mistake occurred because, at the time the redirect was initially tagged with {{R from ISO 4}}, the "abbreviation" field in the main article's infobox (which is used for ISO 4 abbreviations) erroneously contained the Bluebook abbreviation. If the bot is relying solely upon data in the "abbreviation" field in the infobox, and that information is incorrect, then bot may be creating redirects from incorrect titles. There may not be anything that can be done about it, but I wanted to give you the heads up. Best, -- Notecardforfree (talk) 11:31, 4 September 2017 (UTC)

It is purely relying on the infobox entry, yes, so if these are incorrect, then the wrong redirect will be created. I'm not too sure how I can resolve these false positives - I'll stop the run for now and am open to suggestions.
@Notecardforfree and Headbomb: Mdann52 (talk) 15:30, 4 September 2017 (UTC)
AFIACT, that's fine by me. The error already exist, which means the bot isn't doing anything worse. It'll create a badly categorized redirect, based on a badly categorized redirect. I'll be going through Category:Redirects from ISO 4, and having two such redirects means I'm more likely to catch the error. However, the bot should make sure that {{R from ISO 4}} is present on the dotted redirect before creating the dotless one.Headbomb {t · c · p · b} 15:34, 4 September 2017 (UTC)
(edit conflict) Mdann52 and Headbomb, I think the bot is doing good work, and ultimately this task will save countless hours of human editors' time. I understand that a few false positives will occur now and then, but I think the utility of having the bot perform this function far outweighs any harm that would occur from creating a few false positives every now and then. I didn't mean for my message to throw a wrench in the works; I simply wanted to bring this to your attention in case it was relevant to maintaining the bot. I think that we should have the bot continue to perform this task and then have human editors review for accuracy once they have been created (checking for accuracy will take far less time than creating the redirects/tags). Thanks again for your work with this! Best, -- Notecardforfree (talk) 15:40, 4 September 2017 (UTC)
@Notecardforfree: Not an issue - you've actually pointed out an interesting bug before I noticed (namely I wasn't checking the template name when I extracted the paramater!). I'm looking into resolving this in the next few days. Mdann52 (talk) 16:37, 8 September 2017 (UTC)

SportsStatsBot

Operator: DatGuy (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search) and co-botop Kees08 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 00:12, Saturday, March 18, 2017 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available: User:DatBot/footycode

Function overview: Bot automatically updates football (soccer) league tables

Links to relevant discussions (where appropriate): Special:Permalink/770569052#Bot to update tables

Edit period(s): Checks every 30 mins

Estimated number of pages affected: Minimum 2 templates, excluding transclusions. Minimum 53 transclusions.

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): Yes

Function details: A bot that automatically updates tables. The bot would take input from the sites that are customised at User:SportsStatsBot/footyconfig, and edits the template directly. The only manual thing would be to edit the bot's settings for relegations and promotions I believe. It would also be possible to turn off one of the leagues that the bot would manage, if there would be some weird event that the source updated incorrectly.

Discussion

  • This seems like a very worthwhile task for a bot - I can't see any policy reason for objection. Would the bot automatically shut off 30 minutes after the last game in the season has been played to avoid any errors in the edits regarding promotion/relegation? TheMagikCow (T) (C) 09:43, 18 March 2017 (UTC)
  • I'm not sure whether the page would have SEASON ENDED that I could look for in the html, but I believe it shan't edit if there have been no matches. Dat GuyTalkContribs 11:58, 18 March 2017 (UTC)
  • I'm not sure about looking in the html, but perhaps if the bot has not edited a season in - say 4 weeks - it would stop editing that template until it has been restarted with the updated information? I don't it's a huge problem, though. TheMagikCow (T) (C) 19:59, 18 March 2017 (UTC)
  • Note to any BAG member reading this: I plan on expanding this to include infoboxes. Should I make a new bot account? Dat GuyTalkContribs 11:58, 18 March 2017 (UTC)
  • {{BAGAssistanceNeeded}} Comments? Dat GuyTalkContribs 18:30, 24 March 2017 (UTC)
    Seems pretty nifty! What templates will the bot be editing, and how many transclusions do they have? If we're talking just a handful of pages then it's no biggie, but if there are hundreds, thousands, then that changes everything. The config page will have to at least be semi-protected, and if the bot is going to affect a lot of pages we may have to consider moving it to .js so only the botop and admins can edit it. Another important thing for this task is to properly handle edit conflicts, otherwise you may overwrite someone else's changes. Your code doesn't seem to do this, but then again as you know I don't speak Python very well :) I would also recommend that this task be exclusion compliant, especially if we're going to be editing in the mainspace.
    Just so you know, I'm off to WMCON tomorrow and won't be back till 3 April, so pre-apologies if I'm not very responsive. No need to wait for me, other BAGgers feel free to take over MusikAnimal talk 04:59, 25 March 2017 (UTC)
    It'll edit the templates themselves. For example, it will edit Template:2016-17 Premier League table and not the pages it is transcluded on. It isn't currently added, but I could make it not edit if there's a conflict. Hope you have a great time in Berlin. Dat GuyTalkContribs 14:26, 25 March 2017 (UTC)
    That template is transcluded 27 times, so editing it will affect 27 pages. This is very important information. Please update the "Estimated number of pages affected" (keyword affected), and make note of all templates the bot will edit. Thank you! MusikAnimal talk 10:02, 26 March 2017 (UTC)
    Well, 53 pages would be a minimum since the bot is currently set up to edit Bundesliga and premier league. I've updated the "estimated number of pages affected." Is this ready for trial? Dat GuyTalkContribs 10:08, 26 March 2017 (UTC)
    I feel that semi-protection on the config page is advisable, to prevent the potential of vandalism. TheMagikCow (T) (C) 17:37, 26 March 2017 (UTC)
    I see 27 transclusions for {{2016–17 Premier League table}} and 26 for {{2016–17 Bundesliga table}}. Not sure if the pages that transclude them overlap, but if not we're up to 53 pages. It is really nice that you can simply add more templates and regex to the config, but the problem I see with that is that you'd need to somehow do testing first. I consider myself quite fluent with regex but I still wouldn't change it without doing a dry run with the bot. Obviously the room for error is much greater when you are not the bot operator, or very good with regex, so maybe a config isn't the best idea? What do you think? MusikAnimal talk 09:47, 29 March 2017 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── I could make an option called 'dryrun' which outputs the result of the dry run to a subpage such as User:SportsStatsBot/dry/[leaguename]x (x = Number of dryrun). Dat GuyTalkContribs 14:31, 29 March 2017 (UTC)

A dry config page and a dry output page sound like really useful testing tools that anyone could employ before passing the config to the live page. If you are willing to code those. —  HELLKNOWZ  ▎TALK 15:53, 29 March 2017 (UTC)
@Hellknowz: It's been implemented. Dat GuyTalkContribs 16:17, 30 March 2017 (UTC)
I like it a lot! Just need to make sure it is well documented how to do a dry run, and encourage it before instructing the bot to update the actual template. On to the next question – what's up with User:SportsStatsBot/nbaconfig? Are we planning on doing NBA as well? MusikAnimal talk 21:40, 5 April 2017 (UTC)
Planning is the key word. Currently, there's enough for the statistics, but I've found difficulty of how to transition it onto the template and find whether a team has clinched a playoff spot. Dat GuyTalkContribs 13:55, 6 April 2017 (UTC)

Approved for trial (dry run only, one per template). It seems for this bot the dry run functionality is important, so let's do a trial of that first. The other major component missing here is documentation – User:SportsStatsBot currently only states that the account is a bot, nothing more. It would be good to explain what the bot does, and for highly configurable bots like this you should also explain all the available config options, and also how to do a dry run, etc. MusikAnimal talk 01:36, 10 April 2017 (UTC)

Conclusions from mini-run for a week or so (please don't consider that a full trial):
  • I let it run while on vacation, not a very good idea.
  • Bot took content from the template, and put it in the dryrun page. This made the trial effectively useless.
  • I'll change the code, so that if the page is not creating it takes from the template page. If the page is already created, it'll update itself, excluding the template.
  • Documentation has started
Thanks, Dat GuyTalkContribs 09:42, 23 April 2017 (UTC)
  • I suppose Trial complete. I tried to debug stuff on the way, and that's why it took such a long time. Everything works well aside from the function of determining when it was updated. The diff system is by bytes, and the BBC page has Last updated 13 hours ago at the bottom. Every time there is a new digit, the bot takes it as a change. If anyone knows how to fix it, it will be appreciated. Dat GuyTalkContribs 10:09, 7 May 2017 (UTC)
    How about parsing the integer out of that text? So with regex grouping you could do Last updated (\d+), which will return the number as a string, which you can parse into an integer. If it is different than what you have stored, then the bot makes the updates. MusikAnimal talk 23:55, 13 May 2017 (UTC)
    {{OperatorAssistanceNeeded|D}} SQLQuery me! 04:02, 22 May 2017 (UTC)
    Impossible since I use response.info()["Content-Length"]. It doesn't get the content of the page directly, but gets specific attributes about it. Also, I tried doing removing Last updated[^a]*ago with Regex in a file but the length is still different for some reason. I thought about using a module named difflib, but I've never used it before. Dat GuyTalkContribs 17:47, 22 May 2017 (UTC)
    I've changed a bit of the code. I believe it should work now. See [4]. Dat GuyTalkContribs 15:56, 14 June 2017 (UTC)
    Have you tried looking at the other headers? There is an ETag that I think will be updated when the content changes. You could keep track of that instead. Also, where is the dry run page? MusikAnimal talk 16:11, 16 June 2017 (UTC)
    ETag fails. Trying to get a hang of kees08 on IRC. If we can't find a way, I'll have to find another site since it seems like I've exhausted all the options. Dat GuyTalkContribs 10:38, 22 June 2017 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── I've made a pretty simple fix at [5] which should work for all normal template runs and most dry runs. Think it is time for maybe a live run. Dat GuyTalkContribs 21:32, 24 June 2017 (UTC)

I will keep fixing it up, but is there any content this is missing from the bot documentation page or the user page that I can add? Kees08 (Talk) 01:17, 7 July 2017 (UTC)
@DatGuy and Kees08: Sorry for the very, very long delay! I think we can move forward with a live trial. Let me first make sure I've got this right: Based on the config, the bot would be editing Template:2017 League of Ireland Premier Division table, Template:2016–17 Bundesliga table, Template:2016–17 Premier League table, correct? Next, I think we should put a notice up on these template pages saying they will be automatically updated by a bot (and link to the bot userpage). You might also write a note to the primary maintainers of those templates, so we don't catch them off-guard. They might be willing to help vet the data, too. Let me know when we've done these things, and we'll get a trial going :) MusikAnimal talk 17:18, 13 August 2017 (UTC)
@MusikAnimal: I'd suggest starting with only the Irish one since its much less popular than the Premier League (that has started two days ago) and the Bundesliga (which hasn't started yet). Also, BBC have changed their format, so it's either impossible or more difficult to adapt. Haven't tested it yet, since I've had some problems catching Kees due to our different timezones. Dat GuyTalkContribs 14:22, 14 August 2017 (UTC)
Very well then. I guess let me know when you've adapted the code to work with the new format. I saw the bot was still doing test runs and they looked OK (I think), which is why I was ready to start a trail. I would make sure your bot looks for the format it expects, and if it detects it's some other format, abort entirely rather than try to parse and potentially make incorrect edits MusikAnimal talk 16:57, 16 August 2017 (UTC)
  • @Cyberpower678 and MusikAnimal: The Airtricity league ends on October 27. That's 7 'match days,' which are sometimes more than one day. That should be about 21 edits, if we start soon and do follow through until the end of the season (one long BRFA, eh?). That might be good for a trial? I don't think we should worry about the Premier League and Bundesliga because they won't be edited by the bot. If we do choose to do another trial of one of the aforementioned leagues, then we could use soccerway. If not, then I'm going to keep trying to catch up to Kees08 and figure out a way to maybe use BBC once more. Dat GuyTalkContribs 17:47, 31 August 2017 (UTC)
  • Symbol tick plus blue.svg Approved for extended trial (until October 27). This is probably the longest trial I will ever approve. Let's get started and report your results when the trials end.—CYBERPOWER (Chat) 16:26, 3 September 2017 (UTC)

Yobot 58

Operator: Magioladitis (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 13:40, Saturday, August 19, 2017 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): AWB

Source code available: AWB

Function overview:Remove blank lines between list items.

Links to relevant discussions (where appropriate): Wikipedia talk:Manual of Style/Accessibility#Bot run for removing blank lines in list and Wikipedia:Bots/Requests for approval/BG19bot 9

Edit period(s): Weekly

Estimated number of pages affected: Initially ~120,000

Namespace(s): Articles

Exclusion compliant (Yes/No):

Function details: This is an accessibility issue, see WP:LISTGAP. A list will be generated monthly that includes articles with blank lines between list items. List is generated using Checkwiki software. AWB will then be run on the list with general fixes enabled. Latest AWB version added the ability to remove these blank lines. Any spaces or tabs on the blank lines and AWB won't fix, these must be done manually (for now).

Discussion

I would like to take over a task already approved for Bgwhite and BG19bot. -- Magioladitis (talk) 13:40, 19 August 2017 (UTC)

  • (Commenting as a community member, not BAG member, per my recusal) I strongly support this task. It does not violate COSMETICBOT due to the underlying accessibility issues. To explain this to those unfamiliar with it, currently these list items are read by screen readers as "Start list, List Item #1, End list, Start list, List Item #2, End list, ...". This change makes them read "Star list, List Item #1, List Item #2, ..., End list", which is the proper way for it to be read. Having said that, the regex underlying the particular general fixes relevant to this task should be loaded manually as find-and-replace in AWB and then "general fixes only" should be skipped. This will prevent cosmetic-only edits while still allowing all general fixes to run. ~ Rob13Talk 06:03, 20 August 2017 (UTC)
A couple of questions.
  • Is this new code or identical code from the bot being taken over?
  • Are you running genfixes with this?
CYBERPOWER (Chat) 15:27, 21 August 2017 (UTC)

Cyberpower678 I 'll ask Bgwhite to send me is code. I am certain he will have no problem with that. I think Bgwhite was running general fixes too. -- Magioladitis (talk) 15:35, 21 August 2017 (UTC)

I see that your BRFA is almost a copy of his, so if the code is identical and the bot's operating parameters as well, I see no reason to then speedily approve this.—CYBERPOWER (Chat) 15:38, 21 August 2017 (UTC)
Cyberpower678 I would be happy if you do :) -- Magioladitis (talk) 18:15, 21 August 2017 (UTC)
Echoing Cyberpower678 here. If the code is the same and task has approval, I can see no reason not to speedily approve this. If the code is different, I think a trial may be necessary to iron out any issues before open editing. TheMagikCow (T) (C) 08:45, 22 August 2017 (UTC)
@Cyberpower678: I'm not in support of skipping trial for a 100,000+ edit job and have approved a short trial first below - assuming it goes well this should be able to move forward with a standard approval. — xaosflux Talk 12:14, 24 August 2017 (UTC)
Approved for trial (100 edits or 10 days). General community approval was established related to the other BRFA. Trial approved to demonstrate that the implementation of the process is free of technical issues. — xaosflux Talk 12:09, 24 August 2017 (UTC)
Please post link to difs here after running. — xaosflux Talk 12:14, 24 August 2017 (UTC)
  • A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Any update on this?—CYBERPOWER (Message) 23:36, 18 September 2017 (UTC)

Starting soon. Magioladitis (talk) 07:27, 19 September 2017 (UTC)

CitationCleanerBot 2

Operator: Headbomb (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 13:43, Saturday, March 25, 2017 (UTC)

Automatic, Supervised, or Manual: Semi-automated during development, Automatic after

Programming language(s): AWB

Source code available: Upon request. Regex-based

Function overview: Convert bare identifiers to templated instances, applying AWB genfixes along the way (but skip if only cosmetic/minor genfixes are made). This will also have the benefits of standardizing appearance, as well as providing error-flagging and error-tracking. A list of common identifiers is available here, but others exist as well.

Links to relevant discussions (where appropriate): RFC, Wikipedia:Bots/Requests_for_approval/PrimeBOT_13. While not the issue of unlinked/raw identifiers wasn't directly addressed, I know of no argument that doi:10.1234/whatever is better than doi:10.1234/whatever. If ISBNs/PMIDs/RFCs are to be linked (current behaviour) and templated (future behaviour), surely all the other ones should be linked as well.

I have notified the VP about this bot task, as well as others similar ones.

Edit period(s): Every month, after dumps

Estimated number of pages affected: ~5,400 for bare DOIs, probably comparable for the other similar identifiers (e.g. {{PMC}}), and much less for the 'uncommon' identifiers like {{MR}} or {{JFM}}. This will duplicate Wikipedia:Bots/Requests_for_approval/PrimeBOT_13 to a great extent. However, I will initially focus on non-magic words, while I believe PrimeBot_13 will focus on magic word conversions.

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: Because of the great number of identifiers out there, I'll be focusing on "uncommon" identifiers first (more or less defined as <100 instances of the bare identifier). I plan on semi-automatically running the bot while developing the regex, and only automating a run when the error rate for that identifier is 0, or only due to unavoidable GIGO. For less popular identifiers, semi-automatic trialing might very well cover all instances. If no errors were found during the manual trial, I'll consider that code to be ready for automation in future runs.

However, for the 'major' identifiers (doi, pmc, issn, bibcode, etc), I'd do, assuming BAG is fine with this, an automated trial (1 per 'major' identifier) because doing it all semi-automatically would just take way too much time. So more or less, I'm asking for

  • Indefinite trial (semi-automated mode) to cover 'less popular identifiers'
  • Normal trial (automated mode) to cover 'popular identifiers'

Discussion

@Primefac and Anomie:. Headbomb {talk / contribs / physics / books} 13:56, 25 March 2017 (UTC)

For cases where a particular ISBN/PMC/etc. should not be linked, for whatever reason, will this bot respect "nowiki" around the ISBN/PMC/etc. link? — Carl (CBM · talk) 18:39, 26 March 2017 (UTC)

Unless the "Ignore external/interwiki links, images, nowiki, math, and <!-- -->" option is malfunctioning, I don't see why it wouldn't respect nowiki tags. Headbomb {talk / contribs / physics / books} 18:58, 26 March 2017 (UTC)

This proposal looks like a good and useful idea. Thanks for taking the time to work on it! − Pintoch (talk) 11:14, 11 April 2017 (UTC)

{{BAGAssistanceNeeded}}

I'd rather this task be explicit as as to its scope "identifiers" is to vague. Can you specify exactly which identifiers this will cover? Additional identifiers can always be addressed under a new task as needed. — xaosflux Talk 01:38, 20 April 2017 (UTC)

Pretty much those User:Headbomb/Sandbox. Focusing on the CS1/2 supported ones initially, then moving on to less common identifiers, if they are actually used in a "bare" format, like INIST:21945937 vs INIST:21945937. Headbomb {t · c · p · b} 02:47, 20 April 2017 (UTC)
@Headbomb: Based on past issues with overly-broad bot tasks, I try to think about degrees of freedom when I look at a bot task. The more degrees of freedom we have, the harder it is to actually catch every issue. You're asking for a lot of degrees of freedom. We've got code that's never been run on-wiki before, edits being made on multiple different types of citation templates for each identifier, a mostly silent consensus, different types of trials being requested, and an unknown/unspecified number of identifiers being processed. It's probably not a great idea to try to accomplish all that in one approval. Would you be willing to restrict the scope of this approval to a relatively small number of identifiers so we can focus on testing the code and ensuring the community has no issues with this task? In looking at your list, I think a manageable list of identifiers would be as follows: doi, ISBN, ISSN, JSTOR, LCCN, OCLC, PMID. These are likely the identifiers with the most instances; I may have missed a couple other high-use ones that I'm less familiar with. We could handle the rest (including less-used identifiers) in a later approval or approvals. Your thoughts? ~ Rob13Talk 04:09, 3 June 2017 (UTC)

I'm asking for lots of freedom yes, but in a modular and controlled fashion. I'm fine with restricting myself to the popular identifiers at first, but it will make development a bit more annoying/complicated, since the lesser user identifiers are the hardest to test on a wider scale. If BAG is comfortable with a possibly slightly higher false positive rate post-approval (a very marginal increase, basically until someone finds a false positive, if there are some), I'm fine with multiple BRFAs. Only thing I would ask to that initial list is I'd rather have arxiv, bibcode, citeseerX, doi, hdl, ISBN, ISSN, JSTOR, PMID, and PMCID. OCLC/LCCN could be more used than arxiv/bibcode/citeseerx/hdl/PMCID, but they usually are on different type of articles which will make troubleshooting a bit trickier. Headbomb {t · c · p · b} 19:18, 6 June 2017 (UTC)

Approved for trial (250 edits). The list you provided is fine. As soon as we get those sorted and approved, I'm happy to quickly handle future BRFAs, so it shouldn't be too time-consuming of a process for you. Roughly 25 edits per identifier you listed above. Please update your task details to reflect the restricted list of identifiers before running the trial. ~ Rob13Talk 19:51, 6 June 2017 (UTC)
{{OperatorAssistanceNeeded}} Any updated on this trial? — xaosflux Talk 00:41, 19 June 2017 (UTC)
Still working on the code. I can't nail the DOI part, because I haven't yet found a reliable way to detect the end of a doi string, and I've been focusing on that rather fruitlessly since it's the hard part of the bot. I've asked for help with that at the VP. The other identifiers are pretty easy to do, so I'll be working on those shortly. Worse case, I'll exclude DOIs from bot runs and do them semi-automatically. Headbomb {t · c · p · b} 15:49, 19 June 2017 (UTC)
  • [6] 24 edits from the ISSN trial. No issues to report. Headbomb {t · c · p · b} 18:30, 19 June 2017 (UTC)
  • [7] 25 edits from the DOI trial.
    • [8] missed [9]. While I'm planning on taking care of those, down the line, right now my brain is a bit fried from all the other corner cases I've dodged. Headbomb {t · c · p · b} 21:39, 19 June 2017 (UTC)
  • [10] 25 edits for the JSTOR trial.
    I do not think that instance of GIGO is a problem; replacing an incorrect mention of JSTOR with a broken template makes it easier to detect the issue. Jo-Jo Eumerus (talk, contributions) 15:35, 22 June 2017 (UTC)
  • [14] 25 edits from the OCLC trial
    • [15], [16] didn't touch an OCLC (filtering issues)
    • [17] could be better, in the sense that it could make use of |oclc=, but that's what CitationCleanerBot 1 would do
    • [18] touched a DOI, because the OCLC was in an external link which the bot is set to avoid. I plan on doing those manually.
    • [19] shouldn't be done, I've yet to find a good solution for this however. (Follow up: This is now fixed most of the time. Corner cases such as <ref name="BARKER-OCLC013456"/> will remain, but they are exceedingly rare). Headbomb {t · c · p · b} 12:43, 24 July 2017 (UTC)
  • [20] 4 from the PMID/PMC trial. I've tested this substantially on my main account, without issues, save for the same corner case as OCLC, which are a bit more present in the case of PMIDs/PMCs than OCLCs, but I've cleaned most them up manually an very few remain. PMIDs/PMCs are now getting hard to test because very few remain. During my testing, I found that PMC<digits> is problematic on its own, as many other things than PMCIDs are in the same format. PMCID: PMC<digits> is safe and problem free, as are things like [[Pubmed Center|PMC]]:0123456. I plan to exclude plain PMC<digits> from the bot and do those manually instead, and only take care of the safe ones via bot. Headbomb {t · c · p · b} 01:52, 27 July 2017 (UTC)
  • [21] from the Zbl trial.
    • [22] could be better, but didn't break anything.
    • [23] is GIGO, but again the bot didn't break anything.
    • [24] missed [25], but that's fine.
    • [26] and [27] are borked but have now been fixed.
  • [28] from the JFM trial.
    • [29] is borked, but I took care of it with a comment.
    • [30] is GIGO, but the bot didn't break anything.

Headbomb {t · c · p · b} 17:13, 27 July 2017 (UTC)

Unsafe by automated bot (at least with my coding skills)

  • MR / LCCN / plain PMC<digits>
  • Defering to CitationCeanerBot 3: arxiv/bibcode/citeseerx/hdl

Headbomb {t · c · p · b} 02:03, 27 July 2017 (UTC)

{{BAGAssistanceNeeded}} I believe I'm ready for an extended trial, for doi, ISBN, ISSN, JFM, JSTOR, OCLC, PMID, PMCID, and Zbl. Headbomb {t · c · p · b} 17:22, 27 July 2017 (UTC)

Some comments:
  1. Re: [31] (mentioned above), I see this one down the line. Is that something the bot needs? Ideally the bot simply avoids JFM-tagging anything that's not within <ref>, cite, etc..., as that's where it's probably 99% of the time going to be operating (e.g., you almost certainly won't encounter "And so it was said in JFM (id) that..." in the middle of normal wikitext, in a paragraph block. It seems odd and out of place to have to stick comments like that in the source otherwise.
  2. Re: GIGO as a whole / and/or this one — is there an easy way to validate these? Like either via their identifier format, an API to hit, or something? Or also just excluding anything you're not certain meets the format? Like it seems unlikely a date is the identifier, or even more generally, anything with slashes for jstor. It might help to avoid false positives / making things worse.
  3. Re: [32] (and other issues related to parsing), it might be safer to parse the source independently as html/loose xml and iterate through it that way. Ref tags are fairly predictable as far as attributes go; so, your bot should definitely not apply a cleanup within a "name" attribute (for example) while it should feel safer applying a cleanup knowing it's in the tag content. That should at least take care of almost all instances where you'd otherwise risk breaking ref tags, which is where the bot is most likely going to be operating. It would therefore be able to be healthily and confidently suspicious when it's attempting to modify something outside a of a ref tag.
--slakrtalk / 04:58, 4 August 2017 (UTC)
1. There's no real way of telling AWB to only look within ref tag citations, and that would miss 'further reading' and 'manual refs' bibliography sections, which are often the ones most in need of such bot maintenance. From database scans, that 100.4 Jazz FM article is the only article in need for that comment. This is both so I don't pick it up in database scans in the future, and so the bot doesn't touch it. Every other instance of JFM(:| )\d that does not refer to a JFM identifier can be bypassed by checking for \wJFM.
2. Validation could be done at Help:CS1. It's a long-term project of mine, but validation helps when the identifier structure is known/well defined. I'm not saying those identifiers don't have a well-defined structure, but JFM is a defunct German identifier, and JSTOR can have DOIs as identifiers, which can have slashes in them. I could restrict the bot to purely numerical JSTORs, but in GIGO situations, the crap output often serves to flag the issue.
Actually the formats for JfM (\d{2}\.\d{4}\.\d{2} and Zbl (\d{4}\.\d{5}\) are well-defined. I can do the bad JfM/Zbl identifiers manually. I've updated the code, but since no instance remain, it can't really be tested. But it works in the sandbox [33]. Headbomb {t · c · p · b} 20:09, 4 August 2017 (UTC)
3. I certainly wish there would be an easy way to tell the bot not to touch ref name tags. I've bypassed most instances with creative regex, but there's no easy way to avoid them generally with AWB.
Headbomb {t · c · p · b} 12:16, 4 August 2017 (UTC)
@Headbomb: For number 3, try the following regex: (?<!\<\s*ref name\s*\=\s*"[^\>]*) . That's a negative lookbehind that doesn't handle the edit if the replacement would occur after the string <ref name=" but before the tag was closed out. ~ Rob13Talk 09:46, 6 August 2017 (UTC)
I can try that. I'll test it manually a few times, and then I'd like to proceed to bot trial phase 2. Headbomb {t · c · p · b} 17:11, 15 August 2017 (UTC)
Are you ready for a bot trial?—CYBERPOWER (Around) 06:56, 20 August 2017 (UTC)
I am yes. Headbomb {t · c · p · b} 19:03, 20 August 2017 (UTC)

Phase 2

  • Symbol tick plus blue.svg Approved for extended trial (500 edits).CYBERPOWER (Chat) 15:37, 21 August 2017 (UTC)
    @Cyberpower678: 500000000000000000000000? That's almost a mole of edits. :P --slakrtalk / 02:45, 22 August 2017 (UTC)
    Argh, my 0 key got stuck and I didn't even notice. :p—CYBERPOWER (Chat) 07:18, 22 August 2017 (UTC)
  • A user has requested the attention of the operator. Once the operator has seen this message and replied, please deactivate this tag. (user notified) Any update on this?—CYBERPOWER (Message) 23:36, 18 September 2017 (UTC)
The User:Bibcode Bot revival took a bit of my time recently, as have improvements to User:JL-Bot and User:JCW-CleanerBot for WP:JCW/WP:MCW. But I should be able to give CitationCleanerBot 2 some love in the week or two. It's just down on my list of priorities. Headbomb {t · c · p · b} 00:19, 19 September 2017 (UTC)
Ok. Take your time. I'll revisit in 2 weeks. :-)—CYBERPOWER (Message) 00:43, 19 September 2017 (UTC)

Bots that have completed the trial period

Approved requests

Bots that have been approved for operations after a successful BRFA will be listed here for informational purposes. No other approval action is required for these bots. Recently approved requests can be found here (edit), while old requests can be found in the archives.


Denied requests

Bots that have been denied for operations will be listed here for informational purposes for at least 7 days before being archived. No other action is required for these bots. Older requests can be found in the Archive.

Expired/withdrawn requests

These requests have either expired, as information required by the operator was not provided, or been withdrawn. These tasks are not authorized to run, but such lack of authorization does not necessarily follow from a finding as to merit. A bot that, having been approved for testing, was not tested by an editor, or one for which the results of testing were not posted, for example, would appear here. Bot requests should not be placed here if there is an active discussion ongoing above. Operators whose requests have expired may reactivate their requests at any time. The following list shows recent requests (if any) that have expired, listed here for informational purposes for at least 7 days before being archived. Older requests can be found in the respective archives: Expired, Withdrawn.