Wikipedia:Bots/Requests for approval/BHGbot 9: Difference between revisions

Browse history interactively

← Previous edit Next edit →

Content deleted Content added

VisualWikitext

Inline

Revision as of 23:44, 1 November 2021

BHGbot 9

New to bots on Wikipedia? Read these primers!

Approval process – How this discussion works
Overview/Policy – What bots are/What they can (or can't) do
Dictionary – Explains bot-related jargon

Operator: BrownHairedGirl (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 00:21, Thursday, August 19, 2021 (UTC)

Function overview: Remove the banner tag {{Cleanup bare URLs}} from articles which no longer have any WP:Bare URLs.

Automatic, Supervised, or Manual: Automatic

Programming language(s): AWB module (C#)

Source code available: ~~will be published once written, and before any trial run. I don't want to spend time coding it unless there is support in principle for this task.~~ Wikipedia:Bots/Requests for approval/BHGbot 9/AWB module

Links to relevant discussions (where appropriate):

Edit period(s): Initial run to clear the backlog, then about weekly

Estimated number of pages affected: Initial run ~1,650 pages. Thereafter a rough guesstimate of ~50 pages per week. updated estimate 20:20, 18 October 2021 (UTC): 1424 pages

Namespace(s): Article, Draft.

Exclusion compliant (Yes/No): Yes

Function details: initial article list to consist of all main- and draft-space transclusions of {{Cleanup bare URLs}}. With each page:

check that the page contains the banner template {{Cleanup bare URLs}}, or one of its many aliases. If not, skip the page.
count the number of {{Bare URL inline}} tags in the page, including aliases
count the number of untagged bare URL refs in the page, i.e. those which match the regex <ref[^>]*?>\s*\[?\s*https?:[^>< \|\[\]]+\s*\]?\s*<\s*/\s*ref
if the total matches of step 2 + step 3 is greater than zero, then skip the page
Optional check for bare URLs not in ref tags:
- Test for existence on the page of any other URLs which are not:
  - wrapped in a {{cite}} tag, or
  - wrapped in {{URL}}, or
  - formatted as [http://www.example.com/foo some-non-space-characters], or
  - the value of a |website=http://www.example.com/foo parameter in any infobox
- if any such URLs exist, then skip the page
remove the banner {{Cleanup bare URLs}}, and save the page with AWB genfixes, using an edit summary of the form
- WP:BHGbot 9: removed {{Cleanup bare URLs}}. This page currently has no bare URLs

Note 1

Step 5 (check for bare URLs not in ref tags) is based on the discussion at User talk:Citation bot/Archive 26#Cleanup tag not removed after problem fixed, where both @AManWithNoPlan and @Headbomb advocated retaining the banner tag if there are any bare URLs anywhere on the page.
I think that this approach is overly cautious, because in practice the {{Cleanup bare URLs}} tag seems to be used overwhelmingly for bare URLs within ref tags. However, I am happy to include this step unless there is consensus to omit it.

Note 2

My estimate of ~1,650 pages in the initial run is based on comparing the 7,362 pages currently transcluding {{Cleanup bare URLs}} with a scan of the 17 August database dump which found 459,013 pages with 1 or more bare URLs in ref tags. That comparison found 1,665 pages transcluding {{Cleanup bare URLs}} but without bare URLs.
If the bot is set to skip pages with bare URLs not in ref tags, the initial run will be significantly less than 1,650 pages, but until I run the bot in pre-parse mode I won't know how much less.

Note 3

Coding the AWB module is not complicated, but testing it and debugging it without a proper development environment is very slow. So I don't want to put in a few hours work without having first checked that the task has approval in principle.

Discussion

Regarding step 2 & step 3, what if you've got a bare URL tag but the URL is no longer bare, and elsewhere you've got an untagged bare URL? Your bot would skip this if it's only relying on counts? Ditto if there's an inline tag for a URL that's no longer bare? ProcrastinatingReader (talk) 00:34, 19 August 2021 (UTC)[reply]

@ProcrastinatingReader: thanks for that observation. I hadn't factored in the case of a ref which has been fixed, but the {{Bare URL inline}} tag has not ben removed. I think that such cases will be rare, and that it will be even more rare to have that oddity and a banner tag {{Cleanup bare URLs}} (without which this bot will reject the page in step 1).

If you like, I can add extra check for such misplaced {{Bare URL inline}} tags, but I would prefer not to do so, simply to avoid adding extra complexity to accommodate a very rare case whose consequence would be a mistaken skip rather than the more serious matter of a mistaken removal. --BrownHairedGirl (talk) • (contribs) 00:57, 19 August 2021 (UTC)[reply]

PS I just ran https://petscan.wmflabs.org/?psid=19858257 to check for main- and draft-space pages which transclude both {{Cleanup bare URLs}} and {{Bare URL inline}}: total 12 pages.

I checked them all for the case you described, and found only one kindof match, on List of gangs in New Zealand. An IP had wrong added[1] {{Bare URL inline}} after </ref>, instead of the correct placement before it. Then reFill filled a bunch of refs,[2] but didn't remove {{Bare URL inline}} because it was not inside the ref tags. I have now fixed[3] that page. --BrownHairedGirl (talk) • (contribs) 01:27, 19 August 2021 (UTC)[reply]

Regarding step 5 and note 1, I don't see why step 5 is necessary. Is there an example of a page with such a URL (of the 'non-ref bare URL' variety) so I can see a valid use case? ProcrastinatingReader (talk) 00:42, 19 August 2021 (UTC)[reply]

@ProcrastinatingReader: Thanks again. I have identified no such cases. I added Step 5 solely out of respect for the objections already made by the two highly experienced and technically skilled editors who raised the issue at User talk:Citation bot/Archive 26#Cleanup tag not removed after problem fixed. I can't see the use cases myself, but I have high regard for their judgement, which is why I am willing to accommodate their concerns unless there is consensus to proceed without Step 5.

Maybe @AManWithNoPlan and/or @Headbomb could comment here? --BrownHairedGirl (talk) • (contribs) 01:04, 19 August 2021 (UTC)[reply]

For bare URLS without ref tags, here are some basic example

According to a report published at at http://www.example.com, 63% of statistics are made up.

==References==
* http://www.example.com

==External links==
* http://www.example.com

Headbomb {t · c · p · b} 01:19, 19 August 2021 (UTC)[reply]

@Headbomb: I get the situation, which is more common in older articles (before incline cites became strongly preferred in ~2007), but is there any evidence that {{Cleanup bare URLs}} is actually used to tag such issues? --BrownHairedGirl (talk) • (contribs) 01:30, 19 August 2021 (UTC)[reply]

Pretty sure that in the several thousand of articles with such bare urls, at least one was tagged with {{Cleanup bare URLs}}. Headbomb {t · c · p · b} 02:02, 19 August 2021 (UTC)[reply]

For example Ciudad del Carmen or Duncan Sandy, which you've yourself tagged with {{Cleanup bare URLs}}. Headbomb {t · c · p · b} 02:09, 19 August 2021 (UTC)[reply]

@Headbomb: in each case I applied the tags because the page had bare inline refs. That was my sole selection criteria. I didn't even glance at the external links.

Are you telling me that having applied the tags for that reason, I can't remove them when that problem is resolved? --BrownHairedGirl (talk) • (contribs) 02:23, 19 August 2021 (UTC)[reply]

When the problem is resolved, yes. But you've asked for cases where step 5 would be necessary, and those are two examples with {{Cleanup bare URLs}} and non-ref bare URLs. Headbomb {t · c · p · b} 02:35, 19 August 2021 (UTC)[reply]

@Headbomb: I fear that we may be talking past each other.

So, just to clarify, my AWB job added the cleanup banner only to pages with bare URLs inside <ref></ref> tags. Same with the hundreds which I have since added manually as I follow around after Citation bot's processing of the lists which I feed it. AIUI, User:GreenC bot/Job 16 also selects only pages with bare URLs inside <ref></ref> tags.

It seems to me that you are saying that the tags should not be removed after the resolution of the the problem which caused their addition, because there is another unresolved issue to which the tag might have been addressed if it was applied by someone else using different criteria, even tho you have not identified any instance of such usage. Is that what you intend? --BrownHairedGirl (talk) • (contribs) 02:58, 19 August 2021 (UTC)[reply]

You may have added {{Cleanup bare URLs}} to pages with bare URLs in refs tags, but the criteria for the removal of {{Cleanup bare URLs}} is the cleanup of all bare urls, not just those in ref tags. Headbomb {t · c · p · b} 06:10, 19 August 2021 (UTC)[reply]

@Headbomb: I can see the logic in that approach, but I think it's too rigid. It will leave a lot of pages inappropriately stuck with the tag because of some external links, which are much less significant than refs.

Let's see what others think. --BrownHairedGirl (talk) • (contribs) 06:47, 19 August 2021 (UTC)[reply]

I dunno... the text of {{Cleanup bare URLs}} and its documentation look like the template is just for bare URLs in references. I wouldn't think we should care too much about other URLs, so I agree with BHG & proc that step 5 would be unnecessary. Enterprisey (talk!) 07:17, 19 August 2021 (UTC)[reply]

Disagree there. The template isn't just for bare URL in ref tags. For example, a reference section with a non-ref tag'd bare external link. Or further reading sections. Those too should be converted to full citations. Or inline external link used as a reference. Likewise, for external links, it's a very high probability that templates like {{Official}} need to be used. It covers all bare urls. Headbomb {t · c · p · b} 07:24, 19 August 2021 (UTC)[reply]

It seems to me that @Enterprisey's view is better supported by the documentation at {{Cleanup bare URLs}}. --BrownHairedGirl (talk) • (contribs) 15:35, 19 August 2021 (UTC)[reply]

there is zilch in the documentation saying that this template is only for bare URL in ref tags. Headbomb {t · c · p · b} 16:15, 19 August 2021 (UTC)[reply]

Since bare URLs seem to be about link rot with regards to citations (WP:BAREURLS), I'm not sure the links in the "External links" section, which often just describe the page or site name, are really covered. But there may be better venues to have this discussion if we can't come to a consensus here. ProcrastinatingReader (talk) 16:20, 19 August 2021 (UTC)[reply]

Barring that, we can just proceed with the automated task with step 5 and see where that gets us. ProcrastinatingReader (talk) 16:22, 19 August 2021 (UTC)[reply]

@ProcrastinatingReader: as I noted in the proposal, I am happy to proceed with step 5 included. It's not my first choice, but better than no cleanup.

If there is some consensus elsewhere to omit step 5, then it will be trivial matter to disable step 5, subject to BRFA approval.

@Headbomb and Enterprisey: are you happy to proceed on that basis? --BrownHairedGirl (talk) • (contribs) 16:32, 19 August 2021 (UTC)[reply]

Possible trial. I may be getting ahead of things here, but if BAG is minded to consider authorising this task with Step 5 included, please can I ask that we start with a trial and go through a few iterations?
If step 5 is involved, it would be very helpful to have multiple sets of eyes scrutinising test cases for false positives and false negatives in the check for bare links elsewhere the page. --BrownHairedGirl (talk) • (contribs) 20:44, 19 August 2021 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Please make sure at least 20 of these 50 edits include a 'step 5' skip. ProcrastinatingReader (talk) 21:06, 19 August 2021 (UTC)[reply]
  @BrownHairedGirl: Can I gently follow up on this BRFA? Do you still plan to go ahead with it? ProcrastinatingReader (talk) 10:10, 18 October 2021 (UTC)[reply]
  @ProcrastinatingReader: Thanks for the nudge ... and for being gentle about it after such a long delay.
  I have been noticing in the last week or so that there is a lot of work for this bot job to do, so I need to get back to work on it. --BrownHairedGirl (talk) • (contribs) 10:56, 18 October 2021 (UTC)[reply]
Trial complete.

@ProcrastinatingReader: I have completed the trial run of 50 edits: see contribs list. Note that there are 51 edits, because #44 is a revert.

The source code is published at Wikipedia:Bots/Requests for approval/BHGbot 9/AWB module.

To test the bot, I made a list of articles transcluding {{Cleanup bare URLs}} which had been edited by Citation bot in one of its last 25k edits, because that concentrates pages likely to have had bare URLs fixed.

For an annotated list of the pages scanned, see Wikipedia:Bots/Requests for approval/BHGbot 9/Article list for trial run 01.

Note that there was one false positive: this edit[4] to Treasurer of the Household. (#46 in the contribs list, #8 in the annotated list of pages scanned). It should have failed the Step5 check, but didn't.

I tracked that problem down to an error in line 47 of the code: I had omitted the "\s+" in this regex: string nonBareURLMatcher = @"\[\s*https?://[^>< \|\[\]]+\s+[^\]]+\]";.

After that bug was fixed, there were 44 further edit, with no further false positives.

I have not yet checked the skipped pages to look for false negatives. --BrownHairedGirl (talk) • (contribs) 14:48, 18 October 2021 (UTC)[reply]

@Headbomb and Enterprisey: I would value your scrutiny of the trial, if you have time. --BrownHairedGirl (talk) • (contribs) 14:51, 18 October 2021 (UTC)[reply]

PS @ProcrastinatingReader asked me to please make sure at least 20 of these 50 edits include a 'step 5' skip.

Maybe I have misunderstood that request, but it seems to me to be self-contradictory: if a page is skipped at step5 (or any other step), it will not be edited.

I assume that the spirit of PR's request was that we should be able to check that Step5 was skipping where needed, so I devised another way of checking that. I hacked the module so that it saves a page which failed step5, but skips everything else: see Wikipedia:Bots/Requests for approval/BHGbot 9/Step5 checker.

I ran that checker in pre-parse mode on the entire set of 235 pages in WP:Bots/Requests for approval/BHGbot 9/Article list for trial run 01.

That found the following four pages with no bare URL inline refs, but which failed Step5:

PR, does that satisfy your concerns? --BrownHairedGirl (talk) • (contribs) 17:17, 18 October 2021 (UTC)[reply]

It was in August so I can't remember exactly what I was thinking, but I think the spirit of that part was to test to make sure step 5 is working. What you've done works.

I'd prefer to review the BRFA all at once, so (since they're pinged) I'll wait a bit for Enterprisey and Headbomb to comment, if they want, before reviewing. ProcrastinatingReader (talk) 16:08, 19 October 2021 (UTC)[reply]

I have left a note[5] for Headbomb on their talk. BrownHairedGirl (talk) • (contribs) 12:15, 21 October 2021 (UTC)[reply]

Updated estimate of number of pages affected I just ran the module in pre-parse mode on all the 7,920 article- and draft-space pages which transclude {{Cleanup bare URLs}}. That produced a total of 1,424 pages from which {{Cleanup bare URLs}} should be removed, listed at WP:Bots/Requests for approval/BHGbot 9/Pre-parsed list for first run. --BrownHairedGirl (talk) • (contribs) 20:17, 18 October 2021 (UTC)[reply]

{{BAGAssistanceNeeded}} @ProcrastinatingReader: the trial was completed 8 days ago. It would be great to have this reviewed, because I would like to get on with removing {{Cleanup bare URLs}} from the near-20% of pages where it is now superfluous. --BrownHairedGirl (talk) • (contribs) 15:09, 26 October 2021 (UTC)[reply]

For [6] isn't the IMDb one technically a bare URL? Seems bot was confused due to the {{better source needed}} being within the ref tags. Similar at [7], where FN23 is malformed, although arguably GIGO. At [8] the ref is kinda a bare URL? (but it would be difficult for a bot to account for) ProcrastinatingReader (talk) 14:35, 29 October 2021 (UTC)[reply]

Thanks for the review, @ProcrastinatingReader. I'll take those points in order, but first please note the first line of WP:are URLs:A bare URL is a URL cited as a reference for some information in an article without any accompanying information about the linked page. As noted in the initial proposal, I coded that as: those which match the regex <ref[^>]*?>\s*\[?\s*https?:[^>< \|\[\]]+\s*\]?\s*<\s*/\s*ref. That regex has worked succesfuly in all three cases.

the IMBD ref in [9] does not fit the technical definition at WP:Bare URLs: a ref which displays no other info about the linked page. The most common situation where a tag makes the ref "not bare" is a tagged dead link (e.g. <ref>http://example.com/foobar {{dead link}}</ref>), and in that case I think it is right to treat it as "not bare", because the only available extra info is that is dead.
In this case, the extra info is that this source should not be used, so again I think it is right to treat it as "not bare", because the fix needed is to find a better source not to fill this ref.
[10] fN 23 <ref>{{Cite web|url=https://www.hattrick.co.uk/Show/Small_Potatoes|title = //www.hattrick.co.uk/Show/Small_Potatoes}}</ref> is not in any sense a bare URL ref. It is a filled cite template, albeit filled wrongly.
[11] <ref>[http://www.cambridge.gov.uk/public/councillors/agenda/2005/0119plan_files/4_1.pdf cambridge.gov.uk] {{webarchive |url=https://web.archive.org/web/20070927171940/http://www.cambridge.gov.uk/public/councillors/agenda/2005/0119plan_files/4_1.pdf |date=27 September 2007 }}</ref> also does not in anyway the fit the definition at WP:Bare URLs. It is filled with lots of stuff, albeit crudely.

It is now two weeks since the trial was completed, and I would very much like to get the bot running. In the last ten days, someone has chased down and manually removed a few hundred superfluous {{Cleanup bare URLs}} tags. I think it is a great pity that someone is putting hours of their time to do a task for which a bot is coded and tested, and I doubt that any manual process is doing it with as high accuracy. --BrownHairedGirl (talk) • (contribs) 23:43, 1 November 2021 (UTC)[reply]

@@ Line 142: / Line 142: @@
 {{t|BAGAssistanceNeeded}} @[[User:ProcrastinatingReader|ProcrastinatingReader]]: the trial was completed 8 days ago. It would be great to have this reviewed, because I would like to get on with removing {{tl|Cleanup bare URLs}} from the near-20% of pages where it is now superfluous. --[[User:BrownHairedGirl|<span style="font-variant:small-caps"><span style="color:#663200;">Brown</span>HairedGirl</span>]] <small>[[User talk:BrownHairedGirl|(talk)]] • ([[Special:Contributions/BrownHairedGirl|contribs]])</small> 15:09, 26 October 2021 (UTC)
 :For [https://en.wikipedia.org/w/index.php?title=World_War_II:_When_Lions_Roared&diff=1050542495&oldid=1049265836&diffmode=source] isn't the IMDb one technically a bare URL? Seems bot was confused due to the {{t|better source needed}} being within the ref tags. Similar at [https://en.wikipedia.org/w/index.php?title=Tommy_Tiernan&diff=1050541311&oldid=1049325563&diffmode=source], where FN23 is malformed, although arguably GIGO. At [https://en.wikipedia.org/w/index.php?title=University_Pitt_Club&diff=prev&oldid=1050541997&diffmode=source#cite_ref-16] the ref is kinda a bare URL? (but it would be difficult for a bot to account for) [[User:ProcrastinatingReader|ProcrastinatingReader]] ([[User talk:ProcrastinatingReader|talk]]) 14:35, 29 October 2021 (UTC)
+::Thanks for the review, @[[User:ProcrastinatingReader|ProcrastinatingReader]].  I'll take those points in order, but first please note the first line of [[WP:are URLs]]:{{tq|A bare URL is a URL cited as a reference for some information in an article without any accompanying information about the linked page|q=y}}.  As noted in the initial proposal, I coded that as: those which match the [[regex]] {{nowrap|<code><nowiki>&lt;ref[^&gt;]*?>\s*\[?\s*https?:[^&gt;&lt; \|\[\]]+\s*\]?\s*&lt;\s*/\s*ref</nowiki></code>}}. That regex has worked succesfuly in all three cases.
+::# the IMBD ref in [https://en.wikipedia.org/w/index.php?title=World_War_II:_When_Lions_Roared&diff=1050542495&oldid=1049265836&diffmode=source] does not fit the technical definition at [[WP:Bare URLs]]: a ref which displays no other info about the linked page.  The most common situation where a tag makes the ref "not bare" is a tagged dead link (e.g. <code><nowiki><ref>http://example.com/foobar {{dead link}}</ref></nowiki></code>), and in that case I think it is right to treat it as "not bare", because the only available extra info is that is dead.<br />In this case, the extra info is that this source should not be used, so again I think it is right to treat it as "not bare", because the fix needed is to find a better source not to fill this ref.
+::# [https://en.wikipedia.org/w/index.php?title=Tommy_Tiernan&diff=1050541311&oldid=1049325563&diffmode=source] fN 23 <code><nowiki><ref>{{Cite web|url=https://www.hattrick.co.uk/Show/Small_Potatoes|title = //www.hattrick.co.uk/Show/Small_Potatoes}}</ref></nowiki></code> is not in any sense a bare URL ref.  It is a filled cite template, albeit filled wrongly.
+::#[https://en.wikipedia.org/w/index.php?title=University_Pitt_Club&diff=prev&oldid=1050541997&diffmode=source#cite_ref-16] <code><nowiki><ref>[http://www.cambridge.gov.uk/public/councillors/agenda/2005/0119plan_files/4_1.pdf cambridge.gov.uk] {{webarchive |url=https://web.archive.org/web/20070927171940/http://www.cambridge.gov.uk/public/councillors/agenda/2005/0119plan_files/4_1.pdf |date=27 September 2007 }}</ref></nowiki></code> also does not in anyway the fit the definition at [[WP:Bare URLs]].  It is filled with lots of stuff, albeit crudely.
+::It is now two weeks since the trial was completed, and I would very much like to get the bot running. In the last ten days, someone has chased down and manually removed a few hundred superfluous {{tl|Cleanup bare URLs}} tags.  I think it is a great pity that someone is putting hours of their time to do a task for which a bot is coded and tested, and I doubt that any manual process is doing it with as high accuracy. --[[User:BrownHairedGirl|<span style="font-variant:small-caps"><span style="color:#663200;">Brown</span>HairedGirl</span>]] <small>[[User talk:BrownHairedGirl|(talk)]] • ([[Special:Contributions/BrownHairedGirl|contribs]])</small> 23:43, 1 November 2021 (UTC)