Wikipedia:Bots/Requests for approval/Merge bot 2
Operator: Wbm1058 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:08, Saturday, January 28, 2017 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): PHP
Source code available:
Function overview: History-merge categories which were moved by User:Cydebot between April 2006 and March 2015
Links to relevant discussions (where appropriate): Wikipedia:Bot requests#Bot for category history merges
Edit period(s): One time run, but the process will be run 3 times or more as needed to clear the work queue
Estimated number of pages affected: over 87,000
Exclusion compliant (Yes/No): No
Adminbot (Yes/No): Yes (needs admin flag set)
Already has a bot flag (Yes/No): Yes
Function details: Use API:Usercontribs to select all Cydebot category-space new page creations from March 2015 back to 2006. This is the beginning of the selection set. In developing and testing the bot, I've already processed the first four items on the list. Check to see if the page shown in Cydebot's edit history has mergeable, deleted revisions and is a different page than the page Cydebot created, if so, then (1) Undelete the page, (2) Merge the appropriate history to the current category title (3) Delete the page (4) revision-delete the edit summary of the destination page of Cydebot's move, as it's no longer needed for attribution, but prevents listed users from vanishing).
Function details expanded by 103.6.159.90 (talk) 08:29, 29 January 2017 (UTC): The bot processes pages in category namespace in which the first edit is by Cydebot. Cydebot's edit summary will include the text "Moved from <CATEGORYOLDNAME>" and identifies the the editors of the old category to account for the attribution. (example). This bot will undelete the category mentioned as such (if it has any mergeable edits), and history-merge it into the new category, using Special:MergeHistory. The MergeHistory extention cannot be used for merging parallel histories, so there's no chance of any history mess-up. The bot will then delete the leftover redirect. (MergeHistory produces a redirect when all edits on the source page are merged away; this cannot be suppressed.) As an additional step, the bot shall revdelete Cydebot's edit summary that lists the authors - this is because attribution is no longer needed and it prevents user from vanishing.
Going by the botop's notes at the BOTREQ discussion, the bot is equipped to handle a variety of special situations - categories that underwent multiple moves or back-and-forth moves.
Further details: History merges are done using API:Mergehistory, which I believe is functionally equivalent to Special:MergeHistory.
The bot algorithm sequentially steps through the selected Cydebot contribution history, from March 2015 back to 2006, which is 89,893 or 89,894 pages (I've gotten both results on different test runs, and don't have an explanation for the difference).These pages are then grouped as follows:
- 89893 user contributions
- Found count: 87524 (the request to retrieve the timestamp of the oldest deleted revision was successful)
- Mergeable: 86974 (the timestamp of the oldest deleted revision is earlier than the timestamp of Cydebot's edit, i.e. the timestamp of the selection-set item currently being processed)
- Hist-mergeable: 86213 (report showing the first 1,000) – the page title in Cydebot's edit summary is different than the page title of the selection-set item
- Self-mergeable: 761 (report) – the page title in Cydebot's edit summary is the same as the page title of the selection-set item
- Not mergeable: 550 (report) – the timestamp of the oldest deleted revision is later than the timestamp of Cydebot's edit
- Mergeable: 86974 (the timestamp of the oldest deleted revision is earlier than the timestamp of Cydebot's edit, i.e. the timestamp of the selection-set item currently being processed)
- Self-mergeable, no deleted revisions: 15 (report) – I don't believe there is anything that needs to be done with these
- Other no deleted revisions: 2354 (report)
- Found count: 87524 (the request to retrieve the timestamp of the oldest deleted revision was successful)
I am prepared to begin processing the 86213 Hist-mergeable items. After these are processed, they will become Not mergeable items (the first four items on the list of 550 are pages I hist-merged during development and testing). However, when I run the bot through these 89,893 pages a second time, more pages will appear on the Hist-mergeable items list (they will be un-deleted by the first run). The second run will process these. Eventually (maybe after 3 or 4 runs) there will be no more of these left and the Hist-mergeable item count will be zero, while the Not mergeable list will have grown from 550 to over 87,000.
I can likely process the 761 "Self-mergeable" items. Rather than hist-merging, they will just need to have some deleted history restored. I haven't coded or tested this piece yet, and would prefer to process these separately, so in the event of problems, items of this type are isolated and consolidated. But I can work on this next if it's preferred that I do everything at once.
As the IP has pointed out, many of the 2354 "Other no deleted revisions" can likely be hist-merged also, but I haven't tested any of these yet, and would prefer to leave them for later processing as well.
These other two shorter work queues aren't going anywhere, and we can come back to them later. I'm ready to do a test run of a small subset of the 86213 Hist-mergeable items as soon as BAG is comfortable with giving me the go-ahead, and my bot has administrator privileges.
Community notifications
Community notifications for a new admin bot have been placed:
- Wikipedia:Bots/Requests for approval/Adminbots
- Wikipedia:Village_pump_(proposals)#New_Adminbot_proposal_-_History_Merge_cleanup_of_another_bot
- Wikipedia:Administrators'_noticeboard#New_Adminbot_proposal_-_History_Merge_cleanup_of_another_bot
- Wikipedia talk:Categories for discussion/Working
- Wikipedia talk:Categories for discussion
Discussion
Please note that this proposed bot does npothing but stuff I've done in the past, and would probably have gotten back to if not for this bot. עוד מישהו Od Mishehu 04:37, 29 January 2017 (UTC)
- The vast majority of the bot's edits should be trivially easy, but with such a massive number of pages to merge, I'm wondering if we'll encounter some situations where the best action (the action that any human admin would perform) would require this to be a CONTEXTBOT. For example, what happens if we move Cat:Foo to Cat:Bar, delete Cat:Bar some time later, and then someone creates a different Cat:Bar (so it's not G4-eligible) afterward. A human would see that a merge isn't needed. What would the bot do? Would it totally ignore the deleted category because Cydebot didn't create the extant one? Log the deleted category for human review? Also, what would it do when the Cydebot edit summary has already been revdeleted: would it just skip that step, or would this issue potentially cause a crash? Meanwhile, Od Mishehu said at WP:BOTR Some of the pages with no deleted reivsions are the result of a category rename where the source category was changed into something else (a category redirect or disambiguation), and a history merge in those caes should be done. When the last deleted revision is something other than a normal category (it has something more than just text, parent categories, and an XFD or speedy deletion tag), what will the bot do? Not trying to derail anything; I just want to make sure you've addressed everything in your coding. Nyttend (talk) 00:14, 31 January 2017 (UTC)
- (1) I'm looking at a list of Cydebot's page creations, so I might run into something like this:
- 18:01, 21 March 2015 . . N Category:Bar (Robot: Moved from Category:Foo. Authors: Author1, Author2)
- This item will only remain in Cydebot's edit history as long as Category:Bar is not deleted. If someone deletes Cat:Bar some time later, then that item will disappear from Cydebot's edit history. If then someone creates a different Cat:Bar then presumably the history of the old Cat:Bar will remain deleted, and my bot doesn't look at deleted edit histories. If someone else later restores that unrelated history, then that's on them, not my bot.
- (2) When the Cydebot edit summary has already been revdeleted, a warning is returned: "warnings": "code": "revdelete-no-change" – try this in the sandbox (click Make request). I'm not checking for warnings, but I am checking for unexpected "error" returns from my undelete, merge_history, and delete function calls. If any of these return errors, the bot will immediately stop processing, so I can investigate.
- (3) Right, "no deleted revisions" generally means that a human editor has intervened somehow, and humans are less predictable than bots. I've noticed that there are some different scenarios there including category splits and disambiguations. There may be some extra special handling needed for those 2354 "no deleted revisions" items, that's one reason why I'm suggesting we defer processing those until later. I'm not comfortable with those until I take some time for more analysis. wbm1058 (talk) 02:04, 31 January 2017 (UTC)
- (3) Maybe I misunderstood, then, because it didn't occur to me that all of these ones would necessarily fall into the "no deleted revisions" camp. (1) Again, I wasn't clear. Will it just ignore the page entirely, or will it have some way of logging it? I'm guessing that it will ignore it (because it won't show up on the list of pages to check, in the first place), but maybe I just don't understand the algorithm properly. (2) Great. And finally, thanks for the helpful responses. Nyttend (talk) 02:52, 31 January 2017 (UTC)
- re (1) On the bot requests page, I briefly considered whether using the Deletedrevs API to retrieve and examine deleted revisions would be necessary, and decided that it was not. So perhaps, "ignore" isn't quite the right word to use, but rather, it's not going out of its way to look for it. If a currently active category wasn't created by Cydebot, that category is not going to be part of the work queue. The only way to make that category appear as if it was created by Cydebot is to restore the deleted history, and my bot isn't going to do that. I'd have to go to extra trouble to look for deleted Cydebot page creations in order to log them.
- A general response regarding whether there are any lurking unusual category "gotchas" that I haven't accounted for: THIS is the list of the first 1,000 pages the bot will do hist-merges on. Scan that list and see whether you can spot any "unusual types" of categories on it. Those more active in category administration may be better at spotting any weird ones than I am. wbm1058 (talk) 03:51, 31 January 2017 (UTC)
- Now I recall that when I used the sandbox to try Deletedrevs, the API returned a notice: "warnings": "deletedrevs": "*": "\"list=deletedrevs\" has been deprecated. Please use \"prop=deletedrevisions\" or \"list=alldeletedrevisions\" instead." Though the documentation doesn't say that it's deprecated. – wbm1058 (talk) 15:18, 31 January 2017 (UTC)
- So, I'll use Deletedrevisions instead of Deletedrevs. wbm1058 (talk) 16:26, 31 January 2017 (UTC)
- Or should I use Alldeletedrevisions? Or do I need to use both? "List all deleted revisions by a user or in a namespace." How about "List all deleted revisions by a user (Cydebot) and in a namespace (Category). Not the most straightforward thing to figure out, this API. It might be interesting to see if I can generate a list of all of Cydebot's currently deleted category-space deleted revisions, for the time-window of interest, just for kicks. I don't think I would be spilling any beans if I published such a list. Some of the items on this list would be hist-merged by my bot on its second or later runs. wbm1058 (talk) 17:02, 31 January 2017 (UTC)
- Alldeletedrevisions it is: sandbox Hmm. "Note: Due to miser mode, using adruser and adrnamespace together may result in fewer than adrlimit results returned before continuing; in extreme cases, zero results may be returned." Hopefully I won't be running an "extreme case"; I wouldn't want to get no results. – wbm1058 (talk) 18:10, 31 January 2017 (UTC)
- (3) Maybe I misunderstood, then, because it didn't occur to me that all of these ones would necessarily fall into the "no deleted revisions" camp. (1) Again, I wasn't clear. Will it just ignore the page entirely, or will it have some way of logging it? I'm guessing that it will ignore it (because it won't show up on the list of pages to check, in the first place), but maybe I just don't understand the algorithm properly. (2) Great. And finally, thanks for the helpful responses. Nyttend (talk) 02:52, 31 January 2017 (UTC)
- (3) Right, "no deleted revisions" generally means that a human editor has intervened somehow, and humans are less predictable than bots. I've noticed that there are some different scenarios there including category splits and disambiguations. There may be some extra special handling needed for those 2354 "no deleted revisions" items, that's one reason why I'm suggesting we defer processing those until later. I'm not comfortable with those until I take some time for more analysis. wbm1058 (talk) 02:04, 31 January 2017 (UTC)
- Here's a link for Cydebot's Deleted user contributions from March 25, 2015 back. We're only interested in the deleted contributions with edit summaries in the form "(Robot: Moved from Category:Foo. Authors: X, Y)". The first one is:
- 07:01, 19 March 2015 . . Category:Calendar dates (Robot: Moved from Category:Dates.
- This one is a dead end. Category:Calendar dates was deleted per Wikipedia:Categories for discussion/Log/2016 March 25. I won't restore both categories in order to hist-merge Category:Dates into Category:Calendar dates, only to then re-delete both categories, unless there is a consensus to do that.
- Most of these are red links. The first blue link I see is:
- 23:32, 1 March 2015 . . Category:Royal Order of the Seraphim (Robot: Moved from Category:Order of the Seraphim.
- Category:Royal Order of the Seraphim was originally created by Cydebot on 1 March 2015 (edit summary above), but then:
- 06:17, 27 August 2015 Cydebot deleted page Category:Royal Order of the Seraphim (Robot - Removing category Royal Order of the Seraphim per CFD at Wikipedia:Categories for discussion/Log/2015 June 22#A few more award categories.)
- (collapsed under more awards: Propose upmerging Category:Royal Order of the Seraphim to Category:Orders of knighthood of Sweden.)
- But, more recently:
- 13:31, 19 December 2015 Chicbyaccident . . (←Created page with 'Seraphim Category:Orders of knighthood awarded to heads of state, consorts and sovereign family members|Seraphim, Royal Order o...')
- This is an example of a dead-end that was covered up by a new page creation. Again, my bot will not see this; it will not restore Category:Order of the Seraphim and the deleted history of Category:Royal Order of the Seraphim in order to perform a history-merge, only to re-delete Category:Order of the Seraphim and the part of Category:Royal Order of the Seraphim that was temporarily restored to do a history-merge. – wbm1058 (talk) 17:27, 2 February 2017 (UTC)
- Here's a link for Cydebot's Deleted user contributions from March 25, 2015 back. We're only interested in the deleted contributions with edit summaries in the form "(Robot: Moved from Category:Foo. Authors: X, Y)". The first one is:
Hmm... I think one of the problems with not checking deleted revisions and comparing deletion logs might be that we could end up with stuff that got separately deleted (who knows; maybe the content on the category page was deletion worthy at one point(?)). If a page was deleted multiple times—and especially if CydeBot wasn't the first or last thing to delete it—its entire history probably shouldn't be restored (the histmerge api has date parameters, probably for this reason). I think you alluded to covering the not-the-last condition, but I'm just trying to make sure we cover the not-the-first portion too in order to avoid something awful being accidentally dredged up. --slakr\ talk / 09:45, 4 February 2017 (UTC)
- Right, so after adding another check to my algorithm, with today's latest test we have:
- 89888 user contributions (5 less than the 89893 reported above, because 5 categories in the list were deleted recently?)
- Found count: 87519 (the request to retrieve the timestamp of the oldest deleted revision was successful)
- Destination has deleted history: 4056 (report)
- Mergeable: 82913 (the timestamp of the oldest deleted revision is earlier than the timestamp of Cydebot's edit, i.e. the timestamp of the selection-set item currently being processed, and the merge destination has no deleted history)
- Hist-mergeable: 82913 (report showing the first 1,000) – the page title in Cydebot's edit summary is different than the page title of the selection-set item
- Self-mergeable: 0 – the page title in Cydebot's edit summary is the same as the page title of the selection-set item. These by definition all have deleted history, so the 761 found by the last test are now a subset of the 4056 above [trapped there so they don't fall through to here]
- Not mergeable: 550 (report) – the timestamp of the oldest deleted revision is later than the timestamp of Cydebot's edit
- Self-mergeable, no deleted revisions: 15 (report) – I don't believe there is anything that needs to be done with these [no change since last test]
- Other no deleted revisions: 2354 (report) [just one change since last test]
- Found count: 87519 (the request to retrieve the timestamp of the oldest deleted revision was successful)
- So now instead of 761 "Self-mergeable" items to be deferred for later processing, we have 4056 merge destinations with deleted history (which includes all of the 761) to be deferred for later processing.
- The first example on the list – something I would have missed:
- Deletion log
- 05:40, 24 March 2015 Cydebot deleted page Category:Shipyard Associates of The Wire (Robot - Speedily moving category Shipyard Associates of The Wire to Category:Shipyard associates of The Wire per CFDS.)
- 21:16, 16 September 2010 Ex*** deleted page Category:Shipyard associates of The Wire (Merged to Category:The Wire (TV series) characters per Wikipedia:Categories for discussion/Log/2010 September 8#The Wire characters.)
- 16:22, 8 September 2010 Cydebot deleted page Category:Shipyard Associates of The Wire (Robot - Speedily moving category Shipyard Associates of The Wire to Shipyard associates of The Wire per CFDS.)
- Deleted history of Category:Shipyard Associates of The Wire (9 deleted edits)
- 03:04, 22 March 2015 . . Ko*** (554 bytes) (Nominated for speedy renaming; see Categories for discussion/Speedy. (TW))
- 03:04, 22 March 2015 . . Ko*** (94 bytes)
- 14:56, 14 April 2013 . . Gr*** (32 bytes) (added Category:The Wire characters using HotCat)
- 11:57, 6 September 2010 . . Ta*** (1,896 bytes) (cfr-speedy)
- 18:36, 1 July 2008 . . Io*** m (333 bytes)
- 18:36, 1 July 2008 . . Io*** m (335 bytes)
- 14:17, 22 February 2007 . . Cydebot m (337 bytes) (Robot - Speedily moving category The Wire (TV series characters) to The Wire (TV series) characters per CFD.)
- 21:32, 23 January 2007 . . Co*** (338 bytes)
- 21:28, 23 January 2007 . . Co*** (44 bytes) (←Created page with 'Category:The Wire (TV series characters)')
- Live history of Category:Shipyard associates of The Wire
- 05:39, 24 March 2015 Cydebot . . (94 bytes) (+94) . . (Robot: Moved from Category:Shipyard Associates of The Wire. Authors: Gr***, Ko***)
- Deleted history of Category:Shipyard associates of The Wire (4 deleted edits)
- 09:07, 12 September 2010 . . He*** (703 bytes) (Category:The Wire)
- 18:40, 8 September 2010 . . Mi*** m (681 bytes)
- 18:31, 8 September 2010 . . Mi*** (681 bytes) (cfm)
- 16:22, 8 September 2010 . . Cydebot m (333 bytes) (Robot: Moved from Category:Shipyard Associates of The Wire. Authors: Ta***, Io***, Co***, Cydebot)
- Obviously some more to think about here. My bot would undelete the 9 deleted edits of Category:Shipyard Associates of The Wire and hist-merge them to the single live edit of Category:Shipyard associates of The Wire.
- But it should only hist-merge the three deleted edits from 14 April 2013 – 22 March 2015. The downside of working backwards. We can back-burner these 4056 pages with deleted history.
- Now down to 82,913 on my "good to go" list. Have I filtered out all the trouble yet, or is there another "gotcha" scenario still lurking in these? $64K question. – wbm1058 (talk) 00:29, 5 February 2017 (UTC)
— — — — — — — — — — —
There is more work to be done before this is ready. Look at this example:
- Category:Olympic track and field athletes of Puerto Rico <-- Category:Olympic athletes of Puerto Rico
The deletion log of Category:Olympic athletes of Puerto Rico:
- 23:10, 12 March 2015 G*** O*** deleted page Category:Olympic athletes of Puerto Rico (R3: Recently created, implausible redirect: not convinced that this is plausible, since "athletes" in this country refers to all sportspeople)
- 08:21, 11 March 2015 Cydebot deleted page Category:Olympic athletes of Puerto Rico (Robot - Moving category Olympic athletes of Puerto Rico to Category:Olympic track and field athletes of Puerto Rico per CFD at Wikipedia:Categories for discussion/Log/2015 February 26.)
- 01:41, 5 May 2005 Re*** deleted page Category:Olympic athletes of Puerto Rico (deleted (renamed) as per cfd discussion/vote)
We do not want to restore all 15 deleted edits. No need to restore the 7 edits that were deleted on 5 May 2005, nor the one edit created after 08:21, 11 March 2015.
We just want to restore the 7 deleted edits in between those. This is a more complicated scenario, so I'll add another bucket of "items to skip on the first pass" to dump this into.
I need to check the deletion log. For the first pass at this, I will filter out all the items where there are multiple deletions listed in the log, and process the items where there is only one deletion in the log (Cydebot's). Finding the API for this wasn't easy. I was wondering whether I'd need to use Special:Log directly, and parse the results. But, I finally located API:Logevents, so I'll use that. – wbm1058 (talk) 15:20, 9 February 2017 (UTC)
This is really cool (and likely difficult to fix all of the edge cases). If only Wikipedia's software had allowed actual category page content moves a full decade before they were finally implemented. Good luck! --Cyde Weys 15:52, 10 February 2017 (UTC)
- Thanks, I appreciate the support of Cydebot's operator. This API sandbox log query returns 3 deletions, so I'll skip it for now. This log query returns just one item, with "user": "Cydebot", so we're good to go. Most on the list should be simple cases like this. I'll code this up and do another test run. – wbm1058 (talk) 19:35, 10 February 2017 (UTC)
|