Wikipedia talk:WikiProject Copyright Cleanup/Darius Dhlomo Drive

Technical refinement solution[edit]

Following on from this conversation:

The mess that keeps on giving

Hi VernoWhitney. I've had the joy of stumbling across some copy/pasted sections of prose from 2004 Olympics articles from our old friend Darius Dhlomo. All of the ones I've checked appear to have been missed out from the infamous mass blanking episode, but they are listed on the near endless and now abandoned clean up pages. I noticed you drew up these lists initially.

I was wondering would it be possible to produce a refinement to search out passages of copied prose (which was the main issue anyway)? For example, would it be feasible to get a list of all his edits over 500b that don't contain the phrases "wikitable", "|-", or "*" (for lists)? I know that technically this would be more difficult and time-consuming than the brute "search via byte count" method, but it would pretty much nail down the last vestiges of this grand plagiarism. I've been improving my own programming but frankly I don't even know where to start with building code for something like this. Any ideas? SFB 18:04, 11 March 2013 (UTC)[reply]

I don't recall the discussions just at the moment, but judging from the blanking bot's user page it only targeted those pages created by DD, not all of the ones they edited.

Filtering out which edits are likely clean is something I looked into a while ago for a socking CCI-subject, but I never finished it. Once you get a string or strings of which content was actually added, the regular expressions to find templates/tables/lists aren't that hard.

The hard part is turning a diff into that usable string in the first place because it has to handle rearranging paragraphs and things other than simple additions of single isolated blocks of text. As I recall you have to start with fetching the full wikitext of both versions (before the edit and after the edit) of an article from the API, then figure out what you need for your own diff (unless you want to try parsing HTML output of a browser diff instead). I could be missing something, but then it's been over a year since I've even glanced at my notes for that code.

It's something that would be great to have for narrowing down CCI's, but it's not something I'm going to have time to work on again any time soon. VernoWhitney (talk) 18:43, 11 March 2013 (UTC)[reply]

Now that your question has got me thinking, though, I may see what I can come up with... VernoWhitney (talk) 19:02, 11 March 2013 (UTC)[reply]

Two side notes on this. First, I know that User:Amalthea has been working on trimming down trivial diffs in CCIs; tables would count, so if there's a way to get those scripted out, it could likely knock out a good deal. Granted, what would be left would likely be 90% copyvios, but it's something nonetheless. Second, would there be interest in getting a drive going on his CCI? Great progress was made, but it's stalled with about 6,500 left. Yes, it's the second most, but it's a far cry from 23,000. Wizardman 01:23, 12 March 2013 (UTC)[reply]

Well, if we're speaking generally and not DD-specific, then there are going to be some false negatives if tables and lists re excluded across the board (compare the copyrighted list Time's All-TIME 100 Movies versus the verified PD list AFI's 100 Years...100 Movies), but as I said above that part is just applying regular expressions, so should be fairly easy to toggle a 'include/exclude tables' parameter.

Getting back to the CCI at hand, however, I find myself rather short of onwiki time at the moment, so I think I'll stick to looking at the code I have and see if I can actually get something working to appropriately parse a diff and see if I can contribute in that way. Besides that, I've always preferred working on the smaller CCIs before the larger ones, but that's just a personal preference. VernoWhitney (talk) 02:07, 12 March 2013 (UTC)[reply]

An ability to search through DD's non-minor edits that did not add table elements would be a very powerful tool in identifying copyright issues. Last time round, I found only one instance where an edit of multiple sentences of prose was original work. VernoWhitney did not have the time to work on such a programmatic solution. Are there any other editors with a programming background who could assist us in this way? SFB 10:20, 14 April 2013 (UTC)[reply]

Went and sent a message to User:Legoktm about making a bot to get rid of the trivial table diffs. Hopefully that works out. Wizardman 15:52, 14 April 2013 (UTC)[reply]

I ran a test here, and my spotcheck looked good, if someone else could also do a quick check to make sure everything is ok, I'll run the script against the other pages. Legoktm (talk) 16:05, 14 April 2013 (UTC)[reply]

I've glanced at all of them, and they look fine. Hut 8.5 19:00, 14 April 2013 (UTC)[reply]

Ok, the script ran through all of the not finished pages and removed quite a few diffs. I also noticed that there were some diffs that had been revdel'd, but not marked as done yet. Should/can these be marked in some way? Legoktm (talk) 22:46, 14 April 2013 (UTC)[reply]

Assuming the revision was revdeled for being a copyright violation, then yes, they ought to be marked as done (and as copyvios). I've come across a number of these myself. Hut 8.5 08:02, 15 April 2013 (UTC)[reply]

Ok, but how do you note it if one revision out of 5 diffs for an article has been deleted? Legoktm (talk) 01:35, 17 April 2013 (UTC)[reply]

Hmm. The process isn't really designed to handle that situation. Could you post a list of the articles here, or somewhere else? I don't mind checking the remaining diffs. Hut 8.5 08:52, 17 April 2013 (UTC)[reply]

Can I get a second opinion[edit]

I removed a load of content from Michelle McKeehan presumptively - it looks likely and both sources cited are inaccessible, even through the Internet Wayback Machine. Various people are now reverting me claiming that the fact it was unblanked some years ago means it isn't a copyright violation. I would appreciate it if someone else could take a look without me getting into an edit war. Hut 8.5 23:25, 28 May 2013 (UTC)[reply]

I just got a note on my talk page about another questionable inwind check, so we can't use those as word of god. Because of the situation with Darius, we have to presume things are copyvios, unfortunately. They're free to rewrite, but no blanket revert. Wizardman 23:40, 28 May 2013 (UTC)[reply]