Wikipedia:Administrators' noticeboard/Incidents/CCI

Review of unblock request and discussion of possible community ban

Darius Dhlomo (talk · contribs) is a prolific editor, with over 163,000 edits since 2005. However, he is currently blocked for multiple copyright violations, and is requesting an unblock. A few different admins have looked at his case, but I believe a community consensus is required, as I believe a permanent ban is possibly warranted. Darius has a long history of ignoring other editors; perusing his talk page history you'll see that many editors have tried to engage him in discussion about some questionable edits over the years, but Darius has never bothered to reply, save for the occasional section or page blanking. Only now, upon facing an indefinite block, does he appear to show the slightest bit of understanding or remorse over his actions, although the warnings have been given to him for several years. While much of his work is commendable, he does appear to want to work in a vacuum, ignoring not just policy but also conventions and consensus that he doesn't agree with. (Note that only about 1 out of every 900 edits he has ever made have been to talk pages!) I have never believed that "vested editors" should be given more leeway than anyone else.

I seek feedback from the wider community here about what to do with Darius Dhlomo. I do not believe he should be unblocked without a more thorough review of his editing history, and not just based on his current talkpage expression of remorse. — Andrwsc (talk · contribs) 17:38, 4 September 2010 (UTC)[reply]

I'm not sure there is any need for this. He is blocked, and I have just declined his latest unblock request and directed him to consider WP:OFFER instead. Beeblebrox (talk) 17:44, 4 September 2010 (UTC)[reply]

Fair enough, but there were comments such as "I'm verging on unblocking" so I wanted to make sure there was actually consensus to do so (or not) instead of the decision being made by a single admin. — Andrwsc (talk · contribs) 20:41, 4 September 2010 (UTC)[reply]

Comment: not familiar with this case, but the way he repeatedly put "damage" in scare quotes in unblock requests makes me disinclined to give him another chance, at least without successful compliance with drastic restrictions (eg 3-month ban on article edits - talk pages only). Rd232 ^talk 17:45, 4 September 2010 (UTC)[reply]
JamesBWatson sums up pretty well at user talk. I wouldn't support an unblock, per Beeblebrox and Rd232 above. --John (talk) 17:50, 4 September 2010 (UTC)[reply]
Comment Why in the world is "Community ban" in this? The editor doesn't get it ... there's no need to jump to that level of drama. As an admin who declined unblock once ... and have actually tried to help them, and even at one point was prepared to support an unblock (but not anymore), as James has said, everything he types just makes it worse. WP:OFFER (talk→ BWilkins ←track) 18:11, 4 September 2010 (UTC)][reply]

Let me add: thanks to NW for removing autoreviewer from them. How dumb of us not to remove it when the first copyright issues arose years ago then never stopped. (talk→ BWilkins ←track) 19:40, 4 September 2010 (UTC)[reply]

Comment At present I haven't got much to add to what I have written on the user's talk page. However, I will just say that I don't think the user should be unblocked, but that I don't see any reason for a community ban. Wikipedia:Standard offer has been presented, and I would leave it at that for now. JamesBWatson (talk) 18:30, 4 September 2010 (UTC)[reply]
See no need for a community ban, he's already rightfully indefinitely blocked, properly referred to WP:OFFER and should be blocked from editing his talk page if he keeps asking to be unblocked. --WGFinley (talk) 15:12, 5 September 2010 (UTC)[reply]

Scale of the problem

What's the scale of the copyright problem here? I've identified these so far: Sammy Korir (2006) Joetta Clark (2007) Canyon Ceman (2008, notice removed by this same editor and still a copyright violation right now), and Phil McMullen (athlete) (2010). I am unable to determine why the 'bot thought that Kamil Damašek was a copyright violation. Is this all that there is? Uncle G (talk) 20:29, 4 September 2010 (UTC)[reply]

We don't know yet. Within the past five weeks, a contributor has removed content from Julie Isphordin (2010), Jerome Drayton (2006), Ron Tabb (2010), Steve Spence (2007), Nina Kuscsik (2009), and Maurcie Damilano (2010). There's a long way to go before we know the scale here. --Moonriddengirl ^(talk) 21:21, 4 September 2010 (UTC)[reply]
- Randomly jumping around his contribs, I've just found four non-vios and one pretty egregious one, from 2006, content still in the article as of today: Jill Sterkel. We've got a CCI going here at Wikipedia:Contributor copyright investigations/Darius Dhlomo, but we haven't completed listing yet. Contribs of this scale challenge our system. --Moonriddengirl ^(talk) 21:36, 4 September 2010 (UTC)[reply]
  - Well just this evening I have started going through the CCI and because I haven't done this before I thought I would start on the ones that seem to be mainly a recitation of results. I have already flagged one definite copyvio and have just come across 3 more articles this, this and this which I am fairly sure are copyvios too. That's four more probables from a fairly small sample. Boissière (talk) 21:50, 4 September 2010 (UTC)[reply]
    - I had my doubts about those edits. I'm very aware of the IAAF's style and the odd full caps name of NOOL was a give away. Copied sources for those three articles are here here and here. Stuff like this is often beyond the realms of search engines, but its location is second nature for perennial time-wasters such as myself. Sillyfolkboy (talk) (edits)^{Join WikiProject Athletics!} 00:57, 5 September 2010 (UTC)[reply]
      - Good going, both of you! (Can we have you full time? :D) So there can be no doubt of backwards copying, I've checked archives of each of those sources. He did indeed paste the content onto Wikipedia. He assures us at his talk page that he's done this is in "No more than fifteen" articles. This is 15 right here. It would be nice to think that will be all, but I am not optimistic. --Moonriddengirl ^(talk) 01:08, 5 September 2010 (UTC)[reply]
        Make that 16. I've just found Ben Plucknett, which was swiped wholesale from the New York Times. Uncle G (talk) 01:48, 5 September 2010 (UTC)[reply]
        
        17: Eileen Coparropa swiped wholesale from The Panama News. Uncle G (talk) 01:58, 5 September 2010 (UTC)[reply]
        
        18: Ragnhild Hveger nicked from an ISHOF profile. Uncle G (talk) 02:11, 5 September 2010 (UTC)[reply]
        
        19: Marina van der Merwe plagiarized from a Coaching Association of Canada press release, of all things. Uncle G (talk) 02:15, 5 September 2010 (UTC)[reply]
        
        20: Edward Liddie lifted wholesale from the 2005 World Judo Championship press pack. Uncle G (talk) 02:22, 5 September 2010 (UTC)[reply]
        
        21: Alison Forman half-inched from a 2000 ABC News profile for the 2000 olympics. Uncle G (talk) 02:26, 5 September 2010 (UTC)[reply]
        
        22: Francisco Hervás nabbed from 2003 FIVB coach profile. Uncle G (talk) 02:34, 5 September 2010 (UTC)[reply]
        
        23: Bert Goedkoop hooked from another 2003 FIVB coach profile. Uncle G (talk) 02:38, 5 September 2010 (UTC)[reply]
        
        24: Jeff Stork ripped from a profile in The Washington Post. Uncle G (talk) 02:41, 5 September 2010 (UTC)[reply]
        
        25: Pete McArdle janked from an 1985 obituary in the New York Times. Uncle G (talk) 02:48, 5 September 2010 (UTC)[reply]
        
        26: Vera Nikolić clipped from a profile by Belgrade Marathon Ltd. Uncle G (talk) 03:04, 5 September 2010 (UTC)[reply]
        
        27: Lynda Blutreich copped from a profile by the Carolina Tar Heels hall of fame. Uncle G (talk) 03:10, 5 September 2010 (UTC)[reply]
        
        28: Lionel Cox popped from a WWW republication of a 1984 book by Graeme Atkinson. Uncle G (talk) 03:26, 5 September 2010 (UTC)[reply]
        
        29: Mike Fibbens ganked from a July 2000 Associated Press article. Uncle G (talk) 03:51, 5 September 2010 (UTC)[reply]
        
        30: Jason Grimes filched from a profile by Maryland Track & Field. Uncle G (talk) 12:00, 5 September 2010 (UTC)[reply]
        
        31: Caren Kemner swiped from a profile in The Washington Post. Uncle G (talk) 12:09, 5 September 2010 (UTC)[reply]
        
        32: Rita Crockett blagged from a profile by CBS Sports. Uncle G (talk) 12:58, 5 September 2010 (UTC)[reply]
        
        33: Philippe Blain misappropriated from another 2003 FIVB coach profile. Uncle G (talk) 13:43, 5 September 2010 (UTC)[reply]
        
        34: Mikiyasu Tanaka purloined from another 2003 FIVB coach profile. Uncle G (talk) 14:01, 5 September 2010 (UTC)[reply]
        
        35: Vassiliki Arvaniti pilfered from a Beach Volleyball Database biography. Uncle G (talk) 16:51, 5 September 2010 (UTC)[reply]

(de-indenting) Any talk to of either an unblock or a community ban is premature until the CCI finishes. At first blush, though, this sure looks pretty grim. Nandesuka (talk) 23:15, 4 September 2010 (UTC)[reply]

I've read much of Darius Dhlomo's editing in my time here. This may sound drastic, but I would say that any article creations with more than four sentences of prose are highly suspect. Any edits adding three or more sentences of prose are also suspect. Believe it or not, despite the edit count, I guarantee that you will not find many edits which fall within this description. Most of Darius' larger edits will be adding tables/templates etc but I believe a small minority of these will yield copyright violations. I expect that these violations with be confined to large edits to biographies and event results articles (like those linked above). Sillyfolkboy (talk) (edits)^{Join WikiProject Athletics!} 01:43, 5 September 2010 (UTC)[reply]
- It doesn't sound drastic at all, unfortunately. I've just gone through about 40 biographies in the CCI list, and there's a definite pattern emerging. Uncle G (talk) 01:48, 5 September 2010 (UTC)[reply]
When one sees a CCI listing section heading that says "articles 9661 through 9664", it's fairly daunting. Any and all help is most welcome. Uncle G (talk) 01:48, 5 September 2010 (UTC)[reply]
- Person has created almost 10000 articles and almost all are cruft. Why not launch a bot to delete them all. 67.122.211.178 (talk) 08:38, 5 September 2010 (UTC)[reply]
  - They may seem like cruft to people not interested in the topic, but I can tell you that there are plenty of people who think its worth writing about former European champions and world record holders such as Vera Nikolić. Sillyfolkboy (talk) (edits)^{Join WikiProject Athletics!} 09:58, 5 September 2010 (UTC)[reply]
    - If they were so important other people would have written about them. How about deleting all the ones that have less than 500 characters of content added by other editors. That would still get most of them, making the remaining problem a lot smaller. Any that are worth having will get recreated by someone else sooner or later. 67.122.211.178 (talk) 17:27, 5 September 2010 (UTC)[reply]
  - To be honest I tend to agree with the bot idea, or at least deleting on sight anything more than a few sentences. Frankly I think the community shouldn't use resources trying to save or even think twice about these articles when the problem is of such a scale... --Mkativerata (talk) 10:24, 5 September 2010 (UTC)[reply]

Ban

Per the above, which appears to show that this person is a serial copyright violator, I propose a community ban. 160,000+ edits or not, we do not need these headaches. So what now, we will have to dig through all of his contributions looking for things he stole wholesale from others? Kindzmarauli (talk) 03:34, 5 September 2010 (UTC)[reply]

Could we get him involved in cleaning up his own copyvios? 67.122.211.178 (talk) 07:23, 5 September 2010 (UTC)[reply]
- Given that they continually denied that there was any problem at all in the face of diffs, could we actually trust him to clean up his own copyvios? Normally I'm all for any help we can get with copyvios, but I'm not seeing any indication that they actually would (could?) help. VernoWhitney (talk) 12:29, 5 September 2010 (UTC)[reply]
- 67.122.211.178, I suggest that you try doing so. Go to User talk:Darius Dhlomo and participate in the discussion of that very thing, there. Uncle G (talk) 12:54, 5 September 2010 (UTC)[reply]
  - He may be able to help identify sources, but my initial idea to have him review these to see which had more than a few sentences was obviously naive. I realized he was obscuring the problem, but I did not realize that he would obscure it so far as to assure us that this happened in no more than 15 articles, when this is so obviously not the case. We can't trust him to help identify issues. :/ --Moonriddengirl ^(talk) 13:49, 5 September 2010 (UTC)[reply]
    - Well he doesn't (looking at his talk page) seem to be able to help with sources all that much either. I was really disappointed with his reply to my questions - instead he seems to be suggesting that it is the fault of the community for not giving him harsh enough warnings and he felt that, despite the ones he got, he was doing a "fine job". It's not as if the notices are unclear about policy :( --Errant [tmorton166] ^(chat!) 14:03, 5 September 2010 (UTC)[reply]

(outdent) So nobody supports the community ban proposal? Kindzmarauli (talk) 23:31, 6 September 2010 (UTC)[reply]

I'd been waiting to see if he was going to be useful in evaluating his articles. Based on his talk page, it doesn't look like it. He has failed to identify several copyvios. At this point, I'm not sure who would be willing to lift his block. --Moonriddengirl ^(talk) 23:51, 6 September 2010 (UTC)[reply]
Right now we have an active indef block, which seems fine for the moment. I don't think the "nobody gave a final warning or shorter block" is a valid excuse for this, but it's something for the admin/CC community to learn from. Users as unresponsive as Darius sometimes require LART to catch on to how serious their problems are. 67.122.211.178 (talk) 03:22, 7 September 2010 (UTC)[reply]

Mass deletion: Give up and start over

This has been raised above, inline, but I wanted to call it out for discussion here. As near as we can tell, everything this person has written that is longer than a few sentences is a copyvio. This means that the well of articles he has created -- barring the ones that are simple lists of data -- are, quite simply, poisoned foundations upon which we're letting others build.

I propose that we delete every article this user has created, with an exception carved out for data-only articles like lists of winners. This will no doubt be upsetting to those who have worked on those articles since, but I don't see any other way to fairly respect the rights of the original content creators and abide by our own policies. It's very likely that there are others willing to step in and create replacement articles afresh, and I'd rather encourage that than continue to build atop a weak foundation. The task being asked of the CCI - verify the copyright status of over six thousand freaking articles - is, quite simply, beyond what anyone should be asked to do. It is a Sisyphean task. So, if you'll permit me another Hellenic analogy, let's cut the Gordian knot and start with a clean slate.

Comments? Nandesuka (talk) 14:18, 5 September 2010 (UTC)[reply]

It's not a Sysiphean task. It's a Herculean task. It's the Augean Stables, to be precise. Uncle G (talk) 14:56, 5 September 2010 (UTC)[reply]
I am loathe to do this under ordinary circumstances, but these circumstances are not ordinary. In addition to the tens of thousands of articles at this CCI, we have dozens of other CCIs, some over a year old, with additional tens of thousands of articles that need review...and where copying is not this blatant. In this circumstance, I'd support mass deletion or at least reduction of articles to a one sentence stub. (Please note: that's what we did with the last comparable CCI ([[1]]), and it still took us a year.) --Moonriddengirl ^(talk) 14:21, 5 September 2010 (UTC)[reply]
- (I should note that the tens of thousands of articles to which I refer are not only the ones he's created, but the ones to which he's substantially contributed: hiding reverts and minor edits, that's 23,197 to be precise. --Moonriddengirl ^(talk) 14:29, 5 September 2010 (UTC))[reply]
  - 23,197 edits or 23,197 individual articles? Uncle G (talk) 14:37, 5 September 2010 (UTC)[reply]
    - Individual articles. Mind-boggling, I know, but to quote exactly: "23197 articles from timestamp 2005-11-09 06:15:32 UTC to timestamp 2010-08-30 22:03:25 UTC." I don't know how many edits that represents, but the first on his full contrib list shows 19 non-minor edits to a single article alone. --Moonriddengirl ^(talk) 14:46, 5 September 2010 (UTC)[reply]
      - When I started going over the biographies some hours ago I started to get a grasp of the scale of the problem here, and I came to much the same conclusion based upon the evidence before me that you, Mkativerata, and Sillyfolkboy all apparently already have: Any flowing prose in an article created by this person was written by someone else. It was either written by a subsequent Wikipedia editor or plagiarized from somebody else's writing by Darius Dhlomo. I thought that the problem wasn't going to get larger.
        However, that article count manages to do exactly that. My perspective on that is that it is of a similar scale as reviewing my contributions (under just this account, not my 'bots or before I had an account). I have, as Uncle G (talk · contribs), touched fewer pages, across all namespaces taken together, than that. Uncle G (talk) 15:23, 5 September 2010 (UTC)[reply]
I'd prefer a different approach, of finding some way in which we can rapidly trim the current CCI listing. Then we can review what is left to see whether we still have an unmanageable problem. Is there some set of criteria that we can mechanistically apply to rapidly eliminate the hundreds of 1-paragraph pretty much data-only stubs that this person has made? There are quite a few of them, and eliminating them I suspect would reduce the size of the problem significantly. Moonriddengirl, what is your view on the possible copyright infringement status of articles such as … spins wheel … Jennifer Whittle for example? Uncle G (talk) 14:37, 5 September 2010 (UTC)[reply]
- Minimal creativity, minimal content. I would regard that as a safe stub. If those couple of sentences were highly idiosyncratic, I'd probably look for a source. :) (Compelled to come back and clarify: I'm not saying that could not be a copyvio; it could, if it copies from another sources and especially if it is one of dozens of articles he's copied from that same source, which would clearly not be a de minimis situation. This is a risk assessment question.) In terms of other alternatives, there is an image-based CCI on which I'm working that is not this scale where most of the images are free of copyright problems. I am mass sorting these to separate out the ones that need review. If we had somebody of great patience who could separate these articles according to "likely to be a problem" and "not at all likely to be a problem", that would help. I had planned to ask the contributor to do that himself, but, as I said above, I'm no longer sure we could trust him with that task. --Moonriddengirl ^(talk) 14:49, 5 September 2010 (UTC)[reply]
  - I agree that this is about risk evaluation, rather than certainty. Uncle G (talk) 15:23, 5 September 2010 (UTC)[reply]
Concur this should be reserved for special circumstances and this clearly qualifies. Looking at the scale of the problem I don't see how a volunteer effort could clean all of that up. Extreme measures for extreme actions so I support letting the bots loose to undo what he has wrought. If some stubs are lost in the process they can always be recreated if they are notable by others. --WGFinley (talk) 15:07, 5 September 2010 (UTC)[reply]
This is an enormous number of articles; do we really think that these are entirely (or almost entirely) copyright violations? I don't know how representative a single selection could be out of thousands of articles, but Swimming at the 1997 Summer Universiade, for example, doesn't seem to be a copyvio (I couldn't find anything for it on google except sites which copied the wikipedia article). GiftigerWunsch [TALK] 15:11, 5 September 2010 (UTC)[reply]
- I'm actually wondering how he hasn't had hundreds of warnings from the Coren searchbot by now... GiftigerWunsch [TALK] 15:15, 5 September 2010 (UTC)[reply]
  - No, I don't think they're entirely or almost entirely copyright violations. I think many of them are harmless charts and tables. I think, though, that the number of articles that are copyright violations will probably number in the hundreds. High hundreds or low hundreds? I don't know. --Moonriddengirl ^(talk) 15:33, 5 September 2010 (UTC)[reply]
  - This very probably is a weakness in the 'bot and in the Google Web approach. Take Mikiyasu Tanaka, for example. Picking some phrases at random from the article (e.g. "Tanaka was sent abroad by the Japan Olympic Committee to study volleyball") and giving them to Google Web doesn't turn up the FIVB profile that it was copied from. But it is a copy, nonetheless. The sentences are in a different order. But they are the same sentences, the only changes being things like exchange of proper nouns for pronouns and the like. (In the original, it is "he was sent abroad by the Japan Olympic Committee to study volleyball".) Uncle G (talk) 15:36, 5 September 2010 (UTC)[reply]
Giftiger , try this [2] for Swimming at the 1997 Summer Universiade. 81.145.247.158 (talk) 15:40, 5 September 2010 (UTC)[reply]
- The fact that I couldn't find that quickly and that it's a single article out of thousands, does not bode well for our chances of being able to manually fix all these copyright problems... GiftigerWunsch [TALK] 15:43, 5 September 2010 (UTC)[reply]
  - Ahem, it's listed as one of the references in the article. 81.145.247.158 (talk) 15:46, 5 September 2010 (UTC)[reply]
    - It would be nice if we could rely on that, but we can't always. :) --Moonriddengirl ^(talk) 15:47, 5 September 2010 (UTC)[reply]
    - There was no clue in the article where Edward Liddie was lifted from, for example. Uncle G (talk) 15:53, 5 September 2010 (UTC)[reply]
Rather than a straightforward mass deletion, may I suggest an element of triage? Some of the articles that this editor created will have been edited by others, and some will be more or less notable than others. If we identify and delete those that are tagged as orphans, unreferenced or tagged for notability would that leave us something more manageable? Ϣere SpielChequers 16:04, 5 September 2010 (UTC)[reply]
- Almost certainly not. I've reviewed a few hundred of the biographies now. Yes, it's less than 5% of the problem, but I was selecting at random, from the list before it was sorted, so I have little suspicion that my sample is biased. Notability is almost never an issue on which these subjects have been challenged or tagged. These are not exactly minor sporting figures and events. Likewise, orphan status would be problematic. Many of these articles are on navigation templates for sports teams, regular sporting competitions, and the like, and are unlikely to be orphans. (Quite a few cross-reference one another, too.) Nor, indeed, is lack of any citations a recurrent issue. Darius Dhlomo has linked almost all of xyr creations to on-line sports databases and the like. As criteria for filtering out the problematic articles, from what I've seen I suspect these won't be useful at all.
  I suggested that we find some filtering criteria, above. I haven't yet come up with any, and Moonriddengirl quite rightly notes, above, that it might not be safe from a copyright perspective to even do that. Even the 1-paragraph stubs might be a mass copying exercise, from some source that we are unaware of. All of us who have reviewed the article set so far seem to have come to the same conclusion, that Darius Dhlomo simply doesn't write original prose, at all, anywhere, even if it's only a couple of sentences to make up a small paragraph. Pick a couple of hundred for yourself, check them for copyright violations, and see what conclusions you draw.
  If you find from doing so some triage criteria that actually work in practice, that would be good news, of course. ☺ Uncle G (talk) 16:29, 5 September 2010 (UTC)[reply]
I'm going to support whatever Moonridden girl thinks is best. I'm convinced we are going to have to take drastic measures here. If she thinks triage would work, fine, but on the other hand, I don't want to give already over-burdened editors doing herculean work in the copyright field even more to do. Dougweller (talk) 16:13, 5 September 2010 (UTC)[reply]
- I don't know what bots can do. There are some good suggestions in this thread for narrowing down the list by presumptively deleting those that are least likely to impact others in the project, but I'm afraid that short of mass deletion the only way to process most of this is going to involve a human being (or two or ten) looking at each article. I would definitely support at this point simply wiping out creative text supplied by this contributor. But it's still going to take a ton of man hours just to review them all. --Moonriddengirl ^(talk) 21:19, 5 September 2010 (UTC)[reply]
  - I think some simple criteria can be defined and then a script can check all the articles against the criteria without humans having to look at them. Defining the criteria would take a little bit of work. Example criterion: find all articles that don't contain text added by humans other than Darius Dhlomo. The "text added by humans" part means ignore edits made by known maintenance bots (interwiki etc), edits to metadata only (like categories), or edits with certain strings in the edit summary indicating various script-assisted edits unlikely to add new human-written text to the article. Deleting those articles might shrink the problem by enough to make manual triage feasible for the remaining ones. 67.122.211.178 (talk) 22:23, 5 September 2010 (UTC)[reply]
How about running a script that identifies all of these articles with no more than 500 characters (or some other number) contributed by editors other than Darius Dhlomo. That might be most of the affected ones and they can then be deleted, making the problem a lot smaller. 67.122.211.178 (talk) 18:02, 5 September 2010 (UTC)[reply]
As per my comment above, I support mass deletion here. As I read the evidence from Uncle G so far (thank you), it's unlikely we can find any safe triage parameters. Although there's no great rush so I'm more than happy to wait for suggestions.--Mkativerata (talk) 19:18, 5 September 2010 (UTC)[reply]
Comment: a skim of some smaller contribs suggest quite a few edits are just editing categories, DEFAULTSORT and the like. Can these be excluded from the listing using some automation? As for creations, I would suggest nuking the lot (perhaps leaving a log which then someone like Article Rescue Squadron can use, if they feel like taking responsibility for checking individual entries. In this case, for "nuking", read "userfying" or "incubating".) Rd232 ^talk 00:15, 6 September 2010 (UTC)[reply]
- I don't think userfying or incubating copyvio articles gets rid of the copyright problems. It probably makes the problems worse. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)[reply]
  - Properly userfied (or any incubated) articles are hidden from search engines, and so effectively not really published any more. So a bot can do this for all questionable pages immediately, and then a little more time can be taken to see if there's anything that can be salvaged. Then delete the remaining userfied/incubated pages. Incubated pages are deleted after a time anyway (1 month?). Of course they would need to be tucked away in a subsection or something of the Incubator, to avoid swamping everything else in it. Rd232 ^talk 09:10, 6 September 2010 (UTC)[reply]
    - That is a good point. I think you are right that moving the articles to incubation can be better than leaving them in article space. 75.57.241.73 (talk) 23:18, 8 September 2010 (UTC)[reply]
Suggestion Just have a bot (you can use AWB) mark all of them as CSD: G12 for copyvio. They can then be deleted very easily and quickly. Presumably whatever admin gets to it will check the copyvio status. --Selket ^Talk 00:38, 6 September 2010 (UTC)[reply]
- That would be throwing the baby out with the bathwater. I've been looking at random examples, & either I don't understand the criteria for copyright violations (although the example above for Swimming at the 1997 Summer Universiade was pretty obvious when I found the right revision), or I'm not picking the right examples. Anyone but the most painstaking Admin, when faced with all of those edits tagged as CSD will only examine so many at the beginning before either giving up -- or simply untagging the rest. (And if I understood the proper way to clear those which I don't think are copyvios, I'd offer a hand thinning out this list.) -- llywrch (talk) 02:58, 6 September 2010 (UTC)[reply]
- Selket, we're talking about ten thousand articles, or maybe even 20,000+. It would take admins years to work through that many. 67.122.211.178 (talk) 05:42, 6 September 2010 (UTC)[reply]
- Let's also not forget that {{db-g12}} has a mandatory url parameter; a reviewing admin should simply remove the template if it doesn't have a url parameter. How would we automatically figure out what the url(s) being copyrighted are? GiftigerWunsch [TALK] 08:27, 6 September 2010 (UTC)[reply]
Support deletion of them all. It's better to lose some good contributions than to keep so many copyright violations, and it's nearly impossible to check them one by one. Delete them all and ban the user. Fram (talk) 08:21, 6 September 2010 (UTC)[reply]
Support Mass deletion. There is no criteria that can reasonably exclude the right (or wrong) kind of content here. While the worry of losing hundreds of "our" articles is understandable, in reality we have to remember that articles built on other people's content aren't ours to begin with. MLauba ^(Talk) 08:46, 6 September 2010 (UTC)[reply]
Support mass article deletion. 20,000 over articles is just too many to go through, presuming that most of them are copyright violations. I really don't see a way where we can check them manually one by one. Bejinhan talks 10:24, 6 September 2010 (UTC)[reply]
Mass deletion is a bad idea, people. That would get our article on 1943 robotically deleted. What we need here, if we truly are going down this route (which I'm sure we're all very hesitant about), is a special-case speedy deletion criterion that we can apply, such as "created by Darius Dhlomo, no substantive content edits other than by Darius Dhlomo, and containing actual running prose commentary rather than just raw uncopyrightable names, numbers, and dates". We need community authorization for Moonriddengirl and other administrators to perform speedy deletions under that specialized criterion. Uncle G (talk) 10:54, 6 September 2010 (UTC)[reply]
- It's not a speedy category, but we sort of already have authorization for that, though I don't trot it out under ordinary circumstances. Per policy at Wikipedia:Copyright violations: "If contributors have been shown to have a history of extensive copyright violation, it may be assumed without further evidence that all of their major contributions are copyright violations, and they may be removed indiscriminately." Ordinarily, when I encounter an article at CCI that seems to be a copyvio but I can't prove it, I use the copyvio template on the face and Template:CCId to give interested contributors a week to look at them and offer input. (Keeping in mind that by the time we get to CCI, we have verification of multiple violations of copyright policy; this template and this approach are not supported by policy where there is not a proven history of extensive copyright violation from a contributor.) The problem with this approach here is that all of these articles would be listed by bot at WP:CP, which would totally break the board. If we created a different template for the face that would not be placed at CP, but instead categorized by date, it could be manageable to have a bot tag at least the articles he's created. It still requires human review, which would be time consuming, but we could then delete or stub the ones to which he's added substantial text, remove the tag from those without. It's kind of a cross between the delete them all approach (which, having worked CCIs for some time now I understand in this case) and the legitimate desire not to lose more than we have to. --Moonriddengirl ^(talk) 11:47, 6 September 2010 (UTC)[reply]
  - If you want a category for the created articles to progressively de-populate, I could get Uncle G's major work 'bot to append Category:Articles created by Darius Dhlomo (or some template including it) to everything on VernoWhitney' original list. Would a category with just under ten thousand articles in it be useful? There would be no date or size sorting. Uncle G (talk) 12:58, 6 September 2010 (UTC)[reply]
    - Yes, that would be helpful. I was just working up a template based on {{copyviocore}} and {{CCId}} with a similar notion: User:Moonriddengirl/CCIdf. Not sure if we should blank these articles as we do with {{copyvio}} (if nothing else, it makes it clear that there's a time limit) or take an approach more like {{PROD}}, but since I used {{copyviocore}} it's created from the presumption of blanking. Would something like that be helpful? It's still going to be a ton of work, but it would make the job easier if we delay admin processing for seven days. At that point, we can G6 anything that meets the criteria: extensive creative content added by Darius Dhlomo that cannot be removed without leaving an unusable article (similar to G12). It also allows interested contributors an opportunity to get rid of all creative content added by Darius Dhlomo. But we'd need to do something to flag if the template is removed out of process; in copyright cleanup, that happens quite regularly. People don't want the article deleted, and they don't much seem to care if it's a copyvio or not. --Moonriddengirl ^(talk) 13:13, 6 September 2010 (UTC)[reply]
      - I'm rather taken with Rd232's idea of — well, to be frank — sharing the pain. I forsee three things needed:
        A template for the 'bot to blank the article with (Community discussion needed: Should the 'bot blank the article?) I've got something a bit shorter than User:Moonriddengirl/CCIdf in mind. I'll be bold if you've no objection. I think that we might need two templates, one for the blanking and one for a deletion nomination.
        
        An explanation of the 'bot's task, to be used in all 'bot edit summaries so that people seeing their watchlists light up with a thousand blanked articles have somewhere immediately to go for an explanation.
        
        Instructions for editors on what to do now, to be linked to from the template notice.
      - Additional points: The template notice must be carefully worded. I don't think it fair to have Darius Dhlomo's name come up all over Google. The instructions must be clear that this is a complex task that can end in a multiplicity of outcomes (from {{copyvio}} to simple removal of the template). The instructions must also be clear that editors restoring content shoulder the responsibility for doing so. All of the notices and instructions must be hashed out before the 'bot begins. And we need community attention given to the fact that a 'bot is about to mass-blank some ten thousand articles. (I've updated centralized discussions, and to bring more attention here. I've also put a notice on the 'bot owners' noticeboard.)
        I cannot help you with noticing template removals. But there are other 'bot owners who probably can. There are 'bots that note Proposed Deletion challenges. Uncle G (talk) 14:09, 6 September 2010 (UTC)[reply]
        No objections from me. :) I'm all about getting the work done, however we can best do it. You can change my mock-up directly or build your own, whatever works. --Moonriddengirl ^(talk) 14:34, 6 September 2010 (UTC)[reply]
        I just took a look at User:Moonriddengirl/CCIdf, but I think it's a little impractical, and should be closer to what I suggested below: it should be removed by anyone who feels that they have addressed the copyright concerns. If it remains after a week, it should be deleted as IAR in a similar way to a PROD. Placing a template like this such that admins have to do all the work, would mean this problem is likely to never be solved. Once the majority of the articles are deleted, because no one has challenged their deletion as copyright vios, we can manually check anything that remains to confirm that the users who removed the template because they didn't find / managed to address the copyvios, has been properly addressed. GiftigerWunsch [TALK] 14:39, 6 September 2010 (UTC)[reply]
        
        I don't think the articles should be blanked, either; that's likely to just impede evaluation of the articles. GiftigerWunsch [TALK] 14:41, 6 September 2010 (UTC)[reply]
        
        Note: I have just created an alternative draft version of the template in my userspace; comments are welcome. GiftigerWunsch [TALK] 14:48, 6 September 2010 (UTC)[reply]
- ? Darius didn't create 1943. Is it being suggested to delete all articles Darius has ever touched? I thought we were deleting all Darius creations in order to cut the size of the list of contributions needing review. Maybe do that, and blank/special tag all articles he's touched (excluding identifiable cases of minor changes only, like categories)? Rd232 ^talk 12:06, 6 September 2010 (UTC)[reply]
  - 1943 is on the list of articles that we have. It's on page 24. (Yes, I've been to page 24. I've even tagged 1943 as not a copyright violation.) Mass deletion of everything on the list gets 1943 and many other such articles deleted. We don't need mass deletion, and mass deletion wouldn't be right. What we need is (a) community confirmation that it's an acceptable loss to the project to lose articles such as Paul Easter and Yohann Bernard, (b) community confirmation that we don't trust any running prose content by this editor not to have been copied from somewhere, and (c) community confirmation that it's an acceptable risk to the project to retain articles such as Matías Médici and Franklin Chacón. Uncle G (talk) 12:37, 6 September 2010 (UTC)[reply]
    - I don't think anyone has proposed mass-deleting every article Darius contributed to. Just the ones he created, and (in some versions) just the ones he created that meet certain other criteria (like absence of substantial contributions from other users). So bringing up 1943 is a red herring. 67.122.211.178 (talk) 17:55, 6 September 2010 (UTC)[reply]
Alternative: perhaps there is a slightly less destructive way to go about this. Instead of outright deleting every article the user created, why not PROD them all as being potential copyvios, directing to this discussion, with the added condition that users removing the PROD are asserting that they have looked for copyvios and dealt with any they've found? This could be explained in the prod message. The majority of the articles will probably be left for the PROD to expire, in some cases users who have contributed a decent amount to the articles will deal with the copyvios and remove the prods, and then we're just left with a hopefully much smaller number of articles where the prod has been removed without the copyvios being solved; whichever articles survive can then be checked by those who are willing to help solve this case. Any thoughts? GiftigerWunsch [TALK] 12:23, 6 September 2010 (UTC)[reply]
- It may also be preferable to use a custom template instead of a PROD, so that editors don't get confused that additional conditions have been applied to the PROD, and then articles which still have the template after a set time (7 days? 10?) could be deleted as IAR. GiftigerWunsch [TALK] 12:25, 6 September 2010 (UTC)[reply]
- This is a good idea.
  Also, echoing Rd232, my thought was in fact that we delete (or, per Giftiger, custom-prod) every article Darius created, not every article he touched. Nandesuka (talk) 12:30, 6 September 2010 (UTC)[reply]
  - Not quite. Darius Dhlomo created Albertina Dias, but most of its content was written by Sillyfolkboy. There are plenty like that. Uncle G (talk) 12:58, 6 September 2010 (UTC)[reply]
The sheer number of copyright violations uncovered here makes it necessary to take drastic action—even if that means deleting thousands of articles. As Moonriddengirl says, there is already a process in place to do this, and I trust the people involved with this to handle the deletions in an appropriate way. GiftigerWunsch's suggestion in the above post is reasonable; {{CCId}} (which Moonriddengirl mentioned above) can be used for that purpose. Ucucha 12:34, 6 September 2010 (UTC)[reply]
- Whoops, I hadn't read Moonriddengirl's suggestion above; I guess great minds think alike ;) GiftigerWunsch [TALK] 12:42, 6 September 2010 (UTC)[reply]
  - I guess so. :) I'm posing some more ideas on this a bit higher up. --Moonriddengirl ^(talk) 13:13, 6 September 2010 (UTC)[reply]
    - I think the best thing to do is have a bot deal with every affected article. Initially, this would be page blanking, and replacing with a Darius-related explanatory note template. The copyright problem then goes away immediately, and more time can be taken to rescue articles, with effort drawn from lots of people not normally active on copyright issues. Then, after say 30 days, mass-delete (or possibly mass-incubate, if to a special subsection to avoid swamping the WP:INCUBATOR) any still tagged. Use a bot to notify major contributors of the action, and thus spread the work involved very very widely. Rd232 ^talk 13:22, 6 September 2010 (UTC)[reply]

Mass blanking of ten thousand articles by a 'bot

Just to make this clear, with its own section heading, here's the idea:

A 'bot goes through everything on VernoWhitney's original list of some ten thousand articles, which are the articles that Darius Dhlomo created. It blanks the article and replaces it with a template notice.
- Community discussion required: Should the 'bot blank the article? This is a copyright issue.
I volunteer Uncle G's major work 'bot for the task.
- Community discussion required: This 'bot does not have the 'bot flag. The 'bot flag will stop people's watchlists lighting up with thousands of blanked articles. I'm happy to have the 'bot flagged for the duration of the task. But do we want people not to notice?
The edit summaries of the 'bot link to an explanation of the 'bot's task, which gives people something to look at straightaway to see what's going on and why.
The notice itself links to instructions for editors, with a clear procedure for assessing copyright infringement.
The notice also categorizes all articles into a (hidden) category Category:Articles created by Darius Dhlomo.
- Community discussion required: Category:Mass infringement copyright cleanup or some such instead?
Editors assess the status of the article and act appropriately.
- Community discussion required: Do we put a time limit on this? Do we incubate any articles left blanked after 30 days, per Rd232 above? Or do we just leave them blanked long-term for people to address at leisure?
We provide a streamlined version of {{copyvio}} that is dedicated to this task and that doesn't have all of the overhead of the normal procedure. An article tagged with the special process can go straight to deletion assessment by an administrator without the additional listing overheads and suchlike. But administrators can only delete such articles if they were previously tagged as part of the cleanup effort in the first place. (Vandals don't get to abuse the template in the obvious way.)
- Community discussion required: Could we just re-use the existing speedy deletion notices instead?

None of this is happening immediately. The relevant notices and templates need to be set up before we even think of starting such a 'bot task. This is a proposal, condensed from the above. There are questions yet to be answered. Note that it addresses just under ten thousand articles. There are just over thirteen thousand articles in the cleanup list. If all goes according to plan, this will let the CCI folks reduce the list to just the three thousand or so articles touched by Darius Dhlomo but not created by xem.

The major advantage of this over the mass deletion idea is that it shares the task around ordinary editors, rather than concentrating it in the hands of a handful of administrators. Everyone has the tools to action the next step after a blanking.

Please discuss. Uncle G (talk) 14:53, 6 September 2010 (UTC)[reply]

Support this seems more sensible than mass deletion, may I assume that the categories, tags and external links will be left unblanked as they are unlikely to constitute a copyvio? Also would it be possible to run this using a bot without the bot flag? Like many users I ignore bot edits on my watchlist so I would be unaware of any articles that I'm interested in being blanked. Also is there any chance of a second bot run going through these articles and comparing the last version edited by the copyviolater with the version before it is blanked, as there is no point losing such articles. Ϣere SpielChequers 15:40, 6 September 2010 (UTC)[reply]
- There is some merit to the idea of saving the categories/tags (and sigh, maybe the extlinks too), though that info can be saved elsewhere even if the articles are deleted. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)[reply]
- It's either blanking everything or blanking nothing and prepending/appending the template if you want my 'bot to do it. It's only geared for the simplest of content work. I could write something to do more complex editing, but I don't have such ready to hand.
  There's nothing inhibiting a 'bot from running that doesn't have the flag. The flag allows everyone to exclude 'bot-marked edits from various lists, like watchlists and recent changes. Usually 'bots don't make edits that are interesting to recent changes patrollers or people with watchlists. The question is whether that's true in this case. Please discuss.
  As for the last, note that a blanked article is not the end point, but a mid-way point in the process. The whole idea of blanking, rather that deletion, is that we don't lose edit history unnecessarily, and that anyone with the ordinary edit tool can recover content if there turns out to be no copyright violation. Uncle G (talk) 16:38, 6 September 2010 (UTC)[reply]
  - IMO the bot should be flagged (RC patrol has enough to do without dealing with this) but should log all its actions on some special pages that everyone can review. That includes describing its analysis of pages that it then decides not to edit, so the log pages would be more informative than the bot's contrib history. The bot should probably run under a specially made new account for this purpose too. 67.122.211.178 (talk) 22:16, 6 September 2010 (UTC)[reply]
Comment Do you think the bot could identify and tag separately those articles with > 500 edits, or something in that neighborhood? At that point there is likely to be little residual copyvio, so it needs to be looked at differently. Also, would the coren copyviofinderbotthingy (phew) bot be able to check a category, and would it be any use? 69.236.190.48 (talk) 16:08, 6 September 2010 (UTC)[reply]
- If VernoWhitney is capable of coming up with separate lists of such articles to work from, I can certainly tag them differently for each list. I don't know what CorenSearchBot is capable of in respect to scanning old revisions of existing pages in categories. Uncle G (talk) 16:38, 6 September 2010 (UTC)[reply]
- I don't believe that articles with >500 edits necessarily have little residual copyvio. Look at the history of Between Silk and Cyanide, which I just had to mostly-blank because somebody inserted a copyvio in 2006, even though dozens of other editors worked on the article after that. The additional editing expanded the article a lot and smeared the copyvio all over the article, so it was no longer revertable in one lump. There is a comment on the talk page explaining further. I do think articles with substantial text added by human editors (bots edits don't count) other than Darius should be flagged. 67.122.211.178 (talk) 18:10, 6 September 2010 (UTC)[reply]
  - Well, I think that articles with 500 edits generally are large enough they would need human review the most of any of the articles. NativeForeigner ^Talk/_Contribs 18:56, 6 September 2010 (UTC)[reply]
    - Yes. Perhaps the bot could blank those in two edits. It would first detect all the text that had been added by Darius (as opposed to other editors) and mark that text with a font or color change and save it. Then it would blank the whole article and save again. People wanting to restore the article could look at the marked-up version in the history and use that to help separate Darius text from non-Darius text. 67.122.211.178 (talk) 22:29, 6 September 2010 (UTC)[reply]

Prod-like proposal

I thought I'd move my proposal here as it's getting lost in the discussion and I feel it would be beneficial. I propose that all articles that Darius has created be tagged with a template similar to a PROD (I've created a draft here which all are free to edit and comment on), such that the article will be automatically deleted after 7 days (or perhaps longer, depending on consensus). Like a prod, anyone can remove the template, but unlike a prod, in doing so they are asserting that they understand copyright violation, and have thoroughly checked the article and fixed any copyvios. This would be clearly explained on the template.

Any articles which are not checked or which cannot be saved, will still have the template after the time has expired, and will be deleted per WP:IAR. Those articles which survive can then be double-checked to confirm that they are not copyvios.

Hopefully during this process, a large number of editors will have noticed that an article they've contributed to is being sorta-prodded, and will help to remove copyvios and then remove the template. Any articles where no one noticed the sorta-prod deletion are acceptable losses, being deleted per usual PROD rules anyway. GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)[reply]

GiftigerWunsch [TALK] 15:02, 6 September 2010 (UTC)[reply]

This makes much more sense then a mass deletion. I support this. elektrik SHOOS 16:57, 6 September 2010 (UTC)[reply]
- By this process, the articles will be automatically deleted unless somebody objects. By the process above, human review is needed, but articles that are not infringements will be salvaged. --Moonriddengirl ^(talk) 16:59, 6 September 2010 (UTC)[reply]
  - On the other hand, manually reviewing thousands of articles requires an enormous amount of time and resources, and alerting those who have contributed to the articles to help sort out the copyright issues by otherwise having them deleted after a fixed period, means that the job will be distributed among many people, and hopefully achieved in less time. GiftigerWunsch [TALK] 17:50, 6 September 2010 (UTC)[reply]
    - On the gripping hand, when the 7 day Proposed Deletion period expires on ten thousand articles all at the same time, we're right back in the same position that we started from. ☺ Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

There's not that much different between this and the above proposal, except in the matter of imposing a time limit, and defaulting to automatic deletion. Defaulting to automatic deletion will get screams of outrage from people who find out after the fact, I predict with a fair degree of confidence. (That's in part why I've pointed to this discussion on as many noticeboards as I have. I want to reduce the number of people who find out after the fact.) "Why wasn't I warned before you went off and deleted thousands of articles relevant to my WikiProject? I'm going to abuse an administrator for this!", they'll say. Leaving articles blanked, with just a warning notice on them, to be addressed at somewhat greater leisure, avoids that drama before it starts. It also addresses the concern that people have — that we all have — about not losing articles that aren't copyright violations at all. Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

The above proposal is now more concrete. I've boldly updated User:Moonriddengirl/CCIdf and written Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation (for edit summaries) and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (for the notice). Please review, discuss, and boldly improve. Uncle G (talk) 18:41, 6 September 2010 (UTC)[reply]

I'm somewhat concerned, though, that "at somewhat greater leisure", considering that we're talking about thousands of articles, is going to mean months, or longer. If the articles are all going to be blanked until such time as an administrator (or at least other editors) manage to check them for copyvios and restore them, what's the difference? If the deletions are found out after the fact, interested parties could request undeletion so that the article could be rechecked for copyvios, the references could be salvaged, the article could be written, or whatever else. GiftigerWunsch [TALK] 19:25, 6 September 2010 (UTC)[reply]

Thousands don't need to take months if everybody pitches in. I haven't seen you at CCI yet ;) —fetch ·comms 02:12, 7 September 2010 (UTC)[reply]

Fetchcomms, rather than looking at 1000's of articles for a few seconds each, I wonder if you could help investigate Rocío Ríos (see section below). That boilerplate text came from somewhere. If we can't establish where, then we can't rely on whatever processes you're proposing to use on the other 10,000 (or 20,000 or 40,000 depending on how you count) affected articles. 67.122.211.178 (talk) 09:13, 7 September 2010 (UTC)[reply]

I like this suggestion. Will discuss more at Nuclear option below. --Moonriddengirl ^(talk) 12:14, 7 September 2010 (UTC)[reply]

When voting is not evil

I'm suggesting something that is "out of the box" to help solve this problem. The outlines of this suggestion are as follows:

We have a real, majority-rules vote on whether to mass-delete all of the articles DD created. Or to adopt Uncle G's bot proposal above.
A username's vote is counted only if that username has reviewed 10 or more of the articles listed at Wikipedia:Contributor copyright investigations/Darius Dhlomo or one of the associated pages.
One can vote more than once (i.e., review 20 articles, you get 2 votes), under different usernames, etc. One doesn't need to be an Admin.
The vote is a simple yes/no. Either we do it or we don't.
At the end of a week or 10 days, the votes are counted.

what I like about this idea is even if the result is to mass-delete these articles, some will have been examined & determined to not be copyvios. (Based on previous discussions on AN/I, this could lead to as many as 100 people participating, which would mean at least 1000 articles examined.) Hopefully, this would give us further information about a more precise filter for where the copyvios are & aren't. Thoughts? (And while the conversation continues, I'll be working thru the list of possible copyvios; I know I can't save all of the non-copyvios, but I know I can save some.) -- llywrch (talk) 17:04, 6 September 2010 (UTC)[reply]

This doesn't sound like a good idea to me. I'm even skeptical that any of us can review an article and determine the absence of copyvio. Even if the article has no text (e.g. just a table), maybe that table was copied from somewhere. And anyway, voting is always evil, and the likelihood of getting 10000 articles manually reviewed reliably is very small. If you've got a criterion like "all info is in a table and there are no strings of more than 5 consecutive english words" we could run a script that finds and lists such articles. That might be quite a lot.
Anyway, reviewing some tiny fraction (10 or 20) of the articles shouldn't give a special say over the rest of them. If you want extra authority over the whole collection of articles, you have to review all of them. Under your voting scheme, voters should also accept (and be assigned) responsibility for any copyvios that later turn up in articles you have declared to be free of copyvios. Otherwise your suggested voting system gives influence-seekers incentive to "review" articles as quickly as they can, and potentially miss a lot of bad stuff. 67.122.211.178 (talk) 18:12, 6 September 2010 (UTC)[reply]

If a human can't properly determine what is a copyvio, then how can one tell a bot how to do it? And the idea of giving "a special say" isn't that: it's to encourage people to actually tackle the problem of cleaning up this mess, rather than the usual process of talking about the problem. (Out of over a dozen posters to this lengthy thread -- of whom three want a mass deletion -- less than half have even left evidence that they reviewed any of the entries at the CCI.) People here appear a lot more eager to tell us what the solution is & expect someone else to do it, than to actually help fix the problem. And fixing the problem isn't hard just tedious, & would be handled quickly enough if enough people spent their time there & not at this DramazBoard. -- llywrch (talk) 21:43, 6 September 2010 (UTC)[reply]

The difficulty of determing what is a copyvio is at the root of the mass deletion proposal. The proposal is basically that if the article was created by Darius, there is an a priori likelihood that it is a copyvio whether we can find the original source or not, so we should delete it in the expectation that someone else will eventually recreate it if the topic is important. As for special say, no, sorry, if you review 20 articles, that's less than 0.1% of the articles affected, and you can't say "well look, I've made a significant dent in this problem" because 0.1% is insignificant. If you review 5000 of the articles, then your argument may have a bit more validity. That's why most of us (I think) are giving high credence to the views of Moonridden girl, because of the enormous amount of time she's spent dealing with this sort of problem. Looking at 20 or 50 or 100 of these Darius spewings doesn't hold a candle compared to that. 67.122.211.178 (talk) 22:01, 6 September 2010 (UTC)[reply]

Yeah, I've looked through about 40 pages he created, and I have to say that, while most are not vios right out of the box, some are, while some are added later. There's no way to figure out any pattern unless you get through at least a couple hundred. I disagree that we should just delete assuming that they're all vios, which they obviously are not. Manual is hard, but not impossible if everyone helps. These articles are of notable people, and shouldn't be deleted on the assumption that someone will eventually create a non-cv version of them. —fetch ·comms 04:28, 7 September 2010 (UTC)[reply]

Not impossible if everyone helps -- but why should people spend their time that way? [S]houldn't be deleted on the assumption that someone will eventually create a non-cv version of them. I don't know what the issue is, we have 3.3 million articles that all got created somehow, and we're talking about a bunch of almost content-free stubs once the presumed copyvios are removed. Maybe it's just me but I don't see much point in getting attached to such articles. Nobody is proposing to salt them from recreation. We got along without them up to when they were created (some fairly recently) and (for most of them) nobody else thought they were interesting enough to edit substantially. I could see some value to keeping the names/references/categories in a list someplace. We already have the names in the CCI report, if that helps. 67.122.211.178 (talk) 08:10, 7 September 2010 (UTC)[reply]

Well, if we just reword all of the stubs and delete the (few) longer articles he's written or expanded with prose (not just standard tables), while keeping them blanked until a rewrite (as the option below outlines), I see no reason why they all need to be removed. Hiding the cv while preserving the content history sounds like a fair compromise. —fetch ·comms 00:14, 8 September 2010 (UTC)[reply]

Triage criteria

Let's talk a little bit about triage criteria. All these points are open to discussion, but initially for this purpose I'll assume:

"Triage" is a process designed to be carried out by a bot or script that examines (in some way) all 20000+ articles that Darius has touched, and labels those that meet given criteria that we're trying to specify here. The script should run with no human intervention and not much human review of the final output. Of course we'd first run it on smaller sample sets and examine the results carefully to tune the criteria. Triage should divide all those articles into several possible categories, such as:
- Articles needing no special attention (Darius's only edits didn't add any significant content)
- Articles presumed copyvio and which should be deleted or blanked without additional attention (all significant content in the article came from Darius)
- Articles presumed containing copyvio but which should get careful attention anyway (e.g. article contains significant content from both Darius and others)
- Articles that the bot isn't sure how to classify, but that a human can probably tell with a quick look.

By "manual edits" I mean edits to an article made manually by human editors. "Human" is specified because edits by bots shouldn't count for this purpose. (A spot check of the Darius-created articles indicates that the majority of the edits in them are probably bot edits). Script-assisted human edits (routine maintenance scripts) mostly shouldn't count either. "Text" means any sequence of more than 5 consecutive words in the article body (not in category or interwiki tags). Here is a simple proposal for criteria and labels:

Articles that contain no text added by Darius (just tables with names and numbers) => no attention needed
Articles that contain text added by Darius and no text added by others => delete
Articles containing text added by both Darius and by others => if text is more than 80% Darius, then delete, else flag for attention
Articles with 2 or more manual edits from editors other than Darius => these may be of interest, flag these in a sample set and study for further ideas.

Any thoughts?

67.122.211.178 (talk) 19:11, 6 September 2010 (UTC)[reply]

I like it, but is it technically feasible to separate these articles? If so, go for it. —fetch ·comms 19:54, 6 September 2010 (UTC)[reply]

Yeah, most of the above should be doable. I may try coding something later today. 67.122.211.178 (talk) 20:17, 6 September 2010 (UTC)[reply]

This sounds really interesting; I'll look forward to seeing what you can come up with. I would, though, want to find some way to exclude those articles that have been cleared already through the CCI. We've got some really good volunteer work going on there. :) --Moonriddengirl ^(talk) 23:16, 6 September 2010 (UTC)[reply]

For the moment, I'm using User:CorenSearchBot/manual as a double-check; it's a bit buggy and not very reliable (I caught two vios it missed), but might help some users. It definitely needs to be reviewed by a human, though, but for articles that already are two sentence stubs, there's really no way it can be a cv if the bot passes the second sentence (as the first is usually changed to be MOS-compliant). If anyone can find a better cv-checker process, however, please tell everyone. The Earwig's tool only does one page at a time and he told me that he doesn't have time to let it process multiple ones at once. —fetch ·comms 00:20, 7 September 2010 (UTC)[reply]

I thought we'd already concluded that a non-finding from the searchbot doesn't really tell us much. We also don't know about possible copying from materials like printed almanacs and magazines, that have never been online so won't show up in any search engines. We're left treating all Darien text additions as vios whether we locate a source or not. Are we still going by that approach? Are there enough different views on this that we should open a discussion section about it? 67.122.211.178 (talk) 01:13, 7 September 2010 (UTC)[reply]

There are a lot of stubs he made that can't be vios. I mean, "X (born [date] in [place]) was a [occupation]" is a standard first sentence; if that was a copyvio, then it'd be pure coincidence. Many of the vios I saw are that he creates a page (most back in 2006), then comes back in 2008 to add the cv in. Now, this isn't the case for every article, as he also creates cvs at the beginning, but deleting everything he created doesn't help. I also have seen several instances of what he has copied being changed over the years so that it is no longer a cv at this point. The only way to do this right is not to take the easy way out and delete all the pages, but rather to separate the more-likely copyvios (excluding category/template/table-only changes and going through the rest). For the issue of print sources, it may be better to stubbify articles to which he has added more than a couple sentences but do not appear to have been taken from Internet sources. —fetch ·comms 02:07, 7 September 2010 (UTC)[reply]

How do we even know that the people in those articles even exist? See trap street. If Darius got a sports almanac and entered info about some fictitious player who the almanac writers invented, that's a copyvio even if no words were copied. I just can't work up much motivation to try to retain articles that nobody other than Darius contributed to. Most are quite unencyclopedic, sort of a phone book about athletic events. 67.122.211.178 (talk) 04:18, 7 September 2010 (UTC)[reply]

I'm not much of a deletionist/inclusionist kind of guy, but I don't think that it's fair to just delete all these potentially useful articles of notable people (to some degree). The issue of fictitious persons in his sources can't really be helped, I guess; we could basically delete every article without a source under that premise. I know that going through the list manually is not desirable, but I'd rather check everything and salvage what we can. —fetch ·comms 04:24, 7 September 2010 (UTC)[reply]

We use an WP:AGF approach to normal contributions, but in Darius's case there's such a rampant pattern of vios that we may be better off treating every one of his contributions as tainted. I asked on his talkpage if he copied from any print sources, but he hasn't responded yet. The articles that don't contain (presumably copied) text inserted by Darius are mostly uninformative stubs, not all that useful compared to just using a search engine. We do in fact now have a policy (being implemented in stages that are still under way) of deleting all unsourced BLP's. 67.122.211.178 (talk) 06:03, 7 September 2010 (UTC)[reply]

Nuclear option

Above I suggested blanking articles and instigating a mass checking effort. But at this point I reckon triage here should involve excluding edits which cannot be copyvios, particularly by virtue of being too small, or merely changing categories etc. Everything else should be presumed copyvio of one form or another, possibly from print sources (and therefore hard to impossible to identify). All the evidence (and Darius' inability to so far point to things which are not copyvios is damning) suggests to me that Darius simply does not write substantive prose. My feeling is he's one of those people who (possibly English not first language?) isn't confident writing, and so virtually always copies with some minor modification. This would explain his ignoring the warnings - he felt simply unable to contribute without doing it in this copyright-violating way. In logical consequence, all prose he's ever written which remains in articles should be deleted as being a copyvio of something or other. This seems to be a situation where it is unreasonable to say "let's see which of these are copyvios, and remove it if proven". "nuke the entire site from orbit. It's the only way to be sure." I'm not even basing my view on the amount of work involved in checking: I'm basing it on the unacceptable likelihood of large numbers of copyvios not being identified, and so retained - especially if the checking is done by people not familiar with copyright checking. So i) bot-blank and tag all affected articles, excluding whatever can be; ii) the tag requires all Darius prose to be removed for the article to be restored. This has the tremendous advantage of simplicity iii) allow a long time to handle blanked articles, say a year. Then delete any that are left. Rd232 ^talk 12:00, 7 September 2010 (UTC)[reply]

I'm beginning to find the sprawling proposals very hard to follow here. :D I agree with you that what cannot be copyvio should be excluded. We have traditionally excluded edits below 100b as likely to be de minimis at worst. It's not perfect, but it's workable. (Workable matters. This is just one CCI. We have several dozen still open, some of which are over a year old.) In this case, I agree that we need to consider that all creative content added by this user is a copyright problem that needs to be removed. I like the tag modification as made by Uncle G (see section above): User:Moonriddengirl/CCIdf. The combination of that tag, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation and Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help would invite all interested contributors to help out. The only thing we might wish to reconsider is how they can recognize problems. Given that there is a risk of offline sourcing (so far, none found, but some of the sources that have been found didn't show up in a Google search engine), we should presumptively remove or rewrite all of his creative content. Perhaps those who want to help out should be invited just to identify what creative text he added and remove or rewrite it. --Moonriddengirl ^(talk) 12:23, 7 September 2010 (UTC)[reply]

My bad. I guess what I'm really doing here is arguing that the tag directions to editors finding the blanked page should not tell people to check whether there is a copyvio; but to simply to remove (or rewrite) any Darius prose, because any such prose almost certainly is. And it's an error-prone waste of time trying to prove a negative. Beyond that, this is basically the "Mass blanking of ten thousand articles" proposal. Rd232 ^talk 12:57, 7 September 2010 (UTC)[reply]

I agree. I think these two proposals merge well together. --Moonriddengirl ^(talk) 13:08, 7 September 2010 (UTC)[reply]

A lot of what Darius wrote were two-line stubs that probably aren't vios (and if they were, it'd only be the second sentence because the first would have had to be changed to match the MOS) as he uses fairly plain wording everywhere: "X is an [sport] player. Xe won the [medal] with Y country in the Z Olympics. Xyr personal record was [time] at [place].", etc. So, I agree that everything he wrote has the potential to be a vio, but for many of the stubs, just doing a 30-second reword of the only possibly offending sentence or two should be enough to remove any lingering doubt. Until we can run through all those, just keeping it blanked ought to do fine. As long as a query can eliminate the minor diffs (categories, templates, tables, etc. changed only), then we have a lot less to worry about. —fetch ·comms 00:11, 8 September 2010 (UTC)[reply]

Simple proposal

This has three parts:

Skip the articles where he changes less than 200b. He's either adding a row to tables, adding categories, or adding templates.
Everyone stop worrying about this AN/I thread and go check some articles. If there are 23,000 articles on the list, 100 users can go through 230 articles each and that will be that. 230 is not a lot, considering I went through 20 in about three minutes last night, which were ones where about 100kb was changed. If we just skip those, and start at the end, we can eliminate a good portion of the articles as cv-free, and deal with the likely cv ones.
Blank all articles he created, and make a separate list of those to which he added more than 1,000b, which need individual examination.

Otherwise, the triage idea above seems good. —fetch ·comms 19:54, 6 September 2010 (UTC)[reply]

200kb is the size of the entire ANI page. Do you mean 200 bytes? I think that is too many (1 word = approx 6 bytes). 67.122.211.178 (talk) 20:16, 6 September 2010 (UTC)[reply]

My bad. I meant bytes. Fixed accordingly. 200b is around adding a few categories, a medal chart template, or couple of rows to a table. —fetch ·comms 20:47, 6 September 2010 (UTC)[reply]

1 word = 6 bytes so 200b = 30+ words, which can be a pasted sentence. I would not want to accept anything that had more than 4 or 5 consecutive words added, where "consecutive" means e.g. not in separate table cells. 67.122.211.178 (talk) 22:09, 6 September 2010 (UTC)[reply]

Articles to which he has added below 100b are excluded from the listing as minor. If we add them back in, the number of articles we must check jumps from 23197 to 41,108. --Moonriddengirl ^(talk) 22:51, 6 September 2010 (UTC)[reply]

What software are you using to find that? I've started fooling around with some code, but might be duplicating existing effort. 67.122.211.178 (talk) 04:10, 7 September 2010 (UTC)[reply]

It's our CCI tool; you can read about it and access it here. I'm afraid I know zilch about how it works...just that it does. :) --Moonriddengirl ^(talk) 10:40, 7 September 2010 (UTC)[reply]

Should we move this discussion?

Subpaged.

I begin to think we should move this to a newly created project page or set of pages (some such pages have already been created), or an RFC. Otherwise it is going to swamp ANI pretty soon. Part of the discussion should be about technical aspects of proposed bots and scripts, that would be too far in the weeds to clutter ANI with. 67.122.211.178 (talk) 22:48, 6 September 2010 (UTC)[reply]

yes It's already a 1/16th of the page.--intelati^(Call) 22:50, 6 September 2010 (UTC)[reply]

Support enthusiastically. --Moonriddengirl ^(talk) 22:52, 6 September 2010 (UTC)[reply]
Support: move to its own page or series of pages somewhere, this isn't going to get any smaller. Also, as the process continues, this discussion needs to be easily accessible. With ANI's rapid archive rate, it'll be in danger of being archived before its time. GiftigerWunsch [TALK] 22:55, 6 September 2010 (UTC)[reply]
I can't see anyone having an issue with this as long as a moved note is left here, let's just make the move. Where should this be moved to, though? GiftigerWunsch [TALK] 22:57, 6 September 2010 (UTC)[reply]
Wikipedia:Administrators' noticeboard/Mass blanking of ten thousand articles by a 'bot??--intelati^(Call) 23:02, 6 September 2010 (UTC)[reply]

(edit conflict) Usually a subpage of ANI. From the instructions: When moving long threads to a subpage, add a link to the subpage and sign without a timestamp: "Moonriddengirl ^(talk)"; this prevents premature archiving. Move to Wikipedia:Administrators' noticeboard/Incidents/[concise title]." The title here is long and not that descriptive, and I wouldn't want to use the contributor's name as it could be a real name. How about Wikipedia:Administrators' noticeboard/Incidents/CCI? We don't have anything at that title yet. We leave a summary here with ~~~ to prevent early archiving and add {{unresolved}}). --Moonriddengirl ^(talk) 23:03, 6 September 2010 (UTC)[reply]

Breathe deeply

I'm somewhat aghast that the nuclear option, blowing up 10,000 articles by bot, is being discussed so cavalierly. Certainly, this is the worst, most extreme option — to be avoided unless absolutely unnecessary. Having reviewed Darius' talk page — both the current and past versions — I am struck by the fact that he does admit having made serious mistakes but has contended that the number of flagrant copyright violations is relatively minimal in number and that he has offered to help find them and liquidate them.

I wonder why the CorenWhatchamacallit Copyright Bot isn't run over each and every article to which Darius has contributed to flag copyright vios? That's what alerted us of the problem to begin with, did it not? Let the bot check for violations — subject everything to review.

I think the punishment meted out to this editor should be severe, but I don't see why the most draconian corrective measure is being discussed before all corrective options have been exhausted. Bot-check the works and blow up everything that comes back positive for copyright violations... Carrite (talk) 06:12, 7 September 2010 (UTC)[reply]

There seems to have been earlier discussion concluding that bot-checking doesn't help that much and we have to assume it's all tainted. If you are suggesting we rethink that notion, it's probably best to open a new section of this page for thoughts and comments. I personally don't feel very attached to articles with no substantial content contributions from anyone other than Darius. If those articles were so important, other people would have edited them too. I agree about not blowing up the ones with contributions from others. 67.122.211.178 (talk) 07:19, 7 September 2010 (UTC)[reply]

This is not in any way, shape or form, punishment: this is dealing with copyright violation on an unusually vast and barely manageable scale. There is a proposal to blank affected articles with an explanatory note (with variations on the theme), which deals with the problem immediately, allowing anyone interested in the article to deal with the problem. If anything, I have a concern that spreading the copyvio checking so widely risks too many trickier cases being missed by people not normally involved in checking such things. Good instructions will mitigate that, but it's still a worry. Rd232 ^talk 09:05, 7 September 2010 (UTC)[reply]

Carrite, the thing is, 10'000 articles built upon a copyvio aren't ours to take and publish. They aren't even ours to modify: even if the text has been edited out later on to the point there is nothing left of the original, we have created an unauthorized derivative work. Those are not our articles, they're effectively someone else's, and we have no claim on them. There's nothing cavalier about that. MLauba ^(Talk) 09:06, 7 September 2010 (UTC)[reply]

I don't think the "10,000" number is anywhere close to accurate, unless Darius is lying through his teeth. Do we know that the problem is actually this vast? Nor do I have any problem whatsoever blowing up any article found with copyright violations. The question is this: how big is this problem, really? I would suggest that the punishment help mitigate the crime, that for the next six months Darius be limited to editing, with a new account, articles which he created and only articles which he created... With a view to eliminating copyright violations. His work can silently be checked "over his shoulder." At the end of that period, extreme scrutiny should be applied to remaining articles to see if the problem has been fixed or not, and the community can proceed from there based upon findings made at that time. Darius' previous account name should be locked out and a new account name initiated, with edits starting again from zero and no autoreviewed status for multiple years, in my estimation... Current thinking seems to be obsessed with making the problem instantly go away by mass deletion of the good, the bad, and the ugly via automation in one fell swoop. My suggestion is that the culprit be instructed to get to work for half a year fixing his own mess. Carrite (talk) 11:28, 7 September 2010 (UTC)[reply]

With all due respect, copyright violation is not like spelling mistakes, it's not something to be fixed when we get round to it. And you seem not to have heard me when I said this was no about punishment. Finally, there appears to be a consensus that the problem affects an enormous proportion of Darius' substantive prose edits. This is in no way, shape or form a minor issue. Rd232 ^talk 11:41, 7 September 2010 (UTC)[reply]

I'm not saying it's a minor issue and I'm not saying it shouldn't be immediately addressed. And I did hear you when you said this was not about punishment — and I argue that it should be about punishment, with the punishment being the immediate fixing of his mess by the culprit, bearing in mind that Rome wasn't built in a day and that it will take time to ferret out everything... Further, I challenge the assertion that any consensus can be drawn about the scope of the problem until it is systematically studied. See the random sampling below. Expand that process, let's look at this problem scientifically before we go nuclear on it. Carrite (talk) 12:04, 7 September 2010 (UTC)[reply]

Have you looked at the actual copyright investigation subpages? Beginning with Wikipedia:Contributor copyright investigations/Darius Dhlomo and moving through (there's a sidebar above that links to them all), articles that have been checked and cleared are marked

, while articles wherein copyright problems have been found are marked

. The listing of five articles below is significantly smaller sample than has already been evaluated. --Moonriddengirl ^(talk) 12:08, 7 September 2010 (UTC)[reply]

I quickly count 32 green checks and 159 red Xs = 16.75% violation rate. Carrite (talk) 12:19, 7 September 2010 (UTC)[reply]

That would be a much higher outcome than the hundreds I've predicted, but it's possible that contributors are zeroing in on more problematic areas. :/ --Moonriddengirl ^(talk) 12:28, 7 September 2010 (UTC)[reply]

On page 2 I figured out how to let the browser find function do the counting and come up with 7 violations and 100 clean pages = 6.5% violation rate, ~~total now 39/259 = 13.08% violation rate.~~ \Carrite (talk) 12:33, 7 September 2010 (UTC)[reply]

On page 3 it's 5 bad, 98 good = 4.85% violation rate. I need to go back and check my first count mechanically and redo the arithmetic... It looks like a violation rate of under 10%... Carrite (talk) 12:38, 7 September 2010 (UTC)[reply]

% violation rate is meaningless as a proportion of all edits - the problem only applies to substantial prose edits. Most edits (including my sampling on page 3) are not substantial prose; they're adding infoboxes, categories, basic data and the like. Rd232 ^talk 12:53, 7 September 2010 (UTC)[reply]

Darius has no credibility on this issue; he assured us he had done this in no more than 15 articles. We had more than doubled that count before Uncle G stopped counting. Since his block at his talk page, Darius told us that Fabián Roncero was fine; it isn't. He told us that Núria Camón is fine; it isn't. How are we supposed to trust him to identify his copyright violations, much less acknowledge them and address them? Even though I believe that there will probably be hundreds rather than thousands of articles that are a copyright problem by the time the investigation is done, there are still tens of thousands of articles that need review. Having somebody silently check over his shoulder only works when we know that he (a) can and (b) will accurately assist. --Moonriddengirl ^(talk) 11:36, 7 September 2010 (UTC)[reply]

(unindenting) I did a recount, manually counting numbers over 100 since the find feature only counts that high on Safari. I found:

Page 1 — 37 violations, 168 clean pages
Page 2 — 7 violations, 212 clean pages
Page 3 — 5 violations, 98 clean pages
Page 4 — 0 violations, 12 clean pages = 49/539 = 9.09% violation rate

Most of the violations were on the first page. I'm not sure if these things are chronological or not or whether those articles were being critiqued more harshly or what was going on.... The violation rate seems an anomaly for the first page, compared to the next two. What is clear is that we are probably not talking about "10,000" copyright violation articles here, but some substantially lower number in the general range of 5-8% of Darius' total contributed pages. Carrite (talk) 13:05, 7 September 2010 (UTC)[reply]

So, basically, you're saying that you believe my estimated hundreds is low? You may be right. --Moonriddengirl ^(talk) 13:11, 7 September 2010 (UTC)[reply]

There are 13,542 articles in the queue... If you'll accept my premise that the 9.09% rate is somewhat inflated for some unknown reason (learning curve of the editor or more harsh judgment of inspectors) and that the actual rate falls in the 5-8% range, we are talking about between 677 and 1,083 articles with substantial issues, give or take. "Hundreds" is accurate. Carrite (talk) 13:18, 7 September 2010 (UTC)[reply]

I believe that User:VernoWhitney has indicated that the numbering in the queue is inaccurate. We have had some difficulties with putting together the listing because of its scope and the fact that initially we tried to isolate only articles he had created. There are over 23,000 articles listed by our CCI program excluding reverts and minor edits, which makes 5% something in the order of 1,150. --Moonriddengirl ^(talk) 13:23, 7 September 2010 (UTC)[reply]

(Oh, I just noted that above you mentioned not being sure if these things are chronological: they are not. They're listed by size of total contributions, beginning with greater. I would expect more problems in the front end. That's usually the way it goes. --Moonriddengirl ^(talk) 13:25, 7 September 2010 (UTC))[reply]

There are 23,197 total articles, pages 1-10 are articles they created and then the numbering (and order by size) restarts on page 11 for articles they didn't create and just edited. VernoWhitney (talk) 13:36, 7 September 2010 (UTC)[reply]

Thanks. :) More unusual than I knew! --Moonriddengirl ^(talk) 13:39, 7 September 2010 (UTC)[reply]

(unindenting) Okay, so we're seeing a much higher copyright violation rate with long contributions vs. stubs — is that a fair summary? Numbers 1-1000 in size, maybe something like 1 out of ~~six~~ 5 of those are defective, whereas the copyright violation incidence rate falls to what might be considered "normal" levels with shorter contributions (has the question of copyright violation across random WP articles ever been studied? Four or five percent of articles having "problems" would be a pretty reasonable guess, I'd think...).

Anyway, what seems to need to be done is a high-priority manual checking of the top 1000 or so original articles as well as the top 1000 or so content contributions to already-established articles, with maybe some sort of cursory bot-checking of the remaining short and stub articles. Is that a reasonable perspectiive??? Carrite (talk) 16:18, 7 September 2010 (UTC)[reply]

Yes, and quite common for CCIs. I'm not sure what you mean by "normal" levels, though. I don't know if anybody's ever done a random copyvio study of Wikipedia articles. I would kind of hope not, as I'd rather see anybody with that kind of time on their hands trying to help clean them up. :) Again, this is one of dozens of CCIs. We've got more articles than I want to count waiting for review. I don't know that it's reasonable to limit review to 2,000 out of the 23,000+ articles that he's done non-minor edits to. By "top" I assume you mean contribution length: a lot depends on the pattern of the CCI subject. He's got a lot of table and list contribs. Those are high in volume, but low in risk. A single paragraph of creative text from him would worry me a whole lot more than his most prolific contrib, [1982 in athletics (track and field) here] (already cleared). I see that copied content has already been found in article #9514 of the articles he's created. Limiting our checks to the top 1,000 of his articles would stop well short of that. If a bot didn't detect it, the copied content would remain. --Moonriddengirl ^(talk) 17:25, 7 September 2010 (UTC)[reply]

In-depth study of small random sample

I picked 5 random Darius-created articles (using a random number generator) with the idea of investigating them carefully, to gather knowledge that can be used about the rest:

Any help would be appreciated. I'll post my findings here. 67.122.211.178 (talk) 07:00, 7 September 2010 (UTC)[reply]

I think this is a very good way to start, although a bigger sample (say 50 articles) would be helpful. If the copyright violations are chronic, more extreme measures to correct them are implied, whereas if the copyright violations are occasional and sporadic, less draconian measures will probably suffice. Carrite (talk) 11:57, 7 September 2010 (UTC)[reply]

I hope we can spend at least a person-hour on each of the five, whether we find anything or not. We probably can't do that with 50 articles. This is supposed to be a small set of careful investigations aimed at identifying non-obvious problems and figuring out Darius's methods, not a statistical spot check to guess the overall violation frequency. If you want, I can generate a random set of 50 for spot-checking, but that's a different goal. 75.57.241.73 (talk) 20:37, 7 September 2010 (UTC)[reply]

Rocío Ríos

This article is about a Spanish marathon runner. It is predated by a German wikipedia article (linked by interwiki), de:Rocío Ríos, and a Spanish one, es:Rocío Ríos. It mentions:

A resident of Gijón Ríos set her personal best (2:28:20) in the classic distance on October 15, 1995 in San Sebastián.

That "personal best" phrasing sounded a little bit formulaic so I googled it [3] and found a bunch of other articles created by Darius. One of them, Paula Fudge, was created a few months ago and gave a similar description of Fudge's personal best time, but that info wasn't in either of the references cited in the Paula Fudge article, so I asked on Darius's talkpage where the info originally came from. (Rocío Ríos's personal best time is mentioned in her iaaf.org profile in tabular form). There are a bunch of these biographies whose creation was spread out over time, making me wonder if there is a common source like a sports almanac or something like that. I hope Darius gives an answer. 67.122.211.178 (talk) 07:11, 7 September 2010 (UTC)[reply]

Now this is interesting. "Excerpt: Emperatriz Wilson Traba (born January 25, 1966) is a retired female long-distance runner from Cuba. She represented her native country at the 1991 Pan American Games in Havana, Cuba, where she claimed the bronze medal in the women's marathon event behind Mexico's Olga Appell (gold) and compatriot Maribel Durruty (silver) . Wilson set her personal best (2:36:35) in the marathon on December 13, 1992 in Caracas. ..." (Added: But it's ~~possible~~ apparent that book is Wikipedia-derived (date May 2010 and looks like a spam book[4]). WP article Emperatriz Wilson is from January 2010, also created by Darius). 67.122.211.178 (talk) 08:15, 7 September 2010 (UTC)[reply]
- Googlesearch "retired female long distance runner[5] hits tons of wp spam mirrors. Blecch. 67.122.211.178 (talk) 08:26, 7 September 2010 (UTC)[reply]
- Wikipedia:Mirrors and forks/Abc#Books, LLC Uncle G (talk) 16:01, 7 September 2010 (UTC)[reply]
This looks better:[6] "Hara set her personal best, 2:23:48, in the 2007 Osaka Ladies Marathon...." So these probably came from IAAF. 67.122.211.178 (talk) 09:16, 7 September 2010 (UTC)[reply]
- Triage tip: If it says IAAF in the first revision then that's the place to check first. Uncle G (talk) 16:01, 7 September 2010 (UTC)[reply]
  - The IAAF link for Ríos didn't show that text when I tried it last night (and still doesn't). 75.57.241.73 (new address) (talk) 18:16, 7 September 2010 (UTC)[reply]

Most of Darius' articles use that "standard" wording. What struck me at first was the grammatical difference: some articles use commas correctly, but this one (and only a few others that I looked over) is missing a comma in phrases such as "A resident of Gijón Ríos set her personal best". Could this have been translated from a foreign-language source, or maybe it's his own writing, but he's just not a very good writer? I think that an almanac might be possible; did he create this alphabetically or by any other pattern? Usually, one would create all the stubs at once from a single source. I have a feeling, though, that he just started using boilerplate text from other articles (or that he first wrote) and made the stubs all read alike, but no copyvios.
- I never considered the translation point until the interwikis were mentioned. See es translation and de translation: the en version seems like a copyedited translation of the es one. The wording is very close (note the "She is a four-time national champion in the 10,000 metres (1992, 1993, 1996, and 1997), and a three-time national champion in the half marathon (1992, 1994, and 1995)." bit and the combination of the "A resident of Gijón Ríos set her personal best (2:28:20)" bits from eswp. I think that many of these articles are trimmed translations based on some sort of boilerplate organization and presentation of key facts like records and victories, which seems especially likely if he is uncomfortable writing the article (even based off a translation) by himself. —fetch ·comms 00:38, 8 September 2010 (UTC)[reply]
  - The es article doesn't look anything like the en one to my eyes. The comma thing just seems like carelessness. Because of the Hara match above, I think the Rios article (and others like it) came from pages that were once on the IAAF site but currently aren't showing up. The IAAF search function is still (as of last night) not working either. Re interwiki, I'm actually more concerned about Darius's older articles getting translated/copied to other wikipedias, than the other way around. 75.57.241.73 (talk) 02:50, 8 September 2010 (UTC)[reply]

Water polo at the 1975 World Aquatics Championships – Men's tournament

Again uses boilerplate text.[7] Gives a citation to a possible print source, "HistoFINA (Volume II, 1908-2001)". Later volumes of HistoFINA are online[8] but volume II is supposedly out of print ([9] p. 2).

The original author, Jean-Louis Meuret, is deceased (2008).[10]

75.57.241.73 (talk) 19:37, 7 September 2010 (UTC)[reply]

Worldcat lists only one library with the book, Universität Bern (OCLC 603448771). So Darius probably got the info online. 75.57.241.73 (talk) 21:51, 7 September 2010 (UTC)[reply]

Actually, there are several other worldcat records that could use merging[11] but this is still a very uncommon book. 75.57.241.73 (talk) 21:59, 7 September 2010 (UTC)[reply]

Seems fairly common wording to me. Not sure about the foundation of the charts; probably standard WP layout anyway. I don't think there's much chance of finding a cv in any of the "[Sport] at [Event]" articles; there's like two commonly-worded sentences and a list of results. —fetch ·comms 00:41, 8 September 2010 (UTC)[reply]

Ruth Lawanson

Unsourced stub which seems to be copyright clean. Carrite (talk) 11:53, 7 September 2010 (UTC)[reply]

I'm suspicious of the boilerplate phrasing "retired female volleyball player"[12]. Where did the data come from anyway? Will try to find a more recently created one. 75.57.241.73 (talk) 18:31, 7 September 2010 (UTC)[reply]

That seems like "standard" WP wording as wanted by the MOS. There's not really any other way to say it; for a nonretired player, we'd say "X is a female volleyball player from the United States" as well. Very little chance this is a cv, IMO. —fetch ·comms 00:43, 8 September 2010 (UTC)[reply]

What I want to find is where the data came from. This is something like the IAAF example further up. 75.57.241.73 (talk) 02:52, 8 September 2010 (UTC)[reply]

Nordin Wooter

I ran Google searches on five substantial fragments of the article and they all came back clean to Wikipedia or to sources which seem to have drawn from Wikipedia. I made no effort to investigate the statistics in the sidebar box, but this article seems to pass copyright muster. Carrite (talk) 11:44, 7 September 2010 (UTC)[reply]

Triage tip: Always look at the article as created by Darius Dhlomo. Uncle G (talk) 16:04, 7 September 2010 (UTC)[reply]
Carrite, the idea of examining these articles closely is not to give thumbs up or down about copyvios (there's a huge set of CCI pages for that), but to take a few articles and find out everything we can about them, to get a better understanding of Darius's methods. 75.57.241.73 (talk) 18:34, 7 September 2010 (UTC)[reply]

I don't particularly care about research "methods," the important thing is to identify and eliminate copyright violations. It seems that long blocks of prose are the greatest risk and from the two short pieces I looked at, there's no apparent issue. Darius absolutely did NOT rip off everything, it's just a question of quantifying the probable number of problem articles (700 to 1100 a reasonable guess range), figuring out how to find them expeditiously, and liquidating the problem. Carrite (talk) 02:25, 8 September 2010 (UTC)[reply]

I don't think we can "liquidate" the problem until we understand it. I don't think we can understand it without an analysis like this. 75.57.241.73 (talk) 02:43, 8 September 2010 (UTC)[reply]

Athletics at the 1996 Summer Olympics – Women's marathon

Doubt it. I doubt any of these "Athletics at [Date of] [Event] – [Sport]" articles are cvs. —fetch ·comms 02:45, 9 September 2010 (UTC)[reply]

This one appears to be a vio from [13]. 75.57.241.73 (talk) 02:30, 10 September 2010 (UTC)[reply]

Sample size

My back-of-the-envelope estimate, based upon the number of copyright violations found versus the number of articles that I looked at, was that around 10% of articles, just over a thousand, will turn out to be copyright violations. A sample size of five isn't nearly enough. Have ten articles to look at. (Even that's not enough.) Uncle G (talk) 15:57, 7 September 2010 (UTC)[reply]

Were those selected uniformly at random from the whole set of Darius-created articles? If yes, it's surprising that they're all biographies. Anyway the point of selecting the 5 articles wasn't to find vios per se, but to examine them carefully to see if anything could be learned about Darius's methods. So I'd rather keep examining the original 5 for a while before expanding the set. 75.57.241.73 (talk) 18:23, 7 September 2010 (UTC)[reply]
Steve Spence

Y later additions were cleaned two days ago but the very same text introduced at article creation time appears on official bios. Blanked and listed at WP:CP for now. MLauba ^(Talk) 16:20, 7 September 2010 (UTC)[reply]

Leonard Nitz

Y cv of [14]. All the copied content was added word-for-word by Darius as his second edit; the original stub seems to have gotten all the info from the cv source. I have stubbified the article for now, as the original stuff doesn't seem to be a vio. Darius was clearly getting his info from a source, writing a quick stub, and pasting in the rest a few days later, without listing the offending link as a source. —fetch ·comms 00:59, 8 September 2010 (UTC)[reply]

Steffen Radochla

N "turned professional in 2001" is a bit boilerplate-sounding, but it's also a common wording. One-line stub with a short list; no cv found on Google. —fetch ·comms 12:51, 8 September 2010 (UTC)[reply]

Mark Gorski

Y of [15]. —fetch ·comms 12:54, 8 September 2010 (UTC)[reply]

Gerrit de Vries (cyclist)

N Seems OK to me, although the wording is basically like the other stubs. —fetch ·comms 02:22, 9 September 2010 (UTC)[reply]

Lauren Hewitt

Y Of [16]; stubbified. —fetch ·comms 02:22, 9 September 2010 (UTC)[reply]

Japhet Kosgei

N —fetch ·comms 02:26, 9 September 2010 (UTC)[reply]

Lee Naylor (athlete)

Y More froM ABC: [17]. Has been modified by others, but still fundamentally the same wording/structure. —fetch ·comms 02:26, 9 September 2010 (UTC)[reply]

George Mofokeng (athlete)

? Interesting wording. Could not locate source. —fetch ·comms 02:47, 9 September 2010 (UTC)[reply]

Gert Thys

? This has a close copy of the wording in the article, but Darius created it in 2007, while that link was published in August 2009. Some close wording in the last sentence to [18], but that was part of the text of a ruling. I removed most of the text as unsourced information in a BLP anyway. —fetch ·comms 02:38, 9 September 2010 (UTC)[reply]

Technical question

Couple of things; one: has any of the triage stuff been listed (i.e., eliminating edits where he only touched categories, etc.)? Two: can some make a quick list of all articles he made (just page 5 of the CCI for now) where he made at least one edit after initial creation that added more than 500b to an article (ignoring category-only edits, etc. preferably)? This may help to see if he often created a stub first, then pasted in a couple paragraphs later. If this is not technically feasible, I'll just keep looking manually. —fetch ·comms 01:18, 8 September 2010 (UTC)[reply]

1) slightly complicated but doable, I've just been juggling other things. 2) easier, I'll see if I can bang it out. 75.57.241.73 (talk) 02:19, 8 September 2010 (UTC)[reply]

It looks like the CCI report already has this info (the additions are broken out as separate edits). What is it that you're asking for that's not already there? Note the threshold is probably more like 100b than 500b. Even 3-word snippets like "retired female swimmer" is enough to pick up some vios. 75.57.241.73 (talk) 02:24, 8 September 2010 (UTC)[reply]

I want a list of the pages where at least one of the later additions was over 500b (removing the lesser ones). I know it's not perfect; just searching for a possible pattern, seeing how much he may have lifted at once, and then narrow it down from there. If possible, just run through the existing CCI page and get a list of the articles with a diff that says more than 500. —fetch ·comms 02:47, 8 September 2010 (UTC)[reply]

Lifting a three word phrase like "retired female swimmer" isn't technically a copyright violation, although it might be a pointer to a section of lifted text. Carrite (talk) 02:29, 8 September 2010 (UTC)[reply]

Right, finding the same phrase in dozens of articles can point to a common source. 75.57.241.73 (talk) 02:30, 8 September 2010 (UTC)[reply]

Fetchcomms, it looks like articles with that pattern (contribution > 500b on other than the first edit) on page 5 are easy to spot by eyeballing. Do you want something more than that? 75.57.241.73 (talk) 03:06, 8 September 2010 (UTC)[reply]

Meh, alright. I'm probably just getting lazy :P. Working on them now... until I sleep. —fetch ·comms 03:18, 8 September 2010 (UTC)[reply]

Source to watch out for: I found two copyvios so far (he slightly changed it by merging some sentences and reordering them) from http://www.hockey.org.au/index.php – [19] and [20]. Both were in the original creation, it seems, so for field hockey articles, that seems like a source he may have used repeatedly. Is there a way to list all the articles he created that are part of Category:Australian field hockey players, so we can check against this site? —fetch ·comms 03:42, 8 September 2010 (UTC)[reply]

Yeah, give me a few minutes. 75.57.241.73 (talk) 04:10, 8 September 2010 (UTC)[reply]

They are below, feel free to uncollapse the list or move it. I can add links to the 1st rev of each article if that is useful. 75.57.241.73 (talk) 05:04, 8 September 2010 (UTC)[reply]

Fetchcomms, please read Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help. This has already been written down. Uncle G (talk) 11:09, 8 September 2010 (UTC)[reply]

I assume you mean his strategy (I didn't find any links there). That page is very helpful, though hopefully we can add more info as we progress. —fetch ·comms 12:50, 8 September 2010 (UTC)[reply]
- Yes, the strategy. I've already invited everyone to be bold in adding to and improving that page. If you want a list of the created articles, simply go back in the edit history of the first two CCI list pages to before the list was sorted and revamped.
  I've reviewed a few hundred of the biographies. The common creation strategy was for the whole text to be in the first edit. But sometimes there are a few later edits fixing copy and paste errors. As I wrote above, a productive triage approach is to first go back to the latest revision by Darius Dhlomo before someone else (aside from a 'bot) touched the article, and read that. Check the foundation content first, in other words. Uncle G (talk) 13:18, 8 September 2010 (UTC)[reply]

More strategems that you can incorporate into Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help (be bold!):

Articles that cite the Beach Volleyball Database have uniformly proven to be taken from the biographies there. The BVB pages aren't datestamped, unfortunately. But Moonriddengirl and I did some checking with the Wayback Machine to check the relative dates.
If an article cites a "profile" somewhere, it's quite productive to check that profile first. Unfortunately, some profiles pointed to have been removed from the WWW in the intervening years. The Wayback Machine is of some help, here.
As discussed above, if an article cites an IAAF profile, there's a likelihood that the prose came from another article somewhere else on the IAAF site. (I originally skipped a lot of articles that cited IAAF profiles, because I wasn't aware of the other pages.)

Uncle G (talk) 13:32, 8 September 2010 (UTC)[reply]

Field hockey players

Articles from Category:Australian field hockey players created by Darius, per Fetchcomm's request.

collapsed list of 92 players

75.57.241.73 (talk) 04:40, 8 September 2010 (UTC)[reply]

Temporary list of known used sources for Australian field hockey players

I'm just using this for myself right now, but a central list of sources he copies from frequently could be useful in identifying some vios, as Google does not show all of these on the first page or two. —fetch ·comms 15:44, 8 September 2010 (UTC)[reply]

This also indicates usage of the hockey.org.au site before they reorganized and changed links. —fetch ·comms 15:49, 8 September 2010 (UTC)[reply]

Have we alerted the appropriate Wikiprojects and enlisted their help?

As it stands, a casual survey of the articles in question seem to limit the fields to mostly athletics and specific sports within them. Have the appropriate Wikiprojects been contacted and enlisted to try to help? I could see a bot option that is being suggested below to be very effective if there are dedicated members of the affected projects getting involved to help clean up stuff, using a coordination page to drop admin help requests when needed. --MASEM (t) 16:03, 8 September 2010 (UTC)[reply]

Notice of the CCI has been given to WikiProject Athletics and WikiProject Olympics. Some projects are very responsive to these. This CCI's been a little unusual; I don't know if the people who've expressed interest in helping have been pointed to this discussion. It seems like it would be good to link to this discussion from the CCI page; I'll do that now. --Moonriddengirl ^(talk) 16:09, 8 September 2010 (UTC)[reply]

Implementing bot?

At this point, I propose that we go ahead with the following, based on discussions above:

We alter the text at Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help somewhat in line with the proposal at Wikipedia talk:Contributor copyright investigations/Darius Dhlomo/How to help. I do not believe he has copied content in all cases, but I do believe we need to behave as though he has. His original text should in almost all cases be replaced.
User:Uncle G's major work 'bot blanks the articles under investigation (the ~23,000), with edit summary pointing to Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation, replacing the content with the following template: User:Moonriddengirl/CCIdf.
I will personally go through immediately after implementation to make sure that all articles that have already been (a) checked as cleared, (b) cleaned, or (c) blanked are restored to their last condition, coordinating with listings made at the CCI page. No reason to waste the work already done. I'm thrilled that so many people have been willing to help out. :)

The advantages of this: we pull the content from publication immediately, and we invite the wider community to help with cleanup. This could be the most efficient means of addressing a CCI ever, and it may not linger for more than a year as some of our others have done. There is a substantial risk that some of these tags will simply be removed by users who don't care about copyright. I see this routinely at WP:CP. We try to address this at WP:CCI by requiring that only those who have themselves had no copyright issues assist, but this isn't foolproof.

Still to be determined: what then? At what point do we go through the ones still tagged?

Thoughts? --Moonriddengirl ^(talk) 12:55, 8 September 2010 (UTC)[reply]

Go for it. After that, just keep checking manually, I guess. —fetch ·comms 13:25, 8 September 2010 (UTC)[reply]

I realize I'm in the minority by now, so I have to say this--not to get into further debate about it but just to indicate that there are still some of us who feel this way--but I still favor the mass deletion approach over any of these schemes for sucking up massive amounts of community effort cleaning up Darius's mess (plus exposing everyone who touches any of those articles to potential legal liability). The articles aren't for the most part really articles at all. They're more like database dumps that Darius vacuumed from various places into WP article space. None of them are written from secondary sources as our articles are supposed to be. Yes I know some of them are about legitimately notable people. I just don't feel any sense of tragedy that there might exist a notable person someplace in the world who is temporarily not the subject of a WP article, at least til somebody else gets around to writing a real one with real sourcing.
That said, I wonder if we could deploy some additional automation to help with cleanup. Is there some kind of script around that integrates all the diffs from when an article was created, in order to highlight all the text in the last revision, that was originally put in by a particular editor (i.e. Darius)? I can probably write one, but it would surprise me if it hasn't been done already. On the other hand it wouldn't be perfectly accurate. (Update: someone at refdesk mentions User:Cacycle/wikEdDiff which is not what I had in mind, but looks interesting anyway).
Do you want any additional filtering or processing of the 23000 articles? Like maybe for articles creates by people other than Darius, instead of blanking the whole article, the bot could revert to just before Darius's first edit to it. We'd write a different template for articles that got reverted but not blanked. Also I can still attempt some of the triage stuff discussed above, like noticing category-only edits with a script (I just have RL things to do as well). 75.57.241.73 (talk) 13:37, 8 September 2010 (UTC)[reply]

I completely understand favoring the mass deletion approach. I was leaning that way myself when we started. But so many people have been putting their time into this already, and I do not wish to devalue their efforts. Too, this approach has some exciting prospects for future cases like this. Getting assistance at CCI is a challenge; most of them involve thousands of articles, though few are this scale. If we find that this approach actually works, then it may be useful for other similar CCIs down the road...a way to encourage involvement from those members of the community who actually do view these articles. If this leads to finding a new, viable system for these, we might not have dozens of open CCIs with probably hundreds of thousands of articles cumulatively waiting for view. (oi)

I have no idea what automation can do. I'm technologically in the school of "challenged by using my remote control." I don't know of any script that integrates the diffs or how we might process it to automatically revert back to the pre-Darius version, but if those things are possible, they might be good approaches. I already have a notice for talk pages about rolling back CCI articles: User:Moonriddengirl/CCIr. I only use it when there is evidence of copying, but it could be easily modified to this situation.

Do you write scripts? There are several ideas I have for copyright cleanup tools that I would love to see in the works. If you do and you're up for it, come by my talk page. :D (Note, though, that I am technologically clueless. I never know if my ideas are in the realm of "easily accomplished" or "needs a magic wand.") --Moonriddengirl ^(talk) 14:17, 8 September 2010 (UTC)[reply]

The bit about scripts is still very useful, as we should probably focus first on the articles he created, which have a greater likelihood of diffs containing vios rather than just extra categories. —fetch ·comms 14:55, 8 September 2010 (UTC)[reply]

75.57.241.73, blanking the articles now does not preclude deleting them later, if we find that in six months we still have ten thousand blanked articles. But going straight to deletion immediately precludes any other approach. (You would have to get someone else to volunteer to do that, in any event. None of my 'bot accounts have sysop privileges, and I'm not going to do 'bot edits with this account.)

I reiterate my request for everyone to please boldly fix anything in Wikipedia talk:Contributor copyright investigations/Darius Dhlomo/How to help, Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation, and User:Moonriddengirl/CCIdf — which latter I suggest reside somewhere like Wikipedia:Contributor copyright investigations/Darius Dhlomo/Article notice or in the Template: namespace (if we aren't unhappy about Wikipedia mirrors showing the same notice). If we're going to do this, I want those all thoroughly reviewed beforehand. Uncle G (talk) 15:25, 8 September 2010 (UTC)[reply]

What should do about translations of these copyvios in other languages? Is there a way to get a list of all interwikis created after Darius did here? If the en versions are determined to be vios, the other language ones need to go as well. —fetch ·comms 15:39, 8 September 2010 (UTC)[reply]

In the useless comment category, I have no idea. :/ I have once or twice communicated with others Wikis when I know that an article has been copied, but we have never had a practice of doing this. --Moonriddengirl ^(talk) 13:13, 9 September 2010 (UTC)[reply]

FWIW, from the few articles I've looked at, I've seen the following patterns:

Articles of the type "[insert name of country] at the [insert year] [insert competition]" are quite frequently little more than a bit of boilerplate at the top, a formatted list in the middle, & the expected stuff at the end. Probably best examined by hand in case he added any further text -- which is likely a copyvio. (And the subject name does vary a bit.)
Same for individual sports at the Olympics, Pan-American games, etc. Same treatment.
Biographical articles that are not minimal stubs seem to routinely have copyvio material in them. If we can determine the cut-off size for these -- where the visible text is more than "X [insert birth & death information] is a [insert country] [insert athletic specialty]. He/she was active [insert length of career]" -- those could either be safely deleted or stubbified.
My experience confirms Uncle G's estimate that around 10% of his articles are copyright violations; the rest are simply stubs. Deleting the 90% of acceptable -- & likely useful -- articles just to purge this poisonous share is overkill -- unless one believes all stubs are potential maintenance problems & should be deleted. (Not arguing for or against that opinion, but I suspect it is a motivation of some of those who favor mass deletion.) -- llywrch (talk) 16:02, 8 September 2010 (UTC)[reply]

Llywrch, besides maintenance issues, it occurs to me that a more visceral reason I want to delete these articles is WP:DENY. We customarily revert any edits made by banned editors without trying to figure out whether they're good edits or not (occasional exceptions are permitted if a user wants to proxy a particular edit as their own, but that's not supposed to be done as a matter of course). Copyvios introduced as flagrantly as this should be treated like banned edits and undone. If the edit is one that creates a new article, undoing means deleting the article. There are some Wikipedians who somehow equate deleting an article with drowning a kitten, but really, in a case like this (where nobody else has edited the article), we should just think of the deletions as reversions and deal with it. We shouldn't let someone make 10,000 banned edits and have them stay in the encyclopedia.
Also, wanting to maintain high standards for BLP sourcing is in part for WP's neutrality and not just maintainability. OK, these particular articles are about athletes, who tend to not be self-promoters and don't bother me all that much, but if they were mostly about garage bands or motivational speakers or fringe political weirdos, I'd see this incident as someone spewing 10,000 self-serving search magnets into WP article space to bias its content and laughing his head off if they were allowed to stay, and I'd be outraged.
You additionally wrote "[p]eople here appear a lot more eager to tell us what the solution is & expect someone else to do it, than to actually help fix the problem" and that's part of it too. It seems to me that proposing 100's of other editors spend 1000's of hours examining the articles and cleaning the vios is exactly expecting others to fix the problem. It's tragic that someone like Fetchcomms says he's taking time away from his own writing to help preserve this Darius spew. By contrast, nuking all the affected articles with a bot can be done by one person in a few hours. I'm ok with volunteering to implement such a bot myself if that's the decided outcome. The code would have to go through some approval hoops and run from an admin account, so deployment would still require some other people's involvement, but I could write the code and hand it over to a bot op for activation. So I'm willing to personally do (most of) the work of the mass deletion proposal, and implement something that takes care of all 10000 articles in one go. I don't see any preservation proponents offering to personally do most of the work of reviewing all 10000 articles manually.
Finally, you claim that 90% (or whatever) of the articles are vio-free because they don't contain copied text. IANAL but if Darius has copied the dates, names, events, times, etc from 1000's of entries in some copyrighted sports statistics book into WP articles, I'm sure as heck not willing to bet my skateboard on the assertion that those articles are not copyvios even if zero words of actual prose are copied. So I think these Darius-created articles should be treated as 100% copyvio even if they contain no text (just names and numbers copied en masse from wherever into tables and templates). So I don't accept the 90% non-vio figure and I'm not willing to declare a single one of those articles vio-free if I have to accept liability for it if I'm wrong. If others want to do so, that's up to them.
I'm not especially trying at this point to swing the discussion back to bot deletion at this point (I'm used to being in the minority on stuff like this), but just trying to let you know where I'm coming from. I hope that helps. Regards, 75.57.241.73 (talk) 03:16, 9 September 2010 (UTC)[reply]

And all of that is relevant to my volunteered observations exactly how? Or did you intend to respond to someone else's post? -- llywrch (talk) 04:16, 9 September 2010 (UTC)[reply]

You wrote "Deleting the 90% of acceptable -- & likely useful -- articles just to purge this poisonous share is overkill -- unless one believes all stubs are potential maintenance problems & should be deleted." I responded to explain that there are many other reasons besides "all stubs are potential maintenance problems" to believe deletion is not overkill and is the right thing. Anyway, per WP:IINFO, "useful" by itself is not a sufficient reason to keep something. We're looking for documentation from secondary sources, and these articles mostly (entirely?) don't have that. 75.57.241.73 (talk) 04:58, 9 September 2010 (UTC)[reply]

Okay, I see the connection. Despite your appeal to WP:DENY, deleting all of those articles is still overkill. Sanctioning this editor, cleaning up his mess, & moving on will result in a minimal impact which will deny him any attention. Deleting all of those articles will give him far more attention: people will come to the articles, find them deleted, & inquire about what happened to them -- & DD's story will be repeated once again, which will keep attention on him. As for my comment about "useful information", I was referring to the quality of the stubs: many of them clearly have more detail than, for example, "llywrch (1957 -) is an American editor of Wikipedia, who has editted for almost 8 years." There are a lot of stubs in Wikipedia with that much text, which have avoided deletion only because their subject is notable. (Not to imply I am a notable subject, mind you.) -- llywrch (talk) 22:11, 9 September 2010 (UTC)[reply]

I've done some tweaking to Wikipedia:Contributor copyright investigations/Darius Dhlomo/How to help. I hope it's acceptable. I've moved User:Moonriddengirl/CCIdf to Wikipedia:Contributor copyright investigations/Darius Dhlomo/Notice. If there's a better name for it, it can be moved again. That done, are we ready for the bot to start the process, thereby getting wider community involvement? As quickly as possible, I'll clear all the ones that have been checked at the CCI, though the later in the day it gets the longer it will take me. :) (I'm on east coast time; it's 1:00 p.m.ish here.) --Moonriddengirl ^(talk) 17:20, 8 September 2010 (UTC)[reply]

Do we need a bot to do this? According to WP:Administrator "The English Wikipedia has 1,755 administrators as of September 8, 2010." I know some are retired/inactive, but if each administrator looked over 10+/- articles they would all be reviewed within a day. If the article is a copyright violation they can delete it. This would save all of the copyright violation free articles. Of course this would take quite a bit of organization, but it is just an idea. --Alpha Quadrant (talk) 17:27, 8 September 2010 (UTC)[reply]

If they would, we wouldn't. But this has been publicized at both admin noticeboards as well as plenty of points around Wikipedia, and so far we've got nowhere near full admin participation. I suspect we're not going to get even a tenth of them involved. --Moonriddengirl ^(talk) 17:34, 8 September 2010 (UTC)[reply]

If every admin wrote four FAs a year, we'd be well on our way to a better encyclopedia, but there's no way a thousand people will be coaxed into even reviewing one article. Whoever can help, please help. Otherwise, we can't ask any more of users who are already busy in RL. I've personally postponed some writing goals to work with this CCI because I value copyright very seriously, and I realize this is an understaffed area. But I can't speak for others' priorities. —fetch ·comms 18:11, 8 September 2010 (UTC)[reply]

I have made a BOLD edit to Wikipedia:Contributor copyright investigations/Darius Dhlomo/Task explanation per my suggestion above.[21] Note that implementing the suggested change will require some additional code in the bot, which would of course be up to Uncle G. 75.57.241.73 (talk) 21:59, 8 September 2010 (UTC)[reply]
- Seems sensible; as long as Uncle G can make the bot revert to the last "major" (non-category, etc.) change, or it will take ages to get through this list if he added one category in 2006 and then the article was legitimately expanded by someone else. —fetch ·comms 22:28, 8 September 2010 (UTC)[reply]
  - I cannot do any sort of reversion at all without a fair amount more work. As I said, these are very simplistic tools when it comes to content editing. I can append/prepend wikitext to a page, or replace a page entirely with some given wikitext. (These tools' prior content editing tasks have included nothing more hefty than raking sandboxes and creating boilerplate xFD pages.) I could write a tool to do this, I expect. I'm not even going to try to give an estimate on that right now, or even definitely confirm that it's possible. Right now I'm revisiting code that I wrote half a decade ago and updating it to work with the current MediaWiki interfaces. ☺ Then I have to do some testing.
    The first pass of anything that the 'bot does will be the ten thousand creations, simply blanked. (We all agree on that, yes?) More complex work on the remainder after that, the touched but not created articles, we can come back to. I'll probably need a new list off VernoWhitney at that point, anyway. Uncle G (talk) 22:50, 8 September 2010 (UTC)[reply]
    - The API can give the categories directly from the last version of any article[22] and the bot could use that list to categorize the reverted articles, instead of trying to parse the categories out of the article text. So I think the categories aren't a big problem in their own right, though some other templates may be worth scanning for. I agree that it's reasonable to separate the task into blanking the 10000 Darius-created articles first, while deferring til later possibly doing more complicated stuff with the other 13000 articles. That would mean we're back to doing just the 10000 for now, which is OK. (Added: note "reversion" means replacing the entire wikitext with another text, in this case text from an earlier revision of the article). 75.57.241.73 (talk) 23:46, 8 September 2010 (UTC)[reply]

I'm ready to use the list. I've tested the 'bot on Ted Morgan (boxer) (which was a definite copyright violation of this biography). You can see the edit here. That's what's going to happen, and that's what it's going to look like. I might tweak the edit summary a bit. Uncle G (talk) 04:38, 9 September 2010 (UTC)[reply]

I would change the category in the template to something like CCI-DD (for Darius Dhlomo, I don't know that it's appropriate to spew a username directly into articles even in a category) since CCI refers to lots of different incidents/investigations. The edit summary should have the same change. I'd just link CCI-DD in the edit summary to the info page rather than having "what is this bot doing" there. Otherwise, looks reasonable. Can you run it on, say, 5 articles as the next test? 75.57.241.73 (talk) 04:50, 9 September 2010 (UTC)[reply]
- I leave it up to Moonriddengirl and the other CCI regulars as to what the category is. I'd rather not even put "DD" in the edit summary or the article notice, myself. I think it unfair for Darius Dhlomo's name, even abbreviated, to come up all over the WWW as a result of this. Uncle G (talk) 13:32, 9 September 2010 (UTC)[reply]
Also, if you're not going to have the bot make any logs, maybe you could make a new account ("Uncle G's Major Work Bot -- DD CCI task" or some such) just for the purpose of this task. That would make it easier to locate all affected articles by pulling down the special account's contribs.
As a separate matter, I had wondered if we might have a bit more discussion about moving the articles to incubation, rather than leaving them in article space (or deleting them) 75.57.241.73 (talk) 04:53, 9 September 2010 (UTC)[reply]
- Seems logical. I already saw one of Darius' articles graduate from the Incubator, all possible traces of copyvio gone. But we need to delete all of the found vios first, of course. —fetch ·comms 12:50, 9 September 2010 (UTC)[reply]
- The 'bot hasn't really done anything for just over a year, and this is the only task that it's doing. It will be very easy to spot the list of articles from its contributions history. The article list that it's going to be working from is on-wiki in the first place, anyway. It's here (1–5000) and here (5001–9664). Uncle G (talk) 13:32, 9 September 2010 (UTC)[reply]
Just to note: I'm here and ready to start restoring articles that have already been cleared. --Moonriddengirl ^(talk) 13:13, 9 September 2010 (UTC)[reply]
- I'm having a discussion at Wikipedia:Bots/Requests for approval/Uncle G's major work 'bot (q.v.) at the moment. I'm happy to wait a day or so in any event, so that more people in all timezones see the noticeboard and WikiProject notices. (I've just warned the olympics and boxing WikiProjects.)
  If you want something to look at whilst you wait (in addition to the aforementioned 'bot discussion), have a look at Wikipedia:WikiProject External links/Geocities#Locating articles, which I've dumped on you. ☺ Uncle G (talk) 13:32, 9 September 2010 (UTC)[reply]
  - Thanks. :) Sadly, I have no shortage of things to do. :/ My problem is more finding time to do them! --Moonriddengirl ^(talk) 13:40, 9 September 2010 (UTC)[reply]
Okay, I've made a restoration of Ted Morgan (boxer) as a stub. Is this what we should expect? If not, revert/delete my version & replace it with the proper version. (Or revert me if I acted too quickly.) -- llywrch (talk) 22:19, 9 September 2010 (UTC)[reply]
- Well, now it's an unreferenced BLP that may be subject to deletion under some forthcoming BLP cleanup unrelated to this copyright stuff. You might want to transfer some references from old revisions. Anyone currently cleaning up the articles should keep in mind that the bot might wipe them, in which case they can revert the bot. 75.57.241.73 (talk) 23:28, 9 September 2010 (UTC)[reply]
  - List of refs restored. Unsourced BLPs are just about as bad as copyvios, so we need to avoid those as well. —fetch ·comms 02:15, 10 September 2010 (UTC)[reply]

OK, I just seen this on AN a moment ago, so forgive me if this has already been answered, but what is the list of articles are we looking at blanking here? - Neutralhomer • Talk • 04:41, 10 September 2010 (UTC)[reply]
- Revision links are a couple of paragraphs above. ↑ Uncle G (talk) 04:59, 10 September 2010 (UTC)[reply]

Questions

Ok, I've just spent a fair amount of time reading all of this page (and others).

I won't claim to understand the CV stuff, except that I thought that the current practice with CV is delete, and then sort it out.

It doesn't matter if the article is now featured and on some perfect article we "need".

delete and start over.

Not to mention that the rule of thumb with banned users is to delete/revert their contributions.

Wikipedia:BAN#Edits_by_and_on_behalf_of_banned_editors.

I'm just concerned that the blanking will lead to the common editor (who doesn't care about cv, and just is enthusiastic for their favoured info to be on Wikipedia) will shrug their shoulders and revert the blanking.

Not to sideline ANY help, but shouldn't the pages all be protected, at least semi?

Also... I'd suggest oversighting all the edits, but from what I read above, we need them to compare what are his edits (and cv) and what aren't, in the hopes of salvaging?

And not to go the WP:BEANS route, but what's to stop him from block evasion, creating an even further mess of having to now track 20k+ articles for IPs, and other fun stuff, forever....

Further clarification on this would be welcome. - jc37 03:42, 10 September 2010 (UTC)[reply]

Three points:
- It's not that simple. Some articles are 1-paragraph stubs that contain barely more than raw numbers, names, and dates. Some articles have been heavily edited since, by other editors. Some articles had most of their content written by other editors anyway. Some articles were just touched for non-prose-content-related tasks such as (re)categorization. When we are talking about a corpus of twenty-three thousand articles, any "some" is a lot.
- Darius Dhlomo wasn't necessarily acting in bad faith. Xe wouldn't be the first person to just not get that taking other people's prose wholesale, tweaking some pronouns and shuffling the sentences around a bit, is not original writing. It's not that xe has, or had, bad intentions. Clearly xe had good ones, trying to give us articles on olympic althletes and so forth. It is that xe enacted those good intentions entirely wrongly; and has not been particularly forthcoming about either the scale of the problem or specifics of particular articles when questioned.
- There's a fairly clear warning on the template that anyone bulk reverting for the sake of it and reintroducing copyright violations by doing so will be treated as a problem editor. And anyone doing so only has to look at Darius Dhlomo's block log to see where that road leads.
This isn't just a once-off copyright violation, or even a new account that's submitted a couple of copy-and-paste articles. This is Wikipedia:Contributor copyright investigations territory. Things are different, here, not least because processes like Wikipedia:Copyright problems simply don't scale to this number of articles. Even the usual CCI process is having difficulty scaling to this, which is why we're trying this new approach. Uncle G (talk) 04:13, 10 September 2010 (UTC)[reply]

response (Also edit conflicted - I'll try to respond to the other posts (This responds to UncleG's original response) following this.)

To clarify: I wasn't commenting on whether he should be banned - though, from what I read above, indef blocking (at least) seems appropriate. Though, to be honest, regardless of how many possible positive edits he's made, he really seems to be being less than honest about all of this, so trust seems misplaced, at least for now.

And it doesn't matter if the cv was in good faith or not. it's cv - delete/revert. And if we can't trust any of his edits, then send a bot through his edit history and that's that. And I'm presuming that this isn't a case where we can allow for "oops, we missed those cv edits in his edit history".

To touch on your points:

"Some articles are 1-paragraph stubs that contain barely more than raw numbers, names, and dates." - the scale and scope of the cv's precludes caring about the other, possibly "clean" edits. delete and revert, and only restore when assessed as allowable. You've shown repeatedly above that the cv is vast, throughout his edits.

"* This isn't just a once-off copyright violation, or even a new account that's submitted a couple of copy-and-paste articles." - I don't think it matters if he's made one edit or one million edits. If his edits are cv, then they MUST be removed from this encyclopedia. Maybe I'm missing something here, but my understanding was/is that no extenuating circumstances allow for CV material to stay on Wikipedia.

And I get that this is beyond the typical scope. I agree that this should be handled by bot. I'm just saying that too much leeway is being given here for cv material to NOT be removed. WP:AGF has worn out its welcome here. He's lied to you all repeatedly, boldly and point blank. He has cv throughout a massive editing history. There really isn't much left but to remove from the encyclopedia, and then AFTER that, have the CCI experts (and whomever else would help) to try to restore anything salvageable from this mess.

Also, just to be clear, in re-reading my comments above, they sound terse in "voice", I hope you understand that you all have my full empathy in what I am certain is a royal pain and mess that you're trying to sift through. So please don't take anything I'm saying as directed negatively towards any of you. - jc37 04:53, 10 September 2010 (UTC)[reply]

No worries.
Four more points:
- Any administrator deleting 10,000 articles in the normal manner is in for a severe case of repetitive strain injury. Any administrator deleting 10,000 articles with a 'bot is in for massive drama and an arbitration case. (Even just blanking I know is going to cause complaints. Trying to prevent this by letting people know that this is coming ahead of time, is why I've been posting notices all over the place.) And there's no way a Developer is going to touch this without a poll.
- My calculations, and those of others above, are that only 10% of the articles are copyright violations. The problem is that we don't have a mechanical way to determine which 10%. So we're sharing the pain, as it were, of finding out. The trade-off that you're looking at is the outright deletion of ~9,000 good articles for the sake of ~1,000 bad ones. That's something that we all, as people who want to build an encyclopaedia, understandably feel uncomfortable with.
- This step doesn't rule out taking further steps. We're dealing with ten thousand articles out of twenty-three thousand, after all. Even on its own, this isn't the complete solution to the problem. (It shrinks it a little bit, though.) We could, six months from now, if we find that only a small percentage of articles have been reviewed, decide that the process hasn't worked, and choose to take a different route.
- This is all about not concentrating either the decision making or the enactment of the outcome in the hands of one, or a few, administrators. Anyone, even someone without an account, can fix an article if we take this route. If we go down the delete-everything-and-let-the-challenges-come route, we have a small number of administrators having to review the deleted content of ten thousand articles. It's not the case that only administrators are capable of this sort of thing. Aside from the fact that non-administrators scan for, and tag, copyright violations every day, in good faith, let's not forget that most of Wikipedia's content was written by editors who don't even have accounts. The non-administrator editorship is a force to be reckoned with. It does, in the main, operate in good faith. And it may well prove to be capable of demolishing this problem with ease.
Uncle G (talk) 05:30, 10 September 2010 (UTC)[reply]
Some of what I would respond to above is already covered in my "responses to others" below, so I'll leave that there.

And I'm empathetic about concerns about various kinds of "community blowback" (Past dramas involving responses to and from betacommand come immediately to mind). I just think we're past that stage (I think you've done a pretty decent job of communicating that there is an immense problem.) And this should move forward.

"Even on its own, this isn't the complete solution to the problem." - And that's acceptable? To even let one "slip through" when we could prevent it by just mass reverting his edits.... In this case, the baby needs to go out with the bathwater. We'll retrieve the baby afterwards. Nothing else is acceptable, from what I understand. And honestly, I thought that the current system is to revert to some edit just prior to the editor's first edit on an article, and then allow the wiki world to salvage anything after, if there is anything salvageable. So why is that not the case here? And if the editor created the article, blank (in mild cases) and delete in cases of rampant cv. Again, why not here?

I'm not saying that admins should be the only ones doing this. I'm saying that, due to the scope, and due to how the resolution appears to be implemented, that admins might as well do it, since they'll have to meticulously go through and follow up on the bot, and everyone else's edits anyway... I'd like to wish that wasn't true, but, it seems that way to me at least. And when one considers that most admins are less than active (I only just recently came back from a rather extended wikibreak myself), that idea is also daunting. - jc37 05:46, 10 September 2010 (UTC)[reply]

Other responses

(The threading is slightly confusing (for me at least), trying to make some sense of it.) - jc37 05:57, 10 September 2010 (UTC)[reply]

(edit conflict) Blanking seems more viable for a massive list. Better tracked, and only manybe 10% are actually vios. See section just above for a bit more on this.

Is he a sock of a banned user? We only "revert any edits made in defiance of a ban", so if we ban him now, we won't rollback his edits before that.

We can have a list of pages blanked, and just check relatedchanges every so often. I really doubt anyone will bother to remove warning tags for admittedly obscure articles, but they'll get warned and/or blocked if that continues, anyway.

Oh, that would probably kill the servers a couple times over, protection for 10,000+ pages :P. To be serious, it would skew up some tracking lists of semi'd pages and probably isn't needed at this point.

Oversight --> revdel works fine, but for vios, I'd only personally bother with that for major, major vios that have been around since who knows when.

If he creates socks, they will be noticed pretty quickly, I'd imagine. New users and/or IPs removing copyright tags and all is pretty suspicious. If he does that, then just block, etc. IDK what else is needed. He knows that evading blocks is wrong, so if he does it, there goes any chance of an unblock within a few years.

I hope I made sense. It's pretty late where I am right now yawn... —fetch ·comms 04:23, 10 September 2010 (UTC)[reply]

(edit conflict)

jc37, I think the situation is:

WP practice is to remove cv's from displayed articles immediately, but except in very egregious cases, the cv is allowed to stay in the revision history unless the copyright owner requests removal. See WP:CP#Instructions. So if someone pastes too much of a newspaper story into the Elvis Presley article, we just revert it, not delete the article. I guess it's possible now to revdel the revision with the cv, but that may in some cases cause attribution problems for later revisions.
Darius was not banned at the time he created at these articles (whether he's banned now or merely blocked is unclear). So they're not banned edits under a literal interpretation of the BAN policy AFAIK. I've made the case for treating them as banned edits anyway, based on the scale and the fact he had received previous warnings and then (in my opinion) relatively low value of these articles as evidenced by absence of secondary sources and non-involvement of other editors in most of the articles. But it doesn't look like they're being handled that way.
Oversight is usually reserved for bad privacy vios and other extreme circumstances. These copyvios are fairly routine on the scale of things, except that there's so many of them done by one person.

75.57.241.73 (talk) 04:25, 10 September 2010 (UTC)[reply]

responses to the others

WP:BAN says something different now than it did when I last read it. Once upon a time, I seem to recall banned editors (especially when the ban involved dubious or POV content) having their edit histories arbitrarily reverted. So let's drop that facet of my comments above for now.

Part of what I am commenting on is what I kept reading above where it seemed that there was the idea that we shouldn't revert/blank delete his edits because he was a prolific editor, so we should have some other standard for cv from his edits. I hope that's not what was intended to be said, but quite a few were coming across that way.

Another part is that, due to the vast number of contributions, and the vast number of cvs interspersed, I guess I'm just not seeing how we're going to avoid some of the CVs getting lost in the cracks.

And finally, while I AGF, I'm enough of a realist/pragmatist to know that people are going to revert the blanking and not care about the cv concerns. And so unless someone is looking over the shoulder of all the blanked pages, it's going to happen. and if it is, then we might as well delete, and have admins go through the 20k+ pages themselves anyway. This being something that needs doing right. And not according ot the typical "wiki way", which allows for things to be done half way, with the idea that someone else will come along and finish up after you.

So I'm concerned. - does that make more sense? - jc37 05:08, 10 September 2010 (UTC)[reply]

How long will it take?

As a complete outsider looking at his first mass deletion, how long will it take the bot to blank the 10,000 articles? Thanks, and intriguing conversation. --intelati^(Call) 05:45, 10 September 2010 (UTC)[reply]