Wikipedia:Turnitin/RfC

This draft proposal is not active, not live, and not finished. Please don't draw any conclusions from it.

Overview

The community is asked to consider a limited trial evaluation of Turnitin's software--a plagiarism/copyright-violation detection tool--on a selection of articles.

Turnitin is a leading provider of services to schools, universities and professional publications, assisting them in identifying text-matches on websites and through content databases that may be plagiarism. Plagiarism per se is not a problem for Wikipedia, but it often overlaps with copyright violations, which are.

In short, Turnitin has offered to donate services to check all of English Wikipedia through its system, for free, on a non-exclusive basis.

Starting with new articles--which would present fewer technical challenges--a report would periodically run off-Wikipedia, generated on a site hosted on Turnitin's servers. A generic (unbranded) link to that report would be placed on relevant article talk pages.

The report page would also be unbranded (or rather rebranded) as something like "WikipediaCheck", and the only attribution/advertising on that off-Wikipedia page would be a link at the bottom of the page which said "Powered by iThenticate", which is Turnitin's parent company.

In order to pursue this collaboration Turnitin would have to demonstrate efficacy beyond our current copyvio detection tools in a well-designed trial of their system on a selection of 150 Wikipedia articles.

After a trial of Turnitin's software, promotion of the collaboration would happen through a joint press release with Turnitin and the Foundation, and Turnitin would be free to describe their work with Wikipedia on their website and in promotional material as long as the collaboration was ongoing.

Over time, Turnitin's reports could be woven in to our Copyright Investigation workflow and used to assist with tagging articles that should be checked, with the added feature of prioritizing which ones are most suspicious.

Proposal

This RfC specifically is quite limited in scope. What the community is being asked to decide now is: can the community *trial* Turnitin's software on a limited selection of old and new articles to determine whether their services are effective and reliable at identifying possible plagiarism or copyright-violations, with minimal false positives. During the duration of the trial, Turnitin would not publicly promote any collaboration, and no copyright template/investigation workflows would be altered. A selection of 75 'old' articles and 75 'new' articles would be run through Turnitin's servers and the results would be analyzed by editors experienced in copyright investigation and cleanup (as described on the Trial page.)

Questions the trial would be designed to answer

Does Turnitin's system effectively screen out false positives created by Wikipedia mirrors or sites that legitimately reuse our content under a compatible license?
Can Turnitin's system work on old as well as new articles?
What 'percent-match' present in a Turnitin report would optimize copyvio detection while minimizing false positives?
Does Turnitin's system improve upon our current investigation tools: CorenSearchBot/MadmanBot? (note that these bots currently only operate on new and not existing articles)

Wikimedia Foundation perspective

This is a volunteer-created proposal that would be overseen by the community. Still, because of the scope involved and the potential impacts and technical demands, The WMF has looked closely at the proposal. They have spoken with Turnitin about their interest, reviewed the legal issues, and formulated some concerns (below). The WMF is comfortable with the community making the decision about whether or not to pursue a trial. Afterwards they will look at the discussion and if the community consensus is clear to move forward, they will explore supporting the technical implementation on their end.

In their discussions the WMF identified some potential concerns about which that they want to hear the community's opinion:

Would this type of collaboration set a precedent for future for-profit companies to try and benefit from an association with Wikipedia?
Would the relationship between the Foundation and the Wikipedia community be negatively impacted by pursuing this proposal?
Would the collaboration--even if effective--result in a backlog of copyright cleanup work that the community could not handle?
Would the collaboration--even if effective--result in legal or media vulnerabilities that would negatively impact Wikipedia's security and reputation?

RfC

Please record your position in this section. Note that the only question being addressed in this RfC is whether we should have a *trial*; the remaining sections for discussion will merely inform future planning. !Voting for A or B below is a !vote for the trial to proceed. Without a trial there will be no collaboration, so a !vote for C is also a !vote for not pursuing the collaboration at all.

A) A trial of 150 articles from a variety of quality and content areas as described on the Trial page would capably evaluate Turnitin's effectiveness in comparison to other methods; we should proceed with a trial.
B) The trial as described is flawed but can be reasonably improved in time to proceed with it; once corrected, we should proceed with a trial.
C) The trial is inappropriate or inadequate to determine Turnitin's effectiveness, and the situation cannot be remedied; we should not proceed with a trial, and therefore should not proceed with a collaboration with Turnitin.

Support A as nom. Ocaasi^{t | c} 17:40, 6 August 2012 (UTC)[reply]

Discussion

Attribution/advertising

A) Turnitin would be appropriately benefiting from this collaboration and the level of attribution they would receive is within past precedents or current practices or otherwise unproblematic.
B) Turnitin would be unduly benefiting from this collaboration and/or the level of attribution they would receive is tantamount to advertising a private company.

Agree with A. Please see my response here. Ocaasi^{t | c} 18:14, 6 August 2012 (UTC)[reply]

Backlog

A) Turnitin's system could help us address copyvio issues and we could control or manage the workload created by a collaboration with them, even in the case that Turnitin identifies more copyright issues than we currently can.
B) We won't know how Turnitin's system would impact us until the trial is complete. Let's revisit this afterwards.
C) Turnitin's system would create an unmanageable backlog and even if it was effective we could not handle the added workload.

Agree with A but also see the reasonableness of B. Please see my response here. Ocaasi^{t | c} 18:14, 6 August 2012 (UTC)[reply]

Media concerns

A) Turnitin would help us identify problems, that even if larger than we currently knew existed, are better for us to know about and something we should explore further, even if it results in increased media attention.
B) Turnitin could identify problems that we don't currently have resources to fix, and therefore we should proceed very cautiously with a trial, further implementation, and any related publicity.
C) Turnitin could produce results that would seriously harm our reputation and create media vulnerabilities that are not worth the potential benefit from a collaboration with them.

Agree with A but also see the reasonableness of B. Please see my response here. Ocaasi^{t | c} 18:14, 6 August 2012 (UTC)[reply]

Proprietary systems

A) Though proprietary, Turnitin's system is an appropriate use of a closed-source program, and we could always switch to our own or another open-source system if one becomes available.
B) Turnitin's system is not ideal for our circumstances, but it's the best option we have at this point.
C) Turnitin's system is an inappropriate use of proprietary software and even if we can't find a better alternative we should not pursue a collaboration with them.

Agree with A but also see the reasonableness of B. Please see my response here and here. Ocaasi^{t | c} 18:14, 6 August 2012 (UTC)[reply]

Overall direction

This section will influence whether there is a second RfC following the trial, or if a successful demonstration by Turnitin will be grounds to slowly introduce their system on articles where they've shown their system to be effective.

A) Turnitin could be a usefuful partner in our copyvio detection efforts, and I approve of pursuing a collaboration with them, provided they can demonstrate efficacy in their system.
B) Turnitin may be a useful partner, but there are too many open questions without seeing the trial results in detail; let's revisit the bigger picture following the trial.
C) Turnitin is an inappropriate partner and we should not pursue a collaboration with them, even if their system is effective.

Agree with A but also see the reasonableness of B. Please see the introduction and conclusion of this page. Ocaasi^{t | c} 18:14, 6 August 2012 (UTC)[reply]