48% of the edits reported as suspected copyvio required additional follow up ("page fixed"). In tools.labsdb:selectstatus,count(*)froms51306__copyright_p.copyright_diffsgroupbystatus;
The full details how it is going to be shown in Special:NewPagesFeed would probably need to be discussed with community and with Growth team (MMiller, Roan Kattouw) - however, it is already possible to see an example in beta test wiki (search for "copyvio"). It would be important to note tagged page just means an edit may contain copied text (such edits may be OK [CC-BY content from government institutions], copyright violation [copy & paste from commercial news service] or promotional content [may be legally OK sometimes, but violates WP:Promo). Eran (talk) 16:07, 15 September 2018 (UTC)
It isn't sinking in how this fits in with the CopyPatrol activities. I'd like to discuss this further. Please let me know if this is a good place to have that discussion or if I should open up a discussion on your talk page or elsewhere.--S Philbrick(Talk) 18:03, 15 September 2018 (UTC)
Sphilbrick: I think it is relevant in this discussion, can you please elaborate? thanks, Eran (talk) 19:30, 15 September 2018 (UTC)
I start with a bit of a handicap. While I understand the new pages feed in a very broad sense, I haven't actually worked with it in years and even then had little involvement.
It appears to me that the goal is to give editors who work in the new page feed a heads up that there might be a copyvio issue. I've taken a glance at the beta test wiki — I see a few examples related to copyvios. I see that those entries have a link to CopyPatrol. Does this mean that the new page feed will not be directly testing for copyright issues but will be leaning on the copy patrol feed? I checked the links to copy patrol and found nothing in each case which may make sense because those contrived examples aren't really in that report, but I would be interested to know exactly how it works if there is an entry.
The timing is coincidental. I was literally working on a draft of a proposal to consider whether the copy patrol tools should be directly making reports to the editors. That's not exactly what's going on here but it's definitely related.
What training, if any is being given to the editors who work on the new pages feed? Many reports are quite straightforward, but there are a few subtleties, and I wonder what steps have been taken to respond to false positives.--S Philbrick(Talk) 19:57, 15 September 2018 (UTC)
CopyPartol is driven by EranBot with checks done by iThenticate/Turnitin. This BRFA is to send revision IDs with possible violations to the API, which will cause the CopyPatrol links to be shown in the new pages feed. — JJMC89 (T·C) 04:53, 16 September 2018 (UTC)
Directly making reports to the editors - This is good idea, and actually it was already suggested but was never fully defined and implemented - phab:T135301. You are more than welcome to suggest how it should work there (or in my talk page and I will summarize the discussion on phabricator).
Thanks for the link to the training material. I have clicked on the link to "school" thinking it would be there, but I now see the material in the tutorial link.
Regarding direct contacts, I'm in a discussion with Diannaa who has some good reasons why it may be a bad idea. I intend to follow up with that and see if some of the objections can be addressed. Discussion is [[|User_talk:Diannaa#Copyright_and_new_page_Patrol|here]].--S Philbrick(Talk) 18:54, 16 September 2018 (UTC)
@Sphilbrick: thanks for the questions, and I'm sorry it's taken me a few days to respond. It looks like ערן has summarized the situation pretty well, but I'll also take a stab. One of the biggest challenges with both the NPP and AfC process is that there are so many pages that need to be reviewed, and there aren't good ways to prioritize which ones to review first. Adding copyvio detection to the New Pages Feed is one of three parts of this project meant to make it easier to find both the best and worst pages to review soonest. Parts 1 and 2 are to add AfC drafts to the New Pages Feed (being deployed this week), and to add ORES scores on predicted issues and predicted class to the feed for both NPP and AfC (being deployed in two weeks). The third part will add an indicator next to any pages who have a revision that shows up in CopyPatrol, and those will say, "Potential issues: Copyvio". Reviewers will then be able to click through to the CopyPatrol page for those revisions, investigate, and address them. The idea is that this way, reviewers will be able to prioritize pages that may have copyvio issues. Here are the full details on this plan. Xaosflux has brought up questions around using the specific term "copyvio", and I will discuss that with the NPP and AfC communities. Regarding training, yes, I think you are bringing up a good point. The two reviewing communities are good at assembling training material, and I expect that they will modify their material as the New Pages Feed changes. I'll also be continually reminding them about that. Does this help clear things up? -- MMiller (WMF) (talk) 20:32, 20 September 2018 (UTC)
User:ערן how will your bot's on-wiki actions be recorded (e.g. will they appear as 'edits', as 'logged actions' (which log?), etc?). Can you point to an example of where this get recorded on a test system? — xaosfluxTalk 00:22, 16 September 2018 (UTC)
Xaosflux: For the bot side it is logged to s51306__copyright_p on tools.labsdb but this is clearly not accessible place. It is not logged on wiki AFAIK - If we do want to log it this should be done in the extension side. Eran (talk) 18:40, 16 September 2018 (UTC)
I've never commented on a B/RFA before, but I think that another bot doing copyvios would be great, esp if it had less false positives than the current bot. Thanks, L3X1◊distænt write◊ 01:12, 16 September 2018 (UTC)
L3X1: the Page Curation extension defines infrastructure for copyvio bots - so if there are other bots that can detect copyvios they may be added to this group later. AFAIK the automated tools for copyvio detection are Earwig's copyvio detector and EranBot/CopyPatrol and in the past there was also CorenSearchBot. The way it works is technically different (one is based on a general purpose search using Google search, one is based on Turnitin copyvio service) and they are completing each other with various pros and cons for each. I think Eranbot works pretty well (can be compared to Wikipedia:Suspected copyright violations/2016-06-07 for example)
As for the false positives - it is possible to define different thresholds for the getting less false positives but also missing true positives. I haven't done a full Roc analysis to tune all the parameters but the arbitrary criteria is actually works pretty well somewhere in the middle ground. Eran (talk) 18:40, 16 September 2018 (UTC)
Follow up from BOTN discussion, from what has been reviewed so far, the vendor this bot will get results from can check for "copies" but not necessarily "violations of copyrights" (though some copies certainly are also copyvios), as such I think all labels should be limited to descriptive (e.g. "copy detected"), as opposed to accusatory (humans should make determination if the legal situation of violating a copyright has occured). — xaosfluxTalk 01:30, 16 September 2018 (UTC)
@JJMC89: what I'm looking for is where is a log of what this bot does control. As this is editor-managed, its not unreasonable to think another editor may want to run a similar or backup bot in the future. — xaosfluxTalk 05:14, 16 September 2018 (UTC)
Would it be possible to assign a number of bytes to "large chunck of text"? SQLQuery me! 02:25, 16 September 2018 (UTC)
500 bytes. — JJMC89 (T·C) 04:53, 16 September 2018 (UTC)
Procedural note: The components for reading changes, sending data to the third party, and making off-wiki reports alone do not require this BRFA; making changes on the English Wikipedia (i.e. submitting new data to our new pages feed, etc) are all we really need to be reviewing here. Some of this may have overlap (e.g. what namesapces, text size, etc), however there is nothing here blocking the first 3 components alone. — xaosfluxTalk 18:54, 16 September 2018 (UTC)
It looks like phab:T204455 has been closed regarding logging, can you show an example of an action and it making use of this new logging? — xaosfluxTalk 11:02, 2 October 2018 (UTC)
@Xaosflux: If the verbiage issue is resolved, I was wondering if we could move ahead with a trial for this BFRA. The way that PageTriage works is that it won't allow bots to post copyvio data to it unless the bot belongs to the "Copyright violation bots" group. So for the trial, you'll need to add EranBot to the group with whatever expiration time you like. It would be good to have at least a couple days so that we can make sure everything is running properly on our end as well. Ryan Kaldari (WMF) (talk) 17:35, 4 October 2018 (UTC)
@Xaosflux: Unfortunately, setting up EranBot to monitor a new wiki isn't trivial and might take a while. You can see what the new logs will look like on Beta Labs. And you can see what sort of stuff EranBot flags by looking at CopyPatrol. What do you think about just doing a one day trial on English Wikipedia and having folks take a look at the results? That way it will be tested against more realistic edits anyway. Ryan Kaldari (WMF) (talk) 00:15, 11 October 2018 (UTC)
phab:T206731 created as we do not currently have community control over this access. — xaosfluxTalk 02:14, 11 October 2018 (UTC)
Approved for trial (7 days). I've added the cvb flag for the trial and let the NPP/Reviewers know. Do you have a good one line text that could be added to MediaWiki:Pagetriage-welcome to help explain things and point anyone with errors here? — xaosfluxTalk 18:41, 15 October 2018 (UTC)
I'm not seeing on betalabs either - how is anyone going to actually make use of this? — xaosfluxTalk 19:32, 15 October 2018 (UTC)
I was guessing it would show in the filters under "potential issues", but there's nothing there. FWIW, "attack" also has no articles, but is still shown there. I think I might be misunderstanding how this works altogether. Natureium (talk) 19:39, 15 October 2018 (UTC)
@Natureium: just regarding the "attack" filter having no pages, that is behaving correctly. It is very rare that a page gets flagged as "attack", because whole pages meant as attack pages are rare. It's much more common for pages to be flagged as "spam", and you can see some of those in the feed. To see some flagged as attack, you can switch to "Articles for Creation" mode and filter to "All" and "attack". -- MMiller (WMF) (talk) 22:52, 15 October 2018 (UTC)
@Xaosflux: thanks for your help here and for adding that editor. Roan Kattouw (WMF) edited it to link to the project page for this work so that people coming across it have some more context. -- MMiller (WMF) (talk) 22:52, 15 October 2018 (UTC)
Regarding the scope of this bot, User:ערן / User:Ryan Kaldari (WMF) the function overview calls for this to run against "newly added text", but the trials suggest it is only running against newly added pages - is this limited to new pages? — xaosfluxTalk 13:37, 17 October 2018 (UTC)
Xaosflux: the bot runs against any added text and reports for suspected edits to CopyPatrol and to pagetriage. Page triage accepts only edits for pages in the NewPagesFeed (e.g new pages). Eran (talk) 14:43, 17 October 2018 (UTC)
Thanks, I'm not concerned with updates offwiki (such as to CopyPatrol) for this BRFA, just trying to clarify when activity will actually be made on-wiki. For example with page [[Sarkar (soundtrack)]: it was created on 20181007, and a revision made today (20181017) triggered the bot action. Are you only attempting actions for pages where the creation is within a certain timeframe? — xaosfluxTalk 15:07, 17 October 2018 (UTC)
Xaosflux: No it doesn't care for the page creation time (page creation time isn't that meaningful for drafts). However this is viewable only in Special:NewPagesFeed which is intended for new pages, but I'm not sure what is the definition of new page for the feed (User:Ryan Kaldari (WMF) do you know?). Eran (talk) 16:32, 17 October 2018 (UTC)
Will these be usable for recent changes or elsewhere, to catch potential copyvios being introduced to 'oldpages' ? — xaosfluxTalk 16:36, 17 October 2018 (UTC)
For now, no, only new pages and drafts. CopyPatrol, however, handles copyvios introduced to old pages. Ryan Kaldari (WMF) (talk) 17:28, 14 November 2018 (UTC)
(Buttinsky) I know very little about the techincal aspect of this, so if I need to just pipe down I will, but No it doesn't care for the page creation time (page creation time isn't that meaningful for drafts) is one of the main problems that exists with Turintin-based CopyPatrol. Dozens upon dozens of revisions are flagged as potential CVs even though the dif in question did not add the text that is supposedly a CV, most of the time it seems as if if anyone edits a page with a Wikipedia mirror (or whne someone else has wholesale lifted the article) no matter how small the edit, it will be flagged. Most of the 6000 some cases I've closed have been false positives along those lines, and I think it might be of some use to make the software segregate against any cases where the editor has more than 50,000 edits. Thanks, L3X1◊distænt write◊ 02:46, 18 October 2018 (UTC)
L3X1, thank you for the comment. I think this is good comment that should be addressed and disscussed in a seperate subsection, I hope this is OK.
Historically, EranBot detects Wikipedia mirrors (see example in User:EranBot/Copyright/rc/44; look for Mirror?) where the intention is to handle also Copyvio within Wikipedia. That is, if user copies content from other article, he should give credit in the edit summary. (e.g example for sufficent credits in summary: "Copied from Main page").
This is common case, and somewhat different from copying from external site/book. I think CopyPatrol interface doesn't show this indication of mirror (as other indications of CC-BY). So how should we address it:
Do the community wants reports on "internal copyvio" (copy within Wikipedia/Wikipedia mirror) without credits? (if no, this can be disabled, and we will not get anymore reports on such edits)
If the community does want reports for "internal copyvio":
We can add the hints in the CopyPatrol side (Niharika and MusikAnimal) if this isn't done already. (I think it doens't)
This is up to community wheather we want to have distinction of labels in NewPagesFeed?
(based on community input here, this will be tracked technically in phab:T207353)
Though I suspect this will be a minority viewpoint, I don't think detecting copying within Wikipedia is as important as catching from external sources. Thanks, L3X1◊distænt write◊ 16:56, 18 October 2018 (UTC)
If these are only page creations, I think this would be useful for finding pages that have been copy-pasted from other articles, because this also requires action. Natureium (talk) 16:59, 18 October 2018 (UTC)
Detecting copying within WP is important, because it is so easily fixed--and fixing it, especially if it can be fixed automatically, prevents people who don't understand the priorities from nominating them for deletion unnecessarily. Detectingandn ot reporting mirrors is pretty much essential, & it should be marked, because beginners tend not to realize. DGG ( talk ) 05:53, 23 October 2018 (UTC)
Trial complete. The trial time has expired and along with it the access to inject data to the feed. Please summarize the results of your trial and if you require further trialing for evaluation. — xaosfluxTalk 04:48, 22 October 2018 (UTC)
@Xaosflux: thanks for your help with the trial. We think the trial was successful and that we should be ready to go to production. Below are the results of the trial (along with one caveat that we intend to work on):
Since October 15, the bot flagged 237 pages as potential copyright violations. That is here in the log. One encouraging note is that many of those pages are now red links that have been deleted, showing the association between being flagged by the bot and being a revision with enough violations that deletion is warranted.
Spot checking by our team has shown that almost all pages flagged by the bot in the New Pages Feed are also in CopyPatrol, and vice versa -- as expected.
This test was announced at both NPP talk and AfC talk. There were a couple questions, but no negative feedback. Mostly just confirmation that things were working, and a little bit of positive feedback.
The one issue we noticed is in this Phabricator task, in which we noticed from spot-checking that when a page is moved from Main namespace to Draft namespace (or vice versa), the New Pages Feed links to the CopyPatrol page of the new namespace, whereas the CopyPatrol results are still displayed at the original namespace. This, however, is an issue with the PageTriage integration and the way we are displaying results, not with the bot. We'll address this, but we think the bot is behaving correctly.
@MMiller (WMF): thank you for the note, I've left notes that this appears ready to go live on those pages as well, if no issues in ~2days looks like we can call this done. — xaosfluxTalk 20:44, 22 October 2018 (UTC)
Hi Xaosflux -- it's about two days later now, and I don't see any notes about the bot from the affected communities. We want to remove the feature flag for the New Pages Feed to incorporate the bot's results tomorrow, putting this feature in production as a whole. Could you please add the bot back to the copyvio group? Thank you. -- MMiller (WMF) (talk) 18:39, 24 October 2018 (UTC)