Wikipedia:Bots/Requests for approval/ClueBot NG/Pretrial

Pre-trial Discussion

I think one of the main concerns about anti-vandal bots is not about how good they are at detecting vandalism; rather, it's above all about how good they are at not marking legitimate edits as vandalism. Therefore, one main concern from the document might stem from the following: "Estimates given to me for an acceptable false positive rate range from 1% to 0.5%. " Consider, for a moment: a false positive rate of 1.0% means that on average 1 in every 100 legitimate edits would be reverted as vandalism; 0.5% translates to 5 of every 1000 legitimate edits being reverted as vandalism. Keeping in mind that the enwiki edit rate is relatively high, false positives tend to add up. I'm not sure what the magic number is going to be, but my suggestion would be that you should at least keep in mind the idea that as a whole the community seems to prefer dealing with vandalism manually if it avoids labeling new users as vandals. --slakr^\ talk / 06:24, 25 October 2010 (UTC)[reply]

I agree with this completely. One of the aims from the ground up has been to minimize false positives. The key is to realize that even humans can sometimes have false positives, particularly with the borderline edits (and this is the area where Cluebot-NG has a few false positives). Currently existing bots have a significant number of false positives - likely higher than 0.5%. If the 0.5% false positive rate is deemed too high, it can be adjusted at any time. This exact number could be put up for discussion before going live. The program is capable of generating graphs comparing the false positive rate with vandalism detection rate - do you think it would be useful to post these and open up a discussion concerning them? I can also post current lists of false positives in the trial dataset - it may be useful to see that most of them are poor quality or borderline edits, and on nearly all of them the user has contributed a very few times (not just as of the time of the edit, but as of the present). In response to your concerns about new users being labeled as vandals, I'm re-running the trial dataset right now, discarding any data about previous user contributions. I will update this with the results. Crispy1989 (talk) 06:55, 25 October 2010 (UTC)[reply]

Update - I ran the dataset without any prior user information and accuracy was almost the same, with only a slight dip of about 4%. There was no significant effect on false positives.

I know that this discussion is focused more on eliminating false positives rather than increasing ClueBot NG's accuracy, but I've noticed that the bot seems to miss content-removal vandalism more often than the other bots. --Ixfd64 (talk) 20:22, 4 November 2010 (UTC)[reply]

Very interesting. The last 2 anti-vandal bots performed dry runs on their user pages before being approved. This does not require prior permission, AFAIK. And just curious, have you test your program against the datasets provided by the CLEF 2010 LABs?, and if you have, how it has performed in comparison with other approaches? Sole Soul (talk) 18:23, 25 October 2010 (UTC)[reply]
It's difficult to perform a dry run on a userpage that adequately demonstrates the capabilities of the bot. The neural network takes into account not only the raw diff, but information on activity on the page, user activity, and other statistics. Also, the neural network is trained on a dataset of main namespace articles - it wouldn't apply very well to the talk page. It may be possible to set up a simulated environment with a somewhat less functional version by either holding these inputs constant, or removing them from the neural network. This approach differs from all existing ones that I'm aware of in that this approach combines multiple different methods to catch different classes of vandalism. Most existing approaches are either statistics-based, or language-based. This is both. Another key difference is that existing approaches are designed with the mindset of being a research project, and as such, try to maximize overall accuracy without practical considerations. Cluebot-NG has been designed to be practical from the ground up, in terms of speed, and minimizing false positives even if it means a decrease in overall correctness. Crispy1989 (talk) 19:17, 25 October 2010 (UTC)[reply]
What is the accuracy when the false positive rate is capped at 0.1%? Is there a chart somewhere of the catch rate vs false positive setting? Gigs (talk) 21:03, 25 October 2010 (UTC)[reply]
Capped at 0.1%, it currently catches about 40% of vandalism. We're working to improve this by identifying under what circumstances edits are falsely identified as vandalism and correcting them. You can view one recent trial report in whole at [1] . Among other things, it includes the graph you requested: [2] Crispy1989 (talk) 22:02, 25 October 2010 (UTC)[reply]
Thanks, that's just what I was looking for. It does look like the sweet spot is 0.5% at this point. I would feel more comfortable with a 0.1% false positive rate though. Gigs (talk) 22:22, 25 October 2010 (UTC)[reply]
I'll update the graphs as the bot improves and learns. Crispy1989 (talk) 22:46, 25 October 2010 (UTC)[reply]

I would support an anti-vandal bot which could avoid absurd false positives such as these recent reversions by ClueBot [3] and [4] (apparently, any use of the words "sex" or "pussy" in an edit by a non-whitelisted user is sufficient to trigger reversion), provided that the existing anti-vandal bots are decommissioned. Will this the algorithms of this new bot adhere to WP:NOT#CENSORED, and refrain from reverting edits solely because they contain "bad" words? (though questionable language may legitimately be one of the factors which weighs in favour of reversion.) Also, blanking of content is sometimes legitimate; automated restoration of copyright [5] or BLP violations is particularly disruptive. I suggest that blanking in the following situations never be reverted by a bot:

The blanking edit adds a copyright violation template, or {{db-g12}}.
The blanking adds {{db-g10}}.
The blanking only removes content that contains no references in a form the bot can recognize (<ref>, a reference template, or an external link.)
The blanking only removes content (doesn't replace it with something else) on an article that was previously in Category:Living persons (while content removal on BLPs can constitute vandalism, human judgement is required to determine whether this is actually the case; the content may have blatantly violated the sourcing or neutrality requirements of the policy.) Peter Karlsen (talk) 03:31, 26 October 2010 (UTC)[reply]

I also noticed this diff on Wikipedia:Bots/Requests for approval/ClueBot NG/Trial. Since User:Tide rolls is an administrator with nearly 150,000 edits, the identification of this edit for reversion suggests that a whitelisting mechanism is not (currently) implemented. Is this a planned feature, perhaps through the use of Wikipedia:Huggle/Whitelist? Peter Karlsen (talk) 03:49, 26 October 2010 (UTC)[reply]

Cluebot-NG does not use a "blacklist" of any sort. The words that compose an edit are taken into account in two ways; the presence in a set of predefined "word categories", and the result of a naive Bayes classifier (two, actually). The word categories (as opposed to a blacklist) allows the bot to recognize that certain words may be acceptable if similar words are already used in the article. The naive Bayes classifiers recognize if a certain word appears only in vandalism, or if it sometimes appears in good edits. The naive Bayes classifier also provides for instances of normally bad words to be offset by the presence of other words which are not normally found in vandalism (These lists are NOT predefined - they are empirically determined by analyzing the dataset). Additionally, neither of these factors are used independently - they're fed into a neural network along with many other statistics, so the bot can learn from example in what statistical situations a high Bayesian score, or a word's presence in a certain category, is acceptable. Also, the second Bayes classifier uses sets of two words, instead of one, so phrases like "Pussy cat" would be recognized as primarily belonging to good edits (given a large enough data set). As the bot does not use any sort of heuristics, it cannot be programmed to ignore those certain situations that you list. However, these tags, and others, can be added as inputs to the neural network. It should then learn under what circumstances they contribute to an edit being good or bad. I should also note that, of the many false positives I've examined so far, none of them fit into these categories, so it would seem the neural network already does a good job of determining that the edit is not vandalism in these cases. The entire concept relies on the neural network learning complex patterns and making inferences about the data presented to it, and this requires a large dataset. We currently have a dataset of around 30,000 edits, about 20,000 of which are used for training, and the other 10,000 for testing/trialing. We are working on expanding this. About the whitelisting of edits - Yes, this will be implemented when the bot is actually running live. The reason this is not currently active is that we would like to find as many false positives as possible now, before the bot goes into production, so we can work on fine-tuning the statistical parameters of the neural network. Even if an edit would not be reverted in production due to a whitelist, it's useful to train as an example of what should be considered a good edit. The programmatic structure of the bot is modular, and different mechanisms can easily be added - we plan on only adding whitelisting measures post-neural-net (ie, no heuristics that could cause additional false positives). Crispy1989 (talk) 04:37, 26 October 2010 (UTC)[reply]

It sounds like ClueBot NG could greatly outperform traditional Wikipedia anti-vandalism bots, reverting significantly more vandalism with a lower false positive rate. One of the most important issues, however, is exactly where the acceptable false-positive rate is set. I would expect a percentage of false positives no greater than an experienced, careful human editor would produce. 0.1% false positives is probably at least as accurate as most humans could be, and, at 40% of vandalism reverted (based on the discussion above) is still far more effective than existing bots. Given that ClueBot NG would be performing far more reversions in total than DASHBotAV and ClueBot have, setting the false positive rate below the 1% or 0.5% that has heretofore been considered acceptable is particularly important. Peter Karlsen (talk) 05:12, 26 October 2010 (UTC)[reply]

Keep in mind that number of reversions has no bearing on false positive percentage. False positives are measured based on how many legitimate edits it processes, not how many edits it reverts. False positives percentage is (number of legit edits marked as vandalism) / (total legit edits processed) * 100. -- Cobi^(t|c|b) 06:35, 26 October 2010 (UTC)[reply]

Are you saying that if it processes 100,000 edits, 100 of which are vandalism, it will revert 70 of the vandalism edits and 500 legitimate edits? (given 70% catch rate at 0.5% false positive). If so, that's far, far worse than I thought. Gigs (talk) 18:59, 26 October 2010 (UTC)[reply]

That is how false positive rates are measures - This is significantly better than existing bots nonetheless. Ie, if the existing bots are decommissioned and replaced with this, there will be fewer false positives, and much more vandalism caught. Also, vandalism comprises more than 1% of edits, so although that example does show how the rates are calculated, it does not represent reality. Crispy1989 (talk) 19:13, 26 October 2010 (UTC)[reply]

In any case, at a 0.1% false positive rate and 40% of vandalism reverted, ClueBot NG represents an opportunity for significantly improved performance, so that more vandalism is reverted, while events like [6], [7], and [8] become less frequent occurrences. I would suggest that setting the acceptable false positive rate as low as it possibly can be while still reverting a reasonable portion of vandalism increases the probability that the bot will be approved after its trial period (for the same reason, any "final" trial, either dry or with actual reversions, should be conducted with whitelisting enabled.) Peter Karlsen (talk) 19:33, 26 October 2010 (UTC)[reply]

Also, I should mention that the bot is constantly being improved every day. Accuracy will likely be even higher before the final trial. Anyone who feels they have something to add is welcomed to help. The more pertinent statistical information the neural network gets, the better. We've only added the things that we can think of off the tops of our heads. All of this information is configured at run time using configuration files, so new metrics can be easily added. If you have something to add or suggest, stop by irc.cluenet.org #cluebotng . Crispy1989 (talk) 20:40, 26 October 2010 (UTC)[reply]

When do you think you will be ready for a trial and what kind of sample size will you need to verify the method? MBisanz ^talk 05:34, 27 October 2010 (UTC)[reply]

Even its current performance is far superior to current bots. It's up to the BAG when the trial happens. As I said above, accuracy can only get higher, and development won't stop as soon as it goes live. We already have enough of a sample size (around 30,000 edits) to verify that it works very well. This sample size is what the other figures in this discussion are based on. The larger the dataset, the more accurate it will be. Crispy1989 (talk) 05:55, 27 October 2010 (UTC)[reply]

I ran it for a few hours today in dry run mode and exported it's data to User:ClueBot NG/Dry Run -- warning, this is a very large page (about 2500 links). -- Cobi^(t|c|b) 08:29, 28 October 2010 (UTC)[reply]

Generally pretty impressive, imo. I will say that the fourth link I clicked on in a list of 2500 (Christopher Boykin by 68.229.109.100 (talk · contribs) at 2010-10-27 19:37:08 - ANN scored at 0.956627) was a false positive, however. :P Ale_Jrb^talk 11:34, 30 October 2010 (UTC)[reply]

That sort of false positive is very rare, and will be fixed over time as the dataset grows. Part of the algorithm uses a naive Bayes classifier on the inserted text. In the dataset we have, the word "banana" has been used in 4 vandalism edits, and 0 good edits. Since that user made no past contributions to analyze, and there was no data other than the inserted word, this statistic was the only one that contributed to the score. Because our existing dataset is pretty large, false positives like this (and false positives in general) very rarely occur. These can be fixed in one of two ways. The best way is to increase the size of the dataset. If even a single good edit in the dataset included the word "banana", the neural network would put less stock in it. The other way is to increase the minimum number of dataset appearances of a word before it contributes to the Bayesian score. This is currently set at 4 (barely enough for "banana" to trigger it). We are using our best judgement to adjust these parameters to the optimal values, but as I said, the best way is to increase the dataset size. Crispy1989 (talk) 12:04, 30 October 2010 (UTC)[reply]

A couple comments/questions:

I think it would be good to get some wider community input on what an acceptable false positive rate would be.
How do false positives affect the bot's learning ability? Will a higher false positive rate lead to "bad" entries in the dataset that could cause it to "learn" slower (or worse, get worse with time)? Or will the fact that its reverting more actual vandalism, including corner cases, mean that it will learn faster?
A somewhat WP:BEANS-y question - If it looks at users' past edits, is it possible to game the bot? If a user makes a legitimate edit to an article about penises, will that make the bot less likely to detect "penis vandalism" by the same user?

-- Mr.Z-man 19:34, 30 October 2010 (UTC)[reply]

How would I go about getting wider community input for the false positive rate? I agree with this, but so far there have been different acceptable estimates, but no real concensus. My recommendation is somewhere between 0.5% and 0.1%. Anything above 0.5% is probably too high, but close to 0.1%, there's a dropoff in effectiveness.
The bot does not automatically learn. If it did so, regardless of the false positive rate, its performance would simply remain status quo. Over time, we plan on manually growing the dataset, both by using human-reviewed random edits, and by adding specific false positives (and false negatives) to the set.
The bot only looks at general statistics of the user's past edits (number of edits, time frame, number of past edits that were vandalism, number of unique pages edited). It does not process content of past edits. It learns things like, if the user has made two previous edits, and neither were vandalism, it's much less likely for the edit in question to be vandalism than if it's the user's first edit, or if the user has made 2 edits before, both of which were vandalism. (Also note that these scenarios do not alone positively identify an edit as vandalism. They only contribute to the result.)

A note about "penis" vandalism and similar: Currently, the bot's Bayesian database shows "penis" appearing in 157 vandalism edits, and 0 good edits. Because of this, a user simply adding the word "penis" alone to a page would likely be classified as vandalism. However, almost no legitimate edit would add only this word. Bayesian scoring also takes into account other words that are added with it (for example, if "birth" was also added, it appears in 43 good edits and only 15 vandalism edits, so it would bring the score far below threshold). In addition to this, the bot also monitors if words already appear in an article before the edit, so even if the word "penis" alone were added, if it already appeared in the article, the bot may take that into account (however, that may not be enough to have it classified as a good edit - all statistics are taken into account). Also, the bot handles words inside quotes differently, as direct quotes on Wikipedia are allowed to contain many things that direct article text should not. Crispy1989 (talk) 20:20, 30 October 2010 (UTC)[reply]

To me the false positive rate is the key question, since any vandalism reversion is useful. For comparison, what are the false positive rates of the existing anti-vandalism bots? IMHO any bot with a lower false positive rate than currently approved bots can be readily authorised. Rjwilmsi 20:08, 30 October 2010 (UTC)[reply]

If I understood the info from the report.txt that's linked to above, if the bot catches 2100 vandalism edits while having 5 false positives it sounds like it's doing very well. Rjwilmsi 20:11, 30 October 2010 (UTC)[reply]

It's very difficult to estimate the false positive rates of existing bots, because the majority of false positives probably are not even reported. Since existing bots are heuristics-based, they are frequently over-aggressive with many things that usually occur in vandalism, but can sometimes be acceptable. They mitigate this to some extent using a user-whitelist, and not reverting edits made by a user with more than a certain number of contributions. But this means that all or nearly all of the false positives are occurring with new users, who probably don't understand why their edit was classified as vandalism, or how to report it (if they even notice). Just by looking at their logic and considering how many legitimate edits could be misclassified by simple heuristics, it can be seen that existing false positive rates should be significantly higher than the values being considered for this bot.

About the comment on report.txt, you are reading it correctly, but the inference is a little off. The raw numbers of vandalism detected versus false positives are based on the trial dataset, which is about 50/50 vandalism/good edits. If Wikipedia edits followed this same proportion, then saying "2100 vandalism edits reverted for every 5 false positives" would be accurate. However, there are more good edits on Wikipedia than vandalism, so this isn't an accurate way to think about it. That's why I've been discussing false positive rate (which is a percentage of good edits), rather than the ratio, because the false positive rate is independent of the ratio of good edits to vandalism.

On the other hand, a few other things to keep in mind: Of those 5 false positives, one isn't actually a false positive (it's a misclassified edit in the dataset). We are working to correct these few errors by manually reviewing edits. They do not affect training as the very few errors are washed out by all the correctly classified ones, but they can affect calculation of threshold (by making false positive rate seem higher than it actually is). Also, at least one other edit of those 5 false positives would probably be caught by the post-processing filter (using a similar whitelist to existing bots) which will be added in production. Also keep in mind the bot is continually being improved. Many changes have been made since that report was generated. I'll post a new report soon. Crispy1989 (talk) 20:41, 30 October 2010 (UTC)[reply]

Examining the bot's current performance, I've decided to recommend a false positive rate of 0.25%. I also posted the report from a recent dataset trial here. As you can see from the graph, there's a sharp dropoff after about 0.25% false positives. At 0.25% false positives, the score threshold is 95.4%, and 63.7% of vandalism is caught. In reality, the number of false positives will be less than demonstrated here. Looking at the list of false positives, 8 of the 13 aren't even real false positives - they're edits misclassified in the dataset, so the bot is actually identifying these correctly. Since the whole dataset is human-reviewed, this demonstrates that the bot can, under some circumstances, be even more accurate than a human - it can even recognize and ignore errors in its training set. Crispy1989 (talk) 21:52, 30 October 2010 (UTC)[reply]