Want to help out with the Cluebot engine rewrite?

We need help generating the dataset! Look at the dataset contribution interface.

Crispy

C-N

This user has written C compilers, or tweaked C runtime libraries in Assembly language.

This user is a native speaker of the English language.

Cet utilisateur peut contribuer avec un niveau élémentaire de français.

Hi, I'm Crispy. I haven't done much with Wikipedia (I'm more of a programmer and a sysadmin), but I'm working on creating a new core engine for ClueBot PENIS's vandalism detection, and using this page to document some of the work.

New Cluebot Vandalism Detection Engine

General Structure

Cobi and I are working together to create the architecture of the new engine. All of the Wikipedia interface code will remain the same, using Cobi's wikibot.classes PHP code. The core vandalism detection engine, however, will be required to perform heavy computation, and as such, is unsuitable for a scripting language to perform in a reasonable amount of time.

Core Engine

The engine's core will be made up of a feed-forward artificial neural network with back-propagation. It will have the ability to learn what vandalism is, given numerous examples. Currently, the examples are being provided by Cluebot's existing vandalism detection heuristics, with a higher-than-normal threshold value. The code being used for the neural network is a modified version of annutils.

Neural Network Inputs

The raw text cannot be fed directly into the neural network, so a preprocessor performs many operations on the edit to convert it into scaled floating point values before the ANN can process it.

Training Set

The entire principle of this type of supervised-learning neural network is that datapairs are needed consisting of an edit and whether or not that edit is vandalism. A large number of such pairs is needed to properly train the neural network. The current dataset that I am testing the neural network with consists of Cluebot's current outputs. However, this scheme has a number of problems, primarily that the current Cluebot misses a substantial amount of vandalism, and also classifies a fair number of false positives. These errors cause inaccuracies in the dataset which can cause significant problems in the operation of the neural network. Cluebot's current false positive count is significantly less than the amount of vandalism that it misses (by design), but because a neural network needs to be trained with both vandalism and not vandalism, both types of errors cause significant amounts of dataset pollution.

Because Cluebot's direct output appears to be unsuitable for reliably training the neural network, I'm reaching out to the Wikipedia community to see if anyone is willing to help generate this dataset. If anyone wants to help (and any help would be greatly appreciated), look at the dataset page. NOTE: This method of contributing to the dataset is now outdated. Please see the below section for more information.

New Dataset Contribution Interface

There have been several complaints about the old system for contributing to the dataset (copy/paste to the dataset pages), so we've designed a new interface that should make it much easier to help out. The new interface displays random revisions with a toolbar at the top to classify them. To prevent vandals from messing with the system and contributing false data to the dataset, the system is password protected. If you'd like to help, to get an account on the system, register for a Cluenet account, join the Cluenet IRC Channel, and ask Cobi or Crispy for access to the interface. The interface is here.

Subversion Repository

There is now a subversion repository hosting the latest versions of the new cluebot code as well as the updated annutils code. Anyone wanting access to either of these should ask Crispy or Cobi.

Test Interface

There is now a test interface up so people can test the new Cluebot core! Note: This is running on a very preliminary version of the neural network, with a suboptimal dataset - the final result will perform much better, but it should give a general idea of the power behind the new core.

The test interface can be accessed here.

Word Categorization Code Update

Up until now, the Cluebot word categorization code matched words by splitting them on punctuation, then matching each word with the list of categories. This approach has the problem of not matching words that are not delimited by punctuation or spaces (as often occurs in vandalism). Originally, it was thought to be too time-consuming to search for arbitrary strings in the full text of a revision, but I've written some code that can do it very efficiently.

Now, there will be two different word category lists. The first word category list will be how it is now, and will check for discrete, delimited words. However, a new word category list should be generated (probably with only a few categories, and not very many words) that will be checked in the full text. Words in this list should be chosen manually because of specific likelihood to occur in vandalism or nonvandalism, but without context. For example, the word "fuck" should be included in the list, but "ass" should not be included (because it is included in words such as "assume").

From this point on, both lists of words and categories will be manually generated and maintained. The lists in use are: Main Words and Full Text Search Words. Each list is a set of "Word:Category" pairs, where each category is assigned a number starting from 1.