User:Shirik/CollabRC

From Wikipedia, the free encyclopedia

CollabRC is the prototype name for a cross-platform tool for monitoring recent changes, currently in development by Shirik, designed to take collaboration from vandalism fighters and recent change patrollers to the next level. It builds upon the ideas from existing tools such was Twinkle, Huggle, and Igloo, as well as others, and incorporates its own technologies for collaboration in an effort to optimize, distribute, and automate efforts.

CollabRC addresses the problems of the existing tools by promoting active, automated collaboration between patrollers. Known as the collaboration cloud, CollabRC establishes a connection between active patrollers and several AI bots which work with the users of the system to help prioritize changes and issue alerts regarding high-risk changes. The bots learn from the decisions of the users of CollabRC and then use that knowledge to make future decisions. Regardless, the system still distributes load properly without the bots' presence; the bots are merely a supplement, not a requirement, for the system. This ensures that, should the bots fail for any reason, the system will still operate as normally. There is no single point of failure in the collaboration cloud, so a failure of any component in the cloud should not impact the other components.

Motivation[edit]

Many have considered the thought that, along with Wikipedia's motto "the encyclopedia anyone can edit", there exists a corollary, "the encyclopedia anyone can vandalize". While this is true, Wikipedia continues to succeed because, overall, the overwhelming majority of the community wants to benefit it, not vandalize it, and thus the constructive edits significantly outweigh those that are not constructive. Over time, this has led to the creation of many powerful tools to assist in monitoring recent changes, new pages, and other activity on Wikipedia. These tools have significantly benefited the community, especially in the speed at which pages can be patrolled and reverted, however one stone has been left relatively unturned: these tools do not try to distribute the load across all patrollers, resulting in the potential to "step on each other's toes" while monitoring pages. This tends to result in three major problems:

  1. A reduction in efficiency when two patrollers attempt to revert the same change
  2. Changes left unpatrolled that result from improperly distributed work (a direct impact from many patrollers monitoring the same change at the same time)
  3. A loss of potential knowledge that occurs from not recognizing vandalism patterns that occur on Wikipedia

If these three issues can be addressed, the efficiency and quality of work done by recent changes and new pages patrollers will significantly improve, and, accordingly, so will the quality of the Wikipedia community as a whole. CollabRC attempts to address these issues using a combination of distributed computing, service oriented architecture, and artificial intelligence.

The collaboration cloud[edit]

The collaboration cloud is a network of participants (some human, some bots) comprised of two primary actors: clients and suppliers. Suppliers are primarily bots. They focus on monitoring recent changes and flagging potentially malicious edits for client review. Clients are primarily human, though User:CollabRCBot is a bot that acts as a client. They read information from the recent changes feed and supplementary information provided by suppliers to build a list of changes that need to be reviewed. Clients resolve between each other which client(s) should review the edit and then, after view, report back to suppliers with whether or not the edit was determined to be vandalism.

Distributing workloads[edit]

CollabRC connects to either a well-known communications server (such as IRC) or to a distributed-mode tracking system (similar in manner to how Bittorrent technology operates) to connect a group of users collaboratively. In addition, CollabRC connects each user to the Wikimedia recent changes IRC feed to obtain an efficient feed of all changes without adversely affecting the web server. When a new change is made, the active clients in the collaboration group decide amongst themselves which client(s) will patrol the change, based on:

  • Risk level of the change
  • Current pending workload of each client
  • Activity level of the collaboration group

Clients that have decided they will patrol the change will add it to a queue for the user. After it has been patrolled, the results are announced back to the collaboration group. If a response is not received in time, the collaboration group assumes that the patroller has been lost and will re-evaluate who should patrol the change. This ensures that the change is not lost. Furthermore, if a patroller has run out of changes to patrol, the system will ensure that the workload given to that user is increased.

Learning reversion patterns[edit]

As both community quality standards and vandal patterns change, the system must be able to adapt its understanding of high-risk threats. Accordingly, one of the collaborators in the collaboration group can be a bot which automatically analyzes recent changes. Using artificial intelligence, it will classify changes as either having a vandal-like nature or not and broadcast that likelihood to the collaboration group. If vandalism is likely, clients can adapt and show this to the user more quickly, ensuring that the issue is addressed as quickly as possible. Clients may also use this information in determining how many collaborators should inspect the change.

To accommodate learning, every time a user reviews a change, the user's decision over whether or not to revert is recorded. These results are fed back to the collaboration bot; declarations of vandalism vs. not vandalism by the collaboration users are compared with the bot's decisions and the AI is adapted to improve accuracy. This is similar in mechanic to how adaptive junk mail filters operate.

When the bot becomes extremely confident about a particular vandalism act, it will revert it on its own, similar in manner to how ClueBot operates. The penalty in the fitness function for a false "confident positive" is extremely high, so this should only occur in cases where the bot is very certain that the change contains vandal-like behavior.

Development strategy[edit]

The system will remain open source and be version controlled using Mercurial. Assistance from others is certainly appreciated; anyone can pull from the repository. Shirik can be contacted to pull changes from other developers which will then be merged in after code review. This allows a high-quality, yet open, system to be developed.

The Bots[edit]

DASHBot[edit]

See also the list of commands for the bots

DASHBot-1
DASHBot-1 uses both the CRM114 spam engine, and heuristics from this list to classify edits. The first part of the bot can be taught. Only users not in the Whitelist
DASHBot-2
This bot blacklists, and flags edits by users that are reverted by Huggle or Twinkle, given that they are not in the Whitelist. They are automatically removed from the blacklist after 6 hours. It also responds to manual commands, listed here.
DASHBot-3
This bot lists edits flagged by the edit filter, such as repeating characters, etc.

ShirikBot[edit]

stuff here

CollabRCBot[edit]

CollabRCBot is intended to revert edits that the other bots are confident are vandalism.

Feedback[edit]

CollabRC is currently in its "motivation feedback" phase. Input on the following topics is strongly desired:

  • Additional desired features
  • Criticism against proposed features
  • Discussion regarding compatibility with existing systems
  • Any other system-level discussion

To give this feedback, open a new section on this page's talk page. Anyone can contribute there. All input is valued, though unfortunately not all features can be accepted immediately.

Plan[edit]

The initial plan is being actively developed during the feedback phase. The plan is drafted and maintained using Extreme Programming's planning game. The details on the plan can be found on CollabRC's planning subpage. Design for the artificial intelligence can be found at the Genetics subpage.