User:Steven (WMF)/Diff Categorizer

From Wikipedia, the free encyclopedia
An example set of categories for User Talk diffs in the Diff Categorizer

The Diff Categorizer is a MediaWiki gadget that provides registered users the ability to categorize the content and character of edits to Wikipedia.

Why categorize diffs?[edit]

Much of the data and meaning that could be derived from the huge number of edits to Wikipedia requires humans. And not just any humans, but those with experience understanding Wikipedia. Categorizing diffs has the possibility to do many things, including but not limited to:

  • Providing another layer of community review to important sets of edits
  • Creating expanded training sets for bots, which helps improve their accuracy and overall usefulness
  • Coding diffs in order to provide those with less Wikipedia experience an insight into what community members think

The set of categories (i.e. the codebook for the tool) can be customized in the database for a variety of purposes. While only one set of categories can be used at a time on a wiki, these categories and their values can be changed with very little effort in order to suit the kinds of diffs one might want to know about (e.g. Talk page edits versus Template namespace edits). The examples given in the screenshots provided came from previous studies by the Wikimedia Foundation.

How to install[edit]

Installation of the Diff Categorizer will work like any other gadget once it is approved. Simply...

  1. Log in to Wikipedia. If you don't have an account, you'll have to create one.
  2. Go to your preferences page.
  3. Click the Gadgets tab.
  4. Find "Diff Categorizer" under the "Editing gadgets" heading and check the box beside it.
  5. Click Save at the bottom of the page.
  6. Note that you may need to refresh your cache for any new settings to take effect.

How to use it[edit]

A screenshot of the Diff Categorizer on a test wiki.

Once you have the gadget installed, it will only appear when you elect to start categorizing diffs included in the sample (see screenshot). It will not appear on any random diff you view. To start categorizing, just click on the link below:

Important things to note[edit]

  1. You can dismiss the Diff Categorizer at any time by clicking the X in the top right.
  2. You can resize the gadget and move it around as necessary. Simply click and drag the gadget to move it. To resize it, click and drag the bottom right corner.
  3. While categorizing a set of diffs, the gadget should remember the size and position of the categorization window on your screen.
  4. When categorizing diffs, you must submit answers for all the categories present in order to save.

Data and privacy[edit]

The aggregate output produced at the end of a categorization set will be available under the same free licensing as regular Wikipedia content, minus information identifying the categorizers. All data from categorizations is stored on the Toolserver.

The user name and IP addresses of the respondents is collected when categorizing for the purpose of analysing the data within the scope of a diff categorization project (e.g. seeing how many ratings were made by each individual, and for tracking the level of agreement between categorizers). The user name and IP addresses will not be made publicly available, nor be used for any other purpose, nor will they be transferred to any third party, in accordance with the privacy policy.

Special instructions for administrators[edit]

The admin panel of the tool, available only to sysops who have installed the gadget

The diff categorizer tool can be administered via the diff categorizer admin panel, available as a gadget to users with administrative privileges. When the gadget is activated, a link will appear in the navigation toolbox on the left-hand sidebar to open the panel.

From the admin panel it is possible to configure the currently active selection of diffs that will be presented to the respondents. (Use of the tool requires edit rights on the MediaWiki namespace.) The following things can be configured:

  • Load diffs from RecentChanges and generate a random subset that will become the current selection
  • Reconfigure which namespaces the edits should be selected from (which is useful if, for example, you want to generate a dataset only of categorized diffs of Talk or User Talk)
  • Reconfigure the number of diffs that should be loaded from RecentChanges
  • Reconfigure the selection size (how many edits will be selected from the loaded ones)
  • Reconfigure the sample size, which is the number of diffs that will be presented to each respondent

Get help and report problems[edit]

This gadget is still in active development, so reporting bugs and other problems is a big help! Please feel free to use the talk page here to report issues or ask questions.

Team and credits[edit]

The idea for the Diff Categorizer emerged from the research work of the Community Department at the Wikimedia Foundation. Through its own qualitative work on sets of diffs, the three-month summer research project identified a clear need to better utilize the knowledge that only Wikipedians have by letting them categorize the content of diffs on-wiki. Post-hoc human categorization of edits has the potential not only to enrich research, but add a secondary layer of review to the editing process.

The coding and design of the gadget was completed by Anna and Andreas of with funding from the Wikimedia Foundation, and released under the GPL.