A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.
"Privacy, anonymity, and perceived risk in open collaboration: a study of Tor users and Wikipedians"
This qualitative study, based on interviews with privacy-conscious Wikipedia editors and users of the Tor anonymization software, is an informative examination of the privacy issues that are particular to the work on the radically transparent online encyclopedia. It also tries (largely unsuccessfully, in this reviewer's opinion) to make the case that Wikipedia should relax its restrictions on editing via Tor.
The three authors from Drexel University carried out in-depth, semi-structured interviews with two groups:
12 "Tor users who have also contributed to “online projects." (recruited e.g. on the mailing list of the Tor project)
11 "Wikipedia editors who have considered their privacy while editing", including some administrators and Wikimedia Foundation employees
In both groups, the majority (8 in each case) was male.
The goal was "to examine the threats that people perceive when contributing to open collaboration projects and how they maintain their safety and privacy". Interview responses examined using thematic analysis, to identify the most important concepts.
As first part of their findings, the authors group the types of threats described by the participants into five areas:
"Surveillance/Loss of privacy", i.e. the general "fear that their online communication or activities may be accessed or logged by parties without their knowledge or consent". This more abstract concern was more prevalent in the Tor users group than among the privacy-conscious Wikipedians interviewed for the study.
"Loss of employment/opportunity", such as when a potential employer decides against a particular job candidate because she derived certain kinds of negative information from his online activity, or a transphobic boss learning that a particular employee is a transgender person.
"Safety of self/loved ones". Some of the Wikipedians reported "threats of rape, physical assault, and death as reprisals for their contributions to the project".
"Harassment/Intimidation", which was brought up far more often by the Wikipedians (8) than among the Tor users (1). In particular "Editors who took central positions like administrator or arbitration committee member found that additional authority and responsibility brought with it publicity and vulnerability", including rape and death threats in case of a female administrator.
"Reputation loss" in general. One participant related that often Wikipedians edits anonymously (i.e. only identified by their IP address) because "they don't want someone to go on a vendetta against them and what's a volunteer hobby for them suddenly turns into something that affects their professional career".
The researchers seem to have struggled a bit to clearly delineate these five threat areas. For instance, there appears to be quite a bit of similarity between the intimidation and safety concerns, and as the authors point out themselves, "the potential for contributions to controversial topics to be misinterpreted and result in lost opportunities" - the second area - is also related to the more general concern about reputation loss. Nevertheless, for those interested in the privacy threats editors associate with the activity of contributing Wikipedia, this is a very worthwhile read. A thematically related document is the Wikimedia Foundation's 2015 Harassment survey 2015 (Signpost summary) - unfortunately not mentioned in the paper. The WMF survey, while also not designed to be completely representative, covered some of the same ground with vastly more respondents (3,845) than the 23 interviewees in the present study.
Turning to the strategies that the interviewees employ to mitigate these perceived risks, the study identifies "two broad overlapping categories of activities: modifying participation in projects and enacting anonymity." Modifying participation can include refraining from editing certain topics. Under "enacting anonymity", the researchers subsume both "operational approaches that limit others’ ability to connect activities with participants real identities (e.g. maintaining multiple accounts [ also known as sockpuppets on Wikipedia])", and technical means such as Tor (for "participating anonymously on the Internet" in general). It is in this section that the paper becomes a bit muddied about the distinction between privacy threats on the internet in general and on Wikipedia in particular. This is particularly unfortunate as it seems to have been at least partly motivated by the longstanding discussions about the restrictions on editing Wikipedia via Tor (demands from the Tor community to lift these go back at least a decade), with the authors making the case that Wikipedia is incurring a significant loss of contributions because of these restrictions. There is no doubt that the public edit histories can reveal a lot about a Wikipedian's interests etc. (Or as this reviewer concluded in
in a 2008 Wikimania talk that presented several real-life examples of conclusions that can be drawn from a Wikipedia user's editing patterns: "Wikipedia contributors don't just give their time to the project, but pay with their privacy, too.") But the obfuscation of IP addresses that Tor provides is largely irrelevant for this, because editors' IP addresses are not made public anyway, if they don't choose to edit under an IP. In an early presentation about the study at the 2015 Chaos Communication Congress (32C3) (slide 33), the authors themselves alluded to this :
"According to Wikipedians most deanonymization is done based on contextual cues. Tor won’t help with this"
But this kind of caveat is missing from the present paper.
(Interestingly though, Wikipedians in the study reported using Tor-like tools outside of Wikipedia, to avoid "being targeted by groups with a history of harassing Wikipedians:": "when I'm reading Wikipediocracy or one of the Wikipedia criticism sites, because I know that they scoop up IP addresses, I use an IP obfuscator for that.")
Other recent publications that could not be covered in time for this issue include the items listed below. contributions are always welcome for reviewing or summarizing newly published research.
"A method for predicting Wikipedia editors' editing interest: based on a factor graph model" From the abstract: "Recruiting or recommending appropriate potential Wikipedia editors to edit a specific Wikipedia entry (or article) can play an important role in improving the quality and credibility of Wikipedia. ... this paper proposes an interest prediction factor graph (IPFG) model, which is characterized by editor's social properties, hyperlinks between Wikipedia entries, the categories of an entry and other important features, to predict an editor's editing interest in types of Wikipedia entries. ... An experiment on a Wikipedia dataset (with different frequencies of data collection) shows that the average prediction accuracy (F1 score) of the IPFG model for data collected quarterly could be up to 0.875, which is about 0.49 higher than that of a collaborative filtering approach."
"Stationarity of the inter-event power-law distributions" From the abstract: "We show that even though the probability to start [Wikipedia] editing is conditioned by the circadian 24 hour cycle, the conditional probability for the time interval between successive edits at a given time of the day is independent from the latter. We confirm our findings with the activity of posting on the social network Twitter. Our result suggests there is an intrinsic humankind scheduling pattern: after overcoming the encumbrance to start an activity, there is a robust distribution of new related actions, which does not depend on the time of day."
"Controversy detection in Wikipedia using collective classification" From the abstract: "We hypothesize that intensities of controversy among related pages are not independent; thus, we propose a stacked model which exploits the dependencies among related pages. Our approach improves classification of controversial web pages when compared to a model that examines each page in isolation, demonstrating that controversial topics exhibit homophily."
"WIKIREADING: a novel large scale language understanding task over Wikipedia" From the abstract: "We present WIKIREADING, a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The task is to predict textual values from the structured knowledge base Wikidata by reading the text of the corresponding Wikipedia articles. ... We compare various state-of-the-art DNN [deep neural networks]-based architectures for document classification, information extraction, and question answering."
"A contingency view of transferring and adapting best practices within online communities" From the abstract: "Empirical research on the transfer of a quality-improvement practice between projects within Wikipedia shows that modifications are more helpful if they are introduced after the receiving project has had experience with the imported practice. Furthermore, modifications are more effective if they are introduced by members who have experience in a variety of other projects." From the paper: "We collected the history of CotW [Collaboration of the Week] in 146 Wikiprojects and measured how different types of modifications influenced their success, in terms of the length of time the CotW continued to be used in a project, the amount of work they elicited from project members and the number of unique editors who contributed to them."
"Centrality and content creation in networks – the case of economic topics on German Wikipedia" From the abstract: "We analyze the role of local and global network positions for content contributions to articles belonging to the category “Economy” on the German Wikipedia. Observing a sample of 7635 articles over a period of 153 weeks we measure their centrality both within this category and in the network of over one million Wikipedia articles. Our analysis reveals that an additional link from the observed category is associated with around 140 bytes of additional content and with an increase in the number of authors by 0.5. The relation of links from outside the category to content creation is much weaker. ... We find non-neoclassical themes to be highly prevalent among the top articles."
"A platform for visually exploring the development of Wikipedia articles" From the abstract: "... associated to each [Wikipedia] article are the edit history and talk pages, which together entail its full evolution. These spaces can typically reach thousands of contributions, and it is not trivial to make sense of them by manual inspection. This issue also affects Wikipedians, especially the less experienced ones, and constitutes a barrier for new editor engagement and retention. To address these limitations, Contropedia offers its users unprecedented access to the development of an article, using wiki links as focal points." (Also see previous coverage: "Contropedia" tool identifies controversial issues within articles", http://www.contropedia.net/ and the following paper:)
"Platform affordances and data practices: The value of dispute on Wikipedia" From the abstract: "... we study how the affordances of Wikipedia are deployed in the production of encyclopedic knowledge and how this can be used to study controversies. The analysis shows how Wikipedia affords unstable encyclopedic knowledge by having mechanisms in place that suggest the continuous (re)negotiation of existing knowledge. We furthermore showcase the use of our open-source software, Contropedia, which can be utilized to study knowledge production on Wikipedia."