A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, edited jointly with the Wikimedia Research Committee and republished as the Wikimedia Research Newsletter.
Cross-language study of conflict on Wikipedia
Have you wondered about differences in the articles on Crimea in the Russian, Ukrainian, and English versions of Wikipedia? A newly published article entitled "Lost in Translation: Contexts, Computing, Disputing on Wikipedia" doesn't address Crimea, but nonetheless offers insight into the editing of contentious articles in multiple language editions through a heavy qualitative examination of Wikipedia articles about Kosovo in the Serbian, Croatian, and English editions.
The authors, Pasko Bilic and Luka Bulian from the University of Zagreb, found the main drivers of conflict and consensus were different group identities in relation to the topic (Kosovo) and to Wikipedia in general. Happily, the authors found the dominant identity among users in all three editions was the "encyclopedic identity," which closely mirrored the rules and policies of Wikipedia (e.g., NPOV) even if the users didn't cite such policies explicitly. (This echoes the result of a similar study regarding political identities of US editors, see previous coverage: "Being Wikipedian is more important than the political affiliation".) Other identities were based largely on language and territorial identity. These identities, however, did not sort cleanly into the different language editions: "language and territory [did] not produce coherent and homogeneous wiki communities in any of the language editions."
The English Wikipedia was seen by many users as providing greater visibility and thus "seem[ed] to offer a forum for both Pro-Serbian and Pro-Albanian viewpoints making it difficult to negotiate a middle path between all of the existing identities and viewpoints." The Arbitration Committee, present in the English edition but not in the Serbian or Croatian editions, may have helped prevent even greater conflict. Enforcement of its decisions seemed generally to lead to greater caution in the edition process.
Another paper by Bilic, published in New Media & Society looks at the logic behind networked societies and the myth perpetuated by media institutions that there is a center of the social world (as opposed to distributed nodes). The paper goes on to investigate the social processes that contribute to the creation of “mediated centers”, by analyzing the talk pages of English Wikipedia’s In The News (ITN) section.
Undertaking an ethnographic content analysis of ITN talk pages from 2004–2012, Bilic found three issues that were disputed among Wikipedians in their efforts to construct a necessarily temporal section of the encyclopedia. First, that editors differentiate between mass media and Wikipedia as a digital encyclopedia, however what constitutes the border between the two is often contested. Second, there was debate between inclusionists and deletionists regarding the criteria for stories making the ITN section. Third, conflict and discussion occurred regarding English Wikipedia’s relevance to a global audience.
The paper provides a good insight into how editors construct the ITN section and how it is positioned on the “thin line between mass media agenda and digital encyclopedia.” It would be interesting to see further research on the tensions between the Wikipedia policies mentioned in the paper (e.g. WP:NOTNEWS, NPOV) and mainstream media trends in light of other studies about Wikipedia’s approach to breaking news coverage.
User hierarchy map: Building Wikipedia's Org Chart
If you were to make an org chart of English Wikipedia, what would it look like? A recent study presented at the 2014 European Conference on Information Systems examines whether the organizational hierarchy of Wikipedia is as flat and egalitarian as previous research and popular media have claimed in the past. The researchers point out that the degree to which Wikipedia’s actual governance model (and those of other peer production communities) reflect egalitarian principles has seldom been comprehensively examined. Furthermore, a growing body of research has shown that Wikipedia has become increasingly bureaucratic along many dimensions, often in response to new community needs. This suggests that Wikipedia has grown more hierarchical, and less flat, over time.
The researchers develop a taxonomy based on technical user rights and the quality assurance, coordination, and conflict resolution tasks commonly associated with those user rights. They use exploratory factor analysis, least square analysis, and qualitative examination of the user right description pages to distill 19 user rights down to 8 social roles. They assemble these roles into a hierarchy according to their Scope, Granting, Access, and Promotion relationships. For example, in this hierarchy, editors in the Security Force role (checkusers and oversighters) have more power than administrators (sysops and bureaucrats) because being a sysop is an informal prerequisite for checkuser rights, and because oversighters can use the RevisionDelete extension in suppressor mode, blocking access to the content from administrators.
The paper does an excellent job of distilling the complex matrix of technologically mediated power relationships within and across Wikimedia wikis into a relatively simple organizational chart (presented on manuscript page 11). However, other mappings are certainly possible. For example, this analysis excludes the role of bots (and therefore, bot wranglers) within the role ecology. It also does not address the soft power that well-respected veteran community members may wield in some situations.
Extracting machine-readable data from Wiktionary: Yet another research group recognised Wiktionary as a source of «valuable lexical information» and explored conversion of its full content to a machine-readable format, LMF. The UBY tools were used as base, but results are not released, probably being in the works (only English, French and German Wiktionaries are mentioned), and seem unaware of DBpedia's Wiktionary RDF extraction. Authors find a big obstacle in seemingly innocuous context labels of the kind "archaic term": this diachronicity would force to split such definitions to separate lexicons by age. Instead, they believe it wouldn't be hard to map all the formats and tags used by the various Wiktionary editions and unify them, apparently, in a single lexicon. If delivered (and open-sourced), such a map could help the perennial discussion on how to unify Wiktionary data, recently revived by the Wikidata plans.
Wikipedia as a source of proper names in various languages: Another group managed to automatically extract proper names mentioned in articles of Wikipedias in 18 European languages, collating the different transliterations and attributing certain properties like "given name" and "family name" (similar to what Wikidata does, but without using interwiki links). As in the previous work, the conclusion is that LMF is suitable for storing such information, with an extension of the format. The impression is that LMF's viability is being tested in "real life" to refine said theoretical standard, an effort parallel to Wikidata's process of organic growth by trial and error.
"Wikipedia and Machine Translation: killing two birds with one stone". This is a case study about machine aided translation from one language to another. In this case, the researchers made volunteers translate 100 short Computer Science articles from Spanish to Basque Wikipedia, totalling to 50 000 words. They used a rule based machine translation system called Matxin. Volunteers corrected the machine translation output using OmegaT. The machine translation system was adapted by using a collection of Mozilla translations.
Following a long established Apertium practice, the human corrections were used as source for a tool to make them automatically. They claim 10% increase in accuracy with this tool, but do not report the baseline or corpus for which it was measured. Additionally: they translated wikilinks using Wikidata; they noted that markup complicated things; even a not very good machine translation output was still useful for volunteer translators.
"Knowledge Construction in Wikipedia: A Systemic-Constructivist Analysis": In this study of knowledge construction on Wikipedia, the authors focus on the importance of the social system and social structure in influencing the actions of individuals (Wikipedia editors). They analyze the edit history of the German Wikipedia article on Fukushima-Daiichi nuclear power plant, arguing that it is a case study of "a regularly occurring situation: the development of new knowledge in a large-scale social setting based on inconsistent information under uncertainty." The author provide an interesting literature review of what they term a "systemic-constructivist" approach, then discuss the evolution of the Wikipedia article through about 1,200 edits, noting the importance of Wikipedia policies, which were often quoted by the editors. The authors also conduced a survey among the editors of the article to obtain additional information. The authors also asked independent experts to review the article; this review concluded that the German Wikipedia article is of high quality. They note that the experts identified some errors, although unfortunately they do not provide details specific enough for the community to address them. They conclude that the Wikipedia editors were not experts in the field of nuclear power plants, yet were able to produce an article that earned favorable reviews from such experts; this, according to the authors, can be explained through the "systemic-constructivist" approach as validating the importance of the social system and structure of Wikipedia, which guided the amateur editors into producing an expert-level product.
Younger librarians more supportive of Wikipedia:  A survey of information literacy librarians shows that they provide little Wikipedia instruction, with about 40% of respondents answering that their schools provide no instruction on Wikipedia, and 80%, that they hold no dedicated workshops. Still, the remaining group – 60% which do provide some instruction, and 20% who hold dedicated workshops, suggest that the picture is not so dire, and in fact illuminates an interesting opportunity for reaching out with regards to the Wikipedia Education Programs, which do not usually focus on the libraries instructional programs. Only 3% of respondents indicated that they have students actually edit Wikipedia, and one cited story, about "making edits to lower the quality of an article" and "getting a student blocked", raises a specter of similar incidents in the past (see e.g. previous Signpost coverage of a prominent case at George Mason University), as well as a question of ethics in education with regards to purposefully engaging in vandalism for educational purposes. Unsurprisingly, there was also a negative correlation between librarian's age and views on Wikipedia. Although overall majority of respondents were supportive of the idea that librarians need to educate students in digital literacy skills, they were nonetheless opposed to linking Wikipedia from the pages of their institutions.
"Preparing and publishing Wikipedia articles are a good tool to train project management, teamwork and peer reviewed publishing processes in life sciences": This is the conclusion in the title of a recently published paper from the 2012 "Improving University Teaching" conference by two zoologists from the University of Innsbruck.
"Networked Grounded Theory" analysis of views on the use of Wikipedia in education: A report paper describes how a Greek PhD thesis studied the use of Wikipedia in Education using the network visualization software Gephi. Empirical data was gathered "from interviews and focus group discussions with students and teachers participating in Wikipedia assignments, from online blog posts expressing students’, instructors’, and Wikipedians’ reflections on the topic and from Wikipedia’s community discussion pages" and analyzed in a grounded theory approach (classifying text statements into codes such as "Need for Wiki Literate Professors", "Valuable Content Added", "You Are Not Listening & Respecting Us" or "Aggressive Community Editors"). Gephi was used to create a visualization grouping these codes (opinions), and grouping them into "communities". Eventually, the author arrived at "Community Resistance, Organization of Intervention, Community Benefit, Educational Benefit, and Acculturation Stress [as] the conceptual blocks of theory for interpreting the utilization of a virtual community in education as an acculturation process."
"Risk factors and control of hospital acquired infections: a comparison between Wikipedia and scientific literature" is a paper published in 2013 which analysed Wikipedia content from November of 2010. They looked at 15 articles pertaining to hospital acquired infections (HAIs) of which 8 were B class and the rest were lower. Some of the articles were in this reviewer's opinion only tangentially related, such as necktie. They looked at how well Wikipedia's content in 2010 matched the National Institute of Clinical Excellence (NICE) topic on HAIs. NICE writes how to-guides for physicians, while Wikipedians are writing an encyclopedia. The conclusions was thus not surprising that Wikipedia is not a good "how to guide" regarding HAIs (as one editor observed in a discussion about the paper at WikiProject Medicine: "We are criticised for (somewhere) mentioning or recommending signs reminding about hand-washing routines, ... and for not giving all sorts of detailed guidelines about procedures for the use of catheters and the like by medical staff"). Still, a number of specific errors were also found. Most had already been fixed and this reviewer has corrected the last few.
How a country's broadband connectivity and Wikipedia coverage are related: In 2011, the Oxford Internet Institute began a project to study the online representation of the Arab world online, via Wikipedia. The first peer-reviewed paper from this research became available in preprint form at the beginning of 2014. As previously observed by these and other researchers, the density of geotagged Wikipedia is highly uneven, and a part of the paper studies its relationship to a country's population, to the number of broadband internet connections in a geographic area, and to Wikipedia's country-level usage statistics over time. Among other things, the authors find that "over three quarters of the variation in geotagged articles was explained by the population of the country, the number of fixed broadband connections and the number of edits emanating from that country." Curiously, the relationship between internet connectivity and Wikipedia coverage was not linear: "those countries with the least and most broadband have more articles than expected, whereas those countries in the middle of the distribution have fewer articles than expected."
^Global Atlas: Proper Nouns, From Wikipedia to LMF. Chapter written by Gil FRANCOPOULO , Frédéric MARCOUL, David CAUSSE and Grégory PIPARO.
^Alegria I., Cabezon U., Fernandez de Betoño U., Labaka G., Mayor A., Sarasola K. and Zubiaga A.: Wikipedia and Machine Translation: killing two birds with one stone. Workshop on 'Free/open-source language resources for the machine translation of less-resourced languages' at LREC 2014. https://ixa.si.ehu.es/Ixa/Argitalpenak/Artikuluak/1395737124