Wikipedia by numbers
Wikipedia's coverage and conflicts quantified
Researchers at PARC, creators of WikiDashboard (see previous coverage) have announced a new study of English Wikipedia that quantifies Wikipedia coverage. Based on the categories assigned to each article, coverage is sorted according to eleven broad categories used in Wikipedia's Categorical index: Culture and the arts; Geography and places; Health and fitness; History and events; Mathematics and logic; Natural and physical sciences; People and self; Philosophy and thinking; Religion and belief systems; Society and social sciences; and Technology and applied sciences. ("General reference" is excluded.) They report the following approximate breakdown in coverage:
- Culture and the arts: 30%
- People and self: 15%
- Geography and places: 14%
- Society and social sciences: 12%
- History and events: 11%
- Natural and physical sciences: 9%
- Technology and applied sciences: 4%
- Religion and belief systems: 2%
- Health and fitness: 2%
- Mathematics and logic: 1%
- Philosophy and thinking: 1%
The PARC researchers presented a short paper describing their work, "What’s in Wikipedia? Mapping Topics and Conflict Using Socially Annotated Category Structure" to the 'CHI2009 conference on Human-Computer Interaction', and have a blog post summary of the results.
To explain their methodology of sorting content, the researchers use the example of Albert Einstein. In January 2008 (the last available full data dump of English Wikipedia), the Einstein article had 26 categories. Each category can be broken down according to its proportional relevance to the 11 top-level categories based on the shortest paths through the category system to the different top-level categories. So, for example, Einstein's category of "Jewish-American scientists" is most strongly associated with "People and self", but is also part of the "Religion and belief systems" and "Natural and physical sciences" topics. Combining the category weighting for all 26 categories creates a distribution of the Einstein article's proportional relevance to each of the top-level categories. As the paper notes, "Einstein’s topic distribution primarily falls under “People”; however, his roles as both a prominent scientist and social figure are reflected in associations with “Science”, “Society”, “History”, “Philosophy”, “Religion”, and “Culture”. His involvement with the Manhattan Project also leads to associations with “Technology”."
Aggregating this sort of distribution for all articles generates the overall 2008 topic distribution reported above. The paper also compared the 2008 results to the results from a 2006 dump, in order to measure which topic areas were growing most rapidly. They found that "Natural and physical sciences" and "Culture and the arts" each grew by over 200% in that time, with strong growth also in Philosophy, Mathematics, and History. Surprisingly, they also found that coverage of "Technology and applied sciences" actually appeared to shrink by 6% between 2006 and 2008; this was undoubtedly caused by reorganizations of the category hierarchy rather than an actual net loss of technology content.
Using the topic distribution method in combination with an earlier method they had developed to measure the amount of conflict generated by a particular article (based on dispute tags and reversions), the researchers calculated the amount of conflict in each broad topic area (normalized for the average number of categories per article in different topic areas). They report in the blog summary "that "philosophy" and "religion" have generated 28% of the conflicts each." This is despite the fact that they were only 1% and 2%". [The post was later edited to 28% "contentious-ness" each, reflecting that the percentages are relative to the number of articles in each category.] As the paper notes, however, this is "normalized conflict"; "People" and "Society and social sciences" had the highest "absolute amount[s]" of conflict.