Wikipedia:Data mining Wikipedia

From Wikipedia, the free encyclopedia

Wikipedia's open, crowdsourced content can be data mined.

From its articles, their pageviews, WikiProject-assessments, infoboxes, a variety of metadata (such as on page-edits) and categorization information can be extracted that can be used for analysis, statistics and the creation of new insights in general.

Natural language processing may be used to process article contents. This page is not about the use of data mining with the intent to improve Wikipedia.


Data mining involves six common classes of tasks:[1]

  • Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation.
  • Association rule learning (Dependency modelling) – Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
  • Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
    • The Wikipedia Data Mining Project's goal is to discover the internal pattern in a Wikipedia data set and exploring various data mining algorithms. Cluster algorithm/s can group Wikipedia articles based on similarity, and forms thousands of data objects into organized tree to help people view the content.[2]
  • Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
  • Regression – attempts to find a function which models the data with the least error.
  • Summarization – providing a more compact representation of the data set, including visualization and report generation.


  • Kylin is a system that extracted information from Wikipedia by using infoboxes to automatically create training data for learning relation-specific extractors[11]

Help and tools[edit]

  • Milne, David; Witten, Ian H. (1 January 2013). "An open-source toolkit for mining Wikipedia" (PDF). Artificial Intelligence. 194: 222–239. doi:10.1016/j.artint.2012.06.007.
  • Wikipedia's API
  • Mining Wikipedia For Awesome Data, presentation

Legal considerations[edit]

Wikipedia and its sister projects—e.g. Wikimedia Commons, WikiSource—supported by the Wikimedia Foundation are hosted by servers (see Wikimedia servers on Meta-Wiki) at a data center in the state of Virginia, with an emergency backup data center in the state of Texas; caching servers are located in the Netherlands and Singapore. The Wikimedia Foundation is a non-profit incorporated in Florida and based in California; the terms of service for all Wikimedia Foundation websites is governed by the laws of the state of California and U.S. federal law.

From within the U.S.[edit]

Data mining of information on Wikipedia being performed from within the U.S., with one exception, is unlikely to be unlawful or a tortious violation of others' rights, as the information (text of pages, past revisions, IP addresses) is public (so mining likely won't run afoul of privacy laws in the U.S.) and, at least when mining on Wikipedia, likely to be considered fair use of copyrighted materials that doesn't infringe on the rights of the copyright holders (generally, the people who add content to the website). Additionally, privacy laws in the US typically do not protect information for which there is no reasonable expectation of privacy. Since all contributors, including contributions from IP addresses who have not created an account, agree to the terms of service and to irrevocably release their contribution under the CC BY-SA 3.0 & GFDL licenses and anonymous editors agree that their IP address will be recorded, it is unlikely that contributors can claim a reasonable expectation of privacy.

The issue of jurisdiction on the internet is not well settled in the courts, so data miners could be subject to either the jurisdiction of the courts for California (based on the terms of service, especially in any disputes with the Wikimedia Foundation) or the location(s) of the servers accessed for data mining. An exception to this is data mining from the US on Wikimedia Foundation servers in the Netherlands or Singapore, in which case an injured party could claim protection under the laws of either country. Since the Wikimedia servers in the Netherlands and Singapore are for caching, this issue can be avoided by mining only from Wikimedia servers in the U.S.

From outside the U.S.[edit]

Data mining of information being performed from outside the U.S. may violate local law or violate the rights of others (which can result in costly lawsuits if discovered). The main consideration is privacy laws, which should be considered if when any type of personal information (user names and IP addresses) is collected when mining. In the European Union, the General Data Protection Regulation (text) strictly regulates the manner in which personal data may be processed, defining 'personal information' as:

"any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person." (Art. IV, §1)

This appears to include any type of data mining that profiles edits by IP addresses. However, there is an important exception to the GDPR regulation that is found in Art. 85, which says:

(1) "Member States shall by law reconcile the right to the protection of personal data pursuant to this Regulation with the right to freedom of expression and information, including processing for journalistic purposes and the purposes of academic, artistic or literary expression."
(2) "For processing carried out for journalistic purposes or the purpose of academic artistic or literary expression, Member States shall provide for exemptions or derogations from Chapter II (principles), Chapter III (rights of the data subject), Chapter IV (controller and processor), Chapter V (transfer of personal data to third countries or international organisations), Chapter VI (independent supervisory authorities), Chapter VII (cooperation and consistency) and Chapter IX (specific data processing situations) if they are necessary to reconcile the right to the protection of personal data with the freedom of expression and information."
(3) "Each Member State shall notify to the Commission the provisions of its law which it has adopted pursuant to paragraph 2 and, without delay, any subsequent amendment law or amendment affecting them."

The legal status of Article 85 of the GDPR is that it requires Member States to enact certain laws on the subject. Unfortunately, however, the editor adding this content to this section in May 2018 could not find any guide to how Member States have enacted this provision into their national laws.

See also[edit]


  1. ^ Fayyad, Usama; Piatetsky-Shapiro, Gregory; Smyth, Padhraic (1996). "From Data Mining to Knowledge Discovery in Databases" (PDF). Retrieved 17 December 2008.
  2. ^ "Wikipedia Data Mining Project". Retrieved 28 January 2017.
  3. ^ "DeepQA Project: FAQ". IBM. Retrieved February 11, 2011.
  4. ^ Zimmer, Ben (February 17, 2011). "Is It Time to Welcome Our New Computer Overlords?". The Atlantic. Retrieved February 17, 2011.
  5. ^ "Pantheon". Pantheon. Retrieved 29 January 2017.
  6. ^ Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A. (5 January 2016). "Pantheon 1.0, a manually verified dataset of globally famous biographies". Scientific Data. 3: 150075. doi:10.1038/sdata.2015.75. Retrieved 29 January 2017.
  7. ^ Garner, Dwight (14 March 2014). "Who's More Famous Than Jesus?". The New York Times. Retrieved 29 January 2017.
  8. ^ Almossawi, Cesar A. Hidalgo,Ali. "The Data-Visualization Revolution". Scientific American. Retrieved 29 January 2017.
  9. ^ Lages, José; Patt, Antoine; Shepelyansky, Dima L. (1 March 2016). "Wikipedia Ranking of World Universities". The European Physical Journal B. 89 (3). doi:10.1140/epjb/e2016-60922-0. ISSN 1434-6028. Retrieved 28 January 2017.
  10. ^ "Wikipedia-Mining Algorithm Reveals World's Most Influential Universities". MIT Technology Review. Retrieved 28 January 2017.
  11. ^ Moens, Marie-Francine; Li, Juanzi; Chua, Tat-Seng. Mining User Generated Content. CRC Press. ISBN 9781466557406. Retrieved 28 January 2017.

External links[edit]

Further reading[edit]