Common Crawl

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Common Crawl
Type 501(c)(3) non-profit
Founded 2007 (2007)
Headquarters San Francisco, California, USA; Los Angeles, California, USA
Founder(s) Gil Elbaz
Key people Peter Norvig, Nova Spivack, Carl Malamud, Kurt Bollacker Joi Ito
Website commoncrawl.org
Available in English

Common Crawl is a not for profit organization that crawls and archives the web with the intent of providing access to everyone.[1] The organization respects nofollow and robot.txt policies.[2]

Common Crawl makes available a web archive of web page data from 2008 to 2013 which consists of hundreds of terabytes of data from several billion webpages.[3] Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable.[4][5] Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler. In late 2013, Common Crawl moved from using a custom crawler to using Apache Software Foundation's Nutch crawler.[6]

References[edit]

External links[edit]