Common Crawl is a not for profit organization that crawls and archives the web with the intent of providing access to everyone. The organization respects nofollow and robot.txt policies.
Common Crawl makes available a web archive of web page data from 2008 to 2013 which consists of hundreds of terabytes of data from several billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler. In late 2013, Common Crawl moved from using a custom crawler to using Apache Software Foundation's Nutch crawler.