Common Crawl
| Type of business | 501(c)(3) non-profit |
|---|---|
| Available in | English |
| Headquarters | San Francisco, California, USA; Los Angeles, California, USA |
| Founder(s) | Gil Elbaz |
| Key people | Peter Norvig, Nova Spivack, Carl Malamud, Kurt Bollacker Joi Ito |
| Website | commoncrawl.org |
Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2] Common Crawl's web archive consists of petabytes of data collected since 2011.[3] It completes crawls generally every month.[4]
Common Crawl was founded by Gil Elbaz.[5] Advisors to the non-profit include Peter Norvig and Joi Ito.[6] The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.
Contents
History[edit]
Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012.[7]
The organization began releasing metadata files and the text output of the crawlers alongside .arc files in July of that year.[8] Common Crawl's archives had only included .arc files previously.[8]
In December 2012, blekko donated to Common Crawl search engine metadata blekko gathered from crawls it conducted from February to October 2012.[9] The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive SEO."[9]
In 2013, Common Crawl began using Apache Software Foundation's Nutch webcrawler instead of a custom crawler.[10] Common Crawl switched from using .arc files to .warc files with its November 2013 crawl.[11]
History of Common Crawl data[edit]
The following data have been collected from the official Common Crawl Blog
| Crawl Date | Availability date | Size in TB | Billions of pages | Comments |
|---|---|---|---|---|
| May 2018 | June 2018 | 215 | 2.75 | |
| April 2018 | May 2018 | 230 | 3.1 | |
| March 2018 | March 2018 | 250 | 3.2 | |
| February 2018 | March 2018 | 270 | 3.4 | |
| January 2018 | January 2018 | 270 | 3.4 | |
| December 2017 | December 2017 | 240 | 2.9 | |
| November 2017 | November 2017 | 260 | 3.2 | |
| October 2017 | October 2017 | 300 | 3.65 | |
| September 2017 | September 2017 | 250 | 3.01 | |
| August 2017 | August 2017 | 280 | 3.28 | |
| July 2017 | July 2017 | 240 | 2.89 | |
| June 2017 | July 2017 | 260 | 3.16 | |
| May 2017 | June 2017 | 250 | 2.96 | |
| April 2017 | May 2017 | 250 | 2.94 | |
| March 2017 | April 2017 | 250 | 3.07 | |
| February 2017 | March 2017 | 250 | 3.08 | |
| January 2017 | February 2017 | 250 | 3.14 | |
| December 2016 | December 2016 | - | 2.85 | |
| October 2016 | November 2016 | - | 3.25 | |
| September 2016 | October 2016 | - | 1.72 | |
| August 2016 | September 2016 | - | 1.61 | |
| July 2016 | August 2016 | - | 1.73 | |
| June 2016 | July 2016 | - | 1.23 | |
| May 2016 | June 2016 | - | 1.46 | |
| April 2016 | May 2016 | - | 1.33 | |
| February 2016 | February 2016 | - | 1.73 | |
| November 2015 | December 2015 | 151 | 1.82 | |
| September 2015 | November 2015 | 106 | 1.32 | |
| August 2015 | October 2015 | 149 | 1.84 | |
| July 2015 | August 2015 | 145 | 1.81 | |
| June 2015 | July 2015 | 131 | 1.67 | |
| May 2015 | July 2015 | 159 | 2.05 | |
| April 2015 | May 2015 | 168 | 2.11 | |
| March 2015 | May 2015 | 124 | 1.64 | |
| February 2015 | March 2015 | 145 | 1.9 | |
| January 2015 | March 2015 | 139 | 1.82 | |
| December 2014 | January 2015 | 160 | 2.08 | |
| November 2014 | December 2014 | 135 | 1.95 | |
| October 2014 | November 2014 | 254 | 3.7 | |
| September 2014 | November 2014 | 220 | 2.8 | |
| August 2014 | September 2014 | 200 | 2.8 | |
| July 2014 | August 2014 | 266 | 3.6 | |
| April 2014 | July 2014 | 183 | 2.6 | |
| March 2014 | March 2014 | 223 | 2.8 | First Nutch crawl |
| January 2014 | January 2014 | 148 | 2.3 | Crawls performed monthly |
| November 2013 | November 2013 | 102 | 2 | Data in Warc file format |
| July 2012 | July 2012 | - | - | Data in Arc file format |
| January 2012 | January 2012 | - | - | Public Data Set of Amazon Web Services |
| November 2011 | November 2011 | 40 | 5 | First availability on Amazon |
Norvig Web Data Science Award[edit]
In corroboration with SURFsara, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in Benelux.[12][13] The award is named for Peter Norvig who also chairs the judging committee for the award.[12]
References[edit]
- ^ Rosanna Xia (February 5, 2012). "Tech entrepreneur Gil Elbaz made it big in L.A." Los Angeles Times. Retrieved July 31, 2014.
- ^ "Gil Elbaz and Common Crawl". NBC News. April 4, 2013. Retrieved July 31, 2014.
- ^ "So you're ready to get started". Retrieved 2018-06-02.
- ^ Lisa Green (January 8, 2014). "Winter 2013 Crawl Data Now Available". Retrieved June 2, 2018.
- ^ "Startups - Gil Elbaz and Nova Spivack of Common Crawl - TWiST #222". This Week In Startups. January 10, 2012.
- ^ Tom Simonite (January 23, 2013). "A Free Database of the Entire Web May Spawn the Next Google". MIT Technology Review. Retrieved July 31, 2014.
- ^ Jennifer Zaino (March 13, 2012). "Common Crawl To Add New Data In Amazon Web Services Bucket". Semantic Web. Retrieved July 31, 2014.
- ^ a b Jennifer Zaino (July 16, 2012). "Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore". Semantic Web. Retrieved July 31, 2014.
- ^ a b Jennifer Zaino (December 18, 2012). "Blekko Data Donation Is A Big Benefit To Common Crawl". Semantic Web. Retrieved July 31, 2014.
- ^ Jordan Mendelson (February 20, 2014). "Common Crawl's Move to Nutch". Common Crawl. Retrieved July 31, 2014.
- ^ Jordan Mendelson (November 27, 2013). "New Crawl Data Available!". Common Crawl. Retrieved July 31, 2014.
- ^ a b Lisa Green (November 15, 2012). "The Norvig Web Data Science Award". Common Crawl. Retrieved July 31, 2014.
- ^ "Norvig Web Data Science Award 2014". Dutch Techcentre for Life Sciences. Retrieved July 31, 2014.
External links[edit]
| Wikimedia Commons has media related to Common Crawl. |
- Common Crawl in California, United States
- Common Crawl GitHub Repository with the crawler, libraries and example code
- Common Crawl Discussion Group
- Common Crawl Blog