Common Crawl

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Common Crawl
Type 501(c)(3) non-profit
Founded 2007 (2007)
Headquarters San Francisco, California, USA; Los Angeles, California, USA
Founder(s) Gil Elbaz
Key people Peter Norvig, Nova Spivack, Carl Malamud, Kurt Bollacker Joi Ito
Website commoncrawl.org
Available in English

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public.[1][2] Common Crawl's web archive consists of hundreds of terabytes of data from several billion webpages.[3] It completes four crawls a year.[4]

Common Crawl was founded in 2007 by Gil Elbaz.[5] Advisors to the non-profit include Peter Norvig and Joi Ito.[3] The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available.

History[edit]

Amazon Web Services began hosting Common Crawl's archive through its Public Data Sets program in 2012.[6]

The organization began releasing metadata files and the text output of the crawlers alongside .arc files in July of that year.[7] Common Crawl's archives had only included .arc files previously.[7]

In December 2012, blekko donated to Common Crawl search engine metadata blekko gathered from crawls it conducted from February to October 2012.[8] The donated data helped Common Crawl "improve its crawl while avoiding spam, porn and the influence of excessive SEO."[8]

In 2013, Common Crawl began using Apache Software Foundation's Nutch webcrawler instead of a custom crawler.[9] Common Crawl switched from using .arc files to .warc files with its November 2013 crawl.[10]

Norvig Web Data Science Award[edit]

In corroboration with SURFsara, Common Crawl sponsors the Norvig Web Data Science Award, a competition open to students and researchers in Benelux.[11][12] The award is named for Peter Norvig who also chairs the judging committee for the award.[11]

References[edit]

  1. ^ Rosanna Xia (February 5, 2012). "Tech entrepreneur Gil Elbaz made it big in L.A.". Los Angeles Times. Retrieved July 31, 2014. 
  2. ^ "Gil Elbaz and Common Crawl". NBC News. April 4, 2013. Retrieved July 31, 2014. 
  3. ^ a b Tom Simonite (January 23, 2013). "A Free Database of the Entire Web May Spawn the Next Google". MIT Technology Review. Retrieved July 31, 2014. 
  4. ^ Russell Brandom (March 1, 2013). "Common Crawl: going after Google on a non-profit budget". Retrieved July 31, 2014. 
  5. ^ "Startups - Gil Elbaz and Nova Spivack of Common Crawl - TWiST #222". This Week In Startups. January 10, 2012. 
  6. ^ Jennifer Zaino (March 13, 2012). "Common Crawl To Add New Data In Amazon Web Services Bucket". Semantic Web. Retrieved July 31, 2014. 
  7. ^ a b Jennifer Zaino (July 16, 2012). "Common Crawl Corpus Update Makes Web Crawl Data More Efficient, Approachable For Users To Explore". Semantic Web. Retrieved July 31, 2014. 
  8. ^ a b Jennifer Zaino (December 18, 2012). "Blekko Data Donation Is A Big Benefit To Common Crawl". Semantic Web. Retrieved July 31, 2014. 
  9. ^ Jordan Mendelson (February 20, 2014). "Common Crawl's Move to Nutch". Common Crawl. Retrieved July 31, 2014. 
  10. ^ Jordan Mendelson (November 27, 2013). "New Crawl Data Available!". Common Crawl. Retrieved July 31, 2014. 
  11. ^ a b Lisa Green (November 15, 2012). "The Norvig Web Data Science Award". Common Crawl. Retrieved July 31, 2014. 
  12. ^ "Norvig Web Data Science Award 2014". Dutch Techcentre for Life Sciences. Retrieved July 31, 2014. 

External links[edit]