|Developer(s)||Apache Software Foundation|
|Stable release||1.5.1 and 2.1 / October 5, 2012|
|License||Apache License 2.0|
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - Nutch can run on a cluster of up to 100 machines
- quality - crawling can be biased to fetch "important" pages first
IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
- Hadoop - Java framework that supports distributed applications running on large clusters
Search engines built with Nutch
- Creative Commons Search - launched 2004, Nutch implementation replaced 2006
- DiscoverEd - Open educational resources search prototype developed by Creative Commons
- Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
- mozDex (inactive)
- Wikia Search - launched 2008, closed down 2009
- Nutch News
- Using Nutch with Solr
- Scalability of the Nutch search engine
- Base Operating System Provisioning and Bringup for a Commercial Supercomputer
- The Sapphire Web Crawler - Crawl Statistics. Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
- "Our Updated Search". Creative Commons. 2004-09-03.
- "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22.
- "New CC search UI". Creative Commons. 2006-08-02.
- Where can I get the source code for Wikia Search?
- Update on Wikia – doing more of what’s working
- Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6.
- Official website
- Official wiki
- Building Nutch: Open Source Search（2004）- ACM Queue vol. 2, no. 2
- An article about Nutch（2003）- Search Engine Watch
- Another article about Nutch（2003）- Tech News World