Nutch
![]() ![]() |
|
| Developer(s) | Apache Software Foundation |
|---|---|
| Stable release | 1.5.1 and 2.1 / October 5, 2012 |
| Development status | Active |
| Written in | Java |
| Operating system | Cross-platform |
| Type | Search Engine |
| License | Apache License 2.0 |
| Website | nutch.apache.org |
Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.
Contents |
Features [edit]
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
History [edit]
Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]
Advantages [edit]
Advantages of Nutch over a simple fetcher include[2][unreliable source?]
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - you can run Nutch on a cluster of 100 machines
- quality - you can bias the crawling to fetch “important” pages first
Scalability [edit]
IBM Research studied the performance[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[5]
Related projects [edit]
- Hadoop - Java framework that supports distributed applications running on large clusters
Search engines built with Nutch [edit]
- Creative Commons Search - launched 2004, Nutch implementation replaced 2006[6][7][8]
- DiscoverEd - Open educational resources search prototype developed by Creative Commons
- Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
- mozDex (inactive)
- Wikia Search - launched 2008, closed down 2009[9][10]
See also [edit]
References [edit]
- ^ Nutch News
- ^ Using Nutch with Solr
- ^ Scalability of the Nutch search engine
- ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
- ^ http://boston.lti.cs.cmu.edu/crawler/crawlerstats.html
- ^ "Our Updated Search". Creative Commons. 2004-09-03.
- ^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22.
- ^ "New CC search UI". Creative Commons. 2006-08-02.
- ^ Where can I get the source code for Wikia Search?
- ^ Update on Wikia – doing more of what’s working
Bibliography [edit]
- Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6.
External links [edit]
- Official website
- Official wiki
- Building Nutch: Open Source Search(2004)- ACM Queue vol. 2, no. 2
- An article about Nutch(2003)- Search Engine Watch
- Another article about Nutch(2003)- Tech News World

