Nutch
![]() ![]() |
|
| Developer(s) | Apache Software Foundation |
|---|---|
| Stable release | 1.4 / December 26, 2011 |
| Development status | Active |
| Written in | Java |
| Operating system | Cross-platform |
| Type | Search Engine |
| License | Apache License 2.0 |
| Website | nutch.apache.org |
Nutch is an effort to build an open source web search engine based on Lucene Java for the search and index component.
Contents |
[edit] Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "web crawler") has been written from scratch specifically for this project.
[edit] History
Nutch originated with Doug Cutting, creator of both Lucene and Hadoop, and Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multimachine processing needs of the crawl and index tasks, the Nutch project has also implemented a MapReduce facility and a distributed file system. The two facilities have been spun out into their own subproject, called Hadoop.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the Apache Software Foundation.[1]
[edit] Advantages [2]
Some of the advantages of Nutch, when compared to a simple Fetcher
- highly scalable and relatively feature rich crawler
- features like politeness which obeys robots.txt rules
- robust and scalable - you can run Nutch on a cluster of 100 machines
- quality - you can bias the crawling to fetch “important” pages first
[edit] Scalability
IBM Research studied the performance[3] of Nutch/Lucene as part of its Commercial Scale Out (CSO) project.[4] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any scale-up computer such as the Power5.
The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.[5]
[edit] Related projects
- Hadoop - Java framework that supports distributed applications running on large clusters
- nutchWAX - Uses Nutch to search a web archive
- Sixearch - An unstructured peer network application, which provides a complementary way for users to actively and collaboratively share their own document collections.
[edit] Search engines built with Nutch
- Creative Commons Search - launched 2004, Nutch implementation replaced 2006[6][7][8]
- DiscoverEd - Open educational resources search prototype developed by Creative Commons[9]
- Krugle
- mozDex
- Wikia Search - launched 2008, closed down 2009[10][11]
- search2.net
- Tothego.com
[edit] See also
[edit] References
- ^ Nutch News
- ^ Using Nutch with Solr
- ^ Scalability of the Nutch search engine
- ^ Base Operating System Provisioning and Bringup for a Commercial Supercomputer
- ^ http://boston.lti.cs.cmu.edu/crawler/crawlerstats.html
- ^ "Our Updated Search". Creative Commons. 2004-09-03. http://creativecommons.org/weblog/entry/4388.
- ^ "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0". Creative Commons. 2004-11-22. http://creativecommons.org/press-releases/entry/5064.
- ^ "New CC search UI". Creative Commons. 2006-08-02. http://creativecommons.org/weblog/entry/6002.
- ^ DiscoverEd home page
- ^ Where can I get the source code for Wikia Search?
- ^ Update on Wikia – doing more of what’s working
[edit] Bibliography
- Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (1st ed.). Apress. pp. 350. ISBN 978-1590596876. http://www.apress.com/book/view/9781590596876.
[edit] External links
- Official website
- Official wiki
- Building Nutch: Open Source Search(2004)- ACM Queue vol. 2, no. 2
- An article about Nutch(2003)- Search Engine Watch
- Another article about Nutch(2003)- Tech News World
- Official page of the Hadoop project

