Norconex HTTP Collector
||The topic of this article may not meet Wikipedia's notability guideline for web content. (June 2013)|
|Stable release||1.1 / August 22, 2013|
Norconex HTTP Collector is a web spider, or crawler initially created for Enterprise Search integrators and developers. It began as a closed source project developed by Norconex. It was released as open source under GPL3 on June 2013.
Norconex HTTP Collector was built entirely using Java. A single Collector installation is responsible for launching one or multiple crawler threads, each with their own configuration.
Each step is part of a crawler life-cycle is configurable and overwritable. Developers can provide their own interface implementation for most steps undertaken by the crawler. The default implementations provided cover a vast array of crawling use cases, and are built on stable products such as Apache Tika and Apache Derby. The following figure is a high level representation of a URL-life-cycle from the crawler perspective.
The Importer and Committer modules are separate GPL java libraries distributed with the Collector.
The Importer module parses incoming document from their raw form (HTML, PDF, Word, etc) to a set of extracted metadata and plain text content. In addition, it provides interfaces to manipulate a document metadata, transform its content, or simply filter the documents based on their new format. While the Collector is heavily dependent on the Importer module, the later can be used on its own, as a general-purpose document parser.
The committer module is responsible for directing the parsed data to a target repository of choice. Developers are able to write custom implementations, allowing the use of Norconex HTTP Collector with any search engines or repositories. Two committer implementations currently exists, for Apache Solr and Elastic Search.
Java Standard Edition 6.0 or higher is required. Runs on any platform supporting Java.
While the Norconex HTTP Collector can be configured programmatically it also supports XML configuration files. Apache Velocity is used to parse configuration files. Using Velocity directives permits configuration re-use amongst different Collector installations and variables substitution.
<httpcollector id="Minimal Config HTTP Collector"> <crawlers> <crawler id="Minimal Config Wikipedia 1-page Crawl"> <startURLs> <url>http://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland</url> </startURLs> <maxDepth>0</maxDepth> <delay default="5000" /> <httpURLFilters> <filter class="com.norconex.collector.http.filter.impl.RegexURLFilter" onMatch="include" > http://en\.wikipedia\.org/wiki/.* </filter> </httpURLFilters> <importer> <postParseHandlers> <tagger class="com.norconex.importer.tagger.impl.KeepOnlyTagger" fields="title,keywords,description"/> </postParseHandlers> </importer> <committer class="com.norconex.committer.impl.FileSystemCommitter"> <directory>./minimunCrawlFiles</directory> </committer> </crawler> </crawlers> </httpcollector>