Carrot2

Carrot²
	Web search results clustered using Carrot2's Lingo algorithm.
Developer(s)	Carrot Search
Stable release	4.5.2 / November 6, 2023
Repository	github.com/carrot2/carrot2/
Written in	Java
Operating system	Cross-platform
Type	Text mining and cluster analysis
License	BSD license
Website	search.carrot2.org

Carrot²^[1] is an open source search results clustering engine.^[2] It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Carrot² is written in Java and distributed under the BSD license.

History

The initial version of Carrot² was implemented in 2001 by Dawid Weiss as part of his MSc thesis to validate the applicability of the STC clustering algorithm to clustering search results in Polish.^[3] In 2003, a number of other search results clustering algorithms were added, including Lingo,^[4] a novel text clustering algorithm designed specifically for clustering of search results. While the source code of Carrot² was available since 2002, it was only in 2006 when version 1.0 was officially released. In the same year, version 2.0 was released with improved user interface and extended tool set. In 2009, version 3.0 brought significant improvements in clustering quality, simplified API and new GUI application for tuning clustering based on the Eclipse Rich Client Platform. In 2020, version 4.0.0 brought further simplification of the API, code cleanups and removal of the desktop Workbench. Version 4.1.0 brings back the Workbench as a web-based application.

Carrot² releases
Release	Release Date	Major changes and new features
4.6.0	May 2024	Dependency updates, build system improvements.
4.5.2	November 2023	Dependency updates, build system improvements.
4.5.1	May 2023	Dependency updates, minor bug fixes.
4.5.0	November 2022	Dependency updates, bug fixes.
4.4.3	August 2022	Dependency updates, bug fixes to STC and stemming infrastructure.
4.4.0, 4.4.1, 4.4.2	December 2021	Security fixes and dependency updates.
4.3.0	July 2021	Minor API changes and bug fixes. Improvements to the workbench (DCS search frontend).
4.2.0, 4.2.1	March 2021	Improvements to JSON dictionaries and the workbench. Bug fixes.
4.1.0	January 2021	Web-based Workbench. JSON dictionaries and new filtering options. API polishing.
4.0.0	July 2020	API changes and simplifications across the codebase. Removal of deprecated technologies and tools. New documentation and code cleanups.
3.16.2	September 2019	Update third party libraries (security-related issues).
3.16.1	January 2019	Update of JS visualizations. Migration of Microsoft Bing API v5 to v7.
3.16.0	May 2018	An overhaul of Java 9+ compatibility issues. Workbench compatibility for Ubuntu distros. Document source updates and removals of non-functional document sources.
3.15.1	March 2017	A bugfix for .NET release that could result in unchecked I/O exceptions on inaccessible current working directory.
3.15.0	October 2016	Bing API V2 to V5 transition. Upgrade of third party dependencies. Internal cosmetics.
3.14.0	September 2016	Workbench improvements (high DPI support, MacOSX improvements, bug fixes). PubMed switching to HTTPs. Other minor improvements.
3.13.0	July 2016	Servlet API bug fixes, Workbench bug fixes, removed Google document source, fixed language codes for a few languages.
3.12.0	February 2016	Upgrade of Morfologik Polish dictionary, infrastructural changes and adjustments allowing C2 to operate under more strict security manager policies.
3.11.0	October 2015	Upgrade of Apache Lucene, bug fixes and a rollup of changes from 3.10.x minors.
3.10.4	October 2015	Upgrade of Morfologik library.
3.10.3	August 2015	Repackaged Google Guava to avoid conflicts in Solr.
3.10.2	July 2015	Minor fixes to the Workbench (Arabic cluster display).
3.10.1	May 2015	Aduna visualization dropped from MacOS distribution. Minor fixes to the Workbench.
3.10.0	May 2015	Visualization updates. Bug fixes. Library dependency updates.
3.9.4	November 2014	FoamTree update. New attributes for multilingual clustering. Visualization fixes.
3.9.3	July 2014	FoamTree update. Infrastructure fixes and tweaks (jflex, sonatype repository URLs).
3.9.2	April 2014	Bug fix to FoamTree HTML5.
3.9.1	April 2014	Bug fixes, upgrades of HTML5 visualizations.
3.9.0	February 2014	HTML5 visualizations replacing flash, library dependencies update, bugfixes.
3.8.1	October 2013	Bug fixes, minor tweaks to functionality.
3.8.0	July 2013	Bug fixes, library dependency updates.
3.7.1	May 2013	Minor bug fixes (3.7.0 maintenance release).
3.7.0	April 2013	Infrastructure changes to the core (string IDs), better Solr integration XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.3	April 2013	Minor bug fixes and improvements: customization of Solr adapter XSLT, Workbench tweaks for larger inputs, updated dependencies.
3.6.2	November 2012	Minor bug fixes and improvements.
3.6.1	August 2012	Minor bug fixes.
3.6.0	June 2012	Infrastructural changes, refactorings and bug fixes.
3.5.3	December 2011	Infrastructure updates resulting from migration to GitHub. Workbench update to SWT 3.7.1.
3.5.2	September 2011	Ajax support in Document Clustering Server, Bing document source improved, Workbench improvements, bug fixes.
3.5.1	June 2011	Bug fixes, visualization integration improvements, support for Yahoo BOSS API removed.
3.5.0	May 2011	FoamTree visualization, bisecting k-means clustering, resource management improvements
3.4.3	March 2011	Distribution to Maven central repository
3.4.2	October 2010	Bug fixes
3.4.1	September 2010	Solr 1.4.x compatibility package, bug fixes
3.4.0	August 2010	.NET API for calling Carrot² clustering
3.3.0	April 2010	Significant scalability improvements in the STC clustering algorithm
3.2.0	March 2010	Experimental support for clustering Arabic and Korean content, command line application for clustering in batch mode, LGPL-licensed dependencies removed
3.1.0	September 2009	Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.1.0	September 2009	Experimental support for clustering Chinese content, search results clustering plugin for Apache Solr
3.0.1	March 2009	Document Clustering Workbench available for Mac OS X
3.0.0	January 2009	Document Clustering Workbench added for easy experimenting with Carrot² clustering, radically simplified Java API, search results clustering web application re-implemented, user manual^[5] available
2.1.0	August 2007	Document Clustering Server added for exposing clustering as a REST service
2.0.0	September 2006	New user interface of the search results clustering web application
1.0.0	January 2006	First official release, binaries available on SourceForge
0.0.0	since 2002	Incubation releases, source code available on SourceForge

Architecture

Carrot² 4.0 is predominantly a Java programming library with public APIs for management of language-specific resources, algorithm configuration and execution. A HTTP/REST component (document clustering server) is provided for interoperability with other languages.

Clustering algorithms

Carrot² offers a few document clustering algorithms that place emphasis on the quality of cluster labels:

Lingo:^[4] a clustering algorithm based on the Singular value decomposition
STC:^[6] Suffix Tree Clustering

Spin-offs

Carrot Search

Carrot Search,^[7] a commercial spin-off of the Carrot² project, works on further development of Carrot², offers a real-time text clustering algorithm^[8] compliant with the Carrot² framework as well as text mining consulting services based on open source and proprietary software.

Carrot Search Labs

Carrot² gave rise to a number of independent open source projects released under the umbrella of Carrot Search Labs.^[9] The following projects are or were published as part of this initiative:

Randomized Testing: a JUnit test runner with built-in utilities to make every test run slightly different (randomized). Also an ANT task for running JUnit tests on parallel JVMs, with load balancing and other bells and whistles.
High Performance Primitive Collections for Java (HPPC): Lists, Sets, Maps and other collections of primitives for Java tuned for highest performance and memory efficiency.
SmartSprites: fully automatic maintenance of CSS sprites; no tedious copying and pasting to the CSS when adding or changing sprited images.

Discontinued projects:

jSuffixArrays: Several Java implementations of the Suffix Array data structure with different performance and memory characteristics.
JUnitBenchmarks: A set of extensions for turning JUnit4 tests into performance micro-benchmarks with GC monitoring, time variance measurement and simple graphical visualizations.

References

^ Carrot2 Project, Stanislaw Osinski, Dawid Weiss. "Carrot2 - Open Source Search Results Clustering Engine".{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
^ Carrot² search results clustering demo
^ Dawid Weiss: A Clustering Interface for Web Search Results in Polish and English. MSc thesis. Poznan University of Technology, Poznań, Poland, 2001 download PDF
^ ^a ^b Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol. 20), 2005, pp. 48–54.
^ "Carrot2".
^ Oren Zamir, Oren Etzioni: Web Document Clustering: A Feasibility Demonstration, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (1998), pp. 46–54
^ Carrot Search s.c. "Carrot Search: document clustering and visualization software".
^ Carrot Search s.c. "Carrot Search: Lingo3G: Text Document Clustering Engine".
^ Carrot Search s.c. "Carrot Search Labs".

[1] Carrot2 Project, Stanislaw Osinski, Dawid Weiss. "Carrot2 - Open Source Search Results Clustering Engine".{{cite web}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)

[2] Carrot² search results clustering demo

[3] Dawid Weiss: A Clustering Interface for Web Search Results in Polish and English. MSc thesis. Poznan University of Technology, Poznań, Poland, 2001 download PDF

[lingo-4] Stanisław Osiński, Dawid Weiss: A Concept-Driven Algorithm for Clustering Search Results. IEEE Intelligent Systems, May/June, 3 (vol. 20), 2005, pp. 48–54.

[5] "Carrot2".

[6] Oren Zamir, Oren Etzioni: Web Document Clustering: A Feasibility Demonstration, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (1998), pp. 46–54

[7] Carrot Search s.c. "Carrot Search: document clustering and visualization software".

[8] Carrot Search s.c. "Carrot Search: Lingo3G: Text Document Clustering Engine".

[9] Carrot Search s.c. "Carrot Search Labs".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]