From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Gensim logo.png
Original author(s) Radim Řehůřek
Developer(s) RaRe Technologies
Initial release 2009
Stable release
3.3.0 / 3 February 2018; 5 months ago (2018-02-03)
Written in Python
Operating system Linux, Windows, macOS
Type Information retrieval
License LGPL

Gensim is a robust open-source vector space modeling and topic modeling toolkit implemented in Python. It uses NumPy, SciPy and optionally Cython for performance. Gensim is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiates it from most other scientific software packages that only target batch and in-memory processing.

Main features[edit]

Gensim includes implementations of tf-idf, random projections, word2vec and document2vec algorithms,[1] hierarchical Dirichlet processes (HDP), latent semantic analysis (LSA, LSI, SVD) and latent Dirichlet allocation (LDA), including distributed parallel versions.[2]

Some of the online algorithms in Gensim were also published in the 2011 PhD dissertation Scalability of Semantic Analysis in Natural Language Processing of Radim Řehůřek, the creator of Gensim.[3]

Uses of Gensim[edit]

Gensim has been used and cited in over 800 commercial and academic applications, in a diverse array of disciplines from medicine to insurance claim analysis to patent search[4][5] The software has been covered in several new articles, podcasts and interviews since 2009.[6][7][8]

Free and commercial support[edit]

The open source code is developed and hosted on GitHub[9] and a public support forum is maintained on Google Groups[10] and Gitter.[11]

Gensim is commercially supported by the company, who also provide student mentorships and academic thesis projects for Gensim via their Student Incubator programme.[12]


External links[edit]