Biomedical text mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Biomedical text mining (also known as BioNLP) refers to text mining applied to texts and literature of the biomedical and molecular biology domain. It is a rather recent research field on the edge of natural language processing, bioinformatics, medical informatics and computational linguistics.

There is an increasing interest in text mining and information extraction strategies applied to the biomedical and molecular biology literature due to the increasing number of electronically available publications stored in databases such as PubMed.

Main applications[edit]

The main developments in this area have been related to the identification of biological entities (named entity recognition), such as protein and gene names as well as chemical compounds and drugs [1] in free text, the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature, automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms). Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology. Information extraction and text mining methods have been explored to extract information related to biological processes and diseases.[2]


  • PIE the search - PIE (Protein Interaction information Extraction) the search is a web service to extract PPI-relevant articles from MEDLINE.
  • ARIANA - Adaptive Robust and Integrative Analysis for Finding Novel Associations (ARIANA) is context-specific, modular and scalable system that uses PubMed and is able to capture direct and indirect associations among biomedical concepts (concepts are derived from MeSH).
  • PubTator - PubTator is a machine-assisted annotation system and provides web-based semantic search service for gene, disease, and chemical.
  • KLEIO - an advanced information retrieval system providing knowledge enriched searching for biomedicine.
  • FACTA+ - a MEDLINE search engine for finding associations between biomedical concepts. The FACTA+ Visualizer helps intuitive understanding of FACTA+ search results through graphical visualization of the results.[3]
  • U-Compare - U-Compare is an integrated text mining/natural language processing system based on the UIMA Framework, with an emphasis on components for biomedical text mining.[4]
  • TerMine - a term management system that identifies key terms in biomedical and other text types.
  • PLAN2L — Extraction of gene regulation relations, protein-protein interactions, mutations, ranked associations and cellular and developmental process associations for genes and proteins of the plant Arabidopsis from abstracts and full text articles.
  • MEDIE - an intelligent search engine to retrieve biomedical correlations from MEDLINE, based on indexing by Natural Language Processing and Text Mining techniques [5]
  • AcroMine - an acronym dictionary which can be used to find distinct expanded forms of acronyms from MEDLINE.[6]
  • AcroMine Disambiguator - Disambiguates abbreviations in biomedical text with their correct full forms.[7]
  • GENIA tagger - Analyses biomedical text and outputs base forms, part-of-speech tags, chunk tags, and named entity tags
  • NEMine - Recognises gene/protein names in text
  • Yeast MetaboliNER - Recognizes yeast metabolite names in text.
  • Smart Dictionary Lookup - machine learning-based gene/protein name lookup.
  • TPX - A concept-assisted search and navigation tool for biomedical literature analyses - runs on PubMed/PMC and can be configured, on request, to run on local literature repositories too.[8]
  • Chilibot — A tool for finding relationships between genes or gene products.
  • EBIMed - EBIMed is a web application that combines Information Retrieval and Extraction from Medline.[9]
  • FABLE — A gene-centric text-mining search engine for MEDLINE
  • GOAnnotator, an online tool that uses Semantic similarity for verification of electronic protein annotations using GO terms automatically extracted from literature.
  • GoPubMed — retrieves PubMed abstracts for your search query, then detects ontology terms from the Gene Ontology and Medical Subject Headings in the abstracts and allows the user to browse the search results by exploring the ontologies and displaying only papers mentioning selected terms, their synonyms or descendants.
  • Anne O'Tate Retrieves sets of PubMed records, using a standard PubMed interface, and analyzes them, arranging content of PubMed record fields (MeSH, author, journal, words from title and abtsracts, and others) in order of frequency.
  • Information Hyperlinked Over Proteins (iHOP):[10] "A network of concurring genes and proteins extends through the scientific literature touching on phenotypes, pathologies and gene function. iHOP provides this network as a natural way of accessing millions of PubMed abstracts. By using genes and proteins as hyperlinks between sentences and abstracts, the information in PubMed can be converted into one navigable resource, bringing all advantages of the internet to scientific literature research."
  • LitInspector — Gene and signal transduction pathway data mining in PubMed abstracts.
  • NextBio- Life sciences search engine with a text mining functionality that utilizes PubMed abstracts (ex: literature search) and clinical trials (example) to return concepts relevant to the query based on a number of heuristics including ontology relationships, journal impact, publication date, and authorship.
  • The Neuroscience Information Framework (NIF) — A neuroscience research hub with a search engine specifically tailored for neuroscience, direct access to over 180 databases, and curated resources. Built as part of the NIH Blueprint for Neuroscience Research.
  • PubAnatomy — An interactive visual search engine that provides new ways to explore relationships among Medline literature, text mining results, anatomical structures, gene expression and other background information.
  • PubGeneCo-occurrence networks display of gene and protein symbols as well as MeSH, GO, PubChem and interaction terms (such as "binds" or "induces") as these appear in MEDLINE records (that is, PubMed titles and abstracts).
  • Reflect — Reflect is a free service that tags gene, protein, and small molecule names in any web page within a few seconds. Clicking on a tagged term opens a small popup showing summary information.
  • Whatizit - Whatizit is great at identifying molecular biology terms and linking them to publicly available databases.[11]
  • XTractor — Discovering Newer Scientific Relations Across PubMed Abstracts. A tool to obtain manually annotated,expert curated relationships for Proteins, Diseases, Drugs and Biological Processes as they get published in PubMed.
  • Medical Abstract — Medical Abstract is an aggregator for medical abstract journal from PubMed Abstracts.
  • MuGeX — MuGeX is a tool for finding disease specific mutation-gene pairs.
  • MedCase — MedCase is an experimental tool of Faculties of Veterinary Medicine and Computer Science in Cluj-Napoca, designed as a homeostatic serving sistem with natural language support for medical applications.
  • BeCAS — BeCAS is a web application, API and widget for biomedical concept identification, able to annotate free text and PubMed abstracts.
  • @Note2 — A workbench for Biomedical Text Mining (Including Information Retrieval, Name Entity Recognition and Relation Extraction plugins)
  • tagtog — A Biomedical Text Mining web framework. Collaborative tool for assisted annotation and corpus creation. Users can train Machine Learning models for automatic extraction of entities and relations (e.g. gene mentions or mutations) from abstracts and full text articles. Users can also use dictionaries to handle synonyms and easily map the data extracted to any database.[12]

Conferences at which BioNLP research is presented[edit]

BioNLP is presented at a variety of meetings:

See also[edit]

External links[edit]


  1. ^ M Krallinger, F Leitner, O Rabal, M Vazquez, J Oyarzabal and A Valencia, Overview of the chemical compound and drug name recognition (CHEMDNER) task. Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2. 6-37.
  2. ^ Krallinger, M; Leitner, F; Valencia, A (2010). "Analysis of Biological Processes and Diseases Using Text Mining Approaches". Bioinformatics Methods in Clinical Research. Methods in Molecular Biology 593. pp. 341–82. doi:10.1007/978-1-60327-194-3_16. ISBN 978-1-60327-193-6. PMID 19957157.  edit
  3. ^ Tsuruoka Y, Tsujii J and Ananiadou S (2008). "FACTA: a text search engine for finding associated biomedical concepts". Bioinformatics 24 (21): 2559–2560. doi:10.1093/bioinformatics/btn469. PMC 2572701. PMID 18772154. 
  4. ^ Kano Y, Baumgartner Jr WA, McCrohon L, Ananiadou S, Cohen KB, Hunter L and Tsujii J (2009). "U-Compare: share and compare text mining tools with UIMA". Bioinformatics 25 (15): 1997–1998. doi:10.1093/bioinformatics/btp289. PMC 2712335. PMID 19414535. 
  5. ^ Miyao Y, Ohta T, Masuda K, Tsuruoka Y, Yoshida K, Ninomiya T and Tsujii J (2006). "Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases". Proceedings of COLING-ACL 2006. pp. 1017–1024. 
  6. ^ Okazaki N and Ananiadou S (2006). "Building an abbreviation dictionary using a term recognition approach". Bioinformatics 22 (24): 3089–3095. doi:10.1093/bioinformatics/btl534. PMID 17050571. 
  7. ^ Okazaki N, Ananiadou S and Tsujii J (2010). "Building a high-quality sense inventory for improved abbreviation disambiguation". Bioinformatics 26 (9): 1246–1253. doi:10.1093/bioinformatics/btq129. PMC 2859134. PMID 20360059. 
  8. ^ Thomas Joseph, Vangala G Saipradeep, Ganesh Sekar Venkat Raghavan, Rajgopal Srinivasan, Aditya Rao, Sujatha Kotte & Naveen Sivadasan (2012). "TPX: Biomedical literature search made easy". Bioinformation 8 (12): 578–580. doi:10.6026/97320630008578. PMC 3398782. PMID 22829734. 
  9. ^ Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M and Stoehr P (2007). "EBIMed—text crunching to gather facts for proteins from Medline". Bioinformatics 23 (2): e237–e244. doi:10.1093/bioinformatics/btl302. PMID 17237098. 
  10. ^ Hoffmann R, Valencia A (September 2005). "Implementing the iHOP concept for navigation of biomedical literature". Bioinformatics 21 (Suppl 2): ii252–8. doi:10.1093/bioinformatics/bti1142. PMID 16204114. 
  11. ^ Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A (November 2008). "Text processing through Web services: calling Whatizit". Bioinformatics 24 (2): 296–298. doi:10.1093/bioinformatics/btm557. PMID 18006544. 
  12. ^ Cejuela J, McQuilton P, Ponting L, Marygold S, Stefancsik R, Millburn G, Rost B, FlyBase Consortium (2014). "tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles". Database 2014. doi:10.1093/database/bau033. PMID 24715220.