National Centre for Text Mining

From Wikipedia, the free encyclopedia
Jump to: navigation, search
National Centre for Text Mining (NaCTeM)
Established 2004
Parent institution School of Computer Science, University of Manchester
Academic affiliation University of Manchester
Location Manchester, United Kingdom
Director Prof. Sophia Ananiadou

The National Centre for Text Mining (NaCTeM)[1] is a publicly funded text mining (TM) centre. It was established to provide support, advice, and information on TM technologies and to disseminate information from the larger TM community, while also providing tailored services and tools in response to the requirements of the United Kingdom academic community.

The software tools and services which NaCTeM supplies allow researchers to apply text mining techniques to problems within their specific areas of interest – examples of these tools are highlighted below. In addition to providing services, the Centre is also involved in, and makes significant contributions to, the text mining research community both nationally and internationally in initiatives such as Europe PubMed Central.

The Centre is located in the Manchester Institute of Biotechnology and is operated and organized by the University of Manchester School of Computer Science. NaCTeM contributes expertise in natural language processing and information extraction, including named-entity recognition and extractions of complex relationships (or events) that hold between named entitites, along with parallel and distributed data mining systems in biomedical and clinical applications.


TerMine is a domain independent method for automatic term recognition which can be used to help locate the most important terms in a document and automatically ranks them.[2]

AcroMine finds all known expanded forms of acronyms as they have appeared in Medline entries or conversely, it can be used to find possible acronyms of expanded forms as they have previously appeared in Medline and disambiguates them.[3]

Medie is an intelligent search engine, for semantic retrieval of sentences containing biomedical correlations from Medline abstracts [4]

Facta+ is a Medline search engine for finding associations between biomedical concepts.[5]

Facta+ Visualizer is a web application that aids in understanding FACTA+ search results through intuitive graphical visualisation.[6]

KLEIO is a faceted semantic information retrieval system over Medline abstracts.

Europe PMC EvidenceFinder helps users to explore facts that involve entities of interest within the full text articles of the Europe PubMed Central database.[7]

EUPMC Evidence Finder for Anatomical entities with meta-knowledge – similar to the Europe PMC EvidenceFinder, allowing exploration of facts involving anatomical entities within the full text articles of the Europe PubMed Central database. Facts can be filtered according to various aspects of their interpretation (e.g., negation, certainly level, novelty).

Info-PubMed provides information and graphical representation of biomedical interactions extracted from Medline using deep semantic parsing technology. This is supplemented with a term dictionary consisting of over 200,000 protein/gene names and identification of disease types and organisms.

Clinical Trial Protocols (ASCOT) is an efficient, semantically-enhanced search application, customised for clinical trial documents.[8]

History of Medicine (HOM) is a semantic search system over historical medical document archives


BioLexicon – a large-scale terminological resource for the biomedical domain.[9]

GENIA – a collection of reference materials for the development of biomedical text mining systems.

GREC – a semantically annotated corpus of Medline abstracts intended for training IE systems and/or resources which are used to extract events from biomedical literature.[10]

Metabolite and Enzyme Corpus – a corpus of Medline abstracts annotated by experts with metabolite and enzyme names.

Anatomy Corpora – A collection of corpora manually annotated with fine-grained, species-independent anatomical entities, to facilitate the development of text mining systems that can carry out detailed and comprehensive analyses of biomedical scientific text.[11] [12]

Meta-knowledge corpus – an enrichment of the GENIA Event corpus, in which events are enriched with various levels of information pertaining to their interpretation. The aim is to allow systems to be trained that can distinguish between events that factual information or experimental analyses, definite information from speculated information, etc.[13]


Argo – The objective of the Argo project is to develop a workbench for analysing (primarily annotating) textual data. The workbench, which is accessed as a web application, supports the combination of elementary text-processing components to form comprehensive processing workflows. It provides functionality to manually intervene in the otherwise automatic process of annotation by correcting or creating new annotations, and facilitates user collaboration by providing sharing capabilities for user-owned resources. Argo benefits users such as text-analysis designers by providing an integrated environment for the development of processing workflows; annotators/curators by providing manual annotation functionalities supported by automatic pre-processing and post-processing; and developers by providing a workbench for testing and evaluating text analytics.

Big Mechanism – Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. Whilst the collection of big data is increasingly automated, the creation of big mechanisms remains a largely human effort, which is becoming made increasingly challenging, according to the fragmentation and distribution of knowledge. The ability to automate the construction of big mechanisms could have a major impact on scientific research. As one of a number of different projects that make up the big mechanism programme, funded by DARPA, the aim is to assemble an overarching big mechanism from the literature and prior experiments and to utilise this for the probabilistic interpretation of new patient panomics data. We will integrate machine reading of the cancer literature with probabilistic reasoning across cancer claims using specially-designed ontologies, computational modeling of cancer mechanisms (pathways), automated hypothesis generation to extend knowledge of the mechanisms and a 'Robot Scientist' that performs experiments to test the hypotheses. A repetitive cycle of text mining, modelling, experimental testing, and worldview updating is intended to lead to increased knowledge about cancer mechanisms.

COPIOUS – This project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.

Europe PMC Project – This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and Mimas (data centre), forming a work package in the Europe PubMed Central project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PubMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies from the biomedical research funders. The contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.

Mining Biodiversity – This project aims to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project integrates novel text mining methods, visualisation, crowdsourcing and social media into the BHL. The resulting digital resource will provide fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.

Mining for Public Health – This project aims to conduct novel research in text mining and machine learning to transform the way in which evidence-based public health (EBPH) reviews are conducted. The aims of the project are to develop new text mining unsupervised methods for deriving term similarities, to support screening while searching in EBPH reviews and to develop new algorithms for ranking and visualising meaningful associations of multiple types in a dynamic and iterative manner. These newly developed methods will be evaluated in EBPH reviews, based on implementation of a pilot, to ascertain the level of transformation in EBPH reviewing.


  1. ^ Ananiadou S (2007). "The National Centre for Text Mining: A Vision for the Future". Ariadne (53). 
  2. ^ Frantzi, K., Ananiadou, S. and Mima, H. (2007). "Automatic recognition of multi-word terms" (PDF). International Journal of Digital Libraries. 3 (2): 117–132. 
  3. ^ Okazaki N, Ananiadou S (2006). "Building an abbreviation dictionary using a term recognition approach.". Bioinformatics. 22 (24): 3089–95. doi:10.1093/bioinformatics/btl534. PMID 17050571. 
  4. ^ Miyao, Y., Ohta, T., Masuda, K., Tsuruoka, Y., Yoshida, K., Ninomiya, T. and Tsujii, J. (2006). Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. pp. 1017–1024. doi:10.3115/1220175.1220303. 
  5. ^ Tsuruoka Y, Tsujii J, Ananiadou S (2008). "FACTA: a text search engine for finding associated biomedical concepts". Bioinformatics. 24 (21): 2559–60. doi:10.1093/bioinformatics/btn469. PMC 2572701Freely accessible. PMID 18772154. 
  6. ^ Tsuruoka, Y; Miwa, M; Hamamoto, K; Tsujii, J; Ananiadou, S (2011). "Discovering and visualizing indirect associations between biomedical concepts". Bioinformatics. 27 (13): i111–9. doi:10.1093/bioinformatics/btr214. 
  7. ^ The Europe PMC Consortium (2014). "Europe PMC: a full-text literature database for the life sciences and platform for innovation". Nucleic Acids Research. 43 (D1): D1042–D1048. doi:10.1093/nar/gku1061. PMC 4383902Freely accessible. PMID 25378340. 
  8. ^ Korkontzelos, I., Mu, T. and Ananiadou, S. (2012). "ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials". BMC Medical Informatics and Decision Making. 12 (Suppl 1): S3. doi:10.1186/1472-6947-12-S1-S3. 
  9. ^ Thompson, P., McNaught, J., Montemagni, S., Calzolari, N., del Gratta, R., Lee, V., Marchi, S., Monachini, M., Pezik, P., Quochi, V., Rupp, C. J., Sasaki, Y., Venturi, G., Rebholz-Schuhmann, D. and Ananiadou, S. (2011). "The BioLexicon: a large-scale terminological resource for biomedical text mining". BMC Bioinformatics. 12: 397. doi:10.1186/1471-2105-12-397. PMC 3228855Freely accessible. PMID 21992002. 
  10. ^ Thompson, P., Iqbal, S. A., McNaught, J. and Ananiadou, S. (2009). "Construction of an annotated corpus to support biomedical information extraction". BMC Bioinformatics. 10: 349. doi:10.1186/1471-2105-10-349. 
  11. ^ Pyysalo, S., Ohta, T., Miwa, M., Cho, H. -C., Tsujii, J. and Ananiadou, S. (2012). "Event extraction across multiple levels of biological organization". Bioinformatics. 28 (18): i575–i581. doi:10.1093/bioinformatics/bts407. PMC 3436834Freely accessible. PMID 22962484. 
  12. ^ Pyysalo, S. & Ananiadou, S. (2014). "Anatomical Entity Mention Recognition at Literature Scale". Bioinformatics. 30 (6): 868–875. doi:10.1093/bioinformatics/btt580. PMC 3957068Freely accessible. PMID 24162468. 
  13. ^ Thompson, P., Nawaz, R., McNaught, J. and Ananiadou, S. (2011). "Enriching a biomedical event corpus with meta-knowledge annotation". BMC Bioinformatics. 12: 393. doi:10.1186/1471-2105-12-393. PMC 3222636Freely accessible. PMID 21985429. 

External links[edit]