Biomedical text mining: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Restructuring, adding new sections and content
Tags: nowiki added Visual edit
→‎Processes: A bit about relationship discovery
Line 21: Line 21:
=== Named entity recognition ===
=== Named entity recognition ===
Developments in biomedical text have been related to the identification of biological entities ([[Named-entity recognition|named entity recognition]]), such as [[protein]] and [[gene]] names, as well as chemical compounds and drugs.<ref>{{cite journal|vauthors=Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A|title=Overview of the chemical compound and drug name recognition (CHEMDNER) task.|url=http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf|journal=Proceedings of the Fourth BioCreative Challenge Evaluation Workshop|volume=2|pages=6-37}}</ref>
Developments in biomedical text have been related to the identification of biological entities ([[Named-entity recognition|named entity recognition]]), such as [[protein]] and [[gene]] names, as well as chemical compounds and drugs.<ref>{{cite journal|vauthors=Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A|title=Overview of the chemical compound and drug name recognition (CHEMDNER) task.|url=http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf|journal=Proceedings of the Fourth BioCreative Challenge Evaluation Workshop|volume=2|pages=6-37}}</ref>

=== Relationship discovery ===
Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., [[Temporal information retrieval|temporal]] relationships), or [[Causal inference|causal]] relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.<ref>{{Cite journal|last=Rodriguez-Esteban|first=Raul|date=2009-12-24|title=Biomedical Text Mining and Its Applications|url=https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000597|journal=PLoS Computational Biology|language=en|volume=5|issue=12|pages=e1000597|doi=10.1371/journal.pcbi.1000597|issn=1553-7358|pmc=PMC2791166|pmid=20041219}}</ref>


== Applications ==
== Applications ==
Text mining applications in the biomedical field have included computational approaches to assist with studies in [[protein docking]],<ref>{{cite journal | vauthors = Badal VD, Kundrotas PJ, Vakser IA | title = Text Mining for Protein Docking | journal = PLoS Computational Biology | volume = 11 | issue = 12 | pages = e1004630 | date = December 2015 | pmid = 26650466 | pmc = 4674139 | doi = 10.1371/journal.pcbi.1004630 }}</ref> [[protein interactions]],<ref>{{cite journal | vauthors = Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I | title = Protein-protein interaction predictions using text mining methods | journal = Methods | volume = 74 | pages = 47–53 | date = March 2015 | pmid = 25448298 | doi = 10.1016/j.ymeth.2014.10.026 }}</ref><ref>{{cite journal | vauthors = Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, Jensen LJ, von Mering C | display-authors = 6 | title = The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible | journal = Nucleic Acids Research | volume = 45 | issue = D1 | pages = D362-D368 | date = January 2017 | pmid = 27924014 | pmc = 5210637 | doi = 10.1093/nar/gkw937 }}</ref> and protein-disease associations.<ref name="Liem_2018">{{cite journal | vauthors = Liem DA, Murali S, Sigdel D, Shi Y, Wang X, Shen J, Choi H, Caufield JH, Wang W, Ping P, Han J | display-authors = 6 | title = Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease | journal = American Journal of Physiology. Heart and Circulatory Physiology | volume = 315 | issue = 4 | pages = H910-H924 | date = October 2018 | pmid = 29775406 | doi = 10.1152/ajpheart.00175.2018 }}</ref>
Text mining applications in the biomedical field include computational approaches to assist with studies in [[protein docking]],<ref>{{cite journal | vauthors = Badal VD, Kundrotas PJ, Vakser IA | title = Text Mining for Protein Docking | journal = PLoS Computational Biology | volume = 11 | issue = 12 | pages = e1004630 | date = December 2015 | pmid = 26650466 | pmc = 4674139 | doi = 10.1371/journal.pcbi.1004630 }}</ref> [[protein interactions]],<ref>{{cite journal | vauthors = Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I | title = Protein-protein interaction predictions using text mining methods | journal = Methods | volume = 74 | pages = 47–53 | date = March 2015 | pmid = 25448298 | doi = 10.1016/j.ymeth.2014.10.026 }}</ref><ref>{{cite journal | vauthors = Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, Jensen LJ, von Mering C | display-authors = 6 | title = The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible | journal = Nucleic Acids Research | volume = 45 | issue = D1 | pages = D362-D368 | date = January 2017 | pmid = 27924014 | pmc = 5210637 | doi = 10.1093/nar/gkw937 }}</ref> and protein-disease associations.<ref name="Liem_2018">{{cite journal | vauthors = Liem DA, Murali S, Sigdel D, Shi Y, Wang X, Shen J, Choi H, Caufield JH, Wang W, Ping P, Han J | display-authors = 6 | title = Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease | journal = American Journal of Physiology. Heart and Circulatory Physiology | volume = 315 | issue = 4 | pages = H910-H924 | date = October 2018 | pmid = 29775406 | doi = 10.1152/ajpheart.00175.2018 }}</ref>


=== Gene cluster identification ===
=== Gene cluster identification ===

Revision as of 18:37, 4 October 2018

Biomedical text mining (also known as BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Challenges

Applying text mining approaches to biomedical text presents specific challenges common to the domain.

Availability of annotated text data

Large annotated corpora used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue[1], product reviews[2], or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora.[3] Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges[4][5][6] and biomedical informatics researchers[7][8]. Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).

Supporting clinical needs

Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians.[3] This is a concern in environments where clinical decision support is expected to be informative and accurate.

Working with other clinical systems

New text mining systems must be interoperable with existing standards, electronic medical records, and databases.[3] Methods for interfacing with clinical systems such as LOINC have been developed[9] but require extensive organizational effort to implement and maintain.[10][11]

Patient privacy

Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.[12][13][14]

Processes

Specific text mining sub tasks are of particular concern when processing biomedical text.

Named entity recognition

Developments in biomedical text have been related to the identification of biological entities (named entity recognition), such as protein and gene names, as well as chemical compounds and drugs.[15]

Relationship discovery

Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.[16]

Applications

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking,[17] protein interactions,[18][19] and protein-disease associations.[20]

Gene cluster identification

Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.[21]

Protein interactions

Automatic extraction of protein interactions and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored.[citation needed] Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology.

Protein-disease associations

Text mining enables an unbiased evaluation of protein-disease relationships within a vast quantity of unstructured textual data[22].

Applications of phrase mining to disease associations

A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP)[23], then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.[20]

Software tools

Search engines

Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly-available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView[24], and APSE[25]. Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed[26] and OmicsDI[27].

Some search engines, such as Essie[28], OncoSearch[29], PubGene[30][31] and GoPubMed[32], were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.

See also

References

  1. ^ Danescu-Niculescu-Mizil, Cristian; Lee, Lillian (2011). "Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs". CMCL '11.
  2. ^ McAuley, Julian; Leskovec, Jure (2013-10-12). "Hidden factors and hidden topics: understanding rating dimensions with review text". ACM: 165–172. doi:10.1145/2507157.2507163. ISBN 9781450324090. {{cite journal}}: Cite journal requires |journal= (help)
  3. ^ a b c Ohno-Machado, Lucila; Nadkarni, Prakash; Johnson, Kevin (2013). "Natural language processing: algorithms and tools to extract computable information from EHRs and from the biomedical literature". Journal of the American Medical Informatics Association. 20 (5): 805–805. doi:10.1136/amiajnl-2013-002214. ISSN 1067-5027. PMC 3756279. PMID 23935077.{{cite journal}}: CS1 maint: PMC format (link)
  4. ^ Uzuner, Özlem; South, Brett R; Shen, Shuying; DuVall, Scott L (2011). "2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text". Journal of the American Medical Informatics Association. 18 (5): 552–556. doi:10.1136/amiajnl-2011-000203. ISSN 1067-5027. PMC 3168320. PMID 21685143.{{cite journal}}: CS1 maint: PMC format (link)
  5. ^ Sun, Weiyi; Rumshisky, Anna; Uzuner, Ozlem (2013). "Evaluating temporal relations in clinical text: 2012 i2b2 Challenge". Journal of the American Medical Informatics Association. 20 (5): 806–813. doi:10.1136/amiajnl-2013-001628. ISSN 1067-5027. PMC 3756273. PMID 23564629.{{cite journal}}: CS1 maint: PMC format (link)
  6. ^ Stubbs, Amber; Kotfila, Christopher; Uzuner, Özlem (2015). "Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1". Journal of Biomedical Informatics. 58: S11–S19. doi:10.1016/j.jbi.2015.06.007. ISSN 1532-0464. PMC 4989908. PMID 26225918.{{cite journal}}: CS1 maint: PMC format (link)
  7. ^ Albright, Daniel; Lanfranchi, Arrick; Fredriksen, Anwen; Styler, William F; Warner, Colin; Hwang, Jena D; Choi, Jinho D; Dligach, Dmitriy; Nielsen, Rodney D (2013). "Towards comprehensive syntactic and semantic annotations of the clinical narrative". Journal of the American Medical Informatics Association. 20 (5): 922–930. doi:10.1136/amiajnl-2012-001317. ISSN 1067-5027. PMC 3756257. PMID 23355458.{{cite journal}}: CS1 maint: PMC format (link)
  8. ^ Bada, Michael; Eckert, Miriam; Evans, Donald; Garcia, Kristin; Shipley, Krista; Sitnikov, Dmitry; Baumgartner, William A; Cohen, K; Verspoor, Karin (2012). "Concept annotation in the CRAFT corpus". BMC Bioinformatics. 13 (1): 161. doi:10.1186/1471-2105-13-161. ISSN 1471-2105. PMC 3476437. PMID 22776079.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  9. ^ Vandenbussche, Pierre-Yves; Cormont, Sylvie; André, Christophe; Daniel, Christel; Delahousse, Jean; Charlet, Jean; Lepage, Eric (2013). "Implementation and management of a biomedical observation dictionary in a large healthcare information system". Journal of the American Medical Informatics Association. 20 (5): 940–946. doi:10.1136/amiajnl-2012-001410. ISSN 1067-5027. PMC 3756262. PMID 23635601.{{cite journal}}: CS1 maint: PMC format (link)
  10. ^ Jannot, Anne-Sophie; Zapletal, Eric; Avillach, Paul; Mamzer, Marie-France; Burgun, Anita; Degoulet, Patrice (2017). "The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience". International Journal of Medical Informatics. 102: 21–28. doi:10.1016/j.ijmedinf.2017.02.006. ISSN 1386-5056.
  11. ^ Levy, Brian. "Health Care's Semantics Challenge". www.fortherecordmag.com. Great Valley Publishing Company. Retrieved 2018-10-04. {{cite web}}: Cite has empty unknown parameter: |dead-url= (help)
  12. ^ Goodwin, Linda K.; Prather, Jonathan C. (2002). "Protecting patient privacy in clinical data mining". Journal of healthcare information management: JHIM. 16 (4): 62–67. ISSN 1099-811X. PMID 12365302.
  13. ^ Tucker, Katherine; Branson, Janice; Dilleen, Maria; Hollis, Sally; Loughlin, Paul; Nixon, Mark J.; Williams, Zoë (2016). "Protecting patient privacy when sharing patient-level data from clinical trials". BMC Medical Research Methodology. 16 (S1). doi:10.1186/s12874-016-0169-4. ISSN 1471-2288. PMC 4943495. PMID 27410040.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  14. ^ Graves, Stuart (2013). "Confidentiality, electronic health records, and the clinician". Perspectives in Biology and Medicine. 56 (1): 105–125. doi:10.1353/pbm.2013.0003. ISSN 1529-8795. PMID 23748530.
  15. ^ Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. "Overview of the chemical compound and drug name recognition (CHEMDNER) task" (PDF). Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. 2: 6–37.
  16. ^ Rodriguez-Esteban, Raul (2009-12-24). "Biomedical Text Mining and Its Applications". PLoS Computational Biology. 5 (12): e1000597. doi:10.1371/journal.pcbi.1000597. ISSN 1553-7358. PMC 2791166. PMID 20041219.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  17. ^ Badal VD, Kundrotas PJ, Vakser IA (December 2015). "Text Mining for Protein Docking". PLoS Computational Biology. 11 (12): e1004630. doi:10.1371/journal.pcbi.1004630. PMC 4674139. PMID 26650466.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  18. ^ Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I (March 2015). "Protein-protein interaction predictions using text mining methods". Methods. 74: 47–53. doi:10.1016/j.ymeth.2014.10.026. PMID 25448298.
  19. ^ Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, et al. (January 2017). "The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible". Nucleic Acids Research. 45 (D1): D362–D368. doi:10.1093/nar/gkw937. PMC 5210637. PMID 27924014.
  20. ^ a b Liem DA, Murali S, Sigdel D, Shi Y, Wang X, Shen J, et al. (October 2018). "Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease". American Journal of Physiology. Heart and Circulatory Physiology. 315 (4): H910–H924. doi:10.1152/ajpheart.00175.2018. PMID 29775406.
  21. ^ Kankar, P.; Adak, S.; Sarkar, A.; Murari, K.; Sharma, G. (2002-04-11), "MedMeSH Summarizer: Text Mining for Gene Clusters", Proceedings of the 2002 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 548–565, doi:10.1137/1.9781611972726.32, ISBN 9780898715170, retrieved 2018-10-02
  22. ^ Krallinger M, Leitner F, Valencia A (2010). "Analysis of biological processes and diseases using text mining approaches". Methods in Molecular Biology. Methods in Molecular Biology. 593: 341–82. doi:10.1007/978-1-60327-194-3_16. ISBN 978-1-60327-193-6. PMID 19957157.
  23. ^ Tao, Fangbo; Zhuang, Honglei; Yu, Chi Wang; Wang, Qi; Cassidy, Taylor; Kaplan, Lance; Voss, Clare; Han, Jiawei (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (PDF).
  24. ^ Thomas, P.; Starlinger, J.; Vowinkel, A.; Arzt, S.; Leser, U. (2012-06-12). "GeneView: a comprehensive semantic search engine for PubMed". Nucleic Acids Research. 40 (W1): W585–W591. doi:10.1093/nar/gks563. ISSN 0305-1048. PMC 3394277. PMID 22693219.
  25. ^ Brown, Peter; Zhou, Yaoqi (2017-09-06). "Biomedical literature: Testers wanted for article search tool". Nature. 549 (7670): 31. doi:10.1038/549031c. ISSN 0028-0836. PMID 28880292.
  26. ^ Ohno-Machado, Lucila; Sansone, Susanna-Assunta; Alter, George; Fore, Ian; Grethe, Jeffrey; Xu, Hua; Gonzalez-Beltran, Alejandra; Rocca-Serra, Philippe; Gururaj, Anupama E. (2017-05-26). "Finding useful data across multiple biomedical data repositories using DataMed". Nature Genetics. 49 (6): 816–819. doi:10.1038/ng.3864. ISSN 1546-1718. PMID 28546571.
  27. ^ Perez-Riverol, Yasset; Bai, Mingze; da Veiga Leprevost, Felipe; Squizzato, Silvano; Park, Young Mi; Haug, Kenneth; Carroll, Adam J; Spalding, Dylan; Paschall, Justin (2017-05-09). "Discovering and linking public omics data sets using the Omics Discovery Index". Nature Biotechnology. 35 (5): 406–409. doi:10.1038/nbt.3790. ISSN 1087-0156. PMC 5831141. PMID 28486464.
  28. ^ Ide, N. C.; Loane, R. F.; Demner-Fushman, D. (2007-05-01). "Essie: A Concept-based Search Engine for Structured Biomedical Text". Journal of the American Medical Informatics Association. 14 (3): 253–263. doi:10.1197/jamia.m2233. ISSN 1067-5027. PMC 2244877. PMID 17329729.
  29. ^ Lee, Hee-Jin; Dang, Tien Cuong; Lee, Hyunju; Park, Jong C. (2014-05-09). "OncoSearch: cancer gene search engine with literature evidence". Nucleic Acids Research. 42 (W1): W416–W421. doi:10.1093/nar/gku368. ISSN 1362-4962. PMC 4086113. PMID 24813447.
  30. ^ Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan; Hovig, Eivind (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–8. doi:10.1038/ng0501-21. PMID 11326270.
  31. ^ Masys, Daniel R. (2001). "Linking microarray data to the literature". Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264.
  32. ^ Doms, Andreas; Schroeder, Michael (2005-07-01). "GoPubMed: exploring PubMed with the Gene Ontology". Nucleic Acids Research. 33 (Web Server issue): W783–786. doi:10.1093/nar/gki470. ISSN 1362-4962. PMC 1160231. PMID 15980585.

Further reading

Conferences

External links