Biomedical text mining: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Corpora: Another table entry
→‎Corpora: Another table entry
Line 133: Line 133:
|Yes
|Yes
|<ref>{{Cite journal|last=Kim|first=J.-D.|last2=Ohta|first2=T.|last3=Tateisi|first3=Y.|last4=Tsujii|first4=J.|date=2003-07-03|title=GENIA corpus--a semantically annotated corpus for bio-textmining|url=https://academic.oup.com/bioinformatics/article/19/suppl_1/i180/227927|journal=Bioinformatics|language=en|volume=19|issue=Suppl 1|pages=i180–i182|doi=10.1093/bioinformatics/btg1023|issn=1367-4803}}</ref><ref>{{Cite web|url=http://www.geniaproject.org/|title=GENIA Project|website=www.geniaproject.org|access-date=2018-10-06}}</ref>
|<ref>{{Cite journal|last=Kim|first=J.-D.|last2=Ohta|first2=T.|last3=Tateisi|first3=Y.|last4=Tsujii|first4=J.|date=2003-07-03|title=GENIA corpus--a semantically annotated corpus for bio-textmining|url=https://academic.oup.com/bioinformatics/article/19/suppl_1/i180/227927|journal=Bioinformatics|language=en|volume=19|issue=Suppl 1|pages=i180–i182|doi=10.1093/bioinformatics/btg1023|issn=1367-4803}}</ref><ref>{{Cite web|url=http://www.geniaproject.org/|title=GENIA Project|website=www.geniaproject.org|access-date=2018-10-06}}</ref>
|-
|FamPlex
|Bachman ''et al.''
|Protein names and families linked to unique identifiers. Includes [[affix]] sets.
|Yes
|<ref>{{Cite journal|last=Bachman|first=John A.|last2=Gyori|first2=Benjamin M.|last3=Sorger|first3=Peter K.|date=2018-06-28|title=FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining|url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5|journal=BMC Bioinformatics|language=En|volume=19|issue=1|doi=10.1186/s12859-018-2211-5|issn=1471-2105|pmc=PMC6022344|pmid=29954318}}</ref>
|-
|-
|IEPA
|IEPA

Revision as of 22:26, 6 October 2018

Biomedical text mining (also known as biomedical natural language processing or BioNLP) refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Challenges

Applying text mining approaches to biomedical text presents specific challenges common to the domain.

Availability of annotated text data

Large annotated corpora used in the development and training of general purpose text mining methods (e.g., sets of movie dialogue,[1] product reviews,[2] or Wikipedia article text) are not specific for biomedical language. While they may provide evidence of general text properties such as parts of speech, they rarely contain concepts of interest to biologists or clinicians. Development of new methods to identify features specific to biomedical documents therefore requires assembly of specialized corpora.[3] Resources designed to aid in building new biomedical text mining methods have been developed through the Informatics for Integrating Biology and the Bedside (i2b2) challenges[4][5][6] and biomedical informatics researchers.[7][8] Text mining researchers frequently combine these corpora with the controlled vocabularies and ontologies available through the National Library of Medicine's Unified Medical Language System (UMLS) and Medical Subject Headings (MeSH).

Uncertainty

Biomedical literature contains statements about observations that may not be statements of fact. This text may express uncertainty or skepticism about claims. Without specific adaptations, text mining approaches designed to identify claims within text may mis-characterize these "hedged" statements as facts.[9]

Supporting clinical needs

Biomedical text mining applications developed for clinical use should ideally reflect the needs and demands of clinicians.[3] This is a concern in environments where clinical decision support is expected to be informative and accurate.

Working with other clinical systems

New text mining systems must be interoperable with existing standards, electronic medical records, and databases.[3] Methods for interfacing with clinical systems such as LOINC have been developed[10] but require extensive organizational effort to implement and maintain.[11][12]

Patient privacy

Text mining systems operating with private medical data must respect its security and ensure it is rendered anonymous where appropriate.[13][14][15]

Processes

Specific text mining sub tasks are of particular concern when processing biomedical text.

Named entity recognition

Developments in biomedical text mining have incorporated identification of biological entities with named entity recognition, or NER. Names and identifiers for biomolecules such as proteins and genes,[16] chemical compounds and drugs,[17] and disease names[18] have all been used as entities. Most entity recognition methods are supported by pre-defined linguistic features or vocabularies, though methods incorporating deep learning and word embeddings have also been successful at biomedical NER.[19]

Relationship discovery

Biomedical documents describe connections between concepts, whether they are interactions between biomolecules, events occurring subsequently over time (i.e., temporal relationships), or causal relationships. Text mining methods may perform relation discovery to identify these connections, often in concert with named entity recognition.[20]

Hedge cue detection

The challenge of identifying uncertain or "hedged" statements has been addressed through hedge cue detection in biomedical literature.[9]

Claim detection

Multiple researchers have developed methods to identify specific scientific claims from literature.[21][22]

Corpora

The following table lists a selection of biomedical text corpora and their contents. These items include annotated corpora, sources of biomedical research literature, and resources frequently used as vocabulary and/or ontology references, such as MeSH. Items marked "freely available" can be downloaded from a publicly accessible location and used without registration or prior agreement.

Biomedical Text Corpora
Corpus Name Authors or Group Contents Freely Available Citation
2006 i2b2 Deidentification and Smoking Challenge i2b2 889 de-identified medical discharge summaries annotated for patient identification and smoking status features. No [23][24]
2008 i2b2 Obesity Challenge i2b2 1,237 de-identified medical discharge summaries annotated for presence or absence of comorbidities of obesity. No [25]
2009 i2b2 Medication Challenge i2b2 1,243 de-identified medical discharge summaries annotated for names and details of medications, including dosage, mode, frequency, duration, reason, and presence in a list or narrative structure. No [26][27]
2010 i2b2 Relations Challenge i2b2 Medical discharge summaries annotated for medical problems, tests, treatments, and the relations among these concepts. Only a subset of these data records are available for research use due to IRB limitations. No [4]
2011 i2b2 Coreference Challenge i2b2 978 de-identified medical discharge summaries, progress notes, and other clinical reports annotated with concepts and coreferences. Includes the ODIE corpus. No [28]
2012 i2b2 Temporal Relations Challenge i2b2 310 de-identified medical discharge summaries annotated for events and temporal relations. No [5]
2014 i2b2 De-identification Challenge i2b2 1,304 de-identified longitudinal medical records annotated for protected health information (PHI). No [29]
2014 i2b2 Heart Disease Risk Factors Challenge i2b2 1,304 de-identified longitudinal medical records annotated for risk factors for cardiac artery disease. No [30]
AIMed Bunescu et al. 200 abstracts annotated for protein–protein interactions, as well as negative example abstracts containing no protein-protein interactions. Yes [31]
BioCreAtIvE 1 BioCreAtIvE 15,000 sentences (10,000 training and 5,000 test) annotated for protein and gene names. 1,000 full text biomedical research articles annotated with protein names and Gene Ontology terms. Yes [32]
BioCreAtIvE 2 BioCreAtIvE 15,000 sentences (10,000 training and 5,000 test, different from the first corpus) annotated for protein and gene names. 542 abstracts linked to EntrezGene identifiers. A variety of research articles annotated for features of protein–protein interactions. Yes [33]
BioInfer Pyysalo et al. 1,100 sentences from biomedical research abstracts annotated for relationships, named entities, and syntactic dependencies. No [34]
BioScope Vincze et al. 1,954 clinical reports, 9 papers, and 1,273 abstracts annotated for linguistic scope and terms denoting negation or uncertainty. Yes [35]
CRAFT Verspoor et al. 97 full-text biomedical publications annotated with linguistic structures and biological concepts Yes [36]
GENIA Corpus GENIA Project 1,999 biomedical research abstracts on the topics "human", "blood cells", and "transcription factors", annotated for parts of speech, syntax, terms, events, relations, and coreferences. Yes [37][38]
FamPlex Bachman et al. Protein names and families linked to unique identifiers. Includes affix sets. Yes [39]
IEPA Ding et al. 486 sentences from biomedical research abstracts annotated for pairs of co-occurring chemicals, including proteins. No [40]
Learning Language in Logic (LLL) Nédellec et al. 77 sentences from research articles about the bacterium Bacillus subtilis, annotated for protein–gene interactions. Yes [41]
ODIE Corpus Savova et al. 180 clinical notes annotated with 5,992 coreference pairs. No [42]
OHSUMED Hersh et al. 348,566 biomedical research abstracts and indexing information from MEDLINE, including MeSH (as of 1991). Yes [43]
PMC Open Access Subset PubMed Central More than 2 million research articles, updated weekly. Yes [44]
Yapex Franzén et al. 200 biomedical research abstracts annotated with protein names. No [45]

Applications

A flowchart of a text mining protocol.
An example of a text mining protocol used in a study of protein-protein complexes, or protein docking.[46]

Text mining applications in the biomedical field include computational approaches to assist with studies in protein docking,[47] protein interactions,[48][49] and protein-disease associations.[50]

Gene cluster identification

Methods for determining the association of gene clusters obtained by microarray experiments with the biological context provided by the corresponding literature have been developed.[51]

Protein interactions

Automatic extraction of protein interactions[52] and associations of proteins to functional concepts (e.g. gene ontology terms) has been explored.[citation needed] Even the extraction of kinetic parameters from text or the subcellular location of proteins have been addressed by information extraction and text mining technology.

Protein-disease associations

Text mining enables an unbiased evaluation of protein-disease relationships within a vast quantity of unstructured textual data.[53]

Applications of phrase mining to disease associations

A text mining study assembled a collection of 709 core extracellular matrix proteins and associated proteins based on two databases: MatrixDB (matrixdb.univ-lyon1.fr) and UniProt. This set of proteins had a manageable size and a rich body of associated information, making it a suitable for the application of text mining tools. The researchers conducted phrase-mining analysis to cross-examine individual extracellular matrix proteins across the biomedical literature concerned with six categories of cardiovascular diseases. They used a phrase-mining pipeline, Context-aware Semantic Online Analytical Processing (CaseOLAP),[54] then semantically scored all 709 proteins according to their Integrity, Popularity, and Distinctiveness using the CaseOLAP pipeline. The text mining study validated existing relationships and informed previously unrecognized biological processes in cardiovascular pathophysiology.[50]

Software tools

Search engines

Search engines designed to retrieve biomedical literature relevant to a user-provided query frequently rely upon text mining approaches. Publicly-available tools specific for research literature include PubMed search, Europe PubMed Central search, GeneView,[55] and APSE[56] Similarly, search engines and indexing systems specific for biomedical data have been developed, including DataMed[57] and OmicsDI.[58]

Some search engines, such as Essie,[59] OncoSearch,[60] PubGene,[61][62] and GoPubMed[63] were previously public but have since been discontinued, rendered obsolete, or integrated into commercial products.

Medical record analysis systems

Electronic medical records (EMRs) and electronic health records (EHRs) are collected by clinical staff in the course of diagnosis and treatment. Though these records generally include structured components with predictable formats and data types, the remainder of the reports are often free-text. Numerous complete systems and tools have been developed to analyse these free-text portions.[64] The MedLEE system was originally developed for analysis of chest radiology reports but later extended to other report topics.[65] The clinical Text Analysis and Knowledge Extraction System, or cTAKES, annotates clinical text using a dictionary of concepts.[66] The CLAMP system offers similar functionality with a user-friendly interface.[67]

See also

References

  1. ^ Danescu-Niculescu-Mizil, Cristian; Lee, Lillian (2011). Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs. pp. 76–87. ISBN 978-1-932432-95-4. {{cite book}}: |journal= ignored (help); Unknown parameter |name-list-format= ignored (|name-list-style= suggested) (help)
  2. ^ McAuley, Julian; Leskovec, Jure (2013-10-12). Hidden factors and hidden topics: understanding rating dimensions with review text. ACM. pp. 165–172. doi:10.1145/2507157.2507163. ISBN 978-1-4503-2409-0. {{cite book}}: Unknown parameter |name-list-format= ignored (|name-list-style= suggested) (help)
  3. ^ a b c Ohno-Machado L, Nadkarni P, Johnson K (2013). "Natural language processing: algorithms and tools to extract computable information from EHRs and from the biomedical literature". Journal of the American Medical Informatics Association. 20 (5): 805. doi:10.1136/amiajnl-2013-002214. PMC 3756279. PMID 23935077.
  4. ^ a b Uzuner Ö, South BR, Shen S, DuVall SL (2011). "2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text". Journal of the American Medical Informatics Association. 18 (5): 552–6. doi:10.1136/amiajnl-2011-000203. PMC 3168320. PMID 21685143.
  5. ^ a b Sun W, Rumshisky A, Uzuner O (2013). "Evaluating temporal relations in clinical text: 2012 i2b2 Challenge". Journal of the American Medical Informatics Association. 20 (5): 806–13. doi:10.1136/amiajnl-2013-001628. PMC 3756273. PMID 23564629.
  6. ^ Stubbs A, Kotfila C, Uzuner Ö (December 2015). "Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1". Journal of Biomedical Informatics. 58 Suppl: S11-9. doi:10.1016/j.jbi.2015.06.007. PMC 4989908. PMID 26225918.
  7. ^ Albright D, Lanfranchi A, Fredriksen A, Styler WF, Warner C, Hwang JD, Choi JD, Dligach D, Nielsen RD, Martin J, Ward W, Palmer M, Savova GK (2013). "Towards comprehensive syntactic and semantic annotations of the clinical narrative". Journal of the American Medical Informatics Association. 20 (5): 922–30. doi:10.1136/amiajnl-2012-001317. PMC 3756257. PMID 23355458.
  8. ^ Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA, Cohen KB, Verspoor K, Blake JA, Hunter LE (July 2012). "Concept annotation in the CRAFT corpus". BMC Bioinformatics. 13 (1): 161. doi:10.1186/1471-2105-13-161. PMC 3476437. PMID 22776079.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  9. ^ a b Agarwal S, Yu H (December 2010). "Detecting hedge cues and their scope in biomedical text with conditional random fields". Journal of Biomedical Informatics. 43 (6): 953–61. doi:10.1016/j.jbi.2010.08.003. PMC 2991497. PMID 20709188.
  10. ^ Vandenbussche PY, Cormont S, André C, Daniel C, Delahousse J, Charlet J, Lepage E (2013). "Implementation and management of a biomedical observation dictionary in a large healthcare information system". Journal of the American Medical Informatics Association. 20 (5): 940–6. doi:10.1136/amiajnl-2012-001410. PMC 3756262. PMID 23635601.
  11. ^ Jannot AS, Zapletal E, Avillach P, Mamzer MF, Burgun A, Degoulet P (June 2017). "The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience". International Journal of Medical Informatics. 102: 21–28. doi:10.1016/j.ijmedinf.2017.02.006. PMID 28495345.
  12. ^ Levy, Brian. "Health Care's Semantics Challenge". www.fortherecordmag.com. Great Valley Publishing Company. Retrieved 2018-10-04. {{cite web}}: Unknown parameter |name-list-format= ignored (|name-list-style= suggested) (help)
  13. ^ Goodwin LK, Prather JC (2002). "Protecting patient privacy in clinical data mining". Journal of Healthcare Information Management. 16 (4): 62–7. PMID 12365302.
  14. ^ Tucker K, Branson J, Dilleen M, Hollis S, Loughlin P, Nixon MJ, Williams Z (July 2016). "Protecting patient privacy when sharing patient-level data from clinical trials". BMC Medical Research Methodology. 16 Suppl 1 (S1): 77. doi:10.1186/s12874-016-0169-4. PMC 4943495. PMID 27410040.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  15. ^ Graves S (2013). "Confidentiality, electronic health records, and the clinician". Perspectives in Biology and Medicine. 56 (1): 105–25. doi:10.1353/pbm.2013.0003. PMID 23748530.
  16. ^ Leser U, Hakenberg J (2005-01-01). "What makes a gene name? Named entity recognition in the biomedical literature". Briefings in Bioinformatics n. 6 (4): 357–369. doi:10.1093/bib/6.4.357. ISSN 1467-5463.
  17. ^ Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. "Overview of the chemical compound and drug name recognition (CHEMDNER) task" (PDF). Proceedings of the Fourth BioCreative Challenge Evaluation Workshop. 2: 6–37.
  18. ^ Jimeno A, Jimenez-Ruiz E, Lee V, Gaudan S, Berlanga R, Rebholz-Schuhmann D (April 2008). "Assessment of disease named entity recognition on a corpus of annotated sentences". BMC Bioinformatics. 9 Suppl 3 (Suppl 3): S3. doi:10.1186/1471-2105-9-s3-s3. PMC 2352871. PMID 18426548.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  19. ^ Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (July 2017). "Deep learning with word embeddings improves biomedical named entity recognition". Bioinformatics. 33 (14): i37–i48. doi:10.1093/bioinformatics/btx228. PMC 5870729. PMID 28881963.
  20. ^ Rodriguez-Esteban R (December 2009). "Biomedical text mining and its applications". PLoS Computational Biology. 5 (12): e1000597. doi:10.1371/journal.pcbi.1000597. PMC 2791166. PMID 20041219.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  21. ^ Blake C (April 2010). "Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles". Journal of Biomedical Informatics. 43 (2): 173–89. doi:10.1016/j.jbi.2009.11.001. PMID 19900574.
  22. ^ Alamri, Abdulaziz; Stevensony, Mark (2015). "Automatic identification of potentially contradictory claims to support systematic reviews". 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. doi:10.1109/bibm.2015.7359808. ISBN 978-1-4673-6799-8. {{cite journal}}: Unknown parameter |name-list-format= ignored (|name-list-style= suggested) (help)
  23. ^ Uzuner O, Luo Y, Szolovits P (2007-09-01). "Evaluating the state-of-the-art in automatic de-identification". Journal of the American Medical Informatics Association. 14 (5): 550–63. doi:10.1197/jamia.m2444. PMC 1975792. PMID 17600094.
  24. ^ Uzuner O, Goldstein I, Luo Y, Kohane I (2008-01-01). "Identifying patient smoking status from medical discharge records". Journal of the American Medical Informatics Association. 15 (1): 14–24. doi:10.1197/jamia.m2408. PMC 2274873. PMID 17947624.
  25. ^ Uzuner O (2009). "Recognizing obesity and comorbidities in sparse data". Journal of the American Medical Informatics Association. 16 (4): 561–70. doi:10.1197/jamia.M3115. PMC 2705260. PMID 19390096.
  26. ^ Uzuner O, Solti I, Xia F, Cadag E (2010). "Community annotation experiment for ground truth generation for the i2b2 medication challenge". Journal of the American Medical Informatics Association. 17 (5): 519–23. doi:10.1136/jamia.2010.004200. PMC 2995684. PMID 20819855.
  27. ^ Uzuner O, Solti I, Cadag E (2010). "Extracting medication information from clinical text". Journal of the American Medical Informatics Association. 17 (5): 514–8. doi:10.1136/jamia.2010.003947. PMC 2995677. PMID 20819854.
  28. ^ Uzuner O, Bodnari A, Shen S, Forbush T, Pestian J, South BR (2012). "Evaluating the state of the art in coreference resolution for electronic medical records". Journal of the American Medical Informatics Association. 19 (5): 786–91. doi:10.1136/amiajnl-2011-000784. PMC 3422835. PMID 22366294.
  29. ^ Stubbs, Amber; Uzuner, Özlem (2015). "Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus". Journal of Biomedical Informatics. 58 Suppl: S20–29. doi:10.1016/j.jbi.2015.07.020. ISSN 1532-0480. PMC 4978170. PMID 26319540.{{cite journal}}: CS1 maint: PMC format (link)
  30. ^ Stubbs, Amber; Uzuner, Özlem (2015). "Annotating risk factors for heart disease in clinical narratives for diabetic patients". Journal of Biomedical Informatics. 58 Suppl: S78–91. doi:10.1016/j.jbi.2015.05.009. ISSN 1532-0480. PMC 4978180. PMID 26004790.{{cite journal}}: CS1 maint: PMC format (link)
  31. ^ Bunescu, Razvan; Ge, Ruifang; Kate, Rohit J.; Marcotte, Edward M.; Mooney, Raymond J.; Ramani, Arun K.; Wong, Yuk Wah (2005). "Comparative experiments on learning information extractors for proteins and their interactions". Artificial Intelligence in Medicine. 33 (2): 139–155. doi:10.1016/j.artmed.2004.07.016. ISSN 0933-3657. PMID 15811782.
  32. ^ Hirschman, Lynette; Yeh, Alexander; Blaschke, Christian; Valencia, Alfonso (2005). "Overview of BioCreAtIvE: critical assessment of information extraction for biology". BMC bioinformatics. 6 Suppl 1: S1. doi:10.1186/1471-2105-6-S1-S1. ISSN 1471-2105. PMC 1869002. PMID 15960821.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  33. ^ Krallinger, Martin; Morgan, Alexander; Smith, Larry; Leitner, Florian; Tanabe, Lorraine; Wilbur, John; Hirschman, Lynette; Valencia, Alfonso (2008). "Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge". Genome Biology. 9 (Suppl 2): S1. doi:10.1186/gb-2008-9-s2-s1. ISSN 1465-6906. PMC 2559980. PMID 18834487.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  34. ^ Pyysalo, Sampo; Ginter, Filip; Heimonen, Juho; Björne, Jari; Boberg, Jorma; Järvinen, Jouni; Salakoski, Tapio (2007). "BioInfer: a corpus for information extraction in the biomedical domain". BMC Bioinformatics. 8 (1): 50. doi:10.1186/1471-2105-8-50. ISSN 1471-2105. PMC 1808065. PMID 17291334.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  35. ^ Vincze V, Szarvas G, Farkas R, Móra G, Csirik J (November 2008). "The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes". BMC Bioinformatics. 9 Suppl 11 (Suppl 11): S9. doi:10.1186/1471-2105-9-s11-s9. PMC 2586758. PMID 19025695.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  36. ^ Verspoor K, Cohen KB, Lanfranchi A, Warner C, Johnson HL, Roeder C, Choi JD, Funk C, Malenkiy Y, Eckert M, Xue N, Baumgartner WA, Bada M, Palmer M, Hunter LE (August 2012). "A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools". BMC Bioinformatics. 13 (1): 207. doi:10.1186/1471-2105-13-207. PMC 3483229. PMID 22901054.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  37. ^ Kim, J.-D.; Ohta, T.; Tateisi, Y.; Tsujii, J. (2003-07-03). "GENIA corpus--a semantically annotated corpus for bio-textmining". Bioinformatics. 19 (Suppl 1): i180–i182. doi:10.1093/bioinformatics/btg1023. ISSN 1367-4803.
  38. ^ "GENIA Project". www.geniaproject.org. Retrieved 2018-10-06.
  39. ^ Bachman, John A.; Gyori, Benjamin M.; Sorger, Peter K. (2018-06-28). "FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining". BMC Bioinformatics. 19 (1). doi:10.1186/s12859-018-2211-5. ISSN 1471-2105. PMC 6022344. PMID 29954318.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  40. ^ DING, J.; BERLEANT, D.; NETTLETON, D.; WURTELE, E. (2001). "MINING MEDLINE: ABSTRACTS, SENTENCES, OR PHRASES?". Biocomputing 2002. WORLD SCIENTIFIC. doi:10.1142/9789812799623_0031. ISBN 9789810247775.
  41. ^ "LLLchallenge". genome.jouy.inra.fr. Retrieved 2018-10-06.
  42. ^ Savova GK, Chapman WW, Zheng J, Crowley RS (2011). "Anaphoric relations in the clinical narrative: corpus creation". Journal of the American Medical Informatics Association. 18 (4): 459–65. doi:10.1136/amiajnl-2011-000108. PMC 3128403. PMID 21459927.
  43. ^ Hersh, William; Buckley, Chris; Leone, T. J.; Hickam, David (1994), "OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research", SIGIR ’94, Springer London, pp. 192–201, doi:10.1007/978-1-4471-2099-5_20, ISBN 9783540198895, retrieved 2018-10-06
  44. ^ "Open Access Subset". www.ncbi.nlm.nih.gov. Retrieved 2018-10-06.
  45. ^ Franzén, Kristofer; Eriksson, Gunnar; Olsson, Fredrik; Asker, Lars; Lidén, Per; Cöster, Joakim (2002). "Protein names and how to find them". International Journal of Medical Informatics. 67 (1–3): 49–61. doi:10.1016/s1386-5056(02)00052-7. ISSN 1386-5056.
  46. ^ Badal VD, Kundrotas PJ, Vakser IA (December 2015). "Text Mining for Protein Docking". PLoS Computational Biology. 11 (12): e1004630. doi:10.1371/journal.pcbi.1004630. PMC 4674139. PMID 26650466.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  47. ^ Badal VD, Kundrotas PJ, Vakser IA (December 2015). "Text Mining for Protein Docking". PLoS Computational Biology. 11 (12): e1004630. doi:10.1371/journal.pcbi.1004630. PMC 4674139. PMID 26650466.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  48. ^ Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I (March 2015). "Protein-protein interaction predictions using text mining methods". Methods. 74: 47–53. doi:10.1016/j.ymeth.2014.10.026. PMID 25448298.
  49. ^ Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, Jensen LJ, von Mering C (January 2017). "The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible". Nucleic Acids Research. 45 (D1): D362–D368. doi:10.1093/nar/gkw937. PMC 5210637. PMID 27924014.
  50. ^ a b Liem DA, Murali S, Sigdel D, Shi Y, Wang X, Shen J, Choi H, Caufield JH, Wang W, Ping P, Han J (October 2018). "Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease". American Journal of Physiology. Heart and Circulatory Physiology. 315 (4): H910–H924. doi:10.1152/ajpheart.00175.2018. PMID 29775406.
  51. ^ Kankar P, Adak S, Sarkar A, Murari K, Sharma G (11 April 2002). MedMeSH summarizer: text mining for gene clusters. InProceedings of the 2002 SIAM International Conference on Data Mining. Society for Industrial and Applied Mathematics. pp. 548–565. CiteSeerX 10.1.1.215.6230. doi:10.1137/1.9781611972726.32. ISBN 978-0-89871-517-0.
  52. ^ Pyysalo, Sampo; Airola, Antti; Heimonen, Juho; Björne, Jari; Ginter, Filip; Salakoski, Tapio (2008). "Comparative analysis of five protein-protein interaction corpora". BMC Bioinformatics. 9 (Suppl 3): S6. doi:10.1186/1471-2105-9-s3-s6. ISSN 1471-2105. PMC 2349296. PMID 18426551.{{cite journal}}: CS1 maint: PMC format (link) CS1 maint: unflagged free DOI (link)
  53. ^ Krallinger M, Leitner F, Valencia A (2010). "Analysis of biological processes and diseases using text mining approaches". Methods in Molecular Biology. Methods in Molecular Biology. 593: 341–82. doi:10.1007/978-1-60327-194-3_16. ISBN 978-1-60327-193-6. PMID 19957157.
  54. ^ Tao F, Zhuang H, Yu CW, Wang Q, Cassidy T, Kaplan LR, Voss CR, Han J (2016). "Multi-Dimensional, Phrase-Based Summarization in Text Cubes" (PDF). IEEE Data Eng. Bull. 39 (3): 74–84. {{cite journal}}: Cite has empty unknown parameter: |1= (help)
  55. ^ Thomas P, Starlinger J, Vowinkel A, Arzt S, Leser U (July 2012). "GeneView: a comprehensive semantic search engine for PubMed". Nucleic Acids Research. 40 (Web Server issue): W585-91. doi:10.1093/nar/gks563. PMC 3394277. PMID 22693219.
  56. ^ Brown P, Zhou Y (September 2017). "Biomedical literature: Testers wanted for article search tool". Nature. 549 (7670): 31. doi:10.1038/549031c. PMID 28880292.
  57. ^ Ohno-Machado L, Sansone SA, Alter G, Fore I, Grethe J, Xu H, Gonzalez-Beltran A, Rocca-Serra P, Gururaj AE, Bell E, Soysal E, Zong N, Kim HE (May 2017). "Finding useful data across multiple biomedical data repositories using DataMed". Nature Genetics. 49 (6): 816–819. doi:10.1038/ng.3864. PMID 28546571.
  58. ^ Perez-Riverol Y, Bai M, da Veiga Leprevost F, Squizzato S, Park YM, Haug K, et al. (May 2017). "Discovering and linking public omics data sets using the Omics Discovery Index". Nature Biotechnology. 35 (5): 406–409. doi:10.1038/nbt.3790. PMC 5831141. PMID 28486464.
  59. ^ Ide NC, Loane RF, Demner-Fushman D (2007-05-01). "Essie: a concept-based search engine for structured biomedical text". Journal of the American Medical Informatics Association. 14 (3): 253–63. doi:10.1197/jamia.m2233. PMC 2244877. PMID 17329729.
  60. ^ Lee HJ, Dang TC, Lee H, Park JC (July 2014). "OncoSearch: cancer gene search engine with literature evidence". Nucleic Acids Research. 42 (Web Server issue): W416-21. doi:10.1093/nar/gku368. PMC 4086113. PMID 24813447.
  61. ^ Jenssen TK, Laegreid A, Komorowski J, Hovig E (May 2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–8. doi:10.1038/ng0501-21. PMID 11326270.
  62. ^ Masys DR (May 2001). "Linking microarray data to the literature". Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264.
  63. ^ Doms A, Schroeder M (July 2005). "GoPubMed: exploring PubMed with the Gene Ontology". Nucleic Acids Research. 33 (Web Server issue): W783-6. doi:10.1093/nar/gki470. PMC 1160231. PMID 15980585.
  64. ^ Wang Y, Wang L, Rastegar-Mojarad M, Moon S, Shen F, Afzal N, Liu S, Zeng Y, Mehrabi S, Sohn S, Liu H (January 2018). "Clinical information extraction applications: A literature review". Journal of Biomedical Informatics. 77: 34–49. doi:10.1016/j.jbi.2017.11.011. PMC 5771858. PMID 29162496.
  65. ^ Friedman C (1997). "Towards a comprehensive medical language processing system: methods and issues". Proceedings: 595–9. PMC 2233560. PMID 9357695.
  66. ^ Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, Chute CG (2010). "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications". Journal of the American Medical Informatics Association. 17 (5): 507–13. doi:10.1136/jamia.2009.001560. PMC 2995668. PMID 20819853.
  67. ^ Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S, Liu H, Xu H (November 2017). "CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines". Journal of the American Medical Informatics Association. 25 (3): 331–336. doi:10.1093/jamia/ocx132. PMID 29186491.

Further reading

Conferences

External links