Text mining: Difference between revisions

Content deleted Content added

Inline

Revision as of 10:59, 28 January 2008

Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

History

Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

Sentiment analysis

Sentiment analysis may, for example, involve analysis of movie reviews for estimating how favorably a review is for a movie.^[1] Such an analysis may require a labeled data set or labeling of the affectiveness of words. A resource for affectiveness of words have been made for WordNet.^[2]

Applications

Recently, text mining has been receiving attention in many areas.

Security applications

One of the largest text mining applications that exists is probably the classified ECHELON surveillance system. Additionally, many text mining software packages such as AeroText, Attensity and Expert System are marketed towards security applications, particularly analysis of plain text sources such as internet news.

Biomedical applications

A range of applications of text mining of the biomedical literature has been described.^[3] One example is PubGene that combines biomedical text mining with network visualization as an Internet service.^[4]

Software and applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes. Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

UK: The National Centre for Text Mining, a collaborative effort between the Universities of Manchester and Liverpool, funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils provides customised tools, research facilities and offers advice to the academic community. With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of Social Science.

USA: In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.

Commercial software and applications

AeroText - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
AITellU - provides a range of text mining services and applications based on advanced artificial intelligence techniques.
Anderson Analytics - provider of text analytics and content analysis especially as it relates to consumer behavior.
Attensity - suite of text mining solutions that includes search, statistical and NLP based technologies for a variety of industries.
Autonomy - suite of text mining, clustering and categorization solutions for a variety of industries.
Carabao Language Kit - suite of components for text mining, categorization, sense disambiguation, idiom extraction, named entity recognition with tools to add a new language or edit exiting one(s).
Clarabridge - text mining and categorization applications for customer, healthcare, and investigative analytics.
Clearforest - text mining software to extract meaning from various forms of textual information. (Clearforest was sold to Reuters)
Cortex Intelligence - text mining for Competitive Intelligence with Named Entity Recognition.
ConceptMine Concept based text mining, document comparison and efficient indexing
Crossminder - text mining company enabling multilingual searches and searches through semantic approximation.
Endeca Technologies - provides software to analyze and cluster unstructured text.
Evolutionary Software, Inc. - develops software for automatic extraction from financial reports and provides Text Mining consulting serivces.
Expert System S.p.A. - suite of semantic technologies and products for developers and knowledge managers.
Fair Isaac - leading provider of decision management solutions powered by advanced analytics (includes text analytics).
IBM TAKMI - research prototype.
IBM OmniFind Analytics Edition - commercial text mining software.
Inxight - provider of text analytics, search, and unstructured visualization technologies. (Inxight was sold to Business Objects that was sold to SAP AG in 2007)
Island Data - Real-time market intelligence from unstructured customer feedback.
Jane16 - Free online text analysis site, extraction of subject and summary of text.
Linguamatics - Intelligence from text with real-time, agile NLP.
MagentA Technology- provides you with indispensable software to recognise, organise, search and analyse wealth of textual data
Nstein Technologies - provider of text analytics, and asset/web content management technologies (media, e-publishing, online publishing).
Pertinence Mining - Automatic text summarization tools in many languages.
PolyAnalyst - commercial text mining software.
RapidMiner/YALE - open-source data and text mining software for scientific and commercial use.
SAS Enterprise Miner - commercial text mining software.
SPSS - provider of SPSS Text Analysis for Surveys, Text Mining for Clementine, LexiQuest Mine and LexiQuest Categorize, commercial text analytics software that can be used in conjunction with SPSS Predictive Analytics Solutions.
Swotti - Provides text mining search portal, services and applications based on Web 3.0.
TEMIS - TEMIS is a software editor providing innovative Information Discovery solutions to serve the Information Intelligence needs of business corporations.
TextAnalyst - commercial text mining software
Textalyser - an online text analysis tool for generating text analysis statistics of web pages and other texts.
Topicalizer - an online text analysis tool for generating text analysis statistics of web pages and other texts.
Zoomhive - Text Mining, Self-structuring Asynchronous Learning Network, by means of a multi-keyword extraction algorithm (patent pending).
Zoomix - Self learning Text Mining software (patent pending).
XmlMiner - Combined text, data and structure mining for mixed data sources in XML.

Open-source software and applications

GATE - natural language processing and language engineering tool.
YALE/RapidMiner with its Word Vector Tool plugin - data and text mining software.
Pimiento a text-mining application framework written in Java.

Implications

Until recently websites most often used text-based lexical searches; in other words, users could find documents only by the words that happened to occur in the documents. Text mining may allow searches to be directly answered by the semantic web; users may be able to search for content based on its meaning and context, rather than just by a specific word.

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, by using software that extracts specifics facts about businesses and individuals from news reports, large datasets can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

Notes

^ Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques" (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. {{cite conference}}: Check date values in: |year= (help); Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources" (PDF). PsychNology Journal (1): 61–83. {{cite journal}}: Check date values in: |year= (help); Text "volume 2" ignored (help)CS1 maint: multiple names: authors list (link)
^ K. Bretonnel Cohen & Lawrence Hunter (2008). "Getting Started in Text Mining" (PDF). PLoS Computational Biology. 4 (1): e20. doi:10.1371/journal.pcbi.0040020. {{cite journal}}: Check date values in: |year= (help); Unknown parameter |month= ignored (help)CS1 maint: unflagged free DOI (link)
^
Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28: 21–28. doi:10.1038/ng0501-21. {{cite journal}}: Check date values in: |year= (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
- Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28: 9–10. doi:10.1038/ng0501-9. {{cite journal}}: Check date values in: |year= (help)
* Web site: http://www.pubgene.org/

References

Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University Press, ISBN 9780521836579
Kao Anne, Poteet, Steve R. (Editors), Natural Language Processing and Text Mining, Springer, ISBN-10: 184628175X
Konchady Manu "Text Mining Application Programming (Programming Series)" by Manu Konchady, Charles River Media, ISBN 1584504609
M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques, WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp. 966-974 (http://www.math.upatras.gr/~esdlab/en/members/kotsiantis/Text%20Classification%20final%20journal.pdf)

External links

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ MUC
http://projects.ldc.upenn.edu/ace/ ACE (LDC)
http://www.itl.nist.gov/iad/894.01/tests/ace/ ACE (NIST)
http://www.arts-humanities.net/text_mining (Discussion group text mining)
Text Analysis Portal for Research (TAPoR)
http://textanalytics.wikidot.com/ Text Analytics Wiki

[1] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques" (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. {{cite conference}}: Check date values in: |year= (help); Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)

[2] Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources" (PDF). PsychNology Journal (1): 61–83. {{cite journal}}: Check date values in: |year= (help); Text "volume 2" ignored (help)CS1 maint: multiple names: authors list (link)

[3] K. Bretonnel Cohen & Lawrence Hunter (2008). "Getting Started in Text Mining" (PDF). PLoS Computational Biology. 4 (1): e20. doi:10.1371/journal.pcbi.0040020. {{cite journal}}: Check date values in: |year= (help); Unknown parameter |month= ignored (help)CS1 maint: unflagged free DOI (link)

[4] Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28: 21–28. doi:10.1038/ng0501-21. {{cite journal}}: Check date values in: |year= (help)CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28: 9–10. doi:10.1038/ng0501-9. {{cite journal}}: Check date values in: |year= (help)
* Web site: http://www.pubgene.org/

[5] Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28: 9–10. doi:10.1038/ng0501-9. {{cite journal}}: Check date values in: |year= (help)

[1]

[2]

[3]

[4]

@@ Line 33: / Line 33: @@
 ===Security applications===
 One of the largest text mining applications that exists is probably the classified [[ECHELON]] surveillance system.  Additionally, many text mining software packages such as [[AeroText]], [[Attensity]] and [[Expert System]] are marketed towards security applications, particularly analysis of plain text sources such as internet news.
+=== Biomedical applications ===
+A range of applications of text mining of the biomedical literature has been described.<ref>{{Cite journal
+ | author = K. Bretonnel Cohen & Lawrence Hunter
+ | title = Getting Started in Text Mining
+ | journal = [[PLoS Computational Biology]]
+ | month = January
+ | year = [[2008]]
+ | volume = 4
+ | issue = 1
+ | pages = e20
+ | doi = 10.1371/journal.pcbi.0040020
+ | url = http://compbiol.plosjournals.org/archive/1553-7358/4/1/pdf/10.1371_journal.pcbi.0040020-L.pdf
+}}</ref>
+One example is [[PubGene]] that combines biomedical text mining with network visualization as an Internet service.<ref>{{Cite journal
+ | author = Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig
+ | title = A literature network of human genes for high-throughput analysis of gene expression
+ | journal = [[Nature Genetics]]
+ | volume = 28
+ | pages = 21&ndash;28
+ | year = [[2001]]
+ | doi = 10.1038/ng0501-21
+ | url = http://www.nature.com/ng/journal/v28/n1/abs/ng0501_21.html
+}}
+* Summary: {{Cite journal
+ | author = Daniel R. Masys
+ | title = Linking microarray data to the literature
+ | journal = [[Nature Genetics]]
+ | volume = 28
+ | pages = 9&ndash;10
+ | year = [[2001]]
+ | doi = 10.1038/ng0501-9
+}}
+ * Web site: http://www.pubgene.org/</ref>
 ===Software and applications===