Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some.
IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold growth from the beginning of 2010. Computer World states that unstructured information might account for more than 70%–80% of all data in organizations.
Issues with terminology
The term is imprecise for several reasons:
- Structure, while not formally defined, can still be implied.
- Data with some form of structure may still be characterized as unstructured if its structure is not helpful for the processing task at hand.
- Unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.
Dealing with unstructured data
Techniques such as data mining and text analytics and noisy-text analytics provide different methods to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. Unstructured Information Management Architecture (UIMA) provides a common framework for processing this information to extract meaning and create structured data about the information .
Software that creates machine-processable structure exploits the linguistic, auditory, and visual structure inherent in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data". For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.
Since unstructured data commonly occurs in electronic documents, the use of a content or document management system which can categorize entire documents is often preferred over data transfer and manipulation from within the documents. Document management thus provides the means to convey structure onto document collections.
Search engines have become popular tools for indexing and searching through such data, especially text.
Several commercial solutions are available for analyzing and understanding unstructured data for business applications. This includes products from companies like ZL Technologies, SAS, Provalis Research, Inxight and SPSS, as well as more specialized offerings such as Attensity, Clarabridge and Sysomos, which focus on analyzing unstructured social media data.
- ^ Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.
- ^ Structuring Unstructured Data, Forbes, April 5, 2007.
- ^ Christopher C. Shilakes and Julie Tylman, "Enterprise Information Portals", Merrill Lynch, 16 November, 1998.
- ^ Holzinger, A., Stocker, C., Ofner, B., Prohaska, G., Brabenetz, A. & Hofmann-Wellenhof, R. 2013. Combining HCI, Natural Language Processing, and Knowledge Discovery - Potential of IBM Content Analytics as an assistive technology in the biomedical domain. Springer Lecture Notes in Computer Science LNCS 7947. Heidelberg, Berlin, New York: Springer, pp. 13-24.
- ^ Unstructured Data and the 80 Percent Rule, Clarabridge Bridgepoints, 2008 Q3.
- ^ Today’s Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn’t An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010
- ^ XP support deadline haunts IT execs, Computerworld, October 2010
- ^ New Digital Universe Study Reveals Big Data Gap: Less Than 1% of World’s Data is Analyzed; Less Than 20% is Protected, EMC Press Release, December 2012.
- Semi-structured data
- Data mining
- pattern recognition, clustering
- Noisy text
- General Architecture for Text Engineering