Unstructured data

From Wikipedia, the free encyclopedia
  (Redirected from Unstructured information)
Jump to: navigation, search

Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.

Contents


The term is imprecise for several reasons;

  1. structure, while not formally defined can still be implied and
  2. data with some form of structure may still be characterized as unstructured if its structure is not helpful for the desired processing task, and
  3. unstructured information might have some structure (semi-structured) or even be highly structured but in ways that are unanticipated or unannounced.

Software that creates machine-processable structure exploits the linguistic, auditory, and visual structure that is inherent in all forms of human communication.[1] This inherent structure can be inferred from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery. Examples of "unstructured data" may include books, journals, documents, metadata, health records, audio, video, files, and unstructured text such as the body of an e-mail message, Web page, or word processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents, ...) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as "unstructured data".[2] For example, an HTML web page is tagged, but HTML mark-up is typically designed solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements although it typically does not capture or convey the semantic meaning of tagged terms.

In 1998, Merrill Lynch cited estimates that as much as 80% of all potentially usable business information originates in unstructured form.[3] Such estimates may not be based on primary research, but they are nonetheless widely accepted.[4] More recently, multiple analysts have estimated that data will grow 800% over the next five years.[5] Unstructured information accounts for more than 70%–80% of all data in organizations and is growing 10–50x more than structured data.[6]

[edit] Dealing with unstructured data

Data mining and text analytics and noisy text analytics techniques are different methods used to find patterns in, or otherwise interpret, this information. Common techniques for structuring text usually involve manual tagging with metadata or Part-of-speech tagging for further text mining-based structuring. UIMA provides a common framework for processing this information to extract meaning and create structured data about the information.

Several commercial solutions are available for analyzing and understanding unstructured data for business applications. This includes products from companies like SAS, IxReveal, Inxight and SPSS, as well as more specialized offerings such as Attensity360 and Sysomos, which focuses on analyzing unstructured social media data.

[edit] Notes

  1. ^ Structure, Models and Meaning: Is "unstructured" data merely unmodeled?, Intelligent Enterprise, March 1, 2005.
  2. ^ Structuring Unstructured Data, Forbes, April 5, 2007.
  3. ^ Christopher C. Shilakes and Julie Tylman, "Enterprise Information Portals", Merrill Lynch, 16 November, 1998.
  4. ^ Unstructured Data and the 80 Percent Rule, Clarabridge Bridgepoints, 2008 Q3.
  5. ^ Today’s Challenge in Government: What to do with Unstructured Information and Why Doing Nothing Isn’t An Option, Noel Yuhanna, Principal Analyst, Forrester Research, Nov 2010
  6. ^ Computer World Article [7]computerworld, October 2010

[edit] See also

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Interaction
Toolbox
Print/export
Languages