AeroText

From Wikipedia, the free encyclopedia
Jump to: navigation, search
AeroText
Developer(s) Lockheed Martin, Rocket Software
Stable release 5.1
Operating system Windows, Solaris, Linux
Type Data Mining
License Proprietary
Website Rocket Software AeroText

AeroText is a suite of text mining applications that are used for content analysis. Content used can be in multiple languages.

AeroText is a solution developed at the Integrated Systems and Solutions division of Lockheed Martin Corporation, a leading U.S. Defense contractor. Rocket Software acquired AeroText from Lockheed Martin on June 5, 2008,[1] and they are continuing to develop and support AeroText.[2]

History[edit]

Originally developed for the U.S. intelligence community (Department of Defense), the solution has become one of the leading solutions available for information extraction & link analysis capabilities, and is often integrated into other solutions.

Functionality[edit]

AeroText converts unstructured information into structured information. The user has the capability to define the parameters of both.

AeroText output is normalized and stored within the solution’s cache as templates. However, the information can be output in a variety of ways using the Run Time Integration Toolkit (RIT) to integrate the output into existing systems through the use of RIT modules. Wrappers for XML and the DARPA Agent Markup Language ( DAML) and also provided, thus making the solution flexible enough to be utilized in other domains. For instance, the solution was presented to the National Institute of Health’s Biomedical Computing Interest Group (BCIG) in April 2002 and demonstrated excellent applicability to the biomedical domain.

“AeroText is data-independent, which means it does not rely on or have a bias towards a particular domain, document type, document source, or natural language” (Haser and Childs, 2002). Sample target applications include automatic database generation, document routing, browsing, summarization, enhanced full text search, and targeted document search in addition to link analysis. The solution’s multilingual utility is also a strength. The technology is also flexible enough to be able to support format standards, such as DAML (Kogut and Holmes), which aid in law enforcement activities.

The current 5.x release exists as a set of various components that are used to carry out integration and data mining tasks. The Integrated Development Environment (IDE) is, perhaps, the most important component as it provides the rule development, modification, and coordination capabilities – “a complete environment to build, test, and analyze linguistic knowledge bases” (Kogut and Holmes). This graphical interface includes not only object oriented editors and rules wizards, but is also allows visual tools for analyzing extracted data, debugging linguistic data, and analyzing performance (AeroText). As a result, customized logic domains are available.

The Instance Based Run-Time Engine actually carries out the extraction on input documents by applying a Knowledge Base (see below). According to the company, “an Instance is defined as the creation of a single Document Object in the AeroText Application Program Interface (API).” The engine is available in Java, C, or COM APIs and has wrappers for XML and DAML.

The Run Time Integration Toolkit (RIT) helps to deploy AeroText by minimizing the need for integration code and provides for the integration of AeroText output into existing systems through the use of RIT modules.

The Corpus Analyzer clusters documents based on entity and conceptual similarities between documents.

The Answer Key Editor creates an information store for scoring by assigning “an Answer Key that corresponds to a specific collection of documents” (AeroText). This Key objectively measures the accuracy of the extraction process. The scoring capability is integrated into the development environment, enabling the developer to identify and analyze extraction errors in large sets of data during the development process.

Much of the solution’s technology is provided within the company’s Knowledge Bases (KBs). English serves as the key core KB and provides linguistic-driven rules which approach 100 entity types uses to extract text. KBs are also available for the Arabic, Chinese (simplified and traditional), Spanish, and Indonesian (including Melagu) languages. A KB Compiler is used to convert “linguistic data files into an efficient run-time knowledge base” (Kogut and Holmes). AeroText’s solution components are available separately or as one of two product bundles. The Standard bundle includes the IDE, Instance-based Run-Time Engine, Core English Knowledge Base, and the Customization Tool. The Professional bundle includes the Standard components as well as the Corpus Analyzer and the Answer Key Editor).

AeroText can handle any textual input, as the Instance Based Run-Time Engine supports both ASCII and Unicode text.

AeroText's main focus is on "information extraction", which includes both named entity extraction and intrasource link analysis. “AeroText information extraction technology is designed for natural language text” (AeroText, 2003). The company has organized its capabilities into several groupings. Specifically for information extraction, entities (persons, organizations, places, etc.), key phrases (time expressions, money amounts, etc.), and grammatical phrases (verb phrases, etc.) can all be extracted. In terms of link analysis, the solution provides entity coreference (resolution of multiple mentions of the same entity, including pronouns), entity associations (identify relationships), event extraction (who, what, when, where), topic categorization (subject matter determinations), temporal resolution (resolution of time expressions, etc.), and location resolution (identification of a particular place which can be tied to GIS). Additionally, the company’s BlockFinder can be used to understand textual tables. (Haser and Childs, 2002).

The solution gains its flexibility and broad range of applicability from the fact that the system is based on the use of manually crafted rules. These rules are used to perform both entity extraction and intrasource link analysis. While different modules developed will be extensively subject-matter specific, the solution can be easily modified to handle the requirements of a different domain. Therefore, in order to use the solution, “an AeroText specialist must generate a set of extraction rules. These rules describe for AeroText how to identify and structure the information to be extracted. In effect, they create fairly abstract templates that describe all the different ways a concept can be expressed in the target language” (Noble, b). These rules not only extract the information from the text, but also specify how the information should be structured within event records (Noble, a). (Haser and Childs) explains that the fundamental components of the solution include features, elements, templates, packages, rulebases, and caches.

These terms are explained using the following example: “Feb. 28, 2002 AAA Corporation will acquire Tampa-based ZZZ Inc. within 60 days.”)

  • A feature is “a list of terms that represents a common idea based on meaning or grammar,” e.g., ‘inc.’ and ‘corp.’ are business designations {CorpDesignator}.
  • An element is “a set of regular expressions that allow binding of information to matched text”; for instance, “FEB” and “February” both refer to the second month (month = “2”).
  • A template is “a frame with slots used to hold extracted text and sometimes related information.” A time template, for example, would include a “text” field as well as “StartDate” and “EndDate” fields.
  • A package is “a set of rules, similar to elements, but with associated actions that fill template slots with extracted information.” The example above would have Time, Organization, and Location templates into which extracted information could be organized.
  • A rulebase is “a collection of packages that are activated at the appropriate time during a processing sequence.” This example would have the Time and Organization templates feed into an Acquisition template.
  • A cache provides “a virtual bin for storing extracted information.”

An entities cache stores times, organizations, and other such information, while an events cache can store event information, such as acquisitions. A high-level overview of how the solution is set up is provided by the adjacent figure. Given a test document, a knowledge engineer produces the answer key of supposed output while the knowledge base engine uses pre-packaged and user-developed rules to extract the entities and relationships from the text. These two outputs are compared and scored. If changes need to be made, the knowledge engineer creates additional rules or makes other enhancements to the knowledge base (which in turn updates the knowledge base engine).

Further reading[edit]

Haser, Tom and Childs, Lois (2002). “Drug Discovery through Information Extraction Technology.” Presentation at NIH BCIG. April 18, 2002. Online. http://www.altum.com/bcig/events/seminars/502002_04.pdf and http://www.altum.com/bcig/events/seminars/2002_04.htm Accessed January 9, 2006.

Hill, Ryan (2005). Lockheed Martin Signs NetMap Analytics as Authorized Distributor of AeroTextTM Information Extraction Software. August 3, 2005. Online. http://www.netmapanalytics.com/press/AeroText.pdf Accessed January 9, 2006. Now available from http://web.archive.org/web/20060410180934/http://www.netmapanalytics.com/press/AeroText.pdf.

KMWorld. KMWorld Buyers Guide: Lockheed Martin Corporation. Online. http://www.kmworld.com/buyersGuide/ReadCompany.aspx?CategoryID=77&CompanyID=17

Kogut, Paul and Holmes, William. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. Online. http://semannot2001.aifb.uni-karlsruhe.de/positionpapers/AeroDAML3.pdf

Mordoff, Keith (2004). Lockheed Martin’s NEW AeroTextTM Version 4.0 Helps Users Tackle Data Overload, Pinpoint Critical Information. April 14, 2005. Online. http://www.lockheedmartin.com/data/assets/10586.pdf

Noble, David (a). Fusion of Open Source Information. Online. http://www.ebrinc.com/files/Noble_Fusion.pdf

Noble, David (b). Structuring Open Source Information to Support Intelligence Analysis. Online. http://www.ebrinc.com/files/Noble_Structuring.pdf

Roberts, Gregory (2003). AeroTextTM Products: Executive Summary Information. Online. http://www.lockheedmartin.com/data/assets/3504.pdf

Taylor, Sarah M. (2004). "Information Extraction Tools: Deciphering Human Language." IT Professional. Vol. 06, no. 6, pages: 28-34. November/December, 2004. Online. http://ieeexplore.ieee.org/iel5/6294/30282/01390870.pdf?tp=&arnumber=1390870&isnumber=30282.

External links[edit]

References[edit]

See also[edit]