Jump to content

VTD-XML

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 67.164.81.9 (talk) at 21:56, 20 October 2007 (→‎VTD-XML as an XML Editor/Eraser). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

VTD-XML
Developer(s)XimpleWare
Stable release
Operating systemPortable
TypeXML parser/indexer/slicer/editor library
LicenseGPL and Proprietary License
Websitevtd-xml.sourceforge.net


Virtual Token Descriptor for eXtensible Markup Language (VTD-XML) refers to a collection of efficient XML processing technologies centered around a non-extractive XML parsing technique called Virtual Token Descriptor (VTD). Depending on the perspective, VTD-XML can be viewed as one of the following:

  • An XML parser
  • A native XML indexer or a file format that uses binary to enhance the text XML
  • An incremental XML content modifier
  • An XML slicer/splitter/assembler
  • An XML editor/eraser

VTD-XML is developed by XimpleWare and dual-licensed under GPL and proprietary license. It is originally written in Java, but now available also in C and C#.

Basic Concept

Non-Extractive Parsing

Traditionally, a lexical analyzer typically represents tokens (the small units of indivisible character values) as discrete string objects. This approach is known as "extractive" parsing. In contrast, "non-extractive" tokenization mandates that one keep the source text intact, and use offsets and lengths to describe those tokens.

Virtual Token Descriptor

Virtual Token Descriptor (VTD) applies the concept of non-extractive parsing to XML processing. A VTD record is a 64-bit integer that uses a 64-bit integer to encode the offset, length, token type and nesting depth of a token in an XML document. Because VTD records are 64-bit in length, they are intended to be referenced using index values.

Location Cache

Location Caches (LC) build on top of VTD records to provide efficient random access. Organized as tables on a per-depth basis, LC entries model an XML document hierarchy consisting entirely of elements. An LC entry is a 64-bit integer encoding a pair of 32-bit values: the upper 32 bits point to the VTD record for the corresponding element; the lower 32 bits point to the first child index value in the LC table one level deeper.

Benefits

Overview

Virtually all the core benefits of VTD-XML are inherent to non-extractive parsing whose key characteristics include:

  • The source XML text is kept intact in memory and undecoded.
  • The internal representation of VTD-XML is inherently persistent.
  • VTD-XML abolishes object-oriented modeling of the hierarchical representation.

Combining those charasteristics opens the door to thinking XML purely as syntax (bits, bytes, offsets, ;engths, and fragments) instead of serialization of objects. This new thinking should become an increasingly powerful way to think about XML/SOA applications.

VTD-XML as a Parser

When used in parsing mode, VTD-XML is a general purpose, ultra high-performance XML parser that delivers the best of both DOM and SAX and more. The key technical characteristics are:

  • Performance-wise, VTD-XML typically outperforms SAX (with NULL content handler) by 100%, while still providing full random access and built-in XPath support.
  • In terms of memory usage, VTD-XML typically consumes memory 30~50% greater than the XML document size, which is around 1/5 the memory usage of DOM
  • Applications written in VTD-XML are usually much shorter and cleaner than their DOM or SAX versions.

VTD-XML as an Indexer

Because of the inherent persistence of VTD-XML, developers can choose to write the internal representation of a parsed XML document on disk, and later load it back in memory to avoid repetitive parsing. To this end, XimpleWare has introduced VTD+XML as a binary packaging format combining VTD, LC and the XML text. It can typically be viewed in one the two following ways:

  • A native XML index that completely eliminates the parsing cost and also retains 100% benefits of XML.
  • A binary XML format that uses binary data to enhance the processing of the XML text. It is file format that is human readable and backward compatible with XML.

VTD-XML as XML Content Modifier

Because the XML text is kepted intact and undecoded by VTD-XML, when an application intends to modify the content of XML, it only needs to modify the portions of XML content most relevant to the changes. This is in stark contrast with DOM, SAX or StAx parsing, which incur the cost of parsing and re-serialization no matter how small the changes are.

VTD-XML as an XML Slicer/Splitter/Assembler

An application based on VTD-XML can also use offsets and lengths to address tokens, or element fragments. This allow XML documents to be manipulated like an array of bytes.

  • As a slicer, VTD-XML can "slice" off a token or an element fragment from an XML document, then insert it back into another location in the same document, or into a different document.
  • As a splitter, VTD-XML can split sub-elements an XML document and dump each into a separate XML document.
  • As an assembler, VTD-XML can "cut" chunks out of multiple XML documents and assemble them into a new XML document.

VTD-XML as an XML Editor/Eraser

Used as an editor/eraser, VTD-XML can directly overwrite the selected tokens of the underlying byte content of the XML text, provided that the token length is wider than the intended new content. An immediate benefit of this approach is that the application can immediately reuse the original VTD and LC. In contrast, when using VTD-XML to incrementally update an XML document, an application needs to reparse the updated document before the application can process it.

Non-blocking, Incremental XPath Evaluation

Weaknesses

Areas of Applications

API Overview

As of Version 2.2, the Java and C# versions of VTD-XML consist of the following classes:

  • VTDGen (VTD generator) is the class that encapsulates the main parsing, index loading and index writing functions.
  • VTDNav (VTD Navigator) is the class that (1) encapsulates XML, VTD, and hierarchical info, (2) contains various navigation methods,(3) performs various comparisons between VTD records and strings, and (4) converts VTD records to primitive data types.
  • Autopilot is a class containing functions that perform node-level iteration and XPath.
  • XMLModifier is a class that offers incremental update capability, such as delete, insert and update.

Code Sample

/* In this java program, we demonstrate how to use XMLModifier to incrementally
* update an simple XML purchase order.
* a particular name space. We also are going 
* to use VTDGen's parseFile to simplify programming.
*/

import com.ximpleware.*;
import java.io.*;

public class update {
      public static void main(String argv[]) throws NavException, ModifyException,IOException{
        try {
            // open a file and read the content into a byte array
            VTDGen vg = new VTDGen();
            if (vg.parseFile("oldpo.xml", true)){
                VTDNav vn = vg.getNav();
                File fo = new File("newpo.xml");
                FileOutputStream fos = new FileOutputStream(fo);
                AutoPilot ap = new AutoPilot(vn);
                XMLModifier xm = new XMLModifier(vn);
                ap.selectXPath("/purchaseOrder/items/item[@partNum='872-AA']");

                int i = -1;
                while((i=ap.evalXPath())!=-1){
                    xm.remove();
                    xm.insertBeforeElement("<something/>\n"); 
                }
                ap.selectXPath("/purchaseOrder/items/item/USPrice[.<40]/text()");
                while((i=ap.evalXPath())!=-1){
                    xm.updateToken(i,"200");
                }
                xm.output(fos);
                fos.close();
            }
        }
    }
}

Method

VTD maintains a 64-bit integer for each node. This integer encodes the distance from the beginning of the document (start offset), the length of the associated node (token length), type of a node, and the nesting depth. Thus the VTD part of VTD-XML represents a table of contents for an individual XML document, called the Location Cache.

The VTD-XML tool (parser and indexer) was originally written in Java but is also available in C and C#.

Benefits and Drawbacks

This approach can be seen as a hybrid of Binary XML and ordinary XML. Binary data is used to facilitate random access and speed up processing, but the binary data does not carry the information in the XML file; it is merely a supplement. VTD-XML does not reduce the verbosity of XML like Binary XML does, but it retains the other benefits of Binary XML and can be generated using application specific integrated circuits (ASICs).

VTD-XML is an improvement over DOM in memory size[1] and over SAX in speed[2]. Additionally the VTD-XML parser offers random access, which is difficult to perform in SAX.

Notes