Document-oriented database

From Wikipedia, the free encyclopedia
Jump to: navigation, search

A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown[1] with the use of the term NoSQL itself. In contrast to relational databases and their notions of "Relations" (or "Tables"), these systems are designed around an abstract notion of a "Document".

Documents[edit]

The central concept of a document-oriented database is that Documents, in largely the usual English sense, contain vast amounts of data which can usefully be made available. Document-oriented database implementations differ widely in detail and functionality. Most accept documents in a variety of forms, and encapsulate them in a standardized internal format, while extracting at least some specific data items that are then associated with the document.

A trivial example would be scanning paper documents, extracting the title, author, and date from them either by OCR or having a human locate and enter them, and storing each document in a 4-column relational database, the columns being author, title, date, and a blob full of page images. Some documents-oriented databases do essentially the same things, but with PDF (which may or may not contain text rather than images of text).

Today much more can be accomplished, and an effective document-oriented database must extract and manage a great deal more information about the documents it manages. Fortunately, documents are usually now available in more usable forms. A great deal of publishing is done in HTML, XML, TeX, or systems that can at least export or convert to those. Many other documents in the real world are emails, which also have a moderate amount of metadata available explicitly in their headers. In such cases a document database has access not just to images but to actual words, phrases, paragraph boundaries, and descriptive labels indicating the significance of parts of the text ("footnote", "chapter", "author name", etc.), and can make all that available for searching, statistical analysis, data mining, and other uses. Even when data is not in high-value forms such as these, modern document-oriented databases can often extract meaningful components via heuristic and other methods.

In a non-document database, there is generally a very small range of fields, many or most of which can only occur in extremely limited contexts, and which are generally required in those contexts. For example, a "person" record might consist of first and last names, address, city, country, work phone, home phone, and so on. Importantly, none of those fields has much internal structure, or repeats. In traditional databases, any repeatable field(s) require creating a separate table, in which multiple records refer back to the record they relate to in the original table, via a "key". Likewise, a traditional database is unaware of any structure within a given field, since fields are limited to a few atomic datatypes such as integers, dates, and strings.

Documents, in contrast, are structured in ways accessible to humans as well as computers. They are characterized by extremely frequent re-use of small components (words and phrases, but also component types such as "paragraph" or "footnote"), and by very free mixture of those types, as compared to the mixtures allowed in traditional databases. Hamlet is a document, consisting of structural units such as acts, scenes, speeches, attributions, stage directions, and notes. An entry in one's smart-phone address book is a "document" but only barely so, resembling a single record in a relational or similar database far more.

Almost any format can be used for extracted metadata: XML, YAML, JSON, and BSON. However, the document itself is usually stored, at least as a blob in its original format, which may be XML, PDF, proprietary/binary word-processor formats, or "plain text"; functionality of the database is largely dependent on the format in which documents reach it, and the database's ability to extract specific data from that format.

Documents inside a document-oriented database are similar, in some ways, to records or rows in relational databases, but they have vastly more internal structure (the extent the database itself is aware of that structure, and can use it, varies). Documents, particularly in XML, TEX, and other high-end formats, do adhere to a formal schema; but many documents do not, or if they do, the schema is not explicit. For example, the following is a document:

 <Article>
   <Author>
       <FirstName>Bob</FirstName>
       <Surname>Smith</Surname>
   </Author>
   <Abstract>This paper concerns....</Abstract>
   <Section n="1"><Title>Introduction</Title>
       <Para>...
   </Section>
 </Article>

A second document, even of the same genre and schema, may have a far different number and arrangement of sections, paragraphs, and the like; it may have multiple co-authors; it may have much other metadata such as copyright or publication information, bibliographic references to other documents (in the same or other databases, or in no database at all), and so on.

Two such documents typically share many structural elements with one another, but each may also have elements the other does not. Unlike a relational database where every record contains the identical sequence of fields (a few of which may be empty or hold missing value indicators), document structures generally allow for an unbounded number of hierarchically-organized components, with extensive repetition. It would be absurd, for example, to design a database with table for "sections," that tried to provide as many fields as the number of paragraphs in the longest section one will ever see (not to mention the many other kinds of document components that appear within sections). Even if one did, naming fields in a relation something like "p1", "p2",... does not, so far as the database is concerned, indicate that those fields have anything to do with one another, or belong in a certain meaningful order. In order to avoid confusion with the quite different notion of database "fields", document databases may refer to the parts of documents as "components" or "elements".

Documents, however, often conform to formal schemas which constrain just what classes of components are allowed, and where. TEX provides a wide range of components, though authors can create their own as well. The many established schemas for use with XML are similar, but authors can also create or use a formal schema in a schema language such as DTD, XSD, Relax NG, or Schematron. Among the most widely-used schemas are JATS for technical journals; Text Encoding Initiative for literary works; DocBook for computer systems manuals, and HTML for Web publication.

Some of the most popular Web sites are document databases. The many collections of articles at pubmed.gov or major journal publishers; Wikipedia and its kin; and even search engines (though many of those store links to indexed documents, rather than the full documents themselves).

Keys and retrieval[edit]

Documents may be addressed in the database via a unique key that represents that document. This key is often a simple string, a URI, or a path. The key can be used to retrieve the document from the database. Typically, the database retains an index on the key to speed up document retrieval. The most primitive document databases may do little more than that. However, modern document-oriented databases provide far more, because they extract and index all kinds of metadata, and usually also the entire data content, of the documents. Such databases offer a query language that allows the user to retrieve documents based on their content. For example, you may want to retrieve all the documents whose date falls within some range, that contains a citation to another document, etc.. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next.

Organization[edit]

Implementations offer a variety of ways of organizing documents, including notions of:

  • Collections
  • Tags
  • Non-visible Metadata
  • Directory hierarchies
  • Buckets

Implementations[edit]

Name Publisher License Language Notes RESTful API
ArangoDB triAGENS Apache License 2.0 C, C++ & Javascript A distributed multi model, high-performance document store and graph database. Yes [2]
BaseX BaseX Team BSD License Java, XQuery Support for XML, JSON and binary formats; client-/server based architecture; concurrent structural and full-text searches and updates; REST APIs. Yes
Cassandra Apache Software Foundation Apache License Java JSON over HTTP Yes
Cloudant Cloudant, Inc. Proprietary Erlang, Java, Scala, and C Distributed database service based on BigCouch, the company's open source fork of the Apache-backed CouchDB project. Yes
Clusterpoint Clusterpoint Ltd. Free community license / Commercial[3] C++ Schema-free, document-oriented database management system platform with server based data storage, full text search engine functionality, information ranking for search relevance and clustering. Yes
Couchbase Server Couchbase, Inc. Apache License Erlang and C Distributed NoSQL Document Database. Yes [4]
CouchDB Apache Software Foundation Apache License Erlang JSON over REST/HTTP with Multi-Version Concurrency Control and limited ACID properties. Uses map and reduce for views and queries.[5] Yes [6]
eXist eXist, [2] LGPL XQuery, Java XML over REST/HTTP, WebDAV, Lucene Fulltext search, validation, versioning, clustering, triggers, URL rewriting, collections, ACLS, XQuery Update Yes [7]
FleetDB FleetDB MIT License Clojure A JSON-based schema-free database optimized for agile development. (unknown)
Informix IBM Proprietary Various (Compatible with MongoDB API) RDBMS with JSON, replication, sharding and ACID compliance (unknown)
Inquire Infodata Systems, Inc. Proprietary unknown In the mid-80's this was the dominant document-oriented commercial database, widely successful. The company seems to have gone out of business in 2005. (unknown)
Lotus Notes IBM Proprietary LotusScript, Java, Lotus @Formula (unknown)
MarkLogic MarkLogic Corporation Free Developer license or Commercial REST, Java, XQuery, XSLT, C++ Distributed document-oriented database with Multi-Version Concurrency Control, integrated Full text search and ACID-compliant transaction semantics Yes
MongoDB MongoDB, Inc GNU AGPL v3.0 for the DBMS, Apache 2 License for the client drivers[8] C++ Document database with replication and sharding Optional [9]
MUMPS Database[10] Proprietary and Affero GPL[11] MUMPS Commonly used in health applications. (unknown)
OrientDB Orient Technologies Apache License Java JSON over HTTP Yes
RavenDB Hibernating Rhinos LTD Proprietary and modified Affero GPL[12] C#, JavaScript Yes
Redis BSD License ANSI C Key-value store supporting lists and sets with binary-safe protocol (unknown)
RethinkDB GNU APGL for the DBMS, Apache 2 License for the client drivers C++ (unknown)
Rocket U2 Rocket Software Proprietary UniData, UniVerse Yes (Beta)
Sqrrl Enterprise sqrrl Proprietary Java Distributed, real-time database featuring cell-level security and massive scalability. Yes


XML database implementations[edit]

Further information: XML database

Most XML databases are document-oriented databases.

See also[edit]

References[edit]

Further reading[edit]