Data cube

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In computer programming contexts, a data cube (or datacube) is a multi-dimensional ("n-D") array of values. Typically, the term datacube is applied in contexts where these arrays are massively larger than the hosting computer's main memory; examples include multi-Terabyte/Petabyte data warehouses and time series of image data.

The data cube is used to represent data along some measure of interest. Even though it is called a 'cube', it can be 1-dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. Every dimension represents a new measure whereas the cells in the cube represent the facts of interest.

History[edit]

Multi-dimensional arrays have long been familiar in programming languages. Fortran offers 1-D arrays and arrays of arrays, which allows the construction higher-dimensional arrays. APL supports n-D arrays with a rich set of operations. All these have in common that arrays must fit into main memory and are available only while the particular program maintaining them (such as image processing software) is running.

A series of data exchange formats supports storage and transmission of datacube-like data, often tailored towards particular application domains. Examples include MDX for statistical (in particular, business) data, Hierarchical Data Format for general scientific data, TIFF for imagery.

In 1992, Peter Baumann (computer scientist) introduced management of massive datacubes with high-level user functionality combined with an efficient software architecture.[1] Datacube operations include subset extraction, processing, fusion, and in general queries in the spirit of data manipulation languages like SQL.

Some years after, the datacube concept was applied to describe time-varying business data as datacubes by Jim Gray, et al,[2] and by Venky Harinarayan, Anand Rajaraman and Jeff Ullman[3] which rank among the top 500 most cited computer science articles over a 25 year period.[4]

Around that time, a working group on Multi-Dimensional Databases ("Arbeitskreis Multi-Dimensionale Datenbanken") was established at German Gesellschaft für Informatik.[5] [6]

Datacube Inc. was an image processing company selling hardware and software applications for the PC market in 1996, however without addressing datacubes as such.

The EarthServer initiative has established geo data cube service requirements.[7].

Standardization[edit]

In 2018, the ISO SQL database language is getting extended with datacube functionality as "SQL -- Part 15: Multi-dimensional arrays (SQL/MDA)".[8]

Web Coverage Processing Service is a geo datacube analytics language issued by the Open Geospatial Consortium in 2008. In addition to the common datacube operations the language knows about the semantics of space and time and supports both regular and irregular grid datacubes, based on the concept of Coverage data.

An industry standard for querying business datacubes, originally developed by Microsoft, is MultiDimensional eXpressions

Implementation[edit]

Many high-level computer languages treat data cubes and other large arrays as single entities distinct from their contents. These languages, of which APL, IDL, NumPy, PDL, and S-Lang are examples, allow the programmer to manipulate complete film clips and other data en masse with simple expressions derived from linear algebra and vector mathematics. Some languages (such as PDL) distinguish between a list of images and a data cube, while many (such as IDL) do not.

Array DBMSs (Database Management Systems) offer a data model which generically supports definition, management, retrieval, and manipulation of n-dimensional datacubes. This database category has been pioneered by the rasdaman system since 1994[9].

Applications[edit]

Multi-dimensional arrays can meaningfully represent spatio-temporal sensor, image, and simulation data, but also statistics data where the semantics of dimensions is not necessarily of spatial or temporal nature. Generally, any kind of axis can be combined with any other into a datacube.

Mathematics[edit]

In mathematics, a one-dimensional array corresponds to a vector, a two-dimensional array resembles a matrix; more generally, a tensor may be represented as an n-dimensional data cube.

Science and Engineering[edit]

For a time sequence of color images, the array is generally four-dimensional, with the dimensions representing image X and Y coordinates, time, and RGB (or other color space) color plane. For example, the EarthServer initiative[10] unites data centers from different continents offering 3-D x/y/t satellite image timeseries and 4-D x/y/z/t weather data for retrieval and server-side processing through the Open Geospatial Consortium WCPS geo datacube query language standard.

A data cube is also used in the field of imaging spectroscopy, since a spectrally-resolved image is represented as a three-dimensional volume.

Business Intelligence[edit]

In Online analytical processing (OLAP), data cubes are a common arrangement of business data suitable for analysis from different perspectives through operations like slicing, dicing, pivoting, and aggregation.

See also[edit]

References[edit]

  1. ^ Language Support for Raster Image Manipulation in Databases, Peter Baumann, April 1992, Int. Workshop on Graphics Modeling, Visualization in Science & Technology, 1992
  2. ^ Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals, Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, Hamid Pirahesh, January 1997, Data Mining and Knowledge Discovery: Volume 1 Issue 1, 1997
  3. ^ with Rajaraman, Anand; Ullman, Jeffrey D. (1996). "Implementing Data Cubes Efficiently". doi:10.1145/233269.233333. 
  4. ^ 500 Most Cited Computer Science Articles (501–600), CiteSeer. 12 June 2009. Retrieved 21 March 2017.
  5. ^ Der GI-Arbeitskreis Multidimensionale Datenbanken stellt sich vor, Peter Baumann, Wolfgang Lehner, 1997, Datenbank Rundbrief Volume 19, 1997, http://dblp.uni-trier.de/db/journals/gidr/gidr19.html#BaumannL97
  6. ^ Rückblick auf den GI-Arbeitskreis Multidimensionale Datenbanken, Peter Baumann, 1999, Datenbank Rundbrief Volume 23:, 1999, http://dblp.uni-trier.de/db/journals/gidr/gidr23.html#Baumann99
  7. ^ "The Database Manifesto". www.earthserver.eu. Retrieved 2017-09-21. 
  8. ^ "ISO/IEC DIS 9075-15 Information technology -- Database languages -- SQL -- Part 15: Multi-dimensional arrays (SQL/MDA)". Retrieved 2018-05-27. 
  9. ^ "Management of Multidimensional Discrete Data" (PDF). www.vldb.org. Retrieved 2017-09-21. 
  10. ^ "EarthServer - Big Datacube Analytics at Your Fingertips". www.earthserver.eu. Retrieved 2017-03-31.