Jump to content

Apache Arrow

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Kstirman (talk | contribs) at 18:53, 29 December 2017. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Apache Arrow
Developer(s)Apache Software Foundation
Initial releaseOctober 10, 2016; 7 years ago (2016-10-10)
Stable release
v0.8.0[1] / December 17, 2017; 6 years ago (2017-12-17)
Repositoryhttps://github.com/apache/arrow
Written inC++, Java, Python
TypeData analytics, machine learning algorithms
LicenseApache License 2.0
Websitearrow.apache.org}

Apache Arrow is an open source development platform for columnar in-memory data structures and processing. It defines a language-independent physical memory layout, enabling zero-copy, zero-deserialization interchange of flat and nested columnar data amongst a variety of systems such as Python, R, Apache Spark, ODBC protocols, and proprietary systems that utilize the open source components. Apache Arrow is a complement to on-disk columnar data formats such as Apache Parquet and Apache ORC in that it organizes data for efficient in-memory processing by CPUs and GPUs.

In addition to the columnar in-memory data format, Arrow provides native computation functions for Arrow data called Arrow kernels. Arrow kernels are discrete, thread model-agnostic execution operations that allow for SIMD and GPU-accelerated versions of basic in-memory analytics functionality. Arrow kernel examples include sorting, unique values, dictionary encoding, and hash tables.

The open-source project to develop Apache Arrow includes contributors from many different projects, including Apache Calcite, Apache HBase, Apache Impala, Apache Kudu, Apache Parquet, Apache Phoenix, Apache Spark, Dremio, Ibis, and pandas.

The GPU Open Analytics Initiative includes a number of projects that have standardized on Arrow for on-GPU processing.[2]

Features

  • Queries can read specific column values into memory, allowing for more data to be stored in CPU or GPU RAM compared to row-oriented data
  • Dictionary encoding makes more efficient use of RAM
  • Arrow buffers can be accessed via pointers, providing zero-copy access to data[3]
  • As a common data format, multiple processes can access the same Arrow buffers without serializing/de-serializing the data
  • Supported types include: Scalars (timestamp, time, date, UTF8 string, Boolean, decimal, float, double), Complex (struct, list, map), and Advanced (sparse union, dense union)
  • Arrow is capable of representing fully-materialized and decoded / decompressed Parquet data[4]
  • All array slots are accessible in constant time, with complexity growing linearly in the nesting level
  • Any relative type can have null slots.
  • Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built.
  • Arrays are relocatable (e.g. for RPC/transient storage) without pointer swizzling. Another way of putting this is that contiguous memory regions can be migrated to a different address space (e.g. via a memcpy-type of operation) without altering their contents.

Comparisons

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory[5]. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage[6] . The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats[7].

External links


References

  1. ^ "Github releases".
  2. ^ "GOAI".
  3. ^ "Apache Arrow In Theory and In Practice".
  4. ^ "Apache Arrow Memory Layout".
  5. ^ "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory".
  6. ^ "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".
  7. ^ "PyArrow:Reading and Writing the Apache Parquet Format".