Apache Arrow

Apache Arrow
Developer(s)	Apache Software Foundation
Initial release	October 10, 2016; 7 years ago
Stable release	v0.8.0 / December 17, 2017; 6 years ago
Repository	https://github.com/apache/arrow
Written in	C++, Java, Python
Type	Data analytics, machine learning algorithms
License	Apache License 2.0
Website	arrow.apache.org}

Apache Arrow is an open source development platform for columnar in-memory data structures and processing. It defines a language-independent physical memory layout, enabling zero-copy, zero-deserialization interchange of flat and nested columnar data amongst a variety of systems such as Python, R, Apache Spark, ODBC protocols, and proprietary systems that utilize the open source components. Apache Arrow is a complement to on-disk columnar data formats such as Apache Parquet and Apache ORC in that it organizes data for efficient in-memory processing by CPUs and GPUs.

In addition to the columnar in-memory data format, Arrow provides native computation functions for Arrow data called Arrow kernels. Arrow kernels are discrete, thread model-agnostic execution operations that allow for SIMD and GPU-accelerated versions of basic in-memory analytics functionality. Arrow kernel examples include sorting, unique values, dictionary encoding, and hash tables.

The open-source project to develop Apache Arrow includes contributors from many different projects, including Apache Calcite, Apache HBase, Apache Impala, Apache Kudu, Apache Parquet, Apache Phoenix, Apache Spark, Dremio, Ibis, and pandas.

The GPU Open Analytics Initiative includes a number of projects that have standardized on Arrow for on-GPU processing.^[2]

Features

Queries can read specific column values into memory, allowing for more data to be stored in CPU or GPU RAM compared to row-oriented data
Dictionary encoding makes more efficient use of RAM
Arrow buffers can be accessed via pointers, providing zero-copy access to data^[3]
As a common data format, multiple processes can access the same Arrow buffers without serializing/de-serializing the data
Supported types include: Scalars (timestamp, time, date, UTF8 string, Boolean, decimal, float, double), Complex (struct, list, map), and Advanced (sparse union, dense union)
Arrow is capable of representing fully-materialized and decoded / decompressed Parquet data^[4]
All array slots are accessible in constant time, with complexity growing linearly in the nesting level
Any relative type can have null slots.
Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built.
Arrays are relocatable (e.g. for RPC/transient storage) without pointer swizzling. Another way of putting this is that contiguous memory regions can be migrated to a different address space (e.g. via a memcpy-type of operation) without altering their contents.

Comparisons

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory^[5]. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage^[6] . The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats^[7].

External links

Apache Arrow project web site
Apache Arrow GitHub project source code

References

[1] "Github releases".

[2] "GOAI".

[3] "Apache Arrow In Theory and In Practice".

[4] "Apache Arrow Memory Layout".

[5] "Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory".

[6] "Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?".

[7] "PyArrow:Reading and Writing the Apache Parquet Format".

[1]

[2]

[3]

[4]

[5]

[6]

[7]