Jump to content

Apache Druid

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Brylie (talk | contribs) at 07:01, 14 April 2016 (Updated stable release). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Druid
Original author(s)Eric Tschetter, Fangjin Yang
Developer(s)The Druid community
Stable release
0.9.0 / 13 April 2016 (2016-04-13)
Repository
Written inJava
Operating systemCross-platform
Typedistributed, real-time, column-oriented data store
LicenseApache License 2.0
Websitedruid.io

Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, making that data immediately available to queries.[1] This is sometimes referred to as real-time data.

On the developer Q&A site Stack Overflow, Druid is described as "open-source infrastructure for real-time exploratory analytics on large datasets."[2] It is designed to ingest event/log data, chunking and compressing that data into column-based queryable segments.[3]

Architecture[4]

Architecture of the Druid cluster

Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type.[5] In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (MySQL), and a deep storage facility (e.g., HDFS, Amazon S3, or Apache Cassandra).

Data Ingestion

Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.

Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the location of data, these nodes must interact with MySQL to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.

Query Management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster Management

Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the Mysql metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

  • Low latency (real-time) data ingestion
  • Arbitrary slice and dice data exploration
  • Sub-second analytic queries
  • Approximate and exact computations

History

The project was started to power the analytics product of Metamarkets. The first line of code was committed by Eric Tschetter to a private Github repository in March 2011. He converted the first Metamarkets customer to it by the end of March 2011 and all customers were converted by mid-May. Metamarkets subsequently determined to invest more in the system and hired Fangjin Yang in September 2011.[6] The project was open-sourced under the GPL license in October 2012,[7][8] and moved to an Apache License in February 2015.[9][10]

A number of organizations and companies, including Netflix[11] and Yahoo[12] have integrated Druid into their backend technology. In July 2015, Yahoo engineering authored a blog post titled "Complementing Hadoop at Yahoo: Interactive Analytics with Druid" which described the company's increasing use of Druid, calling it "ideal for powering interactive, user-facing, analytic applications."

In October 2015, the commercial company Imply launched to provide enterprise level support and professional services for Druid.[13]

References

  1. ^ Hemsoth, Nicole. "Druid Summons Strength in Real-Time", Datanami, 08 November 2012
  2. ^ Stackoverflow shorthand tag description
  3. ^ Monash, Curt. "Metamarkets Druid Overview", Monash Research, 16 June 2012
  4. ^ Druid Project Documentation
  5. ^ Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. "Druid: A Real-time Analytical Data Store", Metamarkets, retrieved 6 February 2014
  6. ^ "Hadoop Creator Yahoo Finds It's Not Enough, Turns to Druid -- ADTmag". adtmag.com. Retrieved 2015-08-04.
  7. ^ Tschetter, Eric. "Introducing Druid", Druid.io, 24 October 2012
  8. ^ Higginbotham, Stacey. "Metamarkets open sources Druid, its in-memory database", GigaOM, 24 October 2012
  9. ^ Harris, Derrick (2015-02-20). "The Druid real-time database moves to an Apache license". Retrieved 2015-08-04.
  10. ^ "Druid Gets Open Source-ier Under the Apache License". Retrieved 2015-08-04.
  11. ^ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
  12. ^ Iranmanesh, Reza; Chandrashekar, Srikalyan. "Pushing the limits of Realtime Analytics using Druid", Slideshare, 19 July 2014
  13. ^ Novet, Jordan. "Imply launches with $2M to commercialize the Druid open-source data store", Venturebeat, 19 October 2015