Druid (open-source data store)
|Original author(s)||Eric Tschetter, Fangjin Yang|
|Developer(s)||The Druid Community|
|Stable release||0.6.171 / 21 January 2015|
|Type||distributed, real-time, column-oriented data store|
|License||Apache License 2.0|
Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of time-series data, making that data immediately available to queries. This is sometimes referred to as real-time data.
On the developer Q&A site Stackoverflow, Druid is described as "open-source infrastructure for real-time exploratory analytics on large datasets." It is designed to ingest time-series data, chunking and compressing that data into column-based queryable segments.
Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type. In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (Mysql), and a deep storage facility (e.g., HDFS, Amazon S3, or Apache Cassandra).
Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.
Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the location of data, these nodes must interact with Mysql to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.
Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.
Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the Mysql metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.
- Time series queries
- TopN queries
- GroupBy queries
- No join queries as of this writing
The project was started to power the analytics product of Metamarkets. The first line of code was committed by Eric Tschetter to a private Github repository in March of 2011. He converted the first Metamarkets customer to it by the end of March 2011 and all customers were converted by mid-May. Metamarkets subsequently determined to invest more in the system and hired Fangjin Yang in September 2011. With Fangjin added to the development team, feature development of Druid accelerated rapidly. The project was open-sourced under the GPL license in October 2012. Since then, a number of organizations and companies, including Netflix and Yahoo have integrated Druid into their backend technology.
- Hemsoth, Nicole. "Druid Summons Strength in Real-Time", datanami, 08 November 2012
- Stackoverflow shorthand tag description
- Monash, Curt. "Metamarkets Druid Overview", DBMS2, 16 June 2012
- Druid Project Documentation
- Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. "Druid: A Real-time Analytical Data Store", Metamarkets, retrieved 6 February 2014
- Tschetter, Eric. "Introducing Druid", Druid.io, 24 October 2012
- Higginbotham, Stacey. "Metamarkets open sources Druid, its in-memory database", GigaOM, 24 October 2012
- Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
- Iranmanesh, Reza; Chandrashekar, Srikalyan. "Pushing the limits of Realtime Analytics using Druid", Slideshare, 19 July 2014