Data stream management system
A Data stream management system (DSMS) is a computer program to manage continuous data streams. It is similar to a database management system (DBMS), which is, however, designed for static data in conventional databases. A DSMS also offers a flexible query processing so that the information need can be expressed using queries. However, in contrast to a DBMS, a DSMS executes a continuous query that is not only performed once, but is permanently installed. Therefore, the query is continuously executed until it is explicitly uninstalled. Since most DSMS are data-driven, a continuous query produces new results as long as new data arrive at the system. This basic concept is similar to Complex event processing so that both technologies are partially coalescing.
One of the most important features of a DSMS is the possibility to handle potentially infinite and rapidly changing data streams by offering a flexible processing at the same time, although there are only limited resources like a limited main memory. The following table provides various principles of DSMS and compares them to traditional DBMS.
|Database management system (DBMS)||Data stream management system (DSMS)|
|Persistent data (relations)||volatile data streams|
|Random access||Sequential access|
|One-time queries||Continuous queries|
|(theoretically) unlimited secondary storage||limited main memory|
|Only the current state is relevant||Consideration of the order of the input|
|relatively low update rate||potentially extremely high update rate|
|Little or no time requirements||Real-time requirements|
|Assumes exact data||Assumes outdated/inaccurate data|
|Plannable query processing||Variable data arrival and data characteristics|
Processing and streaming models
One of the biggest challenges for a DSMS is to handle potentially infinite data streams using a fixed amount of memory and no random access to the data. There are different approaches to limit the amount of data in one pass, which can be divided into two classes. For the one hand, there are compression techniques that try to summarize the data and for the other hand there are window techniques that try to portion the data into (finite) parts.
The idea behind compression techniques is to maintain only a synopsis of the data, but not all (raw) data points of the data stream. The algorithms range from selecting random data points called sampling to summarization using histograms, wavelets or sketching. One simple example of a compression is the continuous calculation of an average. Instead of memorizing each data point, the synopsis only holds the sum and the number of items. The average can be calculated by dividing the sum by the number. However, it should be mentioned that synopses cannot reflect the data accurately. Thus, a processing that is based on synopses may produce inaccurate results.
Instead of using synopses to compress the characteristics of the whole data streams, window techniques only look on a portion of the data. This approach is motivated by the idea that only the most recent data are relevant. Therefore, a window continuously cuts out a part of the data stream, e.g. the last ten data stream elements, and only considers these elements during the processing. There are different kinds of such windows like sliding windows that are similar to FIFO lists or tumbling windows that cuts out disjoints parts. Furthermore, the windows can also be differentiated into element based to consider, e.g. the last ten elements, or time based windows, e.g. to consider the last ten seconds of data. Additionally, there are also different approaches to implement windows. There are, for example, approaches that use timestamps or time intervals for system-wide windows or buffer-based windows for each single processing-step.
Since there are a lot of prototypes, there is no standardized architecture. However, most DSMS are based on the query processing in DBMS by using languages to express queries, which are translated into a plan of operators. These plans can be optimized and are executed. A query processing often consists of the following steps.
Formulation of continuous queries
The formulation of queries is mostly done using declarative languages like SQL in DBMS. Since there are no standardized query languages to express continuous queries, there are a lot of languages and variations. However, most of them are oriented to SQL like the Continuous Query Language (CQL), StreamSQL or EPL. There are also graphical approaches where each processing step is a box and the processing flow is expressed by arrows between the boxes.
The language strongly depends on the processing model. For example, if windows are used for the processing, the definition of the window has to be expressed. In StreamSQL, a query with a sliding window for the last 10 elements looks like follows:
SELECT AVG(price) FROM examplestream [SIZE 10 ADVANCE 1 TUPLES] WHERE VALUE > 100.0
This stream continuously calculates the average value of "price" of the last 10 tuples, but only considers those tuples for the average calculation where price is greater than 100.0.
In the next step, the declarative query is translated into a logical query plan. A query plan is a directed graph where the nodes are operators and the edges describe the processing flow. Each operator within the query plan encapsulates the semantic of a specific operation like filtering or aggregation. In DSMS that process relational data streams, the operators are equal or similar to the operators of the Relational algebra, so that there are operators for selection, projection, join, or set operations. This operator concept allows the very flexible and versatile processing of a DSMS.
Optimization of queries
The logical query plan can be optimized, which strongly depends on the streaming model. The basic concepts for optimizing continuous queries are equal to those from database systems. If there are relational data streams and the logical query plan is based on relational operators from the Relational algebra, a query optimizer can use the algebraic equivalences to optimize the plan. These may be, for example, to push selection operators down to the sources, because they are not so computationally intensive like join operators.
Furthermore, there are also cost-based optimization techniques like in DBMS, where a query plan with the lowest costs is chosen from different equivalent query plans. One example is to choose the order of two successive join operators. In DBMS this decision is mostly done by certain statistics of the involved databases. But, since the data of a data streams is unknown in advance, there are no such statistics in a DSMS. However, it is possible to observe a data stream for a certain time to obtain some statistics. Using these statistics, the query can also be optimized later. So, in contrast to a DBMS, some DSMS allows to optimize the query even during runtime. Therefore, a DSMS needs some plan migration strategies to replace a running query plan with a new one.
Transformation of queries
Since a logical operator is only responsible for the semantics of an operation but does not consists of any algorithms, the logical query plan must be transformed into an executable counterpart. This is called a physical query plan. The distinction between a logical and a physical operator plan allows more than one implementation for the same logical operator. The join, for example, is logically the same, although it can be implemented by different algorithms like a Nested loop join or a Sort-merge join. Notice, these algorithms also strongly depend on the used stream and processing model. Finally, the query is available as a physical query plan.
Execution of queries
Since the physical query plan consists of executable algorithms, it can be directly executed. For this, the physical query plan is installed into the system. The bottom of the graph (of the query plan) is connected to the incoming sources, which can be everything like connectors to sensors. The top of the graph is connected to the outgoing sinks, which may be for example a visualization. Since most DSMS are data-driven, a query is executed by pushing the incoming data element from the source through the query plan to the sink. Each time when the data element passes an operator, the operator performs its specific operation on the data element and forwards the result to all successive operators.
Data Stream Management Systems
- STREAM 
- AURORA, StreamBase Systems, Inc.
- TelegraphCQ 
- PIPES, webMethods Business Events
- InfoSphere Streams
- Arasu, A., et. al. STREAM: The Stanford Data Stream Management System. Technical Report. 2004, Stanford InfoLab.
- Abadi; et al. "Aurora: A Data Stream Management System". SIGMOD 2003. CiteSeerX: 10.1.1.67.8671.
- Chandrasekaran, S. et al, "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World." CIDR 2003.
- Chen, J. et al, "NiagaraCQ: A Scalable Continuous Query System for Internet Databases." SIGMOD 2000.
- Aggarwal, Charu C. (2007). Data Streams: Models and Algorithms. New York: Springer. ISBN 978-0-387-47534-9.
- Golab, Lukasz; Özsu, M. Tamer (2010). Data Stream Management. Waterloo, USA: Morgan and Claypool. ISBN 978-1-608-45272-9.
- Using Data Stream Management Systems for Traffic Analysis: A Case Study, last visited 2013-01-10
- STREAM: Stanford Stream Data Manager, last visited 2013-01-10
- NiagaraST: A Research Data Stream Management System at Portland State University, last visited 2013-01-10
- Odysseus: An open source Java based framework for Data Stream Management Systems, last visited 2013-01-10
- Processing Flows of Information: From Data Stream to Complex Event Processing - Survey article on Data Stream and Complex Event Processing Systems, last visited 2013-01-10
- StreamSQL reference, last visited 2013-01-10
- Stream processing with SQL - Introduction to streaming data management with SQL