User:Waury/sandbox
![]() | |
Developer(s) | HPI, HU Berlin, TU Berlin |
---|---|
Stable release | 0.4
/ January 13, 2014 |
Preview release | 0.5-SNAPSHOT
|
Written in | Java, Scala |
Operating system | Cross-platform |
Type | Distributed Computing |
License | Apache License 2.0 |
Website | stratosphere |
Stratosphere is an open-source software framework for processing large data sets on clusters. It is developed by the Hasso Plattner Institute, HU Berlin, TU Berlin[1] and individual contributors. It is licensed under the Apache License 2.0.
Architecture
[edit]Front ends
[edit]Stratosphere provides a Java and a Scala API to write programs.[2]. The internals of Stratosphere are implemented in Java.
Runtime
[edit]Operators
[edit]Similar to MapReduce or Apache Hadoop the Stratosphere runtime provides second-order functions called Operators (in publications called PACTs (PArallelization ContracTs)) for processing records with UDFs.[3]
Function name | Image | Description |
---|---|---|
MAP | ![]() |
Receives one record stream and applies UDF to each record. |
REDUCE | ![]() |
Receives one record stream and applies UDF to all records with the same key. |
CO-GROUP | ![]() |
Receives two record streams and applies UDF to all records with the same key. |
CROSS | ![]() |
Receives two record streams and applies UDF to each element of their Cartesian product. |
JOIN | ![]() |
Receives two record streams and applies UDF to each record of an equi-join on the keys. |
The runtime contains different primitives to execute these operators. Currently, there are two join algorithms implemented in Stratosphere: Sort-Merge join and Hybrid Hash Join. The REDUCE operator uses an external sort.[3]
Iterations
[edit]In addition to the five second-order functions Stratosphere also provides two types of iterations, bulk and incremental.[4][5]
Stratosphere Compiler
[edit]In contrast to other frameworks the pipeline is not fixed to a Map step followed by an optional Reduce step. A Stratosphere job can be an arbitrary DAG composed of multiple operators, bulk iterations, incremental iterations, data sources and data sinks. This enables optimizations like choosing a join strategy at runtime based on metadata, available system resources and compiler hints provided by the user.
The result of the optimized Stratosphere program is a job graph that can be executed on the cluster.
Nephele execution engine
[edit]Nephele is the low-level parallel execution engine of Stratosphere. It handles cluster resource allocation and in-memory and network communication.[6].
File systems
[edit]Stratosphere supports HDFS, HBase, Avro, Amazon S3, JDBC and local file systems as data sources and data sinks for Stratopshere jobs.
See also
[edit]References
[edit]- ^ "Stratosphere.eu Website". Retrieved 27 November 2013.
- ^ "Stratosphere Intro (Java and Scala Interface)".
- ^ a b Battré, Dominic; Ewen, Stephan; Hueske, Fabian; Kao, Odej; Markl, Volker; Warneke, Daniel (2010). "Nephele/PACTs". Proceedings of the 1st ACM symposium on Cloud computing. pp. 119–130. doi:10.1145/1807128.1807148. ISBN 9781450300360.
- ^ Ewen, Stephan; Schelter, Sebastian; Tzoumas, Kostas; Warneke, Daniel; Markl, Volker (2013). ""Iterative parallel data processing with stratosphere: an inside look"": 1053–1056. doi:10.1145/2463676.2463693.
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ Ewen, Stephan; Tzoumas, Kostas; Kaufmann, Moritz; Markl, Volker (July 2012). "Spinning fast iterative data flows". Proceedings of the VLDB Endowment. 5 (11): 1268–1279. doi:10.14778/2350229.2350245. Retrieved 3 December 2013.
{{cite journal}}
: CS1 maint: date and year (link) - ^ Warneke, Daniel; Kao, Odej (2009). "Nephele". Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers. pp. 1–10. doi:10.1145/1646468.1646476. ISBN 9781605587141.
External links
[edit]Category:Free software programmed in Java Category:Cloud computing Category:Cloud infrastructure Category:Free software for cloud computing