Jump to content

User:Waury/sandbox

From Wikipedia, the free encyclopedia
Stratosphere
Developer(s) HPI, HU Berlin, TU Berlin
Stable release
0.4 / January 13, 2014 (2014-01-13)
Preview release
0.5-SNAPSHOT
Written inJava, Scala
Operating systemCross-platform
TypeDistributed Computing
LicenseApache License 2.0
Websitestratosphere.eu

Stratosphere is an open-source software framework for processing large data sets on clusters. It is developed by the Hasso Plattner Institute, HU Berlin, TU Berlin[1] and individual contributors. It is licensed under the Apache License 2.0.

Architecture

[edit]

Front ends

[edit]

Stratosphere provides a Java and a Scala API to write programs.[2]. The internals of Stratosphere are implemented in Java.

Runtime

[edit]

Operators

[edit]

Similar to MapReduce or Apache Hadoop the Stratosphere runtime provides second-order functions called Operators (in publications called PACTs (PArallelization ContracTs)) for processing records with UDFs.[3]

Function name Image Description
MAP
Stratosphere MAP
Stratosphere MAP
Receives one record stream and applies UDF to each record.
REDUCE
Stratosphere REDUCE
Stratosphere REDUCE
Receives one record stream and applies UDF to all records with the same key.
CO-GROUP
Stratosphere CO-GROUP
Stratosphere CO-GROUP
Receives two record streams and applies UDF to all records with the same key.
CROSS
Stratosphere CROSS
Stratosphere CROSS
Receives two record streams and applies UDF to each element of their Cartesian product.
JOIN
Stratosphere JOIN
Stratosphere JOIN
Receives two record streams and applies UDF to each record of an equi-join on the keys.

The runtime contains different primitives to execute these operators. Currently, there are two join algorithms implemented in Stratosphere: Sort-Merge join and Hybrid Hash Join. The REDUCE operator uses an external sort.[3]

Iterations

[edit]

In addition to the five second-order functions Stratosphere also provides two types of iterations, bulk and incremental.[4][5]

Stratosphere Compiler

[edit]

In contrast to other frameworks the pipeline is not fixed to a Map step followed by an optional Reduce step. A Stratosphere job can be an arbitrary DAG composed of multiple operators, bulk iterations, incremental iterations, data sources and data sinks. This enables optimizations like choosing a join strategy at runtime based on metadata, available system resources and compiler hints provided by the user.

The result of the optimized Stratosphere program is a job graph that can be executed on the cluster.

Nephele execution engine

[edit]

Nephele is the low-level parallel execution engine of Stratosphere. It handles cluster resource allocation and in-memory and network communication.[6].

File systems

[edit]

Stratosphere supports HDFS, HBase, Avro, Amazon S3, JDBC and local file systems as data sources and data sinks for Stratopshere jobs.

See also

[edit]

References

[edit]
  1. ^ "Stratosphere.eu Website". Retrieved 27 November 2013.
  2. ^ "Stratosphere Intro (Java and Scala Interface)".
  3. ^ a b Battré, Dominic; Ewen, Stephan; Hueske, Fabian; Kao, Odej; Markl, Volker; Warneke, Daniel (2010). "Nephele/PACTs". Proceedings of the 1st ACM symposium on Cloud computing. pp. 119–130. doi:10.1145/1807128.1807148. ISBN 9781450300360.
  4. ^ Ewen, Stephan; Schelter, Sebastian; Tzoumas, Kostas; Warneke, Daniel; Markl, Volker (2013). ""Iterative parallel data processing with stratosphere: an inside look"": 1053–1056. doi:10.1145/2463676.2463693. {{cite journal}}: Cite journal requires |journal= (help)
  5. ^ Ewen, Stephan; Tzoumas, Kostas; Kaufmann, Moritz; Markl, Volker (July 2012). "Spinning fast iterative data flows". Proceedings of the VLDB Endowment. 5 (11): 1268–1279. doi:10.14778/2350229.2350245. Retrieved 3 December 2013.{{cite journal}}: CS1 maint: date and year (link)
  6. ^ Warneke, Daniel; Kao, Odej (2009). "Nephele". Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers. pp. 1–10. doi:10.1145/1646468.1646476. ISBN 9781605587141.
[edit]

Category:Free software programmed in Java Category:Cloud computing Category:Cloud infrastructure Category:Free software for cloud computing