Apache Spark

Spark
Developer(s)	Apache Software Foundation, UC Berkeley
Stable release	v0.9.0 / February 2, 2014
Repository	github.com/apache/spark ;
Written in	Scala, Java, Python
Operating system	Linux, MAC OS, Windows
Type	data analytics, machine learning algorithms
License	Apache License 2.0
Website	http://spark.incubator.apache.org/

Apache Spark is an open-source^[1] data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).^[2] However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications.^[3] Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.^[4]

Spark is an Apache Incubator project of the Apache Software Foundation since June 2013^[5] and has received code contributions from large companies that use Spark including Yahoo! and Intel^[6] as well as small companies and startups such as Conviva,^[7] Quantifind,^[8] ClearStory Data,^[9] Ooyala^[10] and many more.^[11] By Oct 2013, over 90 individual developers had contributed code to Spark, representing 25 different companies.^[12] Spark was accepted as an Apache incubator project of the Apache Software Foundation in June 2013.^[5] Prior to joining Apache Incubator, versions 0.7 and earlier were licensed under the BSD License.^[1]

Features

Java, Scala, and Python APIs.
Proven scalability to 100 nodes in the research lab^[13] and 80 nodes in production at Yahoo!.^[14]
Ability to cache datasets in memory for interactive data analysis: extract a working set, cache it, query it repeatedly.
Interactive command line interface (in Scala or Python) for low-latency data exploration at scale.
Higher level library for stream processing, through Spark Streaming.
Higher level libraries for machine learning and graph processing that because of the distributed memory-based Spark architecture are ten times as fast as Hadoop disk-based Apache Mahout and even scale better than Vowpal Wabbit.^[15]

External links

Spark Homepage
Shark - A large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries.
Spark Streaming - A component of Spark that extends core Spark functionality to allow for real-time analysis of streaming data.
How companies are using Spark

References

^ ^a ^b "Spark FAQ". apache.org. Apache Software Foundation. Retrieved 10 October 2013.
^ Figure showing Spark in relation to other open-source Software projects including Hadoop
^ Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). "Shark: SQL and Rich Analytics at Scale" (PDF). {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter |conference= ignored (help)
^ Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale.{{cite AV media}}: CS1 maint: location (link)
^ ^a ^b Email thread archive: [RESULT] [VOTE] Apache Spark for the Incubator
^ Cade Metz (June 19, 2013). "Spark: Open Source Superstar Rewrites Future of Big Data". wired.com.
^ Dilip Joseph (December 27, 2011). "Using Spark and Hive to process BigData at Conviva".
^ Erich Nachbar. Running Spark In Production. Spark use-cases session at AMP Camp One Aug 2012, UC Berkeley.{{cite AV media}}: CS1 maint: location (link)
^ Beyond Hadoop MapReduce: Interactive Analytic Insights Using Spark - Abstract of talk given by ClearStory Data CEO Sharmila Shahani-Mulligan about using Spark
^ Evan Chan (June 2013). "Fast Spark Queries on In-Memory Datasets".
^ Spark, Shark, and BDAS In the News
^ The Growing Spark Community
^ Zaharia, Matei; Chowdhury, Mosharaf (25 April 2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF). USENIX. {{cite conference}}: Check |authorlink1= value (help); Check |authorlink2= value (help); External link in |authorlink1= and |authorlink2= (help)
^ Feng, Andy (23 July 2013). "Spark and Hadoop at Yahoo: Brought to you by YARN" (PDF). University of California, Berkeley. Retrieved 11 October 2013.
^ Sparks, Evan; Talwalkar, Ameet (2013 August 6). "Spark Meetup: MLbase, Distributed Machine Learning with Spark". slideshare.net. Spark User Meetup, San Francisco, California. Retrieved 10 February 2014. {{cite web}}: Check date values in: |date= (help)

[faq-1] "Spark FAQ". apache.org. Apache Software Foundation. Retrieved 10 October 2013.

[2] Figure showing Spark in relation to other open-source Software projects including Hadoop

[3] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). "Shark: SQL and Rich Analytics at Scale" (PDF). {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter |conference= ignored (help)

[4] Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale.{{cite AV media}}: CS1 maint: location (link)

[mail-archives.apache.org-5] Email thread archive: [RESULT] [VOTE] Apache Spark for the Incubator

[6] Cade Metz (June 19, 2013). "Spark: Open Source Superstar Rewrites Future of Big Data". wired.com.

[7] Dilip Joseph (December 27, 2011). "Using Spark and Hive to process BigData at Conviva".

[8] Erich Nachbar. Running Spark In Production. Spark use-cases session at AMP Camp One Aug 2012, UC Berkeley.{{cite AV media}}: CS1 maint: location (link)

[9] Beyond Hadoop MapReduce: Interactive Analytic Insights Using Spark - Abstract of talk given by ClearStory Data CEO Sharmila Shahani-Mulligan about using Spark

[10] Evan Chan (June 2013). "Fast Spark Queries on In-Memory Datasets".

[11] Spark, Shark, and BDAS In the News

[12] The Growing Spark Community

[13] Zaharia, Matei; Chowdhury, Mosharaf (25 April 2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF). USENIX. {{cite conference}}: Check |authorlink1= value (help); Check |authorlink2= value (help); External link in |authorlink1= and |authorlink2= (help)

[14] Feng, Andy (23 July 2013). "Spark and Hadoop at Yahoo: Brought to you by YARN" (PDF). University of California, Berkeley. Retrieved 11 October 2013.

[15] Sparks, Evan; Talwalkar, Ameet (2013 August 6). "Spark Meetup: MLbase, Distributed Machine Learning with Spark". slideshare.net. Spark User Meetup, San Francisco, California. Retrieved 10 February 2014. {{cite web}}: Check date values in: |date= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]