Jump to content

Apache Spark

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by D. liman (talk | contribs) at 19:31, 1 March 2014 (→‎External links). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Spark
Developer(s)Apache Software Foundation, UC Berkeley
Stable release
v0.9.0 / February 2, 2014 (2014-02-02)
Repository
Written inScala, Java, Python
Operating systemLinux, MAC OS, Windows
Typedata analytics, machine learning algorithms
LicenseApache License 2.0
Websitehttp://spark.incubator.apache.org/

Apache Spark is an open-source[1] data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS).[2] However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications.[3] Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.[4]

Spark is an Apache Incubator project of the Apache Software Foundation since June 2013[5] and has received code contributions from large companies that use Spark including Yahoo! and Intel[6] as well as small companies and startups such as Conviva,[7] Quantifind,[8] ClearStory Data,[9] Ooyala[10] and many more.[11] By Oct 2013, over 90 individual developers had contributed code to Spark, representing 25 different companies.[12] Spark was accepted as an Apache incubator project of the Apache Software Foundation in June 2013.[5] Prior to joining Apache Incubator, versions 0.7 and earlier were licensed under the BSD License.[1]

Features

  • Java, Scala, and Python APIs.
  • Proven scalability to 100 nodes in the research lab[13] and 80 nodes in production at Yahoo!.[14]
  • Ability to cache datasets in memory for interactive data analysis: extract a working set, cache it, query it repeatedly.
  • Interactive command line interface (in Scala or Python) for low-latency data exploration at scale.
  • Higher level library for stream processing, through Spark Streaming.
  • Higher level libraries for machine learning and graph processing that because of the distributed memory-based Spark architecture are ten times as fast as Hadoop disk-based Apache Mahout and even scale better than Vowpal Wabbit.[15]

External links

  • Spark Homepage
  • Shark - A large-scale data warehouse system for Spark designed to be compatible with Apache Hive. It can execute Hive QL queries up to 100 times faster than Hive without any modification to the existing data or queries.
  • Spark Streaming - A component of Spark that extends core Spark functionality to allow for real-time analysis of streaming data.
  • How companies are using Spark

References

  1. ^ a b "Spark FAQ". apache.org. Apache Software Foundation. Retrieved 10 October 2013.
  2. ^ Figure showing Spark in relation to other open-source Software projects including Hadoop
  3. ^ Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). "Shark: SQL and Rich Analytics at Scale" (PDF). {{cite journal}}: Cite journal requires |journal= (help); Unknown parameter |conference= ignored (help)
  4. ^ Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale.{{cite AV media}}: CS1 maint: location (link)
  5. ^ a b Email thread archive: [RESULT] [VOTE] Apache Spark for the Incubator
  6. ^ Cade Metz (June 19, 2013). "Spark: Open Source Superstar Rewrites Future of Big Data". wired.com.
  7. ^ Dilip Joseph (December 27, 2011). "Using Spark and Hive to process BigData at Conviva".
  8. ^ Erich Nachbar. Running Spark In Production. Spark use-cases session at AMP Camp One Aug 2012, UC Berkeley.{{cite AV media}}: CS1 maint: location (link)
  9. ^ Beyond Hadoop MapReduce: Interactive Analytic Insights Using Spark - Abstract of talk given by ClearStory Data CEO Sharmila Shahani-Mulligan about using Spark
  10. ^ Evan Chan (June 2013). "Fast Spark Queries on In-Memory Datasets".
  11. ^ Spark, Shark, and BDAS In the News
  12. ^ The Growing Spark Community
  13. ^ Zaharia, Matei; Chowdhury, Mosharaf (25 April 2012). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF). USENIX. {{cite conference}}: Check |authorlink1= value (help); Check |authorlink2= value (help); External link in |authorlink1= and |authorlink2= (help)
  14. ^ Feng, Andy (23 July 2013). "Spark and Hadoop at Yahoo: Brought to you by YARN" (PDF). University of California, Berkeley. Retrieved 11 October 2013.
  15. ^ Sparks, Evan; Talwalkar, Ameet (2013 August 6). "Spark Meetup: MLbase, Distributed Machine Learning with Spark". slideshare.net. Spark User Meetup, San Francisco, California. Retrieved 10 February 2014. {{cite web}}: Check date values in: |date= (help)