Jump to content

Cascading (software): Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
latest version is 3.0
Rescuing 1 sources, flagging 0 as dead, and archiving 20 sources. #IABot
Line 54: Line 54:
* FlightCaster<ref>[http://www.informationweek.com/news/software/infrastructure/224000240 FlightCaster]</ref> - predicting flight delays
* FlightCaster<ref>[http://www.informationweek.com/news/software/infrastructure/224000240 FlightCaster]</ref> - predicting flight delays
* Ion Flux<ref>[http://www.concurrentinc.com/casestudies/ion_flux Ion Flux]</ref> - analyzing DNA sequence data
* Ion Flux<ref>[http://www.concurrentinc.com/casestudies/ion_flux Ion Flux]</ref> - analyzing DNA sequence data
* RapLeaf<ref>[http://blog.rapleaf.com/dev/2008/09/05/goodbye-mapreduce-hello-cascading/ RapLeaf Blog]{{deadlink|date=September 2014}}</ref> - personalization and recommendation systems
* RapLeaf<ref>[http://blog.rapleaf.com/dev/2008/09/05/goodbye-mapreduce-hello-cascading/ RapLeaf Blog] {{wayback|url=http://blog.rapleaf.com/dev/2008/09/05/goodbye-mapreduce-hello-cascading/ |date=20110201023302 }}</ref> - personalization and recommendation systems
* Razorfish<ref>[http://aws.amazon.com/solutions/case-studies/razorfish/ Razorfish]</ref> - digital advertising
* Razorfish<ref>[http://aws.amazon.com/solutions/case-studies/razorfish/ Razorfish]</ref> - digital advertising



Revision as of 09:27, 11 January 2016

Cascading
Stable release
3.0
Written inJava
LicenseApache License
Websitehttp://www.cascading.org/

Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs. It is open source and available under the Apache License. Commercial support is available from Concurrent, Inc.[1]

Cascading was originally authored by Chris Wensel, who later founded Concurrent, Inc.[2] Cascading is being actively developed by the community[citation needed] and a number of add-on modules are available.[3]

Architecture

To use Cascading, Apache Hadoop must also be installed, and the Hadoop job .jar must contain the Cascading .jars. Cascading consists of a data processing API, integration API, process planner and process scheduler.

Cascading leverages the scalability of Hadoop but abstracts standard data processing operations away from underlying map and reduce tasks.[4][better source needed] Developers use Cascading to create a .jar file that describes the required processes. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’. Pipes are created independent from the data they will process. Once tied to data sources and sinks, it is called a ‘flow’. These flows can be grouped into a ‘cascade’, and the process scheduler will ensure a given flow does not execute until all its dependencies are satisfied. Pipes and flows can be reused and reordered to support different business needs.[5]

Developers write the code in a JVM-based language and do not need to learn MapReduce. The resulting program can be regression tested and integrated with external applications like any other Java application.[6]

Cascading is most often used for ad targeting, log file analysis, bioinformatics, machine learning, predictive analytics, web content mining, and extract, transform and load (ETL) applications.[7]

Uses of Cascading

Cascading is cited as one of the top five most powerful Hadoop projects by SD Times in 2011,[8][unreliable source?] as a major open source project relevant to bioinformatics[9][unreliable source?] and is included in Hadoop: A Definitive Guide, by Tom White.[10] The project is also widely cited in presentations, conference proceedings and Hadoop user group meetings as a useful tool for working with Hadoop.[11][12][13][14]

  • MultiTool on Amazon Web Services was developed using Cascading.[15]
  • LogAnalyzer for Amazon CloudFront was developed using Cascading.[16]
  • BackType[17] - social analytics platform
  • Etsy[18] - marketplace
  • FlightCaster[19] - predicting flight delays
  • Ion Flux[20] - analyzing DNA sequence data
  • RapLeaf[21] - personalization and recommendation systems
  • Razorfish[22] - digital advertising

Other users are listed on the cascading.org site.

Domain-Specific Languages Built on Cascading

  • PyCascading[23] - by Twitter, available on GitHub
  • Cascading.jruby[24] - developed by Gregoire Marabout, available on GitHub
  • Cascalog[25] - authored by Nathan Marz, available on GitHub
  • Scalding[26] - by Twitter, available on GitHub

References

  1. ^ Cascading support page
  2. ^ Concurrent, Inc.
  3. ^ Cascading modules
  4. ^ Blog post by Etsy describing their use of Cascading with Hadoop
  5. ^ Cascading User Guide
  6. ^ Concurrent product page
  7. ^ Concurrent home page
  8. ^ Handy, Alex (1 June 2011). "The top five most powerful Hadoop projects". SD Times. Retrieved 26 October 2013.
  9. ^ Taylor, Ronald (21 December 2010). "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics". BioMed Central. Springer Science+Business Media. Retrieved 26 October 2013.
  10. ^ White, Tom, “Hadoop: The Definitive Guide,” O’Reilly Media, Inc., 2010, pp. 539 – 549.
  11. ^ Nathan, Paco (Wikipedia: Paco Nathan), “Getting Started on Hadoop” presentation for the SV Cloud Computing Meetup, 7/19/2010.
  12. ^ Julio Guijarro, Steve Loughran and Paolo Castagna, “Hadoop and beyond,” HP Labs, Bristol UK, 2008.
  13. ^ Cross, Bradford, “Flightcaster_HUG,” Presentation at the Bay Area Hadoop Users’ Group, March 26, 2010
  14. ^ Curtin, Christopher, “NoSQL, Hadoop and Cascading,” June 2010.
  15. ^ Cascading.Multitool on AWS
  16. ^ LogAnalyzer for Amazon CloudFront
  17. ^ BackType blog
  18. ^ Blog post by Etsy describing their use of Cascading with Hadoop
  19. ^ FlightCaster
  20. ^ Ion Flux
  21. ^ RapLeaf Blog Template:Wayback
  22. ^ Razorfish
  23. ^ [1]
  24. ^ Cascading.jruby
  25. ^ Cascalog
  26. ^ Scalding