H2O (software)

From Wikipedia, the free encyclopedia
  (Redirected from H2o (Analytics tool))
Jump to: navigation, search
H2O
The corporate logo of H2O
Original author(s) 0xdata
Developer(s) H2O.ai
Stable release Shannon (Version 3.0.0.12) / May 25, 2015; 38 days ago (2015-05-25)
Preview release Bleeding Edge (Version 3.1.0.3028) / May 31, 2015; 32 days ago (2015-05-31)
Development status Active
Written in H2O (written in Java, Python, and R)[1][2][3]
Operating system Linux, Mac OS, and Microsoft Windows
Platform Apache Hadoop Distributed File System; Amazon EC2, Google Compute Engine, and Microsoft Azure.
Available in English
Type big data analytics, machine learning, statistical learning theory[4]
License Open source
Alexa rank 606,809[5]
Website 0xdata.com
Standard(s) Databricks certified on Spark.[3]
As of 1 June 2015

H2O is open-source software for big-data analysis. It is produced by the start-up H2O.ai (formerly 0xdata), which launched in 2011 in Silicon Valley. The speed and flexibility of H2O allow users to fit hundreds or thousands of potential models as part of discovering patterns in data. With H2O, users can throw models at data to find usable information, allowing H2O to discover patterns. Using H2O, Cisco estimates each month 20 thousand models of its customers' propensities to buy while Google fits different models for each client according to the time of day.[6]

H2O's mathematical core is developed with the leadership of Arno Candel; after H2O was rated as the best "open-source Java machine learning project" by GitHub's programming members, Candel was named to the first class of "Big Data All Stars" by Fortune in 2014.[7] The firm's scientific advisors are experts on statistical learning theory and mathematical optimization.

The H2O software runs can be called from the statistical package R and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, Mac OS, and Microsoft Windows. The H2O software is written in Java, Python, and R. Its graphical-user interface is compatible with four popular browsers: Chrome, Safari, Firefox, and Internet Explorer.

H2O[edit]

The H2O project aims to develop an analytical interface for cloud computing, providing users with intuitive tools for data analysis.[1]

Leadership[edit]

Cliff Click (left) and SriSatish Ambati (right) speak at an event for H2O.ai (0xdata).
H2O.ai was co-founded by Cliff Click and SriSatish Ambati. (Photograph by H2O.ai released under Creative Commons BY 2.0 license.[1])

H2O's chief executive, SriSatish Ambati, had helped to start Platfora, a big-data firm that develops software for the Apache Hadoop distributed file system.[8] Ambati was frustrated with the performance of the R programming language on large data-sets and started the development of H2O software with encouragement from John Chambers,[2] who created the S programming language at Bell Labs and who is a member of R‍ '​s core team (which leads the development of R).[2][9][10]

Ambati co-founded 0xdata with Cliff Click, who serves as the chief technical officer of H2O. Click helped to write the HotSpot Server Compiler and worked with Azul Systems to construct a big-data Java virtual machine (JVM).[11]

Mathematical leadership is provided by the Dr. Arno Candel, who has the title "physicist and hacker". Candel was a founding engineer at Skytree, where he implemented methods for machine learning, before he developed the mathematical core of H20. After H2O was rated as the best "open-source Java machine learning project" by GitHub's programming members, Candel (with 19 others) was named to the first class of "Big Data All Stars" by Fortune.[7]

Scientific advisory council[edit]

Stanford University professor Trevor J. Hastie serves as an advisor to H2O.ai.

H20's Scientific Advisory Council lists three mathematical scientists, who are all professors at Stanford University:[12] Professor Stephen P. Boyd is an expert in convex minimization and applications in statistics and electrical engineering.[13] Robert Tibshirani, a collaborator with Bradley Efron on bootstrapping,[14] is an expert on generalized additive models and statistical learning theory.[15][16] Trevor Hastie, a collaborator of John Chambers on S,[10] is an expert on generalized additive models and statistical learning theory.[15][16]

H2O.ai: A Silicon Valley start-up[edit]

Main article: H2O.ai

The software is open-source and freely distributed. The company receives fees for providing customer service and customized extensions. In November 2014, its twenty clients included Cisco, eBay, Nielsen, and PayPal, according to VentureBeat.[2] The speed and flexibility of H2O allow users to fit hundreds or thousands of potential models as part of discovering patterns in data. With H2O, users can throw models at data to find usable information, according to Tye Rattenbury at Trifacta. Using H2O, Cisco estimates each month 20 thousand models of its customers' propensities to buy while Google fits different models for each client according to the time of day.[6]

Mining of big data[edit]

Big datasets are too large to be analyzed using traditional software like R. The H2O software provides data structures and methods suitable for big data.

H2O allow users to analyze and visualize whole sets of data without using the Procrustean strategy of studying only a small subset with a conventional statistical package.[2] H2O's statistical repertoire includes generalized linear models and K-means clustering.[17]

Iterative methods for real-time problems[edit]

H2O uses iterative methods that provide quick answers using all of the client's data. When a client cannot wait for an optimal solution, the client can interrupt the computations and use an approximate solution.[1]

In its approach to deep learning,[2][17][18] H20 divides all the data into subsets and then analyzing each subset simultaneously using the same method. These processes are combined to estimate parameters by using the Hogwild scheme,[19] a parallel stochastic gradient method.[20] These methods allow H2O to provide answers that use all the client's data, rather than throwing away most of it and analyzing a subset with conventional software.

Software[edit]

Programming languages[edit]

The H2O software was written with three programming languages: Java (6 or later), Python (2.7.x), and R (3.0.0 or later).[2][3]

Operating systems[edit]

The H2O software can be run on conventional operating-systems: Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 12.04 ; RHEL/CentOS 6 or later),[3] It also runs on big-data systems, particularly Apache Hadoop Distributed File System (HDFS), several popular versions: Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 or later). It operates also on cloud computing environments, for example using Amazon EC2, Google Compute Engine, and Microsoft Azure. The H2O Sparkling Water software is Databricks-certified on Apache Spark.[3]

Graphical user interface and browsers[edit]

Its graphical user interface is compatible with four browsers (unless specified, in their latest versions as of 1 June 2015): Chrome, Safari, Firefox, Internet Explorer (IE10).[3]

Notes[edit]

  1. ^ a b c Harris (2012)
  2. ^ a b c d e f g Novet (2014)
  3. ^ a b c d e f "Recommended systems for H2O" (HTML). 0xdata.com. H2O.ai. May 2015. 
  4. ^ Hardy (2014)
  5. ^ "How popular is 0xdata.com?". Alexa.com. 1 June 2015. 
  6. ^ a b Woodie, Alex (9 February 2015). "The Rise of Predictive Modeling Factories". Datanami: Big data, big analytics, big insights. Retrieved 2 June 2015. 
  7. ^ a b Hackett (2014)
  8. ^ Gage (2013)
  9. ^ ACM honors Dr. John M. Chambers of Bell Labs with the 1998 ACM Software System Award for creating "S System" software, ACM press release, March 29, 1999. Accessed 8 December 2008.
  10. ^ a b J. Chambers and T. Hastie, Statistical Models in S, Wadsworth/Brooks Cole, 1991.
  11. ^ Schuster, Werner (10 January 2014). "Cliff Click on in-memory processing, 0xdata H20, efficient low latency Java and GCs" (HTML). InfoQ. Retrieved 2 June 2015. 
  12. ^ "About". 0xdata. 2015. 
  13. ^ Boyd, Stephen P.; Vandenberghe, Lieven (2004). Convex optimization (HTML). Cambridge University Press. ISBN 978-0-521-83378-3. Retrieved October 15, 2011.  (Free download of PDF of corrected 7th printing, 2009)
  14. ^ Bradley Efron; Robert Tibshirani (1994). An Introduction to the Bootstrap. Chapman & Hall/CRC. ISBN 978-0-412-04231-7. 
  15. ^ a b Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models. Chapman & Hall/CRC. ISBN 978-0-412-34390-2. 
  16. ^ a b Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2011). The Elements of Statistical Learning (HTML) (second ed.). Retrieved 15 June 2012.  (Free download of 10th printing, June 2013)
  17. ^ a b Aiello, Spencer and Tom Kraljevic and Petr Maj and with contributions from the 0xdata team (2015), "h2o: R Interface for H2O", The Comprehensive R Archive Network (CRAN), Contributed Packages (The R Project for Statistical Computing) (3.0.0.12) 
  18. ^ "Prediction of IncRNA using Deep Learning Approach". Tripathi, Rashmi; Kumari, Vandana; Patel, Sunil; Singh, Yashbir; Varadwaj, Pritish. International Conference on Advances in Biotechnology (BioTech). Proceedings: 138-142. Singapore: Global Science and Technology Forum. (2015)
  19. ^ Description of the iterative method for computing maximum-likelihood estimates for a generalized linear model.
  20. ^ Benjamin Recht and Re, Christopher and Wright, Stephen and Feng Niu (2011). J. Shawe-Taylor and R.S. Zemel and P.L. Bartlett and F. Pereira and K.Q. Weinberger, ed. "Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent" (PDF). Advances in Neural Information Processing Systems (Curran Associates, Inc.) 24: 693–701.  Recht's PDF

References[edit]

External links[edit]

37°25′07″N 122°05′44″W / 37.418687°N 122.095642°W / 37.418687; -122.095642Coordinates: 37°25′07″N 122°05′44″W / 37.418687°N 122.095642°W / 37.418687; -122.095642