Apache Hive

Apache Hive
Developer(s)	Contributors
Stable release	1.2.1 / June 27, 2015
Repository	git.apache.org/hive.git ;
Written in	Java
Operating system	Cross-platform
Type	The main article for this category is List of free and open source database management systems. Database engine or management software that has been released under an open source license. Free and open-source software portal;
License	GNU General Public License (Apache License 2.0)
Website	hive.apache.org

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.^[2] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.^[3]^[4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.^[5]

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL^[6] with schema on read and transparently converts queries to map/reduce, Apache Tez^[7] and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.^[8]

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.^[9]

Currently, there are four file formats supported in Hive, which are TEXTFILE,^[10] SEQUENCEFILE, ORC^[11] and RCFILE.^[12]^[13]^[14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.^[15]^[16]

Other features of Hive include:

Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned.
Different storage types such as plain text, RCFile, HBase, ORC, and others.
Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.
Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.^[17]^[18] Support for insert, update, and delete with full ACID functionality was made available with release 0.14.^[19]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.^[20]

References

^ Apache Hive Download News
^ Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2.
^ Use Case Study of Hive/Hadoop
^ OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
^ Amazon Elastic MapReduce Developer Guide
^ HiveQL Language Manual
^ Apache Tez
^ Working with Students to Improve Indexing in Apache Hive
^ Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6.
^ Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive
^ LanguageManual ORC
^ Faster Big Data on Hadoop with Hive and RCFile
^ Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop
^ Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu. "RCFile: A Fast and Space-efﬁcient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF).{{cite web}}: CS1 maint: multiple names: authors list (link)
^ "Parquet". 18 Dec 2014. Archived from the original on 2 February 2015. Retrieved 2 February 2015.
^ Massie, Matt (21 August 2013). "A Powerful Big Data Trio: Spark, Parquet and Avro". http://zenfractal.com/. Archived from the original on 2 February 2015. Retrieved 2 February 2015. {{cite web}}: External link in |website= (help)
^ White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4.
^ Hive Language Manual
^ ACID and Transactions in Hive
^ Hive A Warehousing Solution Over a MapReduce Framework

External links

Official website
The Free Hive Book (CC by-nc licensed)
Hive A Warehousing Solution Over a MapReduce Framework - Original paper presented by Facebook at VLDB 2009
Using Apache Hive With Amazon Elastic MapReduce (Part 1) and Part 2 on YouTube, presented by an AWS Engineer
Using hive + cassandra + shark. A hive cassandra cql storage handler.
Major Technical Advancements in Apache Hive, Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee and Xiaodong Zhang, SIGMOD 2014
Apache Hive Wiki

[1] Apache Hive Download News

[2] Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2.

[3] Use Case Study of Hive/Hadoop

[4] OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube

[5] Amazon Elastic MapReduce Developer Guide

[6] HiveQL Language Manual

[7] Apache Tez

[8] Working with Students to Improve Indexing in Apache Hive

[9] Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6.

[10] Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive

[11] LanguageManual ORC

[12] Faster Big Data on Hadoop with Hive and RCFile

[13] Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop

[14] Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu. "RCFile: A Fast and Space-efﬁcient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF).{{cite web}}: CS1 maint: multiple names: authors list (link)

[15] "Parquet". 18 Dec 2014. Archived from the original on 2 February 2015. Retrieved 2 February 2015.

[16] Massie, Matt (21 August 2013). "A Powerful Big Data Trio: Spark, Parquet and Avro". http://zenfractal.com/. Archived from the original on 2 February 2015. Retrieved 2 February 2015. {{cite web}}: External link in |website= (help)

[17] White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4.

[18] Hive Language Manual

[19] ACID and Transactions in Hive

[20] Hive A Warehousing Solution Over a MapReduce Framework

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive Bluesky iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category

Features

HiveQL

See also

References

External links