|Stable release||0.13.1 / June 6, 2014|
|License||Apache License 2.0|
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with schema on read and transparently converts queries to map/reduce, Apache Tez and in the future Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.
Other features of Hive include:
- Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned.
- Different storage types such as plain text, RCFile, HBase, ORC, and others.
- Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution.
- Operating on compressed data stored into Hadoop ecosystem, algorithm including gzip, bzip2, snappy, etc.
- Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.
- SQL-like queries (HiveQL), which are implicitly converted into MapReduce jobs.
While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support. There are plans for adding support for insert, update, and delete with full ACID functionality.
- Nucleon Database Manager supports Apache Hive.
- Venner, Jason (2009). Pro Hadoop. Apress. ISBN 978-1-4302-1942-2.
- Use Case Study of Hive/Hadoop
- OSCON Data 2011, Adrian Cockcroft, "Data Flow at Netflix" on YouTube
- Amazon Elastic MapReduce Developer Guide
- HiveQL Language Manual
- Apache Tez
- Apache Spark
- Working with Students to Improve Indexing in Apache Hive
- Lam, Chuck (2010). Hadoop in Action. Manning Publications. ISBN 1-935182-19-6.
- Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive
- ORC - An Intelligent Big Data file format for Hadoop and Hive
- Faster Big Data on Hadoop with Hive and RCFile
- Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop
- Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang and Zhiwei Xu. "RCFile: A Fast and Space-efﬁcient Data Placement Structure in MapReduce-based Warehouse Systems" (PDF).
- White, Tom (2010). Hadoop: The Definitive Guide. O'Reilly Media. ISBN 978-1-4493-8973-4.
- Hive Language Manual
- Implement insert, update, and delete in Hive with full ACID support
- Hive A Warehousing Solution Over a MapReduce Framework
- Official website
- The Free Hive Book (CC by-nc licensed)
- Hive A Warehousing Solution Over a MapReduce Framework - Original paper presented by Facebook at VLDB 2009
- Using Apache Hive With Amazon Elastic MapReduce (Part 1) and Part 2 on YouTube, presented by an AWS Engineer
- Using hive + cassandra + shark. A hive cassandra cql storage handler.
- Major Technical Advancements in Apache Hive, Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee and Xiaodong Zhang, SIGMOD 2014