Apache Hive From 5000 Feets


.
Apache Hive is an abstraction on top of HDFS data, that allow querying the data using the familiar SQL like language, called HiveQL (Hive Query Language). Hive was developed at Facebook to allow data analysts to query data using an SQL like language. Hive has limited commands and is similar to basic SQL (advance SQL options are not supported), but provide nice functionality on top of this limited syntax. Hive is intended to provide data warehouse like functionality on top of Hadoop and helps defining structure for the unstructured big data. It provide features to query analytical data using HiveQL.
Hive is not suitable for OLTP (since it is based on Hadoop and its queries have high latency), but is good for running analytical batch jobs on high volume data that is historical or non-mutable. Typical use case include copying unstructured data on HDFS, Running multiple cycles of Map/Reduce to process data and store (semi-)structured data file on HDFS and then using hive to define/apply the structure. Once structure is defined, data analyst’ point to these data files (or load data from these data files) and run various type of queries to analyze data and find hidden statistical patterns out of the data  🙂
Hive closely resemble with SQL and at a high level, support the following type of activities –
  1. Creation and Removal of Databases
  2. Creation, Alteration and Removal of Managed/External Tables within (or outside of) these databases
  3. Insertion/Loading of data inside these Tables
  4. Selection of data using HiveQL, including the capability of performing Join operations
  5. Allows the use of aggregation functions in the Select queries, that most of the times, triggers Map/Reduce jobs in the background.
  6. Structured output on query, (e.g. Schema on Read) not while loading data but when retrieving data.
 .

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.