Loading Data in Pig Using HCatalog

June 25, 2015

Loading Data in Pig Using HCatalog

.

HCatalog is an extension of Hive and in a nutshell, it exposes the schema information in Hive Metastore such that applications outside of Hive can use it.

The objective of HCatalog is to hold the following type of information about the data in HDFS –

Location of the data
Metadata about the data (e.g. Schema etc)

This benefits the scripts and MR jobs who acts on the data, to not to worry about the location of files or schema of the data. HCatalog expose open APIs that other tools (like Teradata Aster) can use to get benefit from it.

In Pig, we can use HCatalog (like the following) to load data without specifying its location –

$ pig -useHCatalog
.  .  .
grunt> exch_log = LOAD 'Exchange_Log' USING org.apache.hive.hcatalog.pig.HCatLoader();

and to store data in the already created Hive table, we can use the command similar to the following-

grunt> STORE exch_log INTO 'Exchange_Log' USING org.apache.hive.hcatalog.pig.HCatStorer();

The table schema should be defined in Hive before you try to store the value and the relation being stored should have only fields that are present in the Hive table.

Note the presence of “hive” in the package name before hcatalog, which documents the fact that HCatalog is part of a bigger project called Hive :-). HCatalog’s home is changed from org.apache.hcatalog (earlier releases) to org.apache.hive.hcatalog (recent releases).

.

Columbia, MD 21045, USA

Vipul Pathak

Apache, Big Data, Hadoop, Hive, Pig, Programming, Technical

Apache, Apache Hadoop, Apache Hive, Apache Pig, Big Data, Hadoop, HCatalog, Programming, Technical, useHCatalog, vpathak

Loading Data in Pig Using HCatalog

Share this:

Leave a comment Cancel reply