Loading an HDFS Data File in Pig


.

emp = LOAD ‘/path/to/data/file/on/hdfc/Employees.txt’ [ USING PigStorage(‘ ‘) ] AS (
    emp_id: INT,
    name: CHARARRAY,
    joining_date: DATETIME,
    department: INT,
    salary: FLOAT,
    mgr_id: INT,
    residence: BAG {
            b:(addr1: CHARARRAY, addr2: CHARARRAY, city: CHARARRAY)
}) ;

The Alias for data in file “Employees.txt” is emp and using emp, we can refer to individual fields and perform actions, like filtering, grouping etc. Note that the text file don’t have named columns (it is structured though). We have defined a schema on top of file using the LOAD command. Every field name and relation name must start with an Alphabet and can only contain A-Z, a-z, 0-9 and an underscore character. Field and Relation names are not case sensitive. In Pig 0.14.0, even the keywords like- LOAD and DUMP are case insensitive.

The parameter to LOAD is actually a resource locator, i.e. it can be an HDFS file location, an HBase table, a JDBC connection or even a WebService URL. A load/store function extend Pig’s LoadFunc class and encapsulate Hadoop’s InputFormat. It provides hint to the LOAD operation about the data and it’s storage format. TextLoader, PigStorage and HCatLoader are such functions.

If we LOAD the data in a relation using TextLoader(), then there is no structure defined and the data is loaded as simple text (unstructured). On the other hand, if we use the PigStorage( ), we can load the data in a structured form and can even define the schema using the AS clause.

.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.