Pig

  • . EMP and DEPT tables are pretty popular between Oracle users. These tables were very handy in quickly trying new queries. Also, there exists a DUAL table in Oracle that was pretty useful in evaluate expressions, like- “Select (SYSDATE + 1/24) as OneHourFromNow  FROM DUAL“. These tables doesn’t exists in Hive, but we can create

    Read more →

  • Loading an HDFS Data File in Pig

    . emp = LOAD ‘/path/to/data/file/on/hdfc/Employees.txt’ [ USING PigStorage(‘ ‘) ] AS (     emp_id: INT,     name: CHARARRAY,     joining_date: DATETIME,     department: INT,     salary: FLOAT,     mgr_id: INT,     residence: BAG {             b:(addr1: CHARARRAY, addr2: CHARARRAY, city: CHARARRAY) }) ; The Alias for data in file “Employees.txt” is emp and using emp,

    Read more →

  • Writing EvalFunc UDF in Pig

    . UDFs (User Defined Functions) are ways in pig to extend its functionality. There are two type of UDFs that we can write in pig – Evaluate (extends from EvalFunc base class) Load/Store functions (extends from LoadFunc base class) Here we will stepwise develop an Evaluate UDF. Lets start by conceptualizing a UDF (named VowelCount)

    Read more →

  • . — emp  = LOAD ‘Employees.txt’ … Data in text file resembles the “EMP” table in Oracle — dept = LOAD ‘Dept.txt’ …….. Data in text file resembles the “DEPT” table in Oracle — Filter data in emp to only those whose job is Clerk. Filtered_Emp = FILTER emp BY (job == ‘CLERK’); — Supports

    Read more →

  • Pig Data Types

    Pig Data Types

    . Simple: INT and FLOAT are 32 bit signed numeric datatypes backed by java.lang.Integer and java.lang.Float Simple: LONG and DOUBLE are 64 bit signed numeric Java datatypes Simple: CHARARRAY (Unicode backed by java.lang.String) Simple: BYTEARRAY (Bytes / Blob, backed by Pig’s DataByteArray class that wraps byte[]) Simple: BOOLEAN (“true” or “false” case sensitive) Simple: DATETIME

    Read more →

  • . Pig Statements — Load command loads the data — Every placeholder like “A_Rel” and “Filter_A” are called Alias, and they are useful — in holding the relation returned by pig statements. Aliases are relations (not variables). A_Rel = LOAD ‘/hdfs/path/to/file’ [AS (col_1[: type], col_2[: type], col_3[: type], …)] ; — Record set returned by

    Read more →

  • Pig is a data flow language developed at Yahoo and is a high level language. Pig programs are translated into a lower level instructions supported by underlying execution engine. Pig is designed for working on complex operations with speed.

    Read more →