Pig Latin
-

. emp = LOAD ‘/path/to/data/file/on/hdfc/Employees.txt’ [ USING PigStorage(‘ ‘) ] AS ( emp_id: INT, name: CHARARRAY, joining_date: DATETIME, department: INT, salary: FLOAT, mgr_id: INT, residence: BAG { b:(addr1: CHARARRAY, addr2: CHARARRAY, city: CHARARRAY) }) ; The Alias for data in file “Employees.txt” is emp and using emp,
-

. UDFs (User Defined Functions) are ways in pig to extend its functionality. There are two type of UDFs that we can write in pig – Evaluate (extends from EvalFunc base class) Load/Store functions (extends from LoadFunc base class) Here we will stepwise develop an Evaluate UDF. Lets start by conceptualizing a UDF (named VowelCount)
-
. — emp = LOAD ‘Employees.txt’ … Data in text file resembles the “EMP” table in Oracle — dept = LOAD ‘Dept.txt’ …….. Data in text file resembles the “DEPT” table in Oracle — Filter data in emp to only those whose job is Clerk. Filtered_Emp = FILTER emp BY (job == ‘CLERK’); — Supports
-

. Simple: INT and FLOAT are 32 bit signed numeric datatypes backed by java.lang.Integer and java.lang.Float Simple: LONG and DOUBLE are 64 bit signed numeric Java datatypes Simple: CHARARRAY (Unicode backed by java.lang.String) Simple: BYTEARRAY (Bytes / Blob, backed by Pig’s DataByteArray class that wraps byte[]) Simple: BOOLEAN (“true” or “false” case sensitive) Simple: DATETIME
-
. Pig Statements — Load command loads the data — Every placeholder like “A_Rel” and “Filter_A” are called Alias, and they are useful — in holding the relation returned by pig statements. Aliases are relations (not variables). A_Rel = LOAD ‘/hdfs/path/to/file’ [AS (col_1[: type], col_2[: type], col_3[: type], …)] ; — Record set returned by