Sample Pig Statements Demonstrating Many Functions


.

Pig Statements

-- Load command loads the data
-- Every placeholder like "A_Rel" and "Filter_A" are called Alias, and they are useful
-- in holding the relation returned by pig statements. Aliases are relations (not variables).
A_Rel = LOAD '/hdfs/path/to/file' [AS (col_1[: type], col_2[: type], col_3[: type], ...)] ;

-- Record set returned by a command is called "Relation"
-- Filter command applies a filter condition on the Data Set / Record Set (Relation)
-- A filter can be viewed as a WHERE clause by SQL enthusiasts.
-- Assignment Operator is =    Equality Comparison Operator is ==
Filter_A = FILTER A_Rel BY (col_1 == "New York");

-- The Dump command can dump the content of a Relation on screen. Every row in a
-- is termed as a "Tuple". This will also trigger the underlying MR job.
DUMP Filter_A;

-- Group By operation in Pig Latin is as easy as applying the GROUP command
-- The grouped result will be stored in Alias "Grp_A"
Grp_A = GROUP A_Rel BY col_1;  -- Grp_A is an alias.

-- As it is already known, DUMP will emit the content of Grp_A on screen.
-- The content will contain a Relation that contains the grouping column value
-- followed by a group of unordered tuples (which is called a Bag, similar to a List),
-- followed by another grouping column value ... etc.
--
-- The first column (e.g. with values like- col_1_val etc) will contain a column called
-- group with value of grouping column, "col_1" in this case. The second column will be
-- the name of grouped relation, (e.g. A_Rel in this case) and that contains all the tuples
-- with same value of col_1 in it.
--
-- Like:   group       | A_Rel
--         ------------|---------------------------
--         col_1_val,  | { {tuple 1}, {tuple 4} }
--         col_2_val,  | { {tuple 2}, {tuple 3} }
--         ------------|---------------------------
DUMP Grp_A;  -- Until DUMP, you are working on Logical Plan only.

-- Join between 2 relations is a breeze, just say- JOIN and list the relation (R2 or R3) and a
-- BY keyword followed by column name OR (zero based) column index ($0 or $3) for that set.
r2 = LOAD '/path/to/another/data/set' ;
r3 = LOAD 'table_3' USING HCatLoader();  -- If table definition was stored in HCatalog
Joined_C = JOIN r2 BY $0, r3 BY emp_id;

Relations (or Sets or Record Sets) can refer and use fields that are associated with an Alias, e.g. in the FILTER statement we specify an Alias and use its field in an expression … “filtered_sals = FILTER salaries BY income > 70000”. Here salaries is an alias pointing to a relation and income is a field within that relation..

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.