Filtering and Limiting Data in Pig


.

-- emp  = LOAD 'Employees.txt' ... Data in text file resembles the "EMP" table in Oracle
-- dept = LOAD 'Dept.txt' ........ Data in text file resembles the "DEPT" table in Oracle
-- Filter data in emp to only those whose job is Clerk.
Filtered_Emp = FILTER emp BY (job == 'CLERK');

-- Supports filtering of records based on Regular Expression (MATCHES keyword)
-- A logical NOT operation can be applied to the criteria, however the NOT operator
-- should be applied before column name, e.g. NOT ename MATCHES (ename not matches
-- is Invalid). Regular Expression is the standard expression and is specified using
-- the MATCHES keyword.
Emp_Names_Without_A = FILTER emp BY (NOT ename MATCHES '.*[A|a].*');

-- The LIMIT operator works the same way as TOP in many RDBMS queries,
-- e.g. TOP 20 in SQL query is same as using LIMIT  20 here.
-- The guarantee here is to pick 20 tuples from the relation, but no
-- guarantee which 20.
Any_4_Employees = LIMIT emp 4;

.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.