Programming

  • Ingesting Log4J Application Logs Into HDFS Using Flume In Realtime

    Log file analysis is popular usecase in Big Data world. Log files contains evidence of historical events that an application witnessed under their execution environment. Monitoring applications intend to find out traces of actual events that happened during program execution. Several analysis usecases are possible from simply counting occurrence of some event to specific processing.

    Read more →

  • Loading Data in Pig Using HCatalog

    . HCatalog is an extension of Hive and in a nutshell, it exposes the schema information in Hive Metastore such that applications outside of Hive can use it. The objective of HCatalog is to hold the following type of information about the data in HDFS – Location of the data Metadata about the data (e.g.

    Read more →

  • Writing Simple UDF in Hive

    . There are a few type of UDFs that we can write in Hive. Functions that act on each column value passed to it, e.g. Select Length(name) From Customer Specific functions written for a specific data type Generic functions written to working with more than one data type (GenericUDF) Functions that act on a group

    Read more →

  • Loading an HDFS Data File in Pig

    . emp = LOAD ‘/path/to/data/file/on/hdfc/Employees.txt’ [ USING PigStorage(‘ ‘) ] AS (     emp_id: INT,     name: CHARARRAY,     joining_date: DATETIME,     department: INT,     salary: FLOAT,     mgr_id: INT,     residence: BAG {             b:(addr1: CHARARRAY, addr2: CHARARRAY, city: CHARARRAY) }) ; The Alias for data in file “Employees.txt” is emp and using emp,

    Read more →

  • . — emp  = LOAD ‘Employees.txt’ … Data in text file resembles the “EMP” table in Oracle — dept = LOAD ‘Dept.txt’ …….. Data in text file resembles the “DEPT” table in Oracle — Filter data in emp to only those whose job is Clerk. Filtered_Emp = FILTER emp BY (job == ‘CLERK’); — Supports

    Read more →

  • . Pig Statements — Load command loads the data — Every placeholder like “A_Rel” and “Filter_A” are called Alias, and they are useful — in holding the relation returned by pig statements. Aliases are relations (not variables). A_Rel = LOAD ‘/hdfs/path/to/file’ [AS (col_1[: type], col_2[: type], col_3[: type], …)] ; — Record set returned by

    Read more →