Apache Pig at a High Level


Pig is a data flow language developed at Yahoo and is a high level language. Pig programs are translated into a lower level instructions supported by underlying execution engine (e.g. Map/Reduce programs). Pig is designed for working on complex operations (e.g. Joins) with speed.

Pig has mainly 2 components- Interpreter and Pig Latin (Scripting Language).
Pig Latin supports the following execution modes-

  1. Script Mode: Script containing Pig Latin statements.
  2. Grunt Shell Mode: This is interactive mode, where we start the Grunt shell and enter Pig Latin statements interactively. 
  3. Embedded Mode: Using the PigServer java class, we can execute a Pig query from within the Java class.

Grunt Modes

  • Local: pig -x local [script_name.pig]
  • MapReduce: pig [-x mapreduce] [hdfs://…/script_name.pig]
  • Tez: pig -x tez [script_name.pig]

Pig statements are executed by Pig Interpreter after validation is done. Instructions in the script/statement are stored in Logical Plan that the Interpreter makes and are executed after converting them into a Physical Plan when a STORE or DUMP or ILLUSTRATE statement is encountered. Pig internally uses MapReduce or Tez execution engine and Pig scripts are eventually converted into MR jobs to do anything they offer.

Comments
Multiline Comments  /* are covered by C style comment markers */. Single line comments are marked with  — which is more of a SQL like commenting style.

Design philosophy of the Pig tool, are compared to actual Pig (the animal)

  • Pigs eat anything: Pig (the tool) can process any kind of data (structured or unstructured, with or without Metadata, nested/relational/Key-Value).
  • Pigs can live anywhere: Pig (the tool) initially supported only MapReduce but isn’t designed for only one execution engine. Pig also work with Tez, another execution engine that work on Hadoop platform.
  • Pigs are domestic animals: Like Pig (the animal), Pig (the tool) is adjustable and modifiable by users. Pig support a wide variety of User Define Code, in the form of UDFs, Stream commands, custom partitioners etc.
  • Pigs Fly: Pig scripts are fast to execute. Pig evolves in a way that it remain fast to execute and doesn’t become heavy weight.

Install Pig by downloading it from pig.apache.org and untaring it (tar -xvf ). Define the Pig installation directory as environment variable PIG_PREFIX and add $PIG_PREFIX/bin to the PATH variable. Also define the HADOOP_PREFIX environment variable. Start Pig’s Grunt shell using the pig command, since HADOOP_PREFIX is defined, Pig will be able to connect to Hadoop Cluster seamlessly.
All HDFS commands work from inside the Grunt shell, e.g we can directly issue the command copyFromLocal or lsr or cd or pwd, without “! hadoop”.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.