Writing EvalFunc UDF in Pig


.

UDFs (User Defined Functions) are ways in pig to extend its functionality. There are two type of UDFs that we can write in pig –

  • Evaluate (extends from EvalFunc base class)
  • Load/Store functions (extends from LoadFunc base class)

Here we will stepwise develop an Evaluate UDF. Lets start by conceptualizing a UDF (named VowelCount) that will return an integer representing a count of vowels in a string.


Including Pig Jar files for Pig 0.14.0  in pom.xml

    1.2.1
    0.14.0




    
        org.apache.pig
        pig
        ${pigVersion}
    

    
        org.apache.hadoop
        hadoop-core
        ${hadoopVersion}
    
    
.  .  .
.  .  .

Implementing UDF

package com.pig.ni.action.assignments.udf;

import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;

import java.io.IOException;


public class VowelCount extends EvalFunc {

    public Integer exec(Tuple inputVal) throws IOException {

        // We will check the input string for these characters ...
        String[] setOfVowels = new String[]{"a", "e", "i", "o", "u"};

        // Validate Input Value ...
        if (inputVal == null ||
            inputVal.size() == 0 ||
            inputVal.get(0) == null) {

            // Emit warning text for user, and skip this iteration
            // Fail this iteration but don't fail the entire task.
            // Throwing an exception will fail the entire task, while
            // returning NULL will only fail this UDF call ...
            super.warn("Inappropriate parameter, either value missing or null. " +
                       "Skipping ...",
                       PigWarning.SKIP_UDF_CALL_FOR_NULL);
            return null;
        }

        // Count vowels in this string ...
        final String inputString = ((String) inputVal.get(0)).toLowerCase();
        int vowelCount = 0;
        int recentlyReportedIndex = 0;

        try {

            // Check each vowel ...
            for(String thisVowel : setOfVowels) { // Find "a" in "Vipul Pathak"

                recentlyReportedIndex = 0;

                // Keep counting until -1 is returned by indexOf() method ...
                while (true){

                    // Where is next vowel located ?
                    recentlyReportedIndex = inputString.indexOf(thisVowel, recentlyReportedIndex);

                    // Not found (-1 is failure to find) ..... Break the loop
                    if (recentlyReportedIndex < 0) {
                        break;

                    } else {
                        // Vowel found
                        ++vowelCount;

                        // Next search for this vowel should be after this index.
                        ++recentlyReportedIndex;
                    }
                }

            }
        } catch (Exception e) {
            throw new IOException("VowelCount: Fatal error condition. Aborting execution: ", e);
        }

        return vowelCount;

    }

}

Using the UDF

$ pig
grunt> REGISTER /Users/vpathak/Data/VowelCount/target/VowelCount-1.jar;
grunt> DEFINE VC com.pig.ni.action.assignments.udf.VowelCount;
.  .  .
grunt> emp = LOAD 'hdfs://localhost:9000/user/vpathak/Data/emp/emp.txt' USING PigStorage(',') AS (emp_no: INT, ename: CHARARRAY, mgr_id: INT, job: CHARARRAY, salary: INT);
grunt> emp_processed = FOREACH emp GENERATE emp_no, VC(ename) AS TotalVowels;
grunt>

This is the simplest type of possible UDF that we can write in Pig.

.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.