.
UDFs (User Defined Functions) are ways in pig to extend its functionality. There are two type of UDFs that we can write in pig –
- Evaluate (extends from EvalFunc base class)
- Load/Store functions (extends from LoadFunc base class)
Here we will stepwise develop an Evaluate UDF. Lets start by conceptualizing a UDF (named VowelCount) that will return an integer representing a count of vowels in a string.
Including Pig Jar files for Pig 0.14.0 in pom.xml
1.2.1
0.14.0
org.apache.pig
pig
${pigVersion}
org.apache.hadoop
hadoop-core
${hadoopVersion}
. . .
. . .
Implementing UDF
package com.pig.ni.action.assignments.udf;
import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;
import java.io.IOException;
public class VowelCount extends EvalFunc {
public Integer exec(Tuple inputVal) throws IOException {
// We will check the input string for these characters ...
String[] setOfVowels = new String[]{"a", "e", "i", "o", "u"};
// Validate Input Value ...
if (inputVal == null ||
inputVal.size() == 0 ||
inputVal.get(0) == null) {
// Emit warning text for user, and skip this iteration
// Fail this iteration but don't fail the entire task.
// Throwing an exception will fail the entire task, while
// returning NULL will only fail this UDF call ...
super.warn("Inappropriate parameter, either value missing or null. " +
"Skipping ...",
PigWarning.SKIP_UDF_CALL_FOR_NULL);
return null;
}
// Count vowels in this string ...
final String inputString = ((String) inputVal.get(0)).toLowerCase();
int vowelCount = 0;
int recentlyReportedIndex = 0;
try {
// Check each vowel ...
for(String thisVowel : setOfVowels) { // Find "a" in "Vipul Pathak"
recentlyReportedIndex = 0;
// Keep counting until -1 is returned by indexOf() method ...
while (true){
// Where is next vowel located ?
recentlyReportedIndex = inputString.indexOf(thisVowel, recentlyReportedIndex);
// Not found (-1 is failure to find) ..... Break the loop
if (recentlyReportedIndex < 0) {
break;
} else {
// Vowel found
++vowelCount;
// Next search for this vowel should be after this index.
++recentlyReportedIndex;
}
}
}
} catch (Exception e) {
throw new IOException("VowelCount: Fatal error condition. Aborting execution: ", e);
}
return vowelCount;
}
}
Using the UDF
$ pig
grunt> REGISTER /Users/vpathak/Data/VowelCount/target/VowelCount-1.jar;
grunt> DEFINE VC com.pig.ni.action.assignments.udf.VowelCount;
. . .
grunt> emp = LOAD 'hdfs://localhost:9000/user/vpathak/Data/emp/emp.txt' USING PigStorage(',') AS (emp_no: INT, ename: CHARARRAY, mgr_id: INT, job: CHARARRAY, salary: INT);
grunt> emp_processed = FOREACH emp GENERATE emp_no, VC(ename) AS TotalVowels;
grunt>
This is the simplest type of possible UDF that we can write in Pig.
.

Leave a comment