Writing Simple UDF in Hive


.

There are a few type of UDFs that we can write in Hive.

  • Functions that act on each column value passed to it, e.g. Select Length(name) From Customer
    • Specific functions written for a specific data type
    • Generic functions written to working with more than one data type (GenericUDF)
  • Functions that act on a group of values and emit a single result, e.g. Select AVG(salary) From Employees
  • Functions that act on a combined values and explode it generate multiple rows

Here we will focus on the specific functions that act on each column value (shown above in bold). There is another article that discusses Generic UDFs.

We will write here a UDF that will set the width of the column and output by either padding spaces or by trimming the output to a specified size. As a first step, we will add dependencies in the pom.xml for Hive and Hadoop –

    
      org.apache.hadoop
      hadoop-core
      ${hadoopVersion}
    

    
      org.apache.hive
      hive-exec
      ${hiveVersion}
    

Here you can define the versions of Hive and Hadoop as per your wish. I defined these properties as follows –

  
    1.2.1
    0.14.0
  

Next, I ran the following command to download Hive, Hadoop and other dependencies before I create the eclipse project and write the code –

$ mvn  clean  compile

Once the dependencies are downloaded, I created a Java project and wrote the following Java class for the Hive UDF –

SetSize


package com.hive.in.action.assignments;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.io.Text;

@Description(
        name     = "SetSizeStr",
        value    = "_FUNC_(String columnNameToPad, int desiredSize) : Pad or Cut the string to make it the desired size.",
        extended = "Example:\n" +
                   "    SELECT _FUNC_(name, 35) FROM customers;\n"
)
/**
 * This UDF acts on values of a single String column and and return one output value
 * per input column value it receives.
 *
 * @author vpathak
 */
public final class SetSize extends UDF {

    // This method is called by Hive for each column value.
    public Text evaluate(final Text inText, int nSize) {

        // Validate input ...
        if (inText == null) {
            return null;
        }

        // Extract the string from the Text object and calculate its length ...
        String sReturnText = inText.toString();
        int nLen = sReturnText.trim().length();

        // Cut or Pad the string ...
        if (nLen > nSize) {
            sReturnText = sReturnText.substring(0, nSize);
        } else {
            StringBuffer strBuff = new StringBuffer(sReturnText.trim());
            for (int nCtr = 0; nCtr < (nSize-nLen); ++nCtr) {
                strBuff.append(' ');
            }
            sReturnText = strBuff.toString();
        }

        // Return the new string ...
        return new Text(sReturnText);
    }
}

The User Defined Function that we want to write to extend the functionality of Hive should be a class that extends from Hive’s UDF class. Hive executes the evaluate() function of the class based on the matching data types. This means we can override evaluate() and implement multiple variants of it. For example, if evaluate is implemented with (Text, int) and (int, int) and at runtime, we call SetSize(“Sample”, 10), then Hive will call the (Text, int) version of evaluate(). However, an evaluate method should return a value and the return type can’t be void.

I built the Jar of the Java class by using the following Maven command –

$ mvn  clean  install

Upon successful completion of the command, the target sub-directory contains the SetSize-1.jar file.

To use the UDF in Hive, we must first Add the Jar file and then create a function definition in the Hive like this –

$ hive
.  .  .
hive> ADD JAR file:///Users/vpathak/Data/Technical/Hive/UDFs/SetSize/target/SetSize-1.jar;
hive> CREATE TEMPORARY FUNCTION SetSize AS 'com.hive.in.action.assignments.SetSize';

Like we can see here that the Java code we wrote here is associated with a temporary function name SetSize in Hive. The Description annotation that we decorated our class with, contains the usage information about our UDF. But the usage information doesn’t contain the name of our UDF. Instead it says- _FUNC_. However, once we create a temporary function, Hive will use that name (SetSize) instead of _FUNC_. This allows the same implementation to be used by different function name, since the user create a function name for their choice to use the functionality.

So if we use the DESCRIBE FUNCTION  or DESCRIBE FUNCTION EXTENDED  command, Hive will display the usage information but will use the SetSize function name instead of _FUNC_, like this –

hive> DESCRIBE FUNCTION EXTENDED SetSize;
OK

SetSize(String columnNameToPad, int desiredSize) : Pad or Cut the string to make it the desired size.
Synonyms: setsize
Example:
    SELECT SetSize(name, 35) FROM customers;

Time taken: 0.031 seconds, Fetched: 5 row(s)
hive>

Generic UDFs are discussed here.

.

One response to “Writing Simple UDF in Hive”

  1. […] Specific functions written for a specific data type (simple UDFs) […]

    Like

Leave a reply to Writing a Generic UDF in Hive | Vipul Pathak Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.