Default Values Set in Job class in Map/Reduce v2

May 20, 2015

Default Values Set in Job class in Map/Reduce v2

.

The following are the values that are set by-default, when we don’t set them explicitly –

Job.setInputFormatClass:    TextInputFormat (Outputs the starting position of each line as Key (Long), and line content as Value (Text))
Job.setMapperClass:         Mapper           (Default mapper, called IdentityMapper in MRv1, which simply output Key/Value as is)
Job.setMapOutputKeyClass:   LongWritable     (The type of key in intermediate Mapper output)
Job.setMapOutputValueClass: Text             (The type of value in intermediate Mapper output)
Job.setPartitionerClass:    HashPartitioner (Partitioner class that evaluate each record and decide which Reducer to send this record)
Job.setReducerClass:        Reducer          (Default Reducer, similar to IdentityReducer that output what goes in as an Input)
Job.setNumReducetasks:      1                (By default only one Reducer will process output from all Mappers)
Job.setOutputKeyClass:      LongWritable     (The type of key in the Job’s output created by the Reducer)
Job.setOutputValueClass:    Text             (The type of value in the Job’s output created by the Reducer)
Job.setOutputFormatClass:   TextOutputFormat (Writes each record on a separate line using tab character as separator)

Similarly, following settings are default for Streaming API when we don’t set them explicitly (defaults are in bold rust) –

$ export HADOOP_TOOLS=$HADOOP_HOME/share/hadoop/tools
$ hadoop jar $HADOOP_TOOLS/lib/hadoop-streaming-*.jar
-input           MyInputDirectory \
-output          MyOutputDirectory \
-mapper          /user/ubuntu/MyScript.sh \ # This can be any Unix Command or even a Java class
-inputformat     org.apache.hadoop.mapred.TextInputFormat \
-partitioner     org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer         org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat    org.apache.hadoop.mapred.TextOutputFormat \
-io              text \
-D               stream.map.input.field.separator=\t \
-file            /user/ubuntu/MyScript.sh     # This needs to be transported to each Mapper node

Unlike the Java API, the mapper needs to be specified. Also, the streaming API uses the old MapReduce API classes (org.apache.hadoop.mapred). The file packaging options –file OR –archives are only required to be added if we want to run our own script that is not supposed to be present on all cluster nodes. The unix pipes are not supported in the mapper, e.g. “cat dataFile.txt | wc” will not work.

.

Vipul Pathak

Big Data, Hadoop, Map-Reduce, Technical

Big Data, Hadoop, MapReduce, MR Defaults, MRv2, vpathak

Vipul Pathak

Default Values Set in Job class in Map/Reduce v2

One response to “Default Values Set in Job class in Map/Reduce v2”

Leave a comment Cancel reply

Default Values Set in Job class in Map/Reduce v2

Share this:

One response to “Default Values Set in Job class in Map/Reduce v2”

Leave a comment Cancel reply