.
The following are the values that are set by-default, when we don’t set them explicitly –
Job.setInputFormatClass: TextInputFormat (Outputs the starting position of each line as Key (Long), and line content as Value (Text))
Job.setMapperClass: Mapper (Default mapper, called IdentityMapper in MRv1, which simply output Key/Value as is)
Job.setMapOutputKeyClass: LongWritable (The type of key in intermediate Mapper output)
Job.setMapOutputValueClass: Text (The type of value in intermediate Mapper output)
Job.setPartitionerClass: HashPartitioner (Partitioner class that evaluate each record and decide which Reducer to send this record)
Job.setReducerClass: Reducer (Default Reducer, similar to IdentityReducer that output what goes in as an Input)
Job.setNumReducetasks: 1 (By default only one Reducer will process output from all Mappers)
Job.setOutputKeyClass: LongWritable (The type of key in the Job’s output created by the Reducer)
Job.setOutputValueClass: Text (The type of value in the Job’s output created by the Reducer)
Job.setOutputFormatClass: TextOutputFormat (Writes each record on a separate line using tab character as separator)
Similarly, following settings are default for Streaming API when we don’t set them explicitly (defaults are in bold rust) –
$ export HADOOP_TOOLS=$HADOOP_HOME/share/hadoop/tools
$ hadoop jar $HADOOP_TOOLS/lib/hadoop-streaming-*.jar
-input MyInputDirectory \
-output MyOutputDirectory \
-mapper /user/ubuntu/MyScript.sh \ # This can be any Unix Command or even a Java class
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat \
-io text \
-D stream.map.input.field.separator=\t \
-file /user/ubuntu/MyScript.sh # This needs to be transported to each Mapper node
Unlike the Java API, the mapper needs to be specified. Also, the streaming API uses the old MapReduce API classes (org.apache.hadoop.mapred). The file packaging options –file OR –archives are only required to be added if we want to run our own script that is not supposed to be present on all cluster nodes. The unix pipes are not supported in the mapper, e.g. “cat dataFile.txt | wc” will not work.
.
Leave a comment