How to Write a Map Only Job in Hadoop MapReduce

Saturday, November 25, 2023

How to Write a Map Only Job in Hadoop MapReduce

In a MapReduce job in Hadoop you generally write both map function and reduce function. Map function to generate (key, value) pairs and reduce function to aggregate those (key, value) pairs but you may opt to have only the map function in your MapReduce job and skip the reducer part. That is known as a Mapper only job in Hadoop MapReduce.

Mapper only job in Hadoop

You may have a scenario where you just want to generate (key, value) pair in that case you can write a job with only map function. For example if you want to convert file to a binary file format like SequenceFile or to a columnar file format like Parquet.

Refer How to Read And Write SequenceFile in Hadoop to see how to convert text file to a sequence file using a mapper only job.

Note that, generally in a MapReduce job output of Mappers are written to local disk rather than in HDFS. In case of Mapper only job map output is written to HDFS which is one of the difference between a MapReduce job and a Mapper only job in Hadoop.

Writing Mapper only job

In order to write a mapper only job you need to set number of reducers as zero. You can do by adding job.setNumReduceTasks(0); in your driver class.

As example

@Override
public int run(String[] args) throws Exception {
 Configuration conf = getConf();
 Job job = Job.getInstance(conf, "TestClass");
 job.setJarByClass(getClass());
 job.setMapperClass(TestMapper.class);
 // Setting reducer to zero
 job.setNumReduceTasks(0);
 .....
 .....

}

Another way to have a Mapper only job is to pass the configuration parameter in the command line. Parameter used is mapreduce.job.reduces note that before Hadoop 2 parameter was mapred.reduce.tasks which is deprecated now.

As example-

hadoop jar /path/to/jar ClasstoRun -D mapreduce.job.reduces=0 /input/path /output/path

Mapper only job runs faster

The output of map job is partitioned and sorted on keys. Then it is sent across the network to the nodes where reducer is running. This whole shuffle phase can be avoided by having a Mapper only job in Hadoop making it faster.

That's all for this topic How to Write a Map Only Job in Hadoop MapReduce. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

You may also like-