Thursday, November 9, 2023

How to Compress Intermediate Map Output in Hadoop

In order to speed up the MaReduce job it is helpful to compress the map output in Hadoop.

Since output of the map phase is-

  1. Stored to disk.
  2. Mapper output is transferred to the reducers on different nodes as their input.

Thus compressing the map output helps in both-

  1. Saving the storage (reducing the IO) while storing map output.
  2. Reduces the amount of data transferred to reducers.

It is better to use a fast compressor like Snappy, LZO or LZ4 to compress map output in Hadoop as higher compression ratio would mean more time to compress. Moreover compressed output is splittable or not does not matter when compressing intermediate map output.

Configuration parameters for compressing map output

You can set configuration parameters for the whole cluster so that all the jobs running on the cluster compress the map output. You can also opt to do it for individual MapReduce jobs.

As example- If you want to set snappy as the compression format for the map output at the cluster level then you need to set the following properties in mapred-site.xml:

<property>
  <name>mapreduce.map.output.compress</name>
  <value>true</value>
</property>
<property>
  <name>mapreduce.map.output.compress.codec</name>
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

If you want to set it for jobs individually then you need to set following properties with in your MapReduce program-

Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");

That's all for this topic How to Compress Intermediate Map Output in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. How to Compress MapReduce Job Output in Hadoop
  2. Data Compression in Hadoop
  3. Compressing File in bzip2 Format in Hadoop - Java Program
  4. Word Count MapReduce Program in Hadoop
  5. What is SafeMode in Hadoop

You may also like-

  1. MapReduce Flow in YARN
  2. Speculative Execution in Hadoop
  3. Data Locality in Hadoop
  4. What is HDFS
  5. instanceof Operator in Java
  6. How to Run a Shell Script From Java Program
  7. Creating Tar File And GZipping Multiple Files - Java Program
  8. Java Multi-Threading Interview Questions