Friday, June 8, 2018

How MapReduce Works in Hadoop

In the post Word Count MapReduce Program in Hadoop a word count MapReduce program is already written in Java. In this post, using that program as reference we’ll see how MapReduce works in Hadoop framework and how data is processed in Map and Reduce tasks respectively.

Map and Reduce tasks in Hadoop

With in a MapReduce job there are two separate tasks map task and reduce task.

Map task- A MapReduce job splits the input dataset into independent chunks known as input splits in Hadoop which are processed by the map tasks in a completely parallel manner. Hadoop framework creates separate map task for each input split.

Reduce task- The output of the maps is sorted by the Hadoop framework which then becomes input to the reduce tasks.

Map and Reduce inputs and outputs

Hadoop MapReduce framework operates exclusively on <key, value> pairs. In a MapReduce job, the input to the Map function is a set of <key, value> pairs and output is also a set of <key, value> pairs. The output <key, value> pair may have different type from the input <key, value> pair.

<K1, V1> -> map -> (K2, V2)

The output from the map tasks is sorted by the Hadoop framework. MapReduce guarantees that the input to every reducer is sorted by key. Input and output of the reduce task can be represented as follows.

<K2, list(V2)> -> reduce -> <K3, V3>

How map task works in Hadoop

Now let’s see how map task works using the word count program as an example which can be seen here.

As already stated both input and output of the map function are <key, value> pairs. For the word count program input file will be read line by line and every line will be passed to the map function as <key, value> pair.

For the word count program TextInputFormat which is the default InputFormat is used. In this format key is the byte offset within the file of the beginning of the line. Whereas value is the content of the line.

Let’s say you have a file wordcount.txt with the following content.

Hello wordcount MapReduce Hadoop program.
This is my first MapReduce program.

Each line will be passed to the map function in the following format.

<0, Hello wordcount MapReduce Hadoop program.>
<41, This is my first MapReduce program.>
In the map function the line is split on space and each word is written to the context along with the value as 1.
public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
    // Splitting the line on spaces
    String[] stringArr = value.toString().split("\\s+");
    for (String str : stringArr) {
        context.write(word, new IntWritable(1));

So the output from the map function for the two lines will be as follows.

Line 1 <key, value> output

(Hello, 1) 
(wordcount, 1) 
(MapReduce, 1)
(Hadoop, 1)
(program., 1)

Line 2 <key, value> output

(This, 1) 
(is, 1) 
(my, 1) 
(first, 1) 
(MapReduce, 1) 
(program., 1)

Shuffling and sorting by Hadoop Framework

The output of map function doesn’t become input of the reduce function directly. It goes through shuffling and sorting by Hadoop framework. In this processing the data is sorted and grouped by keys.

After the internal processing the data will be in the following format. This is the input to reduce function.

<Hadoop, (1)>
<Hello, (1)>
<MapReduce, (1, 1)>
<This, (1)>
<first, (1)>
<is, (1)>
<my, (1)>
<program., (1, 1)>
<wordcount, (1)>

How reduce task works in Hadoop

As we just saw the input to the reduce task is in the format (key, list<values>). In the reduce function, for each input <key, value> pair, just iterate the list of values for each key and add the values that will give the count for each key.

public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable val : values) {
    sum += val.get();
  context.write(key, result);

Write that key and sum of values to context, that <key, value> pair is the output of the reduce function.

Hadoop     1
Hello    1
MapReduce   2
This       1
first       1
is         1
my         1
program.      2
wordcount   1  

That's all for this topic How MapReduce Works in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

  1. Introduction to Hadoop Framework
  2. How to Compress MapReduce Job Output in Hadoop
  3. MapReduce Flow in YARN
  4. Data Locality in Hadoop
  5. YARN in Hadoop

You may also like-

  1. What is HDFS
  2. What is Big Data
  3. Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode
  4. Uber Mode in Hadoop
  5. Compressing File in snappy Format in Hadoop - Java Program
  6. What is SafeMode in Hadoop
  7. Comparing Enum to String - Java Program
  8. Array in Java