Thursday, December 28, 2023

Counters in Hadoop MapReduce Job

If you run a MapReduce job you would have seen a lot of counters displayed on the console after the MapReduce job is finished (You can also check the counters using UI while the job is running). These counters in Hadoop MapReduce give a lot of statistical information about the executed job. Apart from giving you the information about the tasks these counters also help you in diagnosing the problems in MapReduce job, improving the MapReduce performance.

For example you get information about the spilled records and memory usage which gives you an indicator about the performance of your MapReduce job.


Types of counters in Hadoop

There are 2 types of Counters in Hadoop MapReduce.

  1. Built-In Counters
  2. User-Defined Counters or Custom counters

Built-In Counters in MapReduce

Hadoop Framework has some built-in counters which give information pertaining to-

  1. File system like bytes read, bytes written.
  2. MapReduce job like launched map and reduce tasks
  3. MapReduce task like map input records, combiner output records.

These built-in counters are grouped based on the type of information they provide and represented by Enum classes in Hadoop framework. Following is the list of the Counter groups and the corresponding Enum class names.

  1. File System Counters – org.apache.hadoop.mapreduce.FileSystemCounter
  2. Job Counters– org.apache.hadoop.mapreduce.JobCounter
  3. Map-Reduce Framework Counters– org.apache.hadoop.mapreduce.TaskCounter
  4. File Input Format Counters– org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter
  5. File Output Format Counters– org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter

File System Counters in MapReduce

File system counters will be repeated for each type of file system, prefixed with the file system for each entry. As example FILE: Number of bytes read, HDFS: Number of bytes read.

  • Number of bytes read- Displays the number of bytes read by the file system for both Map and Reduce tasks.
  • Number of bytes written- Displays the number of bytes read by the file system for both Map and Reduce tasks.
  • Number of read operations- Displays the number of read operations by both Map and Reduce tasks.
  • Number of large read operations- Displays the number of large read operations (example: traversing the directory tree) for both Map and Reduce tasks.
  • Number of write operations- Displays the number of write operations by both Map and Reduce tasks.

Job Counters in MapReduce

These counters give information about the whole job not at the task level.

  • Launched map tasks- Displays the number of launched map tasks.
  • Launched reduce tasks- Displays the number of launched reduce tasks.
  • Launched uber tasks- Displays the number of tasks launched as uber tasks.
  • Data-local map tasks- Displays the number of mappers run on the same node where the input data they have to process resides.
  • Rack-local map taks- Displays the number of mappers run on the node on the same rack where the input data they have to process resides.
  • Map in uber tasks- Displays the number of maps run as uber tasks.
  • Reduce in uber tasks- Displays the number of reducers run as uber tasks.
  • Total time spent by all map tasks - Total time in miliseconds running all the launched map tasks.
  • Total time spent by all reduce tasks- Total time in miliseconds running all the launched reducde tasks.
  • Failed map tasks- Displays the number of map tasks that failed.
  • Failed reduce tasks- Displays the number of reduce tasks that failed.
  • Failed uber tasks- Displays the number of uber tasks that failed.
  • Killed map tasks- Displays the number of killed map tasks.
  • Killed reduce tasks- Displays the number of killed reduce tasks.

Map-Reduce Framework Counters

These counters collect information about the running task.

  • Map input records– Displays the number of records processed by all the maps in the MR job.
  • Map output records– Displays the number of output records produced by all the maps in the MR job.
  • Map skipped records– Displays the number of records skipped by all the maps.
  • Map output bytes– Displays the number of bytes produced by all the maps in the MR job.
  • Map output materialized bytes– Displays the Map output bytes written to the disk.
  • Reduce input groups– Displays the number of key groups processed by all the Reducers.
  • Reduce shuffle bytes– Displays the number of bytes of Map output copied to Reducers in shuffle process.
  • Reduce input records– Displays the number of input records processed by all the Reducers.
  • Reduce output records– Displays the number of output records produced by all the Reducers.
  • Reduce skipped records– Displays the number of records skipped by Reducer.
  • Input split bytes– Displays the data about input split objects in bytes.
  • Combine input records– Displays the number of input records processed by combiner.
  • Combine output records– Displays the number of output records produced by combiner.
  • Spilled Records– Displays the number of records spilled to the disk by all the map and reduce tasks.
  • Shuffled Maps– Displays the number of map output files transferred during shuffle process to nodes where reducers are running.
  • Failed Shuffles– Displays the number of map output files failed during shuffle.
  • Merged Map outputs– Displays the number of map outputs merged after map output is transferred.
  • GC time elapsed– Displays the garbage collection time in mili seconds.
  • CPU time spent– Displays the CPU processing time spent in mili seconds.
  • Physical memory snapshot– Displays the total physical memory used in bytes.
  • Virtual memory snapshot– Displays the total virtual memory used in bytes.
  • Total committed heap usage– Displays the total amount of heap memory available in bytes.

File Input Format Counters in MapReduce

  • Bytes Read– Displays the bytes read by Map tasks using the specified Input format.

File Output Format Counters in MapReduce

  • Bytes Written– Displays the bytes written by Map and reduce tasks using the specified Output format.

User defined counters in MapReduce

You can also create user defined counters in Hadoop using Java enum. The name of the Enum becomes the counter group's name where as each field in enum is a counter name.

You can increment these counters in the mapper or reducer based on some logic that will help you with debugging. User defined counters are also aggregated across all the mappers or reducers by the Hadoop framework and displayed as a single unit.

User defined counter example

Suppose you have data is following format and in some records sales data is missing.

Item1 345 zone-1
Item1  zone-2
Item3 654 zone-2
Item2 231 zone-3

Now you want to determine the number of records where sales data is missing to get a picture how much skewness is happening in your analysis because of missing fields.

MapReduce code

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SalesCalc extends Configured implements Tool {    
  enum Sales {
    SALES_DATA_MISSING
  }
  // Mapper
  public static class SalesMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private Text item = new Text();
    IntWritable sales = new IntWritable();
    public void map(LongWritable key, Text value, Context context) 
        throws IOException, InterruptedException {
      // Splitting the line on tab
      String[] salesArr = value.toString().split("\t");
      item.set(salesArr[0]);
                
      if(salesArr[1] != null && !salesArr[1].trim().equals("")) {
        sales.set(Integer.parseInt(salesArr[1]));
      }else {
        // incrementing counter
        context.getCounter(Sales.SALES_DATA_MISSING).increment(1);
        sales.set(0);
      }            
        context.write(item, sales);
    }
  }
    
  // Reducer
  public static class TotalSalesReducer extends Reducer<Text, Text, Text, IntWritable>{    
    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
        throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }      
      context.write(key, new IntWritable(sum));
    }
  }

  public static void main(String[] args) throws Exception {
    int exitFlag = ToolRunner.run(new SalesCalc(), args);
    System.exit(exitFlag);
  }
    
  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = getConf();
    Job job = Job.getInstance(conf, "SalesCalc");
    job.setJarByClass(getClass());
    job.setMapperClass(SalesMapper.class);    
    job.setReducerClass(TotalSalesReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}

In the counters displayed for the MapReduce job you can see the counter defined for getting the number of fields where sales numbers are missing.

org.netjs.SalesCalc$Sales
        SALES_DATA_MISSING=4

That's all for this topic Counters in Hadoop MapReduce Job. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. ToolRunner and GenericOptionsParser in Hadoop
  2. Chaining MapReduce Job in Hadoop
  3. Predefined Mapper And Reducer Classes in Hadoop
  4. Input Splits in Hadoop
  5. Data Locality in Hadoop

You may also like-

  1. HDFS Commands Reference List
  2. NameNode, DataNode And Secondary NameNode in HDFS
  3. Sequence File in Hadoop
  4. Parquet File Format in Hadoop
  5. Apache Avro Format in Hadoop
  6. How to Create Ubuntu Bootable USB
  7. Java Exception Handling Interview Questions
  8. Ternary Operator in Java