OutputCommitter in Hadoop

In a distributed environment like Hadoop framework it is important to track the success or failure of all the tasks running on different nodes. The whole job should be marked as successfully finished or aborted based on the success (or failure of any task) of all the tasks. To ensure that Hadoop framework uses commit protocol and the class used for the purpose is known as OutputCommitter in Hadoop.

OutputCommitter class in Hadoop is an abstract class and its concrete implementation is the FileOutputCommitter class.

As per Hadoop docs-

FileOutputCommitter- An OutputCommitter that commits files specified in job output directory i.e. ${mapreduce.output.fileoutputformat.outputdir}.

Tasks performed by OutputCommitter

The Hadoop MapReduce framework relies on the OutputCommitter of the job to do the following tasks-

Setup the job during initialization. For example, create the temporary output directory for the job during the initialization of the job.
Job cleanup after the job completion. For example, remove the temporary output directory after the job completion.
Setup the task temporary output.
Check whether a task needs a commit. This is to avoid the commit procedure if a task does not need commit.
Commit of the task output.
Discard the task commit.

Methods in OutputCommitter class in Hadoop (For MR 2 API)

abortJob(JobContext jobContext, org.apache.hadoop.mapreduce.JobStatus.State state)- For aborting an unsuccessful job's output. Note that this is invoked for jobs with final runstate as JobStatus.FAILED or JobStatus.KILLED. This is called from the application master process for the entire job. This may be called multiple times.
abortTask(TaskAttemptContext taskContext)- Discard the task output. This is called from a task's process to clean up a single task's output that can not yet been committed. This may be called multiple times for the same task, but for different task attempts.
commitJob(JobContext jobContext)- For committing job's output after successful job completion. That is when job clean up also happens. Note that this is invoked for jobs with final runstate as SUCCESSFUL. This is called from the application master process for the entire job. This is guaranteed to only be called once.
commitTask(TaskAttemptContext taskContext)- To promote the task's temporary output to final output location. If needsTaskCommit(TaskAttemptContext) returns true and this task is the task that the AM determines finished first, this method is called to commit an individual task's output. This is to mark that tasks output as complete.
needsTaskCommit(TaskAttemptContext taskContext)- Check whether task needs a commit. This is called from each individual task's process that will output to HDFS, and it is called just for that task.
setupJob(JobContext jobContext)- For the framework to setup the job output during initialization. This is called from the application master process for the entire job. This will be called multiple times, once per job attempt.
setupTask(TaskAttemptContext taskContext)- Sets up output for the task. This is called from each individual task's process that will output to HDFS, and it is called just for that task. This may be called multiple times for the same task, but for different task attempts.

Reference- https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/OutputCommitter.html

That's all for this topic OutputCommitter in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Thursday, November 30, 2023