Friday, November 24, 2023

Input Splits in Hadoop

When a MapReduce job is run to process input data one of the thing Hadoop framework does is to divide the input data into smaller chunks, these chunks are referred as input splits in Hadoop.

For each input split Hadoop creates one map task to process records in that input split. That is how parallelism is achieved in Hadoop framework. For example if a MapReduce job calculates that input data is divided into 8 input splits, then 8 mappers will be created to process those input splits.

How input splits are calculated

Input splits are created by the InputFormat class used in the MapReduce job. Actually the implementation for calculating splits for the input file is in the FileInputFormat class which is the base class for all the implementations of InputFormat.

InputFormat Class

public abstract class InputFormat<K, V> {
  public abstract List<InputSplit> getSplits(JobContext context) throws IOException, InterruptedException;

  public abstract RecordReader<K,V> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException;
}

FileInputFormat method for computing splits

If you want to see how split size is calculated you can have a look at computeSplitSize method in org.apache.hadoop.mapreduce.lib.input.FileInputFormat class.

protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
  return Math.max(minSize, Math.min(maxSize, blockSize));
}
Here values of minSize and maxSize are referred from the following configuration parameters-
  • mapreduce.input.fileinputformat.split.minsize- The minimum size chunk that map input should be split into. Default value is 0.
  • mapreduce.input.fileinputformat.split.maxsize- The maximum size chunk that map input should be split into. Default value is Long.MAX_VALUE.

So you can see with default values input split size is equal to HDFS block size.

Input split is a logical split

We have already seen how input data is divided into input splits in Hadoop. Further each split is divided into records (key-value pair). Map tasks process each of these records.

These input splits and records are logical, they don’t store or contain the actual data. They just refer to the data which is stored as blocks in HDFS. You can have a look at InputSplit class to verify that. InputSplit represents the data to be processed by an individual Mapper.

public abstract class InputSplit {
  public abstract long getLength() throws IOException, InterruptedException;

  public abstract String[] getLocations() throws IOException, InterruptedException;

  @Evolving
  public SplitLocationInfo[] getLocationInfo() throws IOException {
    return null;
  }
}

As you can see the InputSplit class has the length of input split in bytes and a set of storage locations where the actual data is stored. These storage locations help Hadoop framework to spawn map tasks as close to the data as possible in order to take advantage of data locality optimization.

Input split Vs HDFS blocks

As already stated input split is the logical representation of the data stored in HDFS blocks. Where as data of file is stored physically in HDFS blocks. Now, you may think why these two representations of data is needed why can’t MapReduce directly process the data stored in blocks.

HDFS breaks the input file into chunks of exactly 128 MB (As per default configuration), with no consideration of data residing in those blocks. What if there is a record that spans the block boundaries i.e. A record that starts in one block and ends in another block. MapReduce job that is processing that HDFS block would only be able to read the portion of the record with in the block MapReduce job is processing rather than the whole record.That’s where the logical representation of data as used in input splits help. Input splits take care of the logical boundaries. Using the starting record in the block and the byte offset it can get the complete record even if it spans the block boundaries.

Input Splits in Hadoop

That's all for this topic Input Splits in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. Speculative Execution in Hadoop
  2. How MapReduce Works in Hadoop
  3. What is SafeMode in Hadoop
  4. Data Compression in Hadoop
  5. YARN in Hadoop

You may also like-

  1. How to Compress Intermediate Map Output in Hadoop
  2. NameNode, DataNode And Secondary NameNode in HDFS
  3. HDFS Commands Reference List
  4. How to Configure And Use LZO Compression in Hadoop
  5. Capacity Scheduler in YARN
  6. How to Run a Shell Script From Java Program
  7. Enum Type in Java
  8. Switch-Case Statement in Java