Tuesday, December 5, 2023

Replica Placement Policy in Hadoop Framework

HDFS as the name says is a distributed file system which is designed to store large files. A large file is divided into blocks of defined size and these blocks are stored across machines in a cluster. These blocks of the file are replicated for reliability and fault tolerance. For better reliability Hadoop framework has a well defined replica placement policy.

Rake aware replica placement policy

Large HDFS instances run on a cluster of computers that commonly spread across many racks so rack awareness is also part of the replica placement policy in Hadoop.

If two nodes placed in different racks have to communicate that communication has to go through switches.

If machines are on the same rack then network bandwidth between those machines is generally greater than the network bandwidth between machines in different racks.

HDFS replica placement policy

Taking rank awareness and fault tolerance into consideration the replica placement policy followed by Hadoop framework is as follows-

For the default case, when the replication factor is three

  1. Put one replica on the same machine where the client application (application which is using the file) is, if the client is on a DataNode. Otherwise choose a random datanode for storing the replica.
  2. Store another replica on a node in a different (remote) rack.
  3. The last replica is also stored on the same remote rack but the node where it is stored is different.

In case replication factor is greater than 3, for the first 3 replicas policy as described above is followed. From replica number 4 onward node location is determined randomly while keeping the number of replicas per rack below the upper limit (which is basically (replicas - 1) / racks + 2).

HDFS Replication pipelining

While replicating blocks across DataNodes, pipelining is used by HDFS. Rather than client writing to all the chosen DataNodes data is pipelined from one DataNode to the next.

For the default replication factor of 3 the replication pipelining works as follows-

The NameNode retrieves a list of DataNodes that will host the replica of a block. Client gets this list of 3 DataNodes from NameNode and writes to the first DataNode in the list. The first DataNode starts receiving the data in portions, writes each portion to its local storage and then transfers that portion to the second DataNode in the list. The Second DataNode follows the same procedure writes the portion to its local storage and transfers the portion to the third DataNode in the list.

For replication factor of 3 following image shows the placement of replicas.

HDFS replica placement policy

Reference: http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Replication

That's all for this topic Replica Placement Policy in Hadoop Framework. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. What is Hadoop Distributed File System (HDFS)
  2. NameNode, DataNode And Secondary NameNode in HDFS
  3. What is SafeMode in Hadoop
  4. File Write in HDFS - Hadoop Framework Internal Steps
  5. Data Locality in Hadoop

You may also like-

  1. What is Big Data
  2. Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode
  3. How to Compress MapReduce Job Output in Hadoop
  4. YARN in Hadoop
  5. Speculative Execution in Hadoop
  6. Installing Ubuntu Along With Windows
  7. Java Stream flatMap() Method
  8. How to Create Immutable Class in Java

No comments:

Post a Comment