Wednesday, June 6, 2018

File Write in HDFS - Hadoop Framework Internal Steps

In this post we’ll see the internal steps with in the Hadoop framework when a file is written in HDFS.

Writing file in HDFS - Initial step

When client application wants to create a file in HDFS it calls create() method on DistributedFileSystem which in turn calls the create() method of the DFSClient. With in the Hadoop framework it is the DFSClient class which communicates with NameNode and DataNodes.

Since the metadata about the file is stored in NameNode so first communication is with NameNode to store file data in its namespace. NameNode performs some checks related to client permissions and file not already existing. If those checks are passed metadata about the file is stored.

For creating Output Stream FSDataOutputStream and DFSOutputStream classes are used. Client streams file data to FSDataOutputStream which wraps an instance of DFSOutputStream class. The streamed file data is cached internally by DFSOutputStream class and divided into packets of 64 KB. These packets are enqueued into a queue called dataQueue which is actually a Java LinkedList.

Getting DataNode information for writing to the HDFS blocks

There is another class DataStreamer which connects to NameNode and retrieves a new blockid and block locations. Remember for the default replication factor of 3 NameNode will send a list of 3 DataNodes for storing each block of the file. These DataNodes form a pipeline to stream block data from one DataNode to another.

The DataStreamer thread picks up packets from the dataQueue and streams it to the first datanode in the pipeline which stores the packets and also forwards those packets to the second Datanode which stores them and forwards the packet to the third Datanode in the pipeline.

There is another queue in the DFSOutputStream class known as ackQueue. After sending the packet to the first datanode in the pipeline DataStreamer thread moves that packet from the dataQueue to the ackQueue. When that packet is written to all the DataNodes in the pipeline then only it is removed from ackQueue.

DataNode failure scenario while writing file in HDFS

In case any of the DataNode in the pipeline fails then the pipeline is closed and the packets in the ackQueue that are not yet acknowledged are removed from the ackQueue and added in the front of dataQueue so that the DataStreamer thread can pick them up again.

Then a new pipeline is created excluding the failed DataNode and the packets are written to the remaining two DataNodes.

When each DataNode in the pipeline has completed writing the block locally, DataNode also notify the NameNode of their block storage. For the scenario as stated above where block is replicated to only two DataNodes, after getting the reports from DataNodes, NameNode will know that the replication factor of 3 is not maintained so it will ensure that the replica is created in another DataNode.

HDFS data flow for file write in HDFS

That's all for this topic File Write in HDFS - Hadoop Framework Internal Steps. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

  1. File Read in HDFS - Hadoop Framework Internal Steps
  2. What is HDFS
  3. HDFS Commands Reference List
  4. HDFS High Availability
  5. NameNode, DataNode And Secondary NameNode in HDFS

You may also like-

  1. Introduction to Hadoop Framework
  2. Replica Placement Policy in Hadoop Framework
  3. What is SafeMode in Hadoop
  4. Data Compression in Hadoop
  5. Fair Scheduler in YARN
  6. Uber Mode in Hadoop
  7. Difference Between Abstract Class And Interface in Java
  8. Writing File in Java