File Read in HDFS - Hadoop Framework Internal Steps

In this post we’ll see what all happens internally with in the Hadoop framework when a file is read in HDFS.

Reading file in HDFS

With in the Hadoop framework it is the DFSClient class which communicates with NameNode and DataNodes. The instance of DFSClient is created by DistributedFileSystem which is the implementation class in case of HDFS.

When client application has to read a file it calls open() method on DistributedFileSystem which in turn calls open() method of DFSClient. DFSClient creates an instance of DFSInputStream which communicates with NameNode.

Refer Java Program to Read File in HDFS to see how to read file in HDFS using Java API.

DFSInputStream connects to the NameNode and gets the location of first few blocks of the file. Note that default replication factor is 3 so, for every block, information about 3 DataNodes that are storing the specific block will be sent by NameNode. In the list sent by NameNode, DataNodes are also ordered by their proximity to the client for each block. So client application will try to read data from a local DataNode first rather than the remote DataNode.

Refer Replica Placement Policy in Hadoop Framework to know more about the replica placement poilcy followed by Hadoop framework

Reading blocks from DataNodes

Once the list of blocks is retrieved client application calls read on the wrapper stream FSDataInputStream. In turn the wrapper stream DFSInputStream which already has a list of DataNodes connects to the nearest DataNode storing the first block of the file and start streaming data to the client. DFSInputStream will follow the same procedure for all the blocks in the list; connect to DataNode storing that block, stream data, disconnect from the DataNode.

Since NameNode sends only the first few blocks of the file, DFSInputStream also communicates with NameNode to get the DataNode information for the next set of blocks. This process will continue until all the blocks of the file are read.

Once all the blocks are read and streamed to the client, stream is closed.

In the architecture followed for reading the file the client is directly connecting and getting the data from DataNodes. No data flows through NameNode.

In case of any error while reading the block another DataNode storing the same block is tried. That is where replication helps.

HDFS data flow for file read in HDFS

That's all for this topic File Read in HDFS - Hadoop Framework Internal Steps. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

You may also like-

Tech Tutorials

Thursday, December 7, 2023