Thursday, December 7, 2023

File Read in HDFS - Hadoop Framework Internal Steps

In this post we’ll see what all happens internally with in the Hadoop framework when a file is read in HDFS.

Reading file in HDFS

With in the Hadoop framework it is the DFSClient class which communicates with NameNode and DataNodes. The instance of DFSClient is created by DistributedFileSystem which is the implementation class in case of HDFS.

When client application has to read a file it calls open() method on DistributedFileSystem which in turn calls open() method of DFSClient. DFSClient creates an instance of DFSInputStream which communicates with NameNode.

DFSInputStream connects to the NameNode and gets the location of first few blocks of the file. Note that default replication factor is 3 so, for every block, information about 3 DataNodes that are storing the specific block will be sent by NameNode. In the list sent by NameNode, DataNodes are also ordered by their proximity to the client for each block. So client application will try to read data from a local DataNode first rather than the remote DataNode.

Reading blocks from DataNodes

Once the list of blocks is retrieved client application calls read on the wrapper stream FSDataInputStream. In turn the wrapper stream DFSInputStream which already has a list of DataNodes connects to the nearest DataNode storing the first block of the file and start streaming data to the client. DFSInputStream will follow the same procedure for all the blocks in the list; connect to DataNode storing that block, stream data, disconnect from the DataNode.

Since NameNode sends only the first few blocks of the file, DFSInputStream also communicates with NameNode to get the DataNode information for the next set of blocks. This process will continue until all the blocks of the file are read.

Once all the blocks are read and streamed to the client, stream is closed.

In the architecture followed for reading the file the client is directly connecting and getting the data from DataNodes. No data flows through NameNode.

In case of any error while reading the block another DataNode storing the same block is tried. That is where replication helps.

file read in HDFS
HDFS data flow for file read in HDFS

That's all for this topic File Read in HDFS - Hadoop Framework Internal Steps. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. File Write in HDFS - Hadoop Framework Internal Steps
  2. What is HDFS
  3. HDFS Federation in Hadoop Framework
  4. HDFS High Availability
  5. NameNode, DataNode And Secondary NameNode in HDFS

You may also like-

  1. Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode
  2. What is SafeMode in Hadoop
  3. Data Locality in Hadoop
  4. How to Configure And Use LZO Compression in Hadoop
  5. YARN in Hadoop
  6. How HashMap Internally Works in Java
  7. Installing Ubuntu Along With Windows
  8. How to Read File From The Last Line in Java