Thursday, November 16, 2023

YARN in Hadoop

YARN (Yet Another Resource Negotiator) is the cluster resource management and job scheduling layer of Hadoop. YARN is introduced in Hadoop 2.x version to address the scalability issues in MRv1. It also decouples resource management and data processing components making it possible for other distributed data processing engines to run on Hadoop cluster.

This YARN tutorial gives an insight into Apache Hadoop YARN architecture, how resource management is done in YARN and what advantages YARN in Hadoop provides over the classic MapReduce version1.

Problems with Hadoop MRv1

  • Scalability issues- In MapReduce 1 there were two processes responsible for job execution. JobTracker was responsible for resource management, scheduling the job, monitoring the progress of the task where as TaskTracker was responsible for running the assigned map and reduce tasks. This over dependence on single process JobTracker created performance bottlenecks and scalability issues in large clusters running a lots of applications.
  • Cluster resource utilization- In Hadoop 1.0 map slots and reduce slots were configured in each node using the following parameters-
    Because of this static configuration there was no inter change possible. If more map slots were needed on a node where all the map slots were in use, even if there were free reduce slots those can't be used as map slots and vice versa.
  • Tightly coupled with MapReduce- Hadoop 1.0 design tightly coupled it with batch processing MapReduce framework. The design was not abstracting resource management and storage (HDFS) in such a way that different processing engines (like spark, tez) could also run on the Hadoop cluster.

Hadoop Yarn architecture

In Hadoop YARN the functionalities of resource management and job scheduling/monitoring are split into separate daemons. There is a global ResourceManager to manage the cluster resources and per-application ApplicationMaster to manage the application tasks.

The major components of YARN in Hadoop are as follows-

  • ResourceManager- This is the master process in YARN performing the job of accepting jobs, scheduling those jobs and allocating required resources to the jobs. There are two components of ResourceManager performing these tasks.
    • Scheduler- The Scheduler is responsible for allocating resources to the various running applications.
    • ApplicationsManager- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
  • NodeManager- The NodeManager daemon runs on each cluster node. NodeManager is responsible for containers running on the node, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
  • ApplicationMaster- The ApplicationMaster process is application specific which means for a MapReduce job there will be a MapReduce specific ApplicationMaster, for a Spark job there will be a Spark specific ApplicationMaster.
    ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

How YARN addresses the issues in Hadoop 1.x

We already discussed the problems in MRv1 like scalability, no optimum resource utilization and design tightly coupled with MapReduce processing. Now when you have a good idea of YARN architecture lets see how YARN addresses these issues.

  • Scalabitlity- In YARN cluster resource management is done by ResourceManager and application tasks are managed by ApplicationMaster, rather than having a single process JobTracker handling all these tasks. That makes YARN more scalable.
  • Resource utilization- In Hadoop YARN there are no pre-configured map and reduce slots instead there is a concept of containers that are negotiated by the ApplicationMaster as per the needs of the submitted application.
    The ApplicationMaster is application specific so a MapReduce ApplicationMaster will request for containers to run MapReduce job where as Spark ApplicationMaster will request containers for running Spark tasks.
  • Loosely coupled design- YARN is designed in such a way that it decouples cluster resource management and job scheduling from data processing component. In a Hadoop cluster now storage layer, resource management and job scheduling layer and distributed data processing applications are separate independent layers. That makes it possible to run Applications other than MapReduce on Hadoop cluster.
YARN in Hadoop

Application flow in Hadoop YARN

Once an application is submitted its first stop is ResourceManager. Scheduler component of the ResourceManager schedules the application to run, where as ApplicationsManager component of the ResourceManager will negotiate the first container for the application where the application specific ApplicationMaster starts running.

It is the responsibility of ApplicationMaster to negotiate more resources from ResourceManager for running the application tasks. These containers will be allocated on different nodes in the cluster.

ApplicationMaster will communicate with the NodeManagers running on the nodes where the containers are allocated to launch the containers on those nodes. NodeManager monitors the container's resource usage (cpu, memory, disk, network) and report the same to the ResourceManager/Scheduler.

Following image shows the communication among the ResourceManager, NodeManager and ApplicationMaster processes. Here two applications are submitted one MapReduce and another Spark so that there are two application specific ApplicationMaster processes started.

Hadoop YARN flow

That's all for this topic YARN in Hadoop. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page

Related Topics

  1. Introduction to Hadoop Framework
  2. Capacity Scheduler in YARN
  3. Fair Scheduler in YARN
  4. Uber Mode in Hadoop
  5. Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode

You may also like-

  1. HDFS High Availability
  2. Word Count MapReduce Program in Hadoop
  3. Data Compression in Hadoop
  4. HDFS Commands Reference List
  5. Speculative Execution in Hadoop
  6. Varargs in Java
  7. ConcurrentHashMap in Java
  8. Heap Memory Allocation in Java

No comments:

Post a Comment