Saturday, November 18, 2023

Capacity Scheduler in YARN

In the post YARN in Hadoop we have already seen that it is the scheduler component of the ResourceManager which is responsible for allocating resources to the running jobs. The scheduler component is pluggable in Hadoop and there are two options capacity scheduler and fair scheduler. This post talks about the capacity scheduler in YARN, its benefits and how capacity scheduler can be configured in Hadoop cluster.


YARN Capacity scheduler

Capacity scheduler in YARN allows multi-tenancy of the Hadoop cluster where multiple users can share the large cluster.

Every organization having their own private cluster leads to a poor resource utilization. An organization may provide enough resources in the cluster to meet their peak demand but that peak demand may not occur that frequently, resulting in poor resource utilization at rest of the time.

Thus sharing cluster among organizations is a more cost effective idea. However, organizations are concerned about sharing a cluster because they are worried that they may not get enough resources at the time of peak utilization. The CapacityScheduler in YARN mitigates that concern by giving each organization capacity guarantees.

Capacity scheduler in YARN functionality

Capacity scheduler in Hadoop works on the concept of queues. Each organization gets its own dedicated queue with a percentage of the total cluster capacity for its own use. For example if there are two organizations sharing the cluster, one organization may be given 60% of the cluster capacity where as the organization is given 40%.

On top of that, to provide further control and predictability on sharing of resources, the CapacityScheduler supports hierarchical queues. Organization can further divide its allocated cluster capacity into separate sub-queues for separate set of users with in the organization.

Capacity scheduler is also flexible and allows allocation of free resources to any queue beyond its capacity. This provides elasticity for the organizations in a cost-effective manner. When the queue to which these resources actually belongs has increased demand the resources are allocated to it when those resources are released from other queues.

Capacity scheduler in YARN configuration

To configure the ResourceManager to use the CapacityScheduler, set the following property in the conf/yarn-site.xml:

yarn.resourcemanager.scheduler.class- org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler 
For setting up queues in CapacityScheduler you need to make changes in etc/hadoop/capacity-scheduler.xml configuration file.

The CapacityScheduler has a predefined queue called root. All queues in the system are children of the root queue.

Setting up further queues- Configure property yarn.scheduler.capacity.root.queues with a list of comma-separated child queues.

Setting up sub-queues with in a queue- configure property yarn.scheduler.capacity.<queue-path>.queues
Here queue-path is the full path of the queue’s hierarchy, starting at root, with . (dot) as the delimiter.

Capacity of the queue- Configure property yarn.scheduler.capacity.<queue-path>.capacity
Queue capacity is provided in percentage (%). The sum of capacities for all queues, at each level, must be equal to 100. Applications in the queue may consume more resources than the queue’s capacity if there are free resources, providing elasticity.

Capacity scheduler queue configuration example

If there are two child queues starting from root XYZ and ABC. XYZ further divides the queue into technology and development. XYZ is given 60% of the cluster capacity and ABC is given 40%.

<property>
  <name>yarn.scheduler.capacity.root.queues</name>
  <value>XYZ, ABC</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.XYZ.queues</name>
  <value>technology,marketing</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.XYZ.capacity</name>
  <value>60</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.ABC.capacity</name>
  <value>40</value>
</property>

If you want to limit the elasticity for applications in the queue. Restricting XYZ's elasticity to 80% so that it doesn't use more than 80% of the total cluster capacity even if resources are available. In other words ABC has 20% to start with immediately.

<property>
  <name>yarn.scheduler.capacity.root.XYZ.maximum-capacity</name>
  <value>80</value>
</property>
For the two sub-queues of XYZ, you want to allocate 70% of the allocated queue capacity to technology and 30% to marketing.
<property>
  <name>yarn.scheduler.capacity.root.XYZ.technology.capacity</name>
  <value>70</value>
</property>

<property>
  <name>yarn.scheduler.capacity.root.XYZ.marketing.capacity</name>
  <value>30</value>
</property>

Reference: https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

That's all for this topic Capacity Scheduler in YARN. If you have any doubt or any suggestions to make please drop a comment. Thanks!

>>>Return to Hadoop Framework Tutorial Page


Related Topics

  1. Introduction to Hadoop Framework
  2. Installing Hadoop on a Single Node Cluster in Pseudo-Distributed Mode
  3. Replica Placement Policy in Hadoop Framework
  4. Speculative Execution in Hadoop
  5. MapReduce Flow in YARN

You may also like-

  1. HDFS Federation in Hadoop Framework
  2. What is SafeMode in Hadoop
  3. Java Program to Read File in HDFS
  4. Data Compression in Hadoop
  5. Uber Mode in Hadoop
  6. CopyOnWriteArrayList in Java
  7. Type Casting in Java
  8. Getting All The Schemas in a DB - Java Program

1 comment:

  1. Excellent one, I really liked all your posts. thank you so much cheers

    ReplyDelete