Hadoop Yarn Architecture

Hadoop YARN Introduction

YARN is the main component of Hadoop v2.0. YARN helps to open up Hadoop by allowing to process and run data for batch processing, stream processing, interactive processing and graph processing which are stored in HDFS. In this way, It helps to run different types of distributed applications other than MapReduce.

In the YARN architecture, the processing layer is separated from the resource management layer. To create a split between the application manager and resource manager was the Job tracker’s responsibility in the version of Hadoop 1.0. YARN allows the data stored in HDFS (Hadoop Distributed File System) to be processed and run by various data processing engines such as batch processing, stream processing, interactive processing, graph processing and many more. Thus the efficiency of the system is increased with the use of YARN. The processing of the application is scheduled in YARN through its different components. Many different kinds of resources are also progressively allocated for optimum utilization. YARN helps a lot in the proper usage of the available resources, which is very necessary for the processing of a high volume of data.

Why YARN?

MapReduce performs functions of Resource Management and Processing. Hadoop v1.0 is also known as MapReduce Version 1 (MRV1). There was only a single master for Job Tracker. 

In the previous version of Hadoop that is Hadoop version 1.0, which is also known as MapReduce version 1 (MRV1) use to perform both the task of process and resource management by itself. It has a job tracker module that is responsible for everything. Hence it is the single master that allocates resources for applications, performs scheduling for demand and also monitors the jobs of processing in the system. Hadoop version 1.0 reduces tasks & assigns maps on several sub-processes which is called Task Trackers. Task Tracker also reports the progress of processes in a periodical manner. But the main issue is not that, the problem is this design of a single master for all, resulting in bottlenecking issue. Also, the computational resource utilization was inefficient. Thus scalability became an issue with this version of Hadoop. But on the bright side, this issue is resolved by YARN, a vital core component in its successor Hadoop version 2.0 which was introduced in the year 2012 by Yahoo and Hortonworks. The basic idea behind this relief is separating MapReduce from Resource Management and Job scheduling instead of a single master. Thus, YARN is now responsible for Job scheduling and Resource Management. 

 In Hadoop 2.0, The concept of Application Master and Resource Manager was introduced by YARN. Across the cluster of Hadoop, the utilization of resources is monitored by the Resource Manager. 

There are some features of YARN because of which it got very famous, which are:

  1. Multi-tenancy: YARN has allowed access to multiple data processing engines such as batch processing engine, stream processing engine, interactive processing engine, graph processing engine and much more. This has given the benefit of multi-tenancy to the company.
  2. Cluster Utilization: Clusters are utilized in an optimized way because clusters are used dynamically in Hadoop with the help of YARN.
  3. Compatibility: YARN is also compatible with the first version of Hadoop, i.e. Hadoop 1.0, because it uses the existing map-reduce apps. So YARN can also be used with Hadoop 1.0.
  4. Scalability: Thousands of clusters and nodes are allowed by the scheduler in Resource Manager of YARN to be managed and extended by Hadoop.

The main components of YARN architecture include:

  • Client: It submits jobs.
  • Resource Manager: It is the master daemon of YARN and is responsible for resource assignment and management among all the applications. Whenever it receives a processing request, it forwards it to the corresponding node manager and allocates resources for the completion of the request accordingly. It has two major components:
    • Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
    • ApplicationMaster: Manages the life-cycle of a job by directing the NodeManager to create or destroy a container for a job. There is only one ApplicationMaster for a job.
  • Node Manager: It take care of individual node on Hadoop cluster and manages application and workflow of that particular node. Its primary job is to keep-up with the Node Manager. It monitors resource usage, performs log management and also kills a container based on directions from the resource manager. It is also responsible for creating the container process and start it on the request of Application master.
  • Application Master: An application is a single job submitted to a framework. The application manager is responsible for negotiating resources with the resource manager, tracking the status and monitoring progress of a single application. The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes everything an application needs to run. Once the application is started, it sends the health report to the resource manager from time-to-time.
  • Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. The containers are invoked by Container Launch Context(CLC) which is a record that contains information such as environment variables, security tokens, dependencies etc.
This image is used from https://data-flair.training. Just for personal blog i am using this..

The YARN application workflow

Now, take a look at the following sequence diagram that describes the YARN application workflow and also explains how container allocation is done for an application via the ApplicationMaster:

Leave a comment

Design a site like this with WordPress.com
Get started