next generation hadoop: high availability for yarn

Next Generation Hadoop: High Availability for YARN

Arinto MurdopoKTH Royal Institute of Technology

Hanstavägen 49 - 1065A,164 53 Kista, Sweden

[email protected]

Jim DowlingSwedish Institute of Computer Science

Isafjordsgatan 22,164 40 Kista, [email protected]

ABSTRACTHadoop is one of the widely-adopted cluster computing frame-works for big data processing, but it is not free from limi-tations. Computer scientists and engineers are continuouslymaking efforts to eliminate those limitations and improveHadoop. One of the improvements in Hadoop is YARN,which eliminates scalability limitation of the first generationMapReduce. However, YARN still suffers from availabil-ity limitation, i.e. single-point-of-failure in YARN resource-manager. In this paper we propose an architecture to solveYARN availability limitation. The novelty of this architec-ture lies on its stateless failure model, which enables multipleYARN resource-managers to run concurrently and maintainshigh availability. MySQL Cluster (NDB) is proposed as thestorage technology in our architecture. Furthermore, we im-plemented a proof-of-concept for the proposed architecture.The evaluations show that the proof-of-concept is able to in-crease the availability of YARN. In addition, NDB is shownto have the highest throughput compared to Apache’s pro-posed storages (ZooKeeper and NDB). Finally, the evalua-tions show the NDB achieves linear scalability hence it issuitable for our proposed stateless failure model.

Categories and Subject DescriptorsD.4.7 [Operating Systems]: Distributed Systems, BatchProcessing Systems

General TermsBig Data, Storage Management

1. INTRODUCTIONBig data has become widespread across industries, especiallyweb-companies. It has reached petabytes scale and it willkeep increasing in the upcoming years. Traditional storagesystems such as regular file systems and relational databasesare not designed to handle this petabytes-scale of magnitude.Scalability is the main issue for the traditional storage sys-tems in handling big data. This situation has resulted in

several cluster computing frameworks to handle big data ef-fectively.

One of the widely adopted cluster computing frameworksthat commonly used by web-companies is Hadoop1. It mainlyconsists of Hadoop Distributed File System (HDFS) [11] tostore the data. On top of HDFS, MapReduce frameworkinspired by Google’s MapReduce [1] was developed to pro-cess the data inside. Although Hadoop arguably has be-come the standard solution for managing big data, it is notfree from limitations. These limitations have triggered sig-nificant efforts from academia and enterprise to improveHadoop. Cloudera tried to reduce availability limitationof HDFS using NameNode replication [9]. KTHFS solvedthe HDFS availability limitation by utilizing MySQL Clus-ter to make HDFS NameNode stateless [12]. Scalability ofMapReduce has become prominent limitation. MapReducehas reached scalability limit of 4000 nodes. To solve thislimitation, the open source community proposed the nextgeneration MapReduce called YARN (Yet Another ResourceNegotiator) [8]. From the enterprise world, Corona was re-leased by Facebook to overcome the aforementioned scalabil-ity limitation [2]. Another limitation is Hadoop’s inabilityto perform fine-grained resource sharing between multiplecomputation frameworks. Mesos tried to solve this limita-tion by implementation of distributed two-level schedulingmechanism called resource offers [3].

However, few solutions have addressed the availability lim-itation in MapReduce framework. When a MapReduce’sJobTracker failures occur, the corresponding application isnot able to continue, reducing MapReduce’s availability. Cur-rent YARN architecture is unable to solve this availabilitylimitation. ResourceManager, the JobTracker-equivalent inYARN, remains a single-point-of-failure. The open sourcecommunity has recently started to solve this issue but nofinal and proven solution is available yet2. The current pro-posal from the open source community is to use ZooKeeper[4] or HDFS as a persistent storage to store ResourceMan-ager’s states. Upon failure, ResourceManager will be recov-ered using the stored states.

Solving this availability limitation will bring YARN intocloud-ready state. YARN can be executed in the cloud,such as Amazon EC2, and it is resistant to failures thatoften happen in the cloud.

1http://hadoop.apache.org/2https://issues.apache.org/jira/browse/YARN-128

In this report, we present a new architecture for YARN. Themain goal of the new architecture is to solve the aforemen-tioned availability limitation in YARN. This architectureprovides better alternatives than the existing Zoo-Keeper-based architecture since it eliminates the potential scalabil-ity limitation due to ZooKeeper’s relatively limited through-put.

For achieving the desired availability, the new architectureutilizes a distributed in-memory database called MySQLCluster(NDB)3 to persist the ResourceManager states. NDBitself automatically replicates the stored data into differentNDB data-nodes to ensure high availability. Moreover, NDBis able to handle up to 1.8 million write queries per sec-ond [5].

This report is organized as following. Section 2 presentsexisting YARN architecture, its availability limitations andproposed solution from Apache. The proposed architectureis presented in Section 3. Section 4, presents our evaluationto verify the availability and the scalability of the proposedarchitecture. The related works in improving availabilityof cluster computing framework are presented in Section 5.And we conclude this report and propose future work forthis project in Section 6.

2. YARN ARCHITECTUREThis section explains the current YARN architecture, YARNavailability limitation, and Apache’s proposed solution toovercome the limitation.

2.1 Architecture OverviewYARN’s main goal is to provide more flexibility comparedto Hadoop in term of data processing framework that canbe executed on top of it [7]. It is equipped with generic dis-tributed application framework and resource-managementcomponents. Therefore, YARN supports not only MapRe-duce, but also other data processing frameworks such asApache Giraph, Apache Hama and Spark.

In addition, YARN is aimed to solve scalability limitationin original implementation of Apache’s MapReduce [6]. Toachieve this aim, YARN splits MapReduce job-tracker re-sponsibilities of application scheduling, resource manage-ment and application monitoring into separate processesor daemons. The new processes that handle job-trackerresponsibilities are resource-manager which handles globalresource management and job scheduling, and application-master which is responsible for job monitoring, job life-cylemanagement and resource negotiation with the resource-manager. Each submitted job corresponds to an application-master process. Furthermore, YARN converts original MapRe-duce task-tracker into node-manager, which manages taskexecution in YARN’s unit of resource called container.

Figure 1 shows the current YARN architecture. Resource-manager has three core components, they are:

1. Scheduler, which schedules submitted jobs based onspecific policy and available resources. The policy is

3http://www.mysql.com/products/cluster/

Figure 1: YARN Architecture

pluggable, which means we can implement our ownscheduling policy to be used in our YARN deployment.YARN currently provides three policies to choose from,i.e. fair-scheduler, FIFO-scheduler and capacity-scheduler.For the available resources, scheduler should ideallyuse CPU, memory, disk and other computing resourcesas factor of resources during scheduling. However, cur-rent YARN only supports memory as the factor of re-source during scheduling.

2. Resource-tracker, which handles computing-nodes man-agement. ”Computing-nodes” in this context meansnodes that have node-manager process run on it andhave computing resources. The management tasks in-clude new nodes registration, handling requests frominvalid or decommisioned nodes, and nodes’ heartbeatsprocessing. Resource-tracker works closely with node-liveness-monitor(NMLivenessMonitor class), which keepstrack of live and dead computing nodes based on nodes’heartbeats, and node-list-manager(NodesListManagerclass), which store the list of valid and excluded com-puting nodes based on YARN configuration files.

3. Applications-manager, which maintains collection ofuser submitted jobs and cache of completed jobs. It isthe entry point for clients to submit their jobs.

In YARN, clients submit jobs through applications-managerand the submission triggers scheduler to try to schedulethe job. When the job is scheduled, resource-manager allo-cates a container and launches a corresponding application-master. The application-master takes over and process thejob by splitting them into smaller tasks, requesting addi-tional containers to resource-manager, launching them withthe help of node-manager, assigning the tasks into the avail-able containers and keeping track of the job progress. Clientslearn the job progress by polling application-master everyspecific seconds based on YARN configuration. When thejob is completed, application-master cleans up its workingstate.

2.2 Availability Limitation in YARN

Although YARN solves the scalability limitation of origi-nal MapReduce, it still suffers from an availability limita-tion which is the single-point-of-failure nature of resource-manager. This section explains why YARN resource-manageris a single-point-of-failure.

Refer to Figure 1, container and task failures are handledby node-manager. When a container fails or dies, node-manager detects the failure event and launches a new con-tainer to replace the failing container and restart the taskexecution in the new container.

In the event of application-master failure, the resource-managerdetects the failure and start a new instance of the application-master with a new container. The ability to recover the as-sociated job state depends on the application-master imple-mentation. MapReduce application-master has the abilityto recover the state but it is not enabled by default. Otherthan resource-manager, associated client also reacts with thefailure. The client contacts the resource-manager to locatethe new application-master’s address.

Upon failure of a node-manager, the resource-manager up-dates its list of available node-managers. Application-mastershould recover the tasks run on the failing node-managersbut it depends on the application-master implementation.MapReduce application-master has an additional capabilityto recover the failing task and blacklist the node-managersthat often fail.

Failure of the resource-manager is severe since clients cannot submit a new job and existing running job could notnegotiate and request for new container. Existing node-managers and application-masters try to reconnect to thefailed resource-manager. The job progress will be lost whenthey are unable to reconnect. This lost of job progress willlikely frustrate engineers or data scientists that use YARNbecause typical production jobs that run on top of YARNare expected to have long running time and typically theyare in the order of few hours. Furthermore, this limitation ispreventing YARN to be used efficiently in cloud environment(such as Amazon EC2) since node failures often happen incloud environment.

2.3 Proposed Solution from ApacheTo tackle this availability issue, Apache proposed to haverecovery failure model using ZooKeeper or HDFS-based per-sistent storage4. The proposed recovery failure model istransparent to clients, that means clients does not need tore-submit the jobs. In this model, resource-manager savesrelevant information upon job submission.

These information currently include application-identification-number, application-submission-context and list of application-attempts. An application-submission-context contains in-formation related to the job submission such as applica-tion name, user who submits the job, and amount of re-quested resource. An application-attempt represents eachresource-manager attempt to run a job by creating a newapplication-master process. The saved information relatedto application-attempt are attempt identification number

4https://issues.apache.org/jira/browse/YARN-128

Figure 2: Stateless Failure Model

and the first allocated container details such as containeridentification number, container node detail, requested re-source and job priority.

Upon restart, resource-manager reloads the saved informa-tion and restarts all node-managers and application-masters.This restart mechanism does not retain the jobs that cur-rently executing in the cluster. In the worst case, all progresswill be lost and the job will be started from the beginning.To minimize this effect, a new application-master should bedesigned to read the previous application-master states thatexecutes under the failed resource-manager. For example,aMapReduce application-master handles this case by stor-ing the progress in another process called job-history-serverand upon restart, a new application-master obtains the jobprogress from a job-history-server.

The main drawback of this model is the existence of down-time to start a new resource-manager process when the oldone fails. If the down-time is too long, all processes reachtime-out and clients need to re-submit their jobs to the newresource-manager. Furthermore, HDFS is not suitable forstoring lots of data with small size (in this case, the dataare the application states and the application-attempts).ZooKeeper is suitable for current data size, but it is likelyto introduce problem when the amount of stored data in-creased since ZooKeeper is designed to store typically smallconfiguration data.

3. YARN WITH HIGH AVAILABILITYWe explain our proposed failure model and architecture tosolve YARN availability limitation. Furthermore, implemen-tation of the proposal is explained in this section.

3.1 Stateless Failure ModelWe propose stateless failure model, which means all neces-sary information and states used by resource-manager arestored in a persistent storage. Based on our observation,these information include:

1. Application related information such as application-id,application-submission-context and application-attempts.

2. Resource related information such as list of node-managersand available resources.

Figure 2 shows the architecture of stateless failure model.Since all the necessary information are stored in persistent

Figure 3: YARN with High Availability Architec-ture

storage, it is possible to have more than one resource-managersrunning at the same time. All of the resource-managersshare the information through the storage and none of themhold the information in their memory.

When a resource-manager fails, the other resource-managerscan easily take over the job since all the needed states arestored in the storage. Clients, node-managers and application-masters need to be modified so that they can point to newresource-managers upon the failure.

To achieve high availability through this failure model, weneed to have a storage that has these following requirements:

1. The storage should be highly available. It does nothave single-point-of-failure.

2. The storage should be able to handle high read andwrite rates for small data (in the order of at most fewkilo bytes), since this failure model needs to performvery frequent read and write to the storage.

ZooKeeper and HDFS satisfy the first requirement, but theydo not satisfy the second requirement. ZooKeeper is not de-signed as a persistent storage for data and HDFS is not de-signed to handle high read and write rates for small data. Weneed other storage technology and MySQL Cluster (NDB)is suitable for these requirements. Section 3.2 explain NDBin more details.

Figure 3 shows the high level diagram of the proposed ar-chitecture. NDB is introduced to store resource-managerstates.

3.2 MySQL Cluster (NDB)MySQL Cluster (NDB) is a scalable in-memory distributeddatabase. It is designed for availability, which means thereis no single-point-of-failure in NDB cluster. Furthermore,it complies with ACID-transactional properties. Horizontal

Column Typeid intclustertimestamp bigintsubmittime bigintappcontext varbinary(13900)

Table 1: Properties of application state

scalability is achieved by auto-data-sharding based on user-defined partition key. The latest benchmark from Oracleshows that MySQL Cluster version 7.2 achieves horizontalscalability, i.e when number of datanodes is increased 15times, the throughput is increased 13.63 times [5].

Regarding the performance, NDB has fast read and writerate. The aforementioned benchmark [5] shows that 30-node-NDB cluster supports 19.5 million writes per second.It supports fine-grained locking, which means only affectedrows are locked during a transaction. Updates on two dif-ferent rows in the same table can be executed concurrently.SQL and NoSQL interfaces are supported which makes NDBhighly flexible depending on users’ needs and requirements.

3.3 NDB Storage ModuleAs a proof-of-concept of our proposed architecture, we de-signed and implemented NDB storage module for YARNresource-manager. Due to limited time, recovery failuremodel was used in our implementation. In this report, wewill refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

3.3.1 Database DesignWe designed two NDB tables to store application states andtheir corresponding application-attempts. They are calledapplicationstate and attemptstate. Table 1 shows the columnsfor applicationstate table. id is a running number and it isonly unique within a resource-manager. clustertimestampis the timestamp when the corresponding resource-manageris started. When we have more than one resource-managerrunning at a time (as in stateless failure model), we need todifferentiate the applications that run among them. There-fore, the primary keys for this table are id and clustertimes-tamp. appcontext is a serialized ApplicationSubmissionCon-text object, thus the type is varbinary.

The columns for attemptstate table are shown in Table 2.applicationid and clustertimestampe are the foreign keys toapplicationstate table. attemptid is the id of an attemptand mastercontainer contains serialized information aboutthe first container that is assigned into the correspondingapplication-master. The primary keys of this table are at-temptid, applicationid and clustertimestamp.

To enhance table performance in term of read and writethroughput, partitioning technique was used5. Both ta-bles were partitioned by applicationid and clustertimestamp.With this technique, NDB located the desired data with-out contacting NDB’s location resolver service, hence it wasfaster compared to NDB tables without partitioning.

5http://dev.mysql.com/doc/refman/5.5/en/partitioning-key.html

Column Typeattemptid intapplicationid intclustertimestamp bigintmastercontainer varbinary(13900)

Table 2: Properties of attempt state

Figure 4: NDB Storage Unit Test Flowchart

3.3.2 Integration with Resource-ManagerWe developed YARN-NDB using ClusterJ6 for two develop-ment iterations based on patches released by Apache. Thefirst YARN-NDB implementation is based on YARN-128.full-code.5 patch on top of Hadoop trunk dated 11 November2012. The second implementation7 is based on YARN-231-2patch8 on top of Hadoop trunk dated 23 December 2012. Inthis report, we refer to the second implementation of YARN-NDB unless otherwise specified. The NDB storage modulein YARN-NDB has same functionalities as Apache YARN’sHDFS and ZooKeeper storage module such as adding anddeleting application states and attempts.

Furthermore, we developed unit test module for the storagemodule. Figure 4 shows the flowchart of this unit test mod-ule. In this module, three MapReduce jobs are submittedinto YARN-NDB. The first job finishes the execution beforea resource-manager fails. The second job is successfully sub-mitted and scheduled, hence application-master is launched,but no container is allocated. The third job is successfullysubmitted but not yet scheduled. These three jobs representthree different scenarios when a resource-manager fails.

Restarting a resource-manager is achieved by connectingthe existing application-masters and node-managers to thenew resource-manager. All application-masters and node-managers process are rebooted by the new resource-manager

6http://dev.mysql.com/doc/ndbapi/en/mccj.html7https://github.com/arinto/hadoop-common8https://issues.apache.org/jira/browse/YARN-231

and all unfinished jobs are re-executed with a new application-attempt.

4. EVALUATIONWe designed two types of evaluation in this project. Thefirst evaluation was to test whether the NDB storage moduleworks as expected or not. The second evaluation was toinvestigate and compare the throughput among ZooKeeper,HDFS and NDB when storing YARN’s application state.

4.1 NDB Storage Module Evaluation4.1.1 Unit Test

This evaluation used the unit test class explained in Sec-tion 3.3.2. It was performed using single-node-NDB-clusteri.e. two NDB datanode-processes in a node. on top of acomputer with 4 GB of RAM and Intel dual-core i3 CPUat 2.40 GHz. We changed accordingly the ClusterJ’s Java-properties-file to point into our single-node-NDB-cluster.

The unit test class was executed using Maven and Netbeans,and the result was positive. We tested the consistency byexecuting the unit test class several times and the resultswere always pass.

4.1.2 Actual Resource-Manager Failure TestIn this evaluation, we used Swedish Institute of ComputerScience (SICS) cluster. Each node in SICS’s cluster had30 GB of RAM and two six-core AMD Opteron processorat 2.6GHz, which effectively could run 12 threads withoutsignificant context-switching overhead. Ubuntu 11.04 withLinux Kernel 2.6.38-12-server was installed as the operat-ing system and Java(TM) SE Runtime Environment (JRE)version 1.6.0 was the Java runtime environment.

NDB was deployed in 6-node-cluster and YARN-NDB wasconfigured using single-node setting. We executed pi andbbp examples that come from Hadoop distribution. In themiddle of pi and bbp execution, we terminated the resource-manager process using Linux kill command. The new resource-manager with the same address and port was started threeseconds after the old one was successfully terminated.

We observed that the currently running job finished prop-erly, which means the resource-manager was correctly restarted.Several connection-retry-attempts to contact the resource-manager by node-managers, application-masters and MapRe-duce clients were observed. To check for consistency, wesubmitted a new job to the new resource-manager and thenew job was finished correctly. We repeated this experi-ment several times and same results were observed, i.e thenew resource-manager was successfully restarted and tookover the killed resource-manager’s roles correctly.

4.2 NDB Performance EvaluationWe utilised the same set of machines in SICS cluster as ourevaluation in Section4.1.2. NDB was deployed in the same6-node-cluster and ZooKeeper were deployed to three SICSnodes. The maximum memory for each ZooKeeper processwas set to 5GB of RAM. HDFS were also deployed to threeSICS nodes and it used ZooKeeper’s maximum memory con-figuration of 5GB of RAM.

Figure 5: zkndb Architecture

4.2.1 zkdnb FrameworkWe developed zkndb framework9 to effectively benchmarkstorage systems with minimum effort. Figure 5 shows thearchitecture of zkndb framework. The framework consistsof three main packages:

1. storage package, which contains the configurable loadgenerator (StorageImpl) in term of number of readsand writes per time unit.

2. metrics package, which contains metrics parameters(MetricsEngine), for example write or read requestand acknowledge. Additionally, this package containsalso the metrics logging mechanism (ThroughputEngineImpl).

3. benchmark package, which contains benchmark appli-cations and manages benchmark executions.

zkndb framework offers flexibility in integrating new storagetechnologies, defining new metrics and storing benchmarkresults. To integrate a new storage technology, frameworkusers can implement storage-interface in storage package. Anew metric can be developed by implementing metric in-terface in metrics package. Additionally, framework userscan design new metrics logging mechanism by implementingthroughput-engine-interface in metrics package. Resultingdata produced by ThroughputEngineImpl were further pro-cessed by our custom scripts for further analysis. For thisevaluation, three storage implementations were added intothe framework, which are NDB, HDFS and ZooKeeper.

4.2.2 Benchmark Implementation in zkndbFor ZooKeeper and HDFS, we ported YARN’s storage mod-ule implementation based on YARN-128.full-code.5 patch10

into our benchmark. The first iteration of YARN-NDB’sNDB storage module is ported into our zkndb NDB storageimplementation.

Each data-write into the storage had an application identifi-cation and application state information. Application iden-tification was a Java long data type with size of eight bytes.9https://github.com/4knahs/zkndb

10https://issues.apache.org/jira/browse/YARN-128

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

R intensive W intensive R/W intensive

Com

ple

ted r

equests

/s

Workload type

ZooKeeperNDB

HDFS

Figure 6: zkndb Throughput Benchmark Result for8 Threads and 1 Minute of Benchmark Execution

Application state information was an array of random bytes,with length of 53 bytes. The length of application state in-formation was determined after observing actual applicationstate information that stored when executing YARN-NDBjobs. Each data-read consisted of reading an applicationidentification and its corresponding application state infor-mation.

Three types of workload were used in our experiment, theywere:

1. Read-intensive. One set of data was written into database,and zkndb always read on the written data.

2. Write-intensive. No read was performed, zkndb alwayswrote a new set of data into different location.

3. Read-write balance. Read and write were performedalternately.

Furthermore, we varied the throughput rate by configuringthe number of threads that accessed the database for read-ing and writing. To maximize the throughput, no delay wasconfigured in between each read and each write. We com-pared the throughput between ZooKeeper, HDFS, and NDBfor equal configurations of number of threads and workloadtypes. In addition, scalability of each storage was investi-gated by increasing the number of threads, while keepingthe other configurations unchanged.

4.2.3 Throughput Benchmark ResultFigure 6 shows the throughput benchmark result for eightthreads and one minute of execution with the three types ofdifferent workload and three types of storage implementa-tion: ZooKeeper, NDB and HDFS.

For all three workload types, NDB had the highest through-put compared to ZooKeeper and HDFS. These results canbe attributed to the nature of NDB as a high performancepersistent storage which is capable to handle high read andwrite request rate. Refer to the error bar in Figure6, NDB

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

4 8 12 16 24 36

Com

ple

ted r

equests

/s

Number of threads

ZooKeeperNDB

HDFS

Figure 7: Scalability Benchmark Results for Read-Intensive Workload

has big deviation between its average and the lowest valueduring experiment. This big deviation could be attributedto infrequent intervention from NDB management processto recalculate the data index for fast access.

Interestingly, ZooKeeper’s throughput were stable for allworkload types. This throughput stability can be accountedto ZooKeeper’s behavior to linearize incoming requests thatcauses read and write request have approximately the sameexecution time. Another possible explanation for ZooKeeper’sthroughput stability is the YARN’s ZooKeeper storage mod-ule implementation. The module implementation code couldcause the read and write execution time equal.

As expected, HDFS had the lowest throughput for all work-load types. HDFS’ low throughput may be attributed toHDFS’ NameNode-locking overhead and inefficient data ac-cess pattern when it processes lots of small files. Eachtime HDFS receives read or write request, HDFS NameN-ode needs to acquire a lock for the file path so HDFS canreturn a valid result. Acquiring a lock frequently increasesdata access time hence decreases the throughput. The inef-ficient data access pattern in HDFS is due to data splittingto fit the data into HDFS block and data replication. Fur-thermore, the needs to write the data into disk in HDFSdecreases the throughput as observed in write-intensive andread-write balance workload.

4.2.4 Scalability Benchmark ResultFigure 7 shows the increases in throughput when we in-creased the number of threads for read-intensive workload.All of the storage implementation increased their through-put when the number of threads were increased. NDB hadthe highest increase compared to HDFS and ZooKeeper. ForNDB, doubling the number of threads increased the through-put by 1.69 and it was close to linear scalability.

Same trend was observed for write-intensive workload asshown in 8. NDB still had the highest increase in through-put compared to HDFS and ZooKeeper. For NDB, doublingthe number of threads increased the throughput by 1.67. On

0

5000

10000

15000

20000

25000

30000

4 8 12 16 24 36

Com

ple

ted r

equests

/s

Number of threads

ZooKeeperNDB

HDFS

Figure 8: Scalability Benchmark Results for Write-Intensive Workload

the other hand, HDFS performed very poor for this work-load. The highest throughput achieved by NDB with 36threads was only 534.92 requests per second. The poor per-formance of HDFS could be attributed to the same reasonsas explained in Section 4.2.3, which are NameNode-lockingoverhead and inefficient data access pattern for small files.

5. RELATED WORK5.1 CoronaCorona [2] introduces a new process called cluster-managerto take over cluster management functions from MapReducejob-tracker. The main purposes of cluster-manager is to keeptrack of amount of free resources in and manages the nodesin the cluster. Corona utilizes push-based scheduling, i.e.cluster-manager pushes the allocated resources back to thejob-tracker after it receives resource requests. Furthermore,Corona claims that scheduling latency is low since there is noperiodic heartbeat involved during this resource scheduling.Although Corona solves the MapReduce scalability limita-tion, it has single-point-of-failure in cluster-manager hencethe MapReduce availability limitation is still present.

5.2 KTHFSKTHFS [12] solves scalability and availability limitation ofHDFS NameNodes. The filesystem metadata informationof HDFS NameNodes are stored in NDB, hence the HDFSNameNodes are fully state-less. By being state-less, morethan one HDFS NameNodes can run simultaneously andfailure of HDFS NameNodes can be easily mitigated by theremaining alive NameNodes. Furthermore, KTHFS has lin-ear throughput scalability, that is throughput increment canbe performed by adding HDFS NameNodes or adding NDBDataNodes. KTHFS has inspired the NDB usage to solvethe YARN availability limitation.

5.3 MesosMesos [3] is a resource management platform that enablescommodity-cluster-sharing between different cluster comput-ing frameworks. Cluster utilization is improved due to thesharing mechanism. It has several master processes that

have similar roles compared to YARN resource-manager.The availability of Mesos is achieved by having several stand-by master processes to replace the failed active master pro-cess. Mesos utilizes ZooKeeper to monitor the group of mas-ter processes. And during master process failures, ZooKeeperperforms leader-election to choose the new active masterprocess. Reconstruction of state is performed by the newlyactive master process. This reconstruction mechanism mayintroduce significant amount of delay when the state is big.

5.4 Apache HDFS-1623Apache utilizes failover recovery model to solve HDFS Na-meNode single-point-of-failure limitation [9,10]. In this solu-tion, additional HDFS NameNodes are introduced as stand-by NameNodes. The active NameNode writes all changesto the file system namespace into a write-ahead-log in apersistent storage. Overhead when storing data is likely tobe introduced and the overhead magnitude depends on thechoice of storage system. This solution supports automaticfailover, but the solution complexity increases due to theexistence of additional processes as failure detectors. Thesefailure detectors trigger automatic failover mechanism whenthey detect NameNode failures.

6. CONCLUSION AND FUTURE WORKWe have presented an architecture for highly-available clus-ter computing management framework. The proposed ar-chitecture incorporated state-less failure model into exist-ing Apache YARN. To achieve the high-availability natureand the state-less failure model, MySQL Cluster (NDB) wasproposed as the storage technology for storing the necessarystate information.

As a proof-of-concept, we implemented Apache YARN’s re-covery failure model using NDB (YARN-NDB) and we de-veloped zkndb benchmark framework to test it. Availabilityand scalability of the implementation has been examinedand proven using unit test, actual resource-manager fail-ure test and throughput benchmark experiments. Resultsshowed that YARN-NDB was better in term of throughputand ability to scale compared to existing ZooKeeper andHDFS-based solutions.

For future work, we plan to further develop YARN-NDBwith fully state-less failure model. As the first step of thisplan, more detailed analysis of resource-manager states areneeded. After the states are successfully analysed, we plan tore-design the database to accommodate additional informa-tion of states from the analysis. In addition, modificationsin YARN-NDB code are needed to remove the informationfrom memory and always access NDB when the informationare needed. Next, we perform evaluation to measure thethroughput and overhead of the new implementation. Fi-nally, after the new implementation successfully passes theevaluations, we should deploy YARN-NDB in significantlybig cluster with real-world workload to check for its actualscalability. The resulting YARN-NDB is expected to runperfectly in cloud environment and handle the node failuresproperly.

7. ACKNOWLEDGEMENTThe authors would like to thank our partner Mario Almeidafor his contribution in the project. We would also like to

thank our colleague: Umit Cavus Buyuksahin, StrahinjaLazetic and, Vasiliki Kalavri for providing feedback through-out this project. Additionally we would like to thank ourEMDC friends: Muhammad Anis uddin Nasir, EmmanouilDimogerontakis, Maria Stylianou and Mudit Verma for con-tinuous support throughout report writing process.

8. REFERENCES[1] J. Dean and S. Ghemawat. MapReduce: simplified

data processing on large clusters. Commun. ACM,51(1):107–113, Jan. 2008.

[2] Facebook. Under the hood: Scheduling MapReducejobs more efficiently with corona, Nov. 2012. Retrievedat November, 18, 2012 fromhttp://on.fb.me/109FHPD.

[3] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,A. D. Joseph, R. Katz, S. Shenker, and I. Stoica.Mesos: a platform for fine-grained resource sharing inthe data center. In Proceedings of the 8th USENIXconference on Networked systems design andimplementation, NSDI’11, page 22, Berkeley, CA,USA, 2011. USENIX Association.

[4] P. Hunt, M. Konar, F. P. Junqueira, and B. Reed.ZooKeeper: wait-free coordination for internet-scalesystems. In USENIX ATC, volume 10, 2010.

[5] M. Keep. MySQL cluster 7.2 GA released, delivers 1BILLION queries per minute, Apr. 2012. Retrieved atNovember, 18, 2012 from http://dev.mysql.com/tech-resources/articles/mysql-cluster-7.2-ga.html.

[6] A. C. Murthy. The next generation of apache hadoopMapReduce, Feb. 2011. Retrieved at November, 18,2012 fromhttp://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/.

[7] A. C. Murthy. Introducing apache hadoop YARN,Aug. 2012. Retrieved at November, 11, 2012 fromhttp://hortonworks.com/blog/introducing-apache-hadoop-yarn/.

[8] A. C. Murthy, C. Douglas, M. Konar, O. O’Malley,S. Radia, S. Agarwal, and V. KV. Architecture of nextgeneration apache hadoop MapReduce framework.Retrieved at November, 18, 2012 fromhttps://issues.apache.org/jira/secure/attachment/12486023/MapR.

[9] A. Myers. High availability for the hadoop distributedfile system (HDFS), Mar. 2012. Retrieved atNovember, 18, 2012 from http://bit.ly/ZT1xIc.

[10] S. Radia. High availability framework for HDFS NN,Feb. 2011. Retrieved at January, 4, 2012 fromhttps://issues.apache.org/jira/browse/HDFS-1623.

[11] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.The hadoop distributed file system. In 2010 IEEE26th Symposium on Mass Storage Systems andTechnologies (MSST), pages 1–10, May 2010.

[12] M. Wasif. A distributed namespace for a distributedfile system, 2012. Retrieved at November, 18, 2012from http://kth.diva-portal.org/smash/record.jsf?searchId=1&pid=diva2:548037.

next generation hadoop: high availability for yarn

Education