research article a replication-based mechanism …downloads.hindawi.com › journals › mpe ›...

Research ArticleA Replication-Based Mechanism for Fault Tolerance inMapReduce Framework

Yang Liu and Wei Wei

College of Information Science and Engineering Henan University of Technology Zhengzhou 450001 China

Correspondence should be addressed to Wei Wei nsyncw126com

Received 20 October 2014 Accepted 31 January 2015

Academic Editor Hui-Huang Hsu

Copyright copy 2015 Y Liu and W Wei This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

MapReduce is a programmingmodel and an associated implementation for processing and generating large data sets with a paralleldistributed algorithm on a cluster In cloud environment node and task failure are no longer accidental but a common featureof large-scale systems Current rescheduling-based fault tolerance method in MapReduce framework failed to fully consider thelocation of distributed data and the computation and storage overhead of rescheduling failure tasksThus a single node failure willincrease the completion time dramatically In this paper a replication-based mechanism is proposed which takes both task andnode failure into consideration Experimental results show that compared with default mechanism inHadoop ourmechanism cansignificantly improve the performance at failure time with more than 30 decreasing in execution time

1 Introduction

MapReduce is an emerging programming paradigm thathas gained more and more popularity due to its abilityto support complex tasks execution in a scalable way [12] MapReduce is used for processing large data sets byparallelizing the processing on a large number of distributednodes Data is stored in splits that are processed by separatetasks Processing is done in two phases map and reduceA MapReduce application is implemented in terms of twofunctions that correspond to these two phases A mapfunction processes input data expressed in terms of key-valuepairs and produces an output also in the form of key-valuepairs A reduce function picks the output of themap functionsand produces results It is shown that many applications canbe implemented using this programming model [1]

This popularity is also shown by the appearance of open-source implementations like Hadoop that is now extensivelyadopted by Yahoo and many other companies [3] Hadoop[4] is an open-source software framework implemented usingJava and is designed to support data-intensive applicationsexecuted on large distributed systems It is a project of theApache Software Foundation and is a very popular software

tool due in part to it being open-source Yahoo has con-tributed to about 80 of the main core of Hadoop but manyother large technology organizations have used or are cur-rently using Hadoop such as Facebook Twitter LinkedInand others [5]TheHadoop framework is comprised of manydifferent projects but two of the main ones are the HadoopDistributed File System (HDFS) and MapReduce Both theinitial input and the final output of a Hadoop MapReduceapplication are normally stored in HDFS [3] which is similarto the Google File System [6]

However in large-scale distributed computing environ-ment such as cloud environment node and task failure are nolonger accidental but a common feature Research discoveredthat failure has a significant impact on system performancein large-scale systems [4] Every year in a cluster 1 to 5hard disks will be scrapped up to 20 racks and 3 routers willgo down and servers will go down at least twice with 2to 4 scrap rate each year It shows that failure also occursdaily even in a distributed system with up to ten thousandsuper reliable servers (MTBF of 30 years) [5] For the cloudenvironment consisting of a large number of inexpensivecomputers node and task failure become a more frequentand wide spread problem which must be handled by some

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 408921 7 pageshttpdxdoiorg1011552015408921

2 Mathematical Problems in Engineering

effective fault tolerance method It is difficult to achieve 100fault tolerance because there aremany physical circumstancesthat just can not be planned for but the goal of fault toleranceis to plan for all common failures [7] In managing faulttolerance it is important to eliminate Single Points of Failure(SPOF) which are single elements of the system that whenthey fail they can bring down the whole system [8]

As a result fault tolerance is as important in the design ofthe original MapReduce as in Hadoop Specifically a MapRe-duce job is a unit of work that consists of the input data amap and a reduce function and configuration informationHadoop breaks the input data in splits Each split is processedby a map task which Hadoop prefers to run on one ofthe nodes where the split is stored (HDFS replicates thesplits automatically for fault tolerance) Map tasks write theiroutput to local disk which is not fault tolerance Howeverif the output is lost as when the machine crashes the maptask is simply executed again on another computing nodeThe outputs of all map tasks are then merged and sortedby an operation called shuffle This kind of rescheduling isinefficient and can be improved in many ways

Since the MapReduce-based programs generate a lotof intermediate data which is critical for completing thejob this paper views failover of intermediate data as anecessary component of MapReduce framework specificallytargeting andminimizing the effect of tasks and nodes failureon performance metrics such as job completion time Wepropose new design techniques for a new fault tolerancemechanism implement these techniqueswithinHadoop andexperimentally evaluate the prototype system

2 Related Work

MapReduce is a programming model and an associatedimplementation for processing and generating big data [9]It is initially designed for parallel processing of big datausing mass cheap server clusters and putting scalability andsystem availability on the prior position Within GoogleCompany more than 20 PB of data is produced every dayand 400 PB every month Yahoo adopts Hadoop an open-source MapReduce framework Facebook uses it to processdata and generate reports while Amazon Company useselastic MapReduce framework for large amount of data-intensive tasks [10] MapReduce has obvious advantage overother programming models like MPI and a single taskfailure does not affect the execution of other tasks becauseof the independence between mapper and reducer taskIt has drawn a lot of attention for its benefits in simpleprogramming data distribution and fault tolerance whichhas been widely used in many areas like data mining andmachine learning

Google Company pointed out that in a cloud environ-ment with averagely 268 nodes each MapReduce job isaccompanied with failure of five nodes [11] On the otherhand large amount of data is usually accompanied with datainconsistency or even data loss and incorrect data record willresult in task failure or even failure of the entire job MapRe-duce uses a rescheduling-based fault tolerance mechanism

to ensure the correct execution of the failed task Sincethe rescheduling failed to fully consider the location ofdistributed data in the scenario of node failure all thecompleted tasks on the failed node will start over whichshows severely low efficiency If the failure detection timeoutin Hadoop is set to 10 minutes (the default value) a singlefailure will cause at least 50 increase in completion time[12] If each input split contains one bad record the entireMapReduce job will have a 100 runtime overhead which isnot acceptable for those users with rigorous SLA require-ments So it clearly shows the need for effective algorithmsthat can reduce delays caused by these failures [13]

In [14] tests show that in seven types of cluster withdifferent MTBF (Mean Time between Failures) MapReducejob with three replicas can achieve better performance thanthat with one replica because more replicas can reduce thechances of data migration when rescheduling jobs at failuretime Reference [15] discussed an alternative fault tolerancescheme the state-based Stream MapReduce (SMR) which issuitable for handling continuous data streaming applicationswith real-time requirements such as financial and stockdataThe key feature is low-overhead deterministic executionwhich reduces the amount of persistently stored informationReference [16] proposed a method to replicate intermediatedata to the reducer but thismethodwill produce a large num-ber of IO operations consume a lot of network bandwidthand only support recovery for single node failure Reference[17] proposed a method to improve performance of fault tol-erance by replicating data copies Reference [18] presented anintelligent scheduling system for web service which consid-ers both the requirements of different service requests and thecircumstances of the computing infrastructure which con-sists of various resources Reference [19] described the prior-ity of fairness efficiency and the balance between benefit andfairness respectively and then recompiled the CloudSim andsimulated three task scheduling algorithms on the basis ofextended CloudSim respectively

We proposed a replication-based mechanism for faulttolerance in MapReduce framework which can significantlyreduce the average completion time of jobsUnlike traditionalfault tolerance mechanism it will reschedule tasks on thefailed node to another available node without starting overagain but reconstruct intermediate results quickly from thecheckpoint file The preliminary experiments show thatunder a failure condition it outperforms default mechanismwith more than 30 increasing in performance and incursonly up to 7 overhead

3 Algorithms

31 MapReduce Programming Model In MapReduce pro-grammingmodel the calculation process is decomposed intotwo main phases namely the mapper stage and reducerstage For one piece of input data the reducer stage onlystarts when the mapper stage is completed A MapReducejob includes119872mapper tasks and 119877 reducer tasks In mapperstage multiple mapper tasks run in parallel and one mappertask will read an input split and perform a mapper function

Mathematical Problems in Engineering 3

Output file

Input split 0

Input split 1

Input split 2

Input split 3

Input split 4

Input split 5Mapper

MapperMapper

MapperMapper

Mapper

ReducerWorker node

Worker node

Output fileWorker node

Intermediate

Intermediate

Intermediate

Intermediate

Intemediate

Intermediate

Worker node

Reducer

Worker node

Input Map stage Shuffle stage Reduce stage Output

Figure 1 Execution process of MapReduce model

where mapper tasks are independent of each other Mappertasks will produce a large number of intermediate results inlocal storage Before reducer function is called the systemwill classify the generated intermediate result and shuffleresult with the same key to reducers A reducer task willexecute a reduce function and generates an output file andeventually aMapReduce jobwill generate119877 output fileswhichcan be merged to get the final result When programmingdevelopers need to write a mapper and a reducer function

map (input data) 997888rarr (key119895 value119895) | 119895 = 1 119896

reduce (key [value1 value119898]) 997888rarr (key final value) (1)

The MapReduce model is shown in Figure 1Node and task failure are prone to happen during

a MapReduce job execution process When a node failsMapReduce will move all mapper tasks on the failed node toavailable nodes This kind of rescheduling method is simplebut often introduces a lot of time cost thus for users withhigh responsiveness requirements its negative influence isnot acceptable

32 Improved Rescheduling Algorithm In the paper areplication-based mechanism in MapReduce framework isproposed which uses a checkpoint-based active replicationmethod to provide better performance Both node and taskfailures are considered and the introduced delay is signifi-cantly decreased thus the overall performance is improved

In our mechanism two kinds of files are introducednamely the local checkpoint file and the global index fileThese files are created before the execution of one mappertask The local checkpoint file is responsible for recordingthe progress of current task avoiding reexecution from thebeginning in case of task failure If local task failure happenedone node can continue failed task using information in

local checkpoint file And the global index file is responsiblefor recording the characteristics of the current job therebyhelping in reconstructing the intermediate results acrossnodes to reduce reexecution time The global index file canbe implemented as a checkpoint file saved in HDFS and canbe accessible in case of node failure

The algorithm is divided into two parts one part workson master node and the other works on worker nodes Inaddition as the master node is important it is necessary tomaintain multiple fully consistent ldquohot backuprdquo to ensureseamless migration when a fault occurs The two parts ofalgorithm are shown as follows

The Algorithm on Master Node

(1) The master node preassigns mapper and reducer taskto different worker nodes

(2) Choose K replicas for each worker node

(3) Wait for results from all worker nodes

(a) If all results are receivedmerge these results andmark job as completed

(b) Or go to 3 and keep waiting

(4) Periodically send probing packets to all worker nodes

(a) If all worker nodes respond then go to 4 andkeep probing

(b) Or if one node does not respond in given timeinterval then mark the node as failed

(i) Get the worker ID and all unfinished taskson the node

(ii) Put all unfinished mapper tasks into globalqueue and reschedule them on availablereplica nodes


(iii) If there are failed tasks on the failednode then reschedule these tasks on replicanodes which have intermediate resultswithout reexecuting these mapper tasks

(c) If one node has finished all tasks then reassignother unexecuted tasks to the node

The Algorithm on Worker Node(1) Check the type of the given task

(a) If it is a mapper task check whether it is a newtask or a reexecution of failed task(i) If it is a new task initialize and execute it(ii) If it is a reexecution of local failed task then

get its progress from the local checkpointfile and continue execution

(iii) If it is a reexecution of the failed taskfrom other nodes then read global indexfile for the task and rapidly reconstructintermediate result using information inglobal index file

(b) If it is a reducer task check whether it is anew task a reexecution of the failed task orunexecuted task from other nodes(i) If it is a new task initialize and execute it(ii) If it is a reexecution of the failed task from

other nodes read intermediate data fromthe local disk and execute it

(iii) If it is a new assigned unexecuted task fromother nodes then read intermediate datafrom a given worker node and execute it

(2) Create a local checkpoint file and a global index filefor the given mapper task

(3) Start the mapper task

(a) When memory buffer of the mapper node isfull dump intermediate data into local filesAfter dumping finished record the positionof the input stream and mapper ID (Positionmapper ID) into local checkpoint file

(b) According to the location distribution of inputkey-value pairs which contributes to outputkey-value pairs two different strategies areemployed to record these distributions in theglobal index file(i) For input key-value pairs producing out-

put record their position (T1 offset) intoglobal index file which means only pairsin these offsets need to be processed inreexecution Here T1 means this is the type1 record

(ii) Or for input pairs which give no outputrecord the range as (T2 offset1 offset2)which means the input pairs between off-set1 and offset2 have no output and can beskipped in reexecution T2means this is thetype 2 record

(4) When a mapper task finished shuffle and send theintermediate results to corresponding reducer nodesCopy the intermediate data needed by local reducertask to replica nodes then notify the completion ofmapper task to master node and delete the localcheckpoint file and the global index file

In our algorithm when a task failure occurs the corre-sponding node simply reads the checkpoint file saved in thelocal disk restore task status to the checkpoint and reload theintermediate results generated before failure so reexecutionis avoided

When a node failure occurs the scheduler on the masternode is responsible for rescheduling the interrupted mappertasks to available replica nodes which can quickly constructthe intermediate results of failed tasks using the global indexfile to reduce the reexecution time greatly

Note if tasks and nodes failed before checkpoint theprogress will continue from the recent checkpoint And ifthe action of saving checkpoint failed the progress willstart from the recent checkpoint again The simple failoverstrategy is of low cost and is effective in distributed computingenvironment The frequency of saving checkpoint should becarefully chosen a high frequency will provide lower costfor failover and higher checkpoint saving cost for normalrunning while a low frequency is in the opposite case Aftercareful adjusting the frequency value in our experiment is setto one checkpoint per 105 key-value pairs

If node failure happens in the reducer stage the taskson the failed node are rescheduled to any available replicanodes The needed intermediate results have been copied tothe replica node when mapper tasks finished thus there isno need to repeat the mapper tasks on the failed node sothat overall completion time of the MapReduce job is greatlyreduced

33 SystemAnalysis For a givenmapper task119898 the questionis for a given input range in an input split how to properlyplace the kind of records (1198791or 1198792) in global index file tominimize storage overhead For the given input range set thetotal number of key-value pairs as 119873119898 and the number ofoutput key-value pairs as 1198731015840119898 The size of type 1 and type 2records is 1198711and 1198712

1198712 = 21198711 (2)

Set 1198781198981 and 1198781198982 as the storage overhead in type 1 andtype 2 record in the input range Set 119881119898 as the count of inputsubranges which give no output and the119874119894th input key-valuepair produces the 119894th output key-value pairs Consider

119881119898 =

1198731015840119898

sum

119894=2

0 119874119894 minus 119874119894minus1 = 1

1 119874119894 minus 119874119894minus1 gt 1

(3)


We have

1198781198981 = 11987111198731015840119898

1198781198982 = 1198712119881119898 = 21198711119881119898

(4)

Set 119877119898 as the decision we make when 119877119898 = 1 we storetype 1 record to global index file or if 119877119898 = 2 we use type2 record Then we can decide which kind of record to useaccording to equation as follows

119877119898 =

1 1198731015840119898 lt 2119881119898

2 1198731015840119898 ge 2119881119898

(5)

4 Experimental Results

We validate our algorithm on Hadoop cluster which isimplemented as a patch to Hadoop The comparison ismeasured from the performance and the overhead aspectThe performance of the algorithm is evaluated by delay Theimplementation of the algorithm is based on Hadoop 0201Java 16 andHDFS file systemwith data block size of 256MBThe underlying infrastructure is a 20-node HP blade clusterand each node is equipped with quad-core Xeon 26GHzCPU 8Gmemory 320Gharddrive and twopieces ofGigabitNICs Each node is configured to hold four Xen virtualmachines thus it has 80 virtual nodes And we schedule40 nodes for Hadoop cluster with default fault tolerancemechanism and 40 nodes for cluster with our mechanismFor each cluster one node is deployed as themaster node andthe remaining 39 nodes are deployed as the worker node Asingle worker node can simultaneously run twomapper tasksand one reducer task In the experiment we run a typical filterjob which is to find out certain entries from huge amountsof data This kind of job is computationally intensive and hasless intermediate resultsWeuse 12millionEnglishweb pageswith an average page size of 1MB for test By adjusting thesplit size one mapper handles an average of approximately120M input data and each node is assigned with an averageof about 250 mapper tasks

According to the distribution of query words in inputdata there are three kinds of MapReduce jobs used in theexperiment the aggregated job the sparse job and the mixedjob In an aggregated job the locations of query words aregathered in the target data in a sparse job the locations ofthe querywords aremore dispersed in a hybrid job the abovetwo situations coexist

The processing time of MapReduce job is usually influ-enced by factors as below (1) the size of the data setwhere each mapper task will deal with a subset of data(split) (2) the computation performance of worker nodes(3) MapReduce job type (4) the number of bad recordsin each split where each record will lead to restarting oftask (5) the number of failed worker nodes By adjustingthese factors the effectiveness and the overhead of given faulttolerance algorithms can be evaluated and compared in thesame criterion

In scenario with only task failure the comparison of jobcompletion time of two mechanisms is shown in Figure 2

0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000

Failed count per 100 tasks

Tim

e (m

in)

HadoopHadoop with RFM

Figure 2 Comparison of execution time with task failure

0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000

Number of failed nodes

Tim

e (m

in)


Figure 3 Comparison of execution time with node failure

with ourmechanismmarked as RFM (replication-based faulttolerancemechanism)The 119909-axis is the task error probabilityin the form of error tasks number per 100 tasks with the 119910-axis being the total completion time There is no upper limiton the errors number of eachmapper task It can be observedthat along with the increasing of probability of errors theexecution time of the job with REF is significantly decreasedcompared to that of default mechanism

In scenario with only node failure the comparison ofMapReduce job completion times is shown in Figure 3 The119909-axis is the number of failed nodes with the 119910-axis beingthe total completion time It can be observed that along withmore failed nodes our mechanism can significantly decreasethe reexecution time It is because the default mechanism ofHadoopwill start all failed task from the beginning on replicanodes while our mechanism gets more tasks finished in thesame period of time by continuing the mapper tasks quicklyand efficiently

In scenario with both task and node failure the compari-son of job completion times is shown in Figure 4The119909-axis isthe probability of task failure and node failure in the form offailed tasksnodes count per 100 tasksnodes with 119910-axis asthe total completion time It is shown that under the jointinfluence of task and node failure the relation between thejob execution time and the failure probability is nearly linear


0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000

Failed count per 100 tasksnodes

Tim

e (m

in)


Figure 4 Comparison of execution time with task and node failure

0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104

Figure 5 Comparison of average network overhead with nodefailure

in our mechanism and the overhead is mainly introduced bytask migration and restoring

Moreover the overhead introduced by our mechanismis low In case of task failure the network overhead can beignored since there is no extra network transferring In caseof node failure the extra network overhead of ourmechanismis mainly from the active replication of the global index fileFigure 5 shows the network overhead which is introduced bynode failure The 119909-axis is the number of failure nodes andthe 119910-axis is the average network overhead It is shown thatcompared with the network overhead of default mechanismin Hadoop the network overhead of our mechanism issignificantly low

In the task failure scenario the storage overhead is thesize of the local checkpoint file which is negligible due tothe limited position information stored in it Figure 6 showsthe comparison of storage overhead of three different typesof MapReduce job under the scenario of 10 failed nodes It isshown that the increased storage overhead of our mechanismis mainly from global index file which is low compared withoriginal intermediate results

Type I task Type II task Type III task0

05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104

Figure 6 Comparison of average storage overhead with nodefailure

5 Conclusions

The paper proposed a replication-based mechanism for faulttolerance in MapReduce framework which is fully imple-mented and tested onHadoop Experimental results show theruntime performance can be improved by more than 30 inHadoop thus our mechanism is suitable for multiple types ofMapReduce job and can greatly reduce the overall completiontime under the condition of task and node failures

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This paper is supported by the National Natural and ScienceFoundation of China (nos 61003052 61103007) Natural Sci-ence Research Plan of the Education Department of HenanProvince (nos 2010A520008 13A413001 and 14A520018)Henan Provincial Key Scientific and Technological Plan (no102102210025) Program forNewCentury Excellent Talents ofMinistry of Education of China (no NCET-12-0692) NaturalScience Research Plan of Henan University of Technology(2014JCYJ04) and Doctor Foundation of Henan Universityof Technology (nos 2012BS011 2013BS003)

References

[1] J Dean and S Ghemawat ldquoMapReduce simplified data pro-cessingon large clustersrdquo in Proceedings of the 6th SymposiumonOperating Systems Design amp Implementation vol 3 pp 102ndash1112004

[2] U Hoelzle and L A Barroso The Datacenter as a ComputerAn Introduction to the Design of Warehouse-Scale MachinesMorgan and Claypool Publishers 2009

[3] T White Hadoop The Definitive Guide OrsquoReilly 2009[4] Apache Hadoop httphadoopapacheorg


[5] D Borthakur The Hadoop Distributed File System Architec-ture and Design httphadoopapacheorgdocsr0180hdfsdesignpdf

[6] S Ghemawat H Gobioff and S T Leung ldquoThe Google filesystemrdquo SIGOPS Operating Systems Review vol 37 no 5 pp29ndash43 2003

[7] B Selic ldquoFault tolerance techniques for distributed systemsrdquohttpzhscribdcomdoc37243421Fault-Tolerance-Techniques-for-Distributed-Systems

[8] F Wang J Qiu J Yang B Dong X Li and Y Li ldquoHadoop highavailability through metadata replicationrdquo in Proceedings ofthe 1st International Workshop on Cloud Data Management(CloudDB rsquo09) pp 37ndash44 ACM November 2009

[9] J Dean and SGhemawat ldquoMapReduce simplified data process-ing on large clustersrdquo Communications of the ACM vol 51 no1 pp 107ndash113 2008

[10] Amazon ElasticMapreduce httpawsamazoncomelasticma-preduce

[11] JDean ldquoExperienceswithMapReduce an abstraction for large-scale computationrdquo in Proceedings of the Internet Conference onParallel Architectures and Computation Techniques (PACT rsquo06)vol 8 pp 5ndash10 Seattle Wash USA 2006

[12] S Y Ko I Hoque B Cho and I Gupta ldquoOn availabilityof intermediatedata in cloud computationsrdquo in Proceedings ofthe The USENIX Workshop onHot Topics in Operating Systems(HotOS rsquo09) vol 5 pp 32ndash38 Ascona Switzerland 2009

[13] J-A Quiane-Ruiz C Pinkel J Schad and J Dittrich ldquoRAFTingMapReduce fast recovery on the RAFTrdquo in Proceedings of theIEEE 27th International Conference on Data Engineering (ICDErsquo11) vol 3 pp 589ndash600 IEEE Hannover Germany April 2011

[14] J Hui Q Kan S Xian-He and L Ying ldquoPerformance underfailures of mapreduce applicationsrdquo in IEEEACM InternationalSymposium on Cluster Cloud and Grid Computing pp 608ndash609Newport Beach Calif USA 2011

[15] A Martin T Knauth S Creutz et al ldquoLow-overhead faulttolerance for high-throughput data processing systemsrdquo inProceedings of the 31st International Conference on DistributedComputing Systems (ICDCS rsquo11) pp 689ndash699 MinneapolisMinn USA June 2011

[16] S Y Ko I Hoque B Cho and I Gupta ldquoMaking cloudintermediate data fault-tolerantrdquo in Proceedings of the 1st ACMSymposium on Cloud Computing vol 1 pp 181ndash192 Indianapo-lis Ind USA June 2010

[17] Q Zheng ldquoImproving mapreduce fault tolerance in the cloudrdquoin Proceedings of the IEEE International Symposium on ParallelDistributed ProcessingWorkshops and Phd Forum (IPDPSW rsquo10)vol 1 pp 1ndash6 New York NY USA 2010

[18] J Liu X G Luo B N Li X M Zhang and F Zhang ldquoAnintelligent job scheduling system for web service in cloudcomputingrdquo TELKOMNIKA Indonesian Journal of ElectricalEngineering vol 11 no 12 pp 2956ndash2961 2013

[19] S Hong S-p Chen J Chen and G Kai ldquoResearch andsimulation of task scheduling algorithm in cloud computingrdquoTelkomnika vol 11 no 11 pp 1923ndash1931 2013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of


Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of


Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of


CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of


Operations ResearchAdvances in

Journal of


Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences


The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Algebra

Discrete Dynamics in Nature and Society



Decision SciencesAdvances in

Discrete MathematicsJournal of


Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of


effective fault tolerance method It is difficult to achieve 100fault tolerance because there aremany physical circumstancesthat just can not be planned for but the goal of fault toleranceis to plan for all common failures [7] In managing faulttolerance it is important to eliminate Single Points of Failure(SPOF) which are single elements of the system that whenthey fail they can bring down the whole system [8]

As a result fault tolerance is as important in the design ofthe original MapReduce as in Hadoop Specifically a MapRe-duce job is a unit of work that consists of the input data amap and a reduce function and configuration informationHadoop breaks the input data in splits Each split is processedby a map task which Hadoop prefers to run on one ofthe nodes where the split is stored (HDFS replicates thesplits automatically for fault tolerance) Map tasks write theiroutput to local disk which is not fault tolerance Howeverif the output is lost as when the machine crashes the maptask is simply executed again on another computing nodeThe outputs of all map tasks are then merged and sortedby an operation called shuffle This kind of rescheduling isinefficient and can be improved in many ways

Since the MapReduce-based programs generate a lotof intermediate data which is critical for completing thejob this paper views failover of intermediate data as anecessary component of MapReduce framework specificallytargeting andminimizing the effect of tasks and nodes failureon performance metrics such as job completion time Wepropose new design techniques for a new fault tolerancemechanism implement these techniqueswithinHadoop andexperimentally evaluate the prototype system

2 Related Work

MapReduce is a programming model and an associatedimplementation for processing and generating big data [9]It is initially designed for parallel processing of big datausing mass cheap server clusters and putting scalability andsystem availability on the prior position Within GoogleCompany more than 20 PB of data is produced every dayand 400 PB every month Yahoo adopts Hadoop an open-source MapReduce framework Facebook uses it to processdata and generate reports while Amazon Company useselastic MapReduce framework for large amount of data-intensive tasks [10] MapReduce has obvious advantage overother programming models like MPI and a single taskfailure does not affect the execution of other tasks becauseof the independence between mapper and reducer taskIt has drawn a lot of attention for its benefits in simpleprogramming data distribution and fault tolerance whichhas been widely used in many areas like data mining andmachine learning

Google Company pointed out that in a cloud environ-ment with averagely 268 nodes each MapReduce job isaccompanied with failure of five nodes [11] On the otherhand large amount of data is usually accompanied with datainconsistency or even data loss and incorrect data record willresult in task failure or even failure of the entire job MapRe-duce uses a rescheduling-based fault tolerance mechanism

to ensure the correct execution of the failed task Sincethe rescheduling failed to fully consider the location ofdistributed data in the scenario of node failure all thecompleted tasks on the failed node will start over whichshows severely low efficiency If the failure detection timeoutin Hadoop is set to 10 minutes (the default value) a singlefailure will cause at least 50 increase in completion time[12] If each input split contains one bad record the entireMapReduce job will have a 100 runtime overhead which isnot acceptable for those users with rigorous SLA require-ments So it clearly shows the need for effective algorithmsthat can reduce delays caused by these failures [13]

In [14] tests show that in seven types of cluster withdifferent MTBF (Mean Time between Failures) MapReducejob with three replicas can achieve better performance thanthat with one replica because more replicas can reduce thechances of data migration when rescheduling jobs at failuretime Reference [15] discussed an alternative fault tolerancescheme the state-based Stream MapReduce (SMR) which issuitable for handling continuous data streaming applicationswith real-time requirements such as financial and stockdataThe key feature is low-overhead deterministic executionwhich reduces the amount of persistently stored informationReference [16] proposed a method to replicate intermediatedata to the reducer but thismethodwill produce a large num-ber of IO operations consume a lot of network bandwidthand only support recovery for single node failure Reference[17] proposed a method to improve performance of fault tol-erance by replicating data copies Reference [18] presented anintelligent scheduling system for web service which consid-ers both the requirements of different service requests and thecircumstances of the computing infrastructure which con-sists of various resources Reference [19] described the prior-ity of fairness efficiency and the balance between benefit andfairness respectively and then recompiled the CloudSim andsimulated three task scheduling algorithms on the basis ofextended CloudSim respectively

We proposed a replication-based mechanism for faulttolerance in MapReduce framework which can significantlyreduce the average completion time of jobsUnlike traditionalfault tolerance mechanism it will reschedule tasks on thefailed node to another available node without starting overagain but reconstruct intermediate results quickly from thecheckpoint file The preliminary experiments show thatunder a failure condition it outperforms default mechanismwith more than 30 increasing in performance and incursonly up to 7 overhead

3 Algorithms

31 MapReduce Programming Model In MapReduce pro-grammingmodel the calculation process is decomposed intotwo main phases namely the mapper stage and reducerstage For one piece of input data the reducer stage onlystarts when the mapper stage is completed A MapReducejob includes119872mapper tasks and 119877 reducer tasks In mapperstage multiple mapper tasks run in parallel and one mappertask will read an input split and perform a mapper function


Output file

Input split 0

Input split 1

Input split 2

Input split 3

Input split 4

Input split 5Mapper

MapperMapper

MapperMapper

Mapper

ReducerWorker node

Worker node


Intermediate

Intermediate

Intermediate

Intermediate

Intemediate

Intermediate

Worker node

Reducer

Worker node













































1198712 = 21198711 (2)


119881119898 =

1198731015840119898

sum

119894=2

0 119874119894 minus 119874119894minus1 = 1

1 119874119894 minus 119874119894minus1 gt 1

(3)


We have

1198781198981 = 11987111198731015840119898

1198781198982 = 1198712119881119898 = 21198711119881119898

(4)


119877119898 =

1 1198731015840119898 lt 2119881119898

2 1198731015840119898 ge 2119881119898

(5)






0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)







0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104






05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104


5 Conclusions




Acknowledgments


References



























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










Output file

Input split 0

Input split 1

Input split 2

Input split 3

Input split 4

Input split 5Mapper

MapperMapper

MapperMapper

Mapper

ReducerWorker node

Worker node


Intermediate

Intermediate

Intermediate

Intermediate

Intemediate

Intermediate

Worker node

Reducer

Worker node













































1198712 = 21198711 (2)


119881119898 =

1198731015840119898

sum

119894=2

0 119874119894 minus 119874119894minus1 = 1

1 119874119894 minus 119874119894minus1 gt 1

(3)


We have

1198781198981 = 11987111198731015840119898

1198781198982 = 1198712119881119898 = 21198711119881119898

(4)


119877119898 =

1 1198731015840119898 lt 2119881119898

2 1198731015840119898 ge 2119881119898

(5)






0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)







0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104






05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104


5 Conclusions




Acknowledgments


References



























Volume 2014




Journal of











Journal of


Function Spaces






Algebra































1198712 = 21198711 (2)


119881119898 =

1198731015840119898

sum

119894=2

0 119874119894 minus 119874119894minus1 = 1

1 119874119894 minus 119874119894minus1 gt 1

(3)


We have

1198781198981 = 11987111198731015840119898

1198781198982 = 1198712119881119898 = 21198711119881119898

(4)


119877119898 =

1 1198731015840119898 lt 2119881119898

2 1198731015840119898 ge 2119881119898

(5)






0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)







0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104






05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104


5 Conclusions




Acknowledgments


References



























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










We have

1198781198981 = 11987111198731015840119898

1198781198982 = 1198712119881119898 = 21198711119881119898

(4)


119877119898 =

1 1198731015840119898 lt 2119881119898

2 1198731015840119898 ge 2119881119898

(5)






0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)







0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104






05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104


5 Conclusions




Acknowledgments


References



























Volume 2014




Journal of











Journal of


Function Spaces






Algebra










0 1 2 3 4 5 6 7 8 9 100

500

1000

1500

2000

2500

3000


Tim

e (m

in)



0 1 2 3 4 5 6 7 8 9 100

051

152

253

35


Net

wor

k tr

ansfe

r cos

t (By

tes)


times104






05

1

15

2

25

Stor

age c

ost (

Byte

s)


times104


5 Conclusions




Acknowledgments


References



























Volume 2014




Journal of











Journal of


Function Spaces






Algebra
































Volume 2014




Journal of











Journal of


Function Spaces






Algebra









research article a replication-based mechanism …downloads.hindawi.com › journals › mpe ›...

Documents