lncs 9531 - mr-cof: a genetic mapreduce configuration...

MR-COF: A Genetic MapReduceConfiguration Optimization Framework

Chao Liu1,2, Deze Zeng1(&), Hong Yao1, Chengyu Hu1,Xuesong Yan1, and Yuanyuan Fan1

1 Hubei Key Laboratory of Intelligent Geo-Information Processing,China University of Geosciences, Wuhan 430074, China

[email protected] China Services Computing Technology and System Lab and Cluster and Grid

Computing Lab, Huazhong University of Science and Technology,Wuhan 430074, China

Abstract. Hadoop/MapReduce has emerged as a de facto programmingframework to explore cloud-computing resources. Hadoop has many configu-ration parameters, some of which are crucial to the performance of MapReducejobs. In practice, these parameters are usually set to default or inappropriatevalues. This severely limits system performance (e.g., execution time). There-fore, it is essential but also challenging to investigate how to automatically tunethese parameters to optimize MapReduce job performance. In this paper, wepropose an automatic MapReduce configuration optimization framework namedas MR-COF. By monitoring and analyzing the runtime behavior, the frameworkadopts a cost-based performance prediction model that predicts the MapReducejob performance. In addition, we design a genetic search algorithm whichiteratively tunes parameters in order to find out the best one. Testbed-basedexperimental results show that the average MapReduce job performance isincreased by 35 % with MR-COF compared to the default configuration.

Keywords: Mapreduce � Massive data processing � Parameter configuration �Performance optimization � Search algorithm

1 Introduction

In recent years, massive data processing applications (e.g., web indexing and searching;enterprise and scientific data processing) have become increasingly more and morepopular. Traditional parallel programming techniques are constrained by their devel-opment complexity, scalability, and flexibility; therefore, they cannot meet the growingrequirements of large-scale data processing. To explore bulk cloud computingresources, the MapReduce programming model [1] emerges as a promising technologyto deal with big data processing. Hadoop [2] is an open source implementation ofMapReduce characterized by its programming simplicity, scalability, and fault toler-ance. Consequently, it has been widely studied in academia and business communities.However, some recent studies show that Hadoop/MapReduce suffers some perfor-mance and cost-effectiveness problems, especially when a job occupies intensive

© Springer International Publishing Switzerland 2015G. Wang et al. (Eds.): ICA3PP 2015, Part IV, LNCS 9531, pp. 344–357, 2015.DOI: 10.1007/978-3-319-27140-8_24

hardware resources. For example, Pavlo et al. [3] showed that MapReduce program is2–30 times slower than a parallel database program with the same function in the samemedium-scale cluster environment.

Obviously, the values of configuration parameters in Hadoop settings have sig-nificant influence or even are crucial to job performance and system efficiency.Therefore, it is important to know how to adjust these parameters so as to improve theMapReduce job performance [4, 5]. For example, consider the configuration parametermapred.tasktracker.map.tasks.maximum, which is used to control the number of maptasks within each task node. Setting the parameter to a smaller value can result in lowerCPU utilization, while a larger value may lead to resource competition and job per-formance degradation. Hadoop administrators and users normally use default parametersettings or manually adjust the values of few parameters based on their experiences.However, there is no one-size-fits-all solution. The default settings are generally not thebest for most MapReduce jobs, and manual adjustment tends to be inefficient or evenerror-prone. To tackle this issue, pioneering researchers and engineers have carried outsome preliminary research on automatic optimization of Hadoop parameters based ondifferent aspects. For example, Babu et al. [6] proposed a system-level coderewrite-based approach to automatically adjust Hadoop settings by adding a functionalmodule. However, this approach has two main disadvantages. First, some parametersare subject to the application characteristics and to available resources in Hadoop;therefore, the lack of an accurate cost model makes it hard to achieve optimized resultsat the system level. Second, the underlying system code modification is complex,making it difficult to effectively manage and maintain.

Furthermore, parameter tuning in Hadoop is time consuming because a largenumber of configuration parameters (over 100) must be configured. To solve thisproblem, in this paper, we present a genetic MapReduce Configuration OptimizationFramework (MR-COF) for massive data processing applications. MR-COF adopts adynamic monitoring mechanism to profile the runtime behaviors of MapReduce jobs.In addition, a cost-based performance prediction model is developed and incorporated.MR-COF provides the ability to constantly adjust parameters through a heuristic searchstrategy to enhance MapReduce job performance in Hadoop.

The rest of this paper is organized as follows. Section 2 discusses works related toMapReduce performance optimization. The system architecture and key mechanismsof MR-COF for MapReduce parameter optimization are presented in Sect. 3. Section 4describes the experimental environment and presents the performance evaluation.Finally, we conclude this work in Sect. 5.

2 Related Works

Because the basic implementation of MapReduce model has many deficiencies,researchers have conducted many optimization studies from different perspectives toimprove MapReduce job performance. In this section, we summary some recent workon MapReduce optimization from three different aspects, i.e., usability optimization,process optimization and parameter configuration optimization.

MR-COF: A Genetic MapReduce Configuration Optimization Framework 345

MapReduce Usability Optimization: A number of techniques have been proposed toprovide support for SQL semantics to enhance the usability of MapReduce program-ming [7–10]. PigLatin [7], designed by Yahoo, is a dataflow programming language ontop of MapReduce and uses advanced declarative query SQL concepts to provide datamanipulation primitives, such as projection and connection. Sawzall [8] is a scriptinglanguage used for Google MapReduce applications. Sawzall provides an outputprimitive emit, which transmits data to an external aggregator (e.g., Sum, Average).Hive [9, 10] is an open source data warehousing solution developed by Facebook. Hivesupports a SQL subset and provides complex types (e.g., maps, lists). In addition, Hiveprovides HiveQL, a declarative query language. Therefore, queries written in HiveQLcan be compiled into MapReduce jobs and then run in Hadoop environments.

MapReduce Process Optimization: The basic MapReduce framework mandatorilywrites the output data of each map and reduce task to a local file. The next phase ofMapReduce tasks need to read data from disk. Such process may cause performancedegradation in the plurality of consecutive MapReduce jobs. The performance opti-mization of the MapReduce process itself has gained attention in research community[11, 12]. Yang et al. [11] proposed a Map-Reduce-Merge model, which introduces aMerge phase to MapReduce that can efficiently merge partitioned and sorted data fromtwo different reducer outputs into one. The Map-Join-Reduce [12] system improvesand extends the MapReduce runtime framework by adding a Join phase to support theimplementation of complex data analyses on a large cluster and to avoid frequentcheckpoints and the exchange of intermediate results.

MapReduce Parameter Configuration Optimization: It has been shown thatMapReduce parameter configuration optimization is time consuming because theparameter space can go up to 100 [13]. Moreover, different parameters have differenteffects on the performance of massive data processing applications. Some of them areeven interdependent. In recent years, a number of studies have concentrate on opti-mizing MapReduce configuration parameters.

From the perspective of Hadoop job performance predication, a variety of per-formance models have been proposed. For example, Shi et al. [14] proposed MRTuner,a MapReduce job overall optimization tool that uses a Production–Transmission–Consumption Model to analyze the parallel execution of MapReduce tasks. MRON-LINE [15] is an on line tuning system that provides the fine-grained control of Hadoopconfiguration parameters and supports different settings for different tasks. A regres-sion-based model is proposed in [16]. This model can predict the performance ofmassive data processing jobs running on large-scale Hadoop clusters through datasampling and job execution within a small number of nodes. Zhang et al. [17] proposeda MapReduce job performance model based on automatic resource allocation anddeduction. The model is applicable to estimate completion time using varied input dataand cluster resources. Yigibasi et al. [18] studied a Support Vector Regression(SVR) model used to automatically tune Hadoop cluster configuration parameters. In[19], the time cost model of a MapReduce job is represented as the weighted linearcombination of a set of non-linear functions.

In terms of the search schema to find optimal Hadoop configuration parameters,many parameter optimization search strategies in the Hadoop cloud have been studied.

346 C. Liu et al.

Gunther [20] is a search-based automatic tuning tool using a heuristic algorithm toidentify optimal Hadoop configuration parameters. Herodotou et al. [21] proposedStarfish, a cost-based self-tuning system that uses a subspace random search method tofind the approximate optimal parameter configurations through enumeration.

Different methods can also be used with regard to resource usage statistics forHadoop jobs. To achieve the goal of maximizing the MapReduce job performancewhile minimizing cost, a statistical signature generation model is proposed in [22],aiming to optimize MapReduce job resource provision in the cloud. The optimizationmethod includes two components. First, a RS (Resource Set) Maximizer is responsiblefor calculating optimal configuration parameters to fully utilize the resources. Second, aRS Sizer is designed to determine the set of resources required to maintain the balancebetween costs and performance. This method improves the provision capability ofHadoop jobs by counting the resource consumption of jobs. Wang et al. [23] proposedMRPerf, a simulator to capture setup information such as node, storage capacity,network topology configuration, data layout, and the application’s I/O characteristics.This information is used to predict application performance and improve the envi-ronment settings in MapReduce. Verma et al. [24] proposed SimMR, another simu-lation environment for MapReduce clusters that comprises three components: a tracegenerator, a simulator, and a scheduling policy.

3 Design of MR-COF

3.1 System Overview

The overall architecture of MR-COF is shown in Fig. 1. MR-COF is an automaticoptimizer based on performance prediction to tune MapReduce configuration param-eters. The design of MR-COF mainly consists of three parts: the runtime monitoringand analysis module (MAM), the performance prediction module (PPM) and theconfiguration parameter optimization module (POM). MAM is used to monitor andstatistically analyze the running information of a MapReduce job and then write to aprofile. PPM is responsible for predicting job performance based on the current con-figuration parameters according to the MAM output file. In accordance with the esti-mated job completion time, POM is then applied to adjust configuration parametersbased on a genetic search algorithm.

The working process of the MR-COF system is as follows. First, the client submitsa MapReduce job to the Hadoop environment through the command interface. TheHadoop Distributed File System (HDFS) is used to persistently store programs, inputand output data, and configuration files. Second, MR-COF starts MAM and transmitsthe output information to PPM to estimate the job completion time. Third, POMiteratively searches for better configuration parameters until the termination condition issatisfied, and the result is sent back to the client, allowing the client to rerun theMapReduce job with the optimized configuration parameters.


3.2 MapReduce Monitoring and Analysis

In MR-COF, the MAM module is used to monitor and statistically analyze data flowinformation and the execution time of map or reduce tasks during a job running. Thedata flow information consists of the size of the data bytes generated by each pro-cessing phase during the execution of a MapReduce job. For example, an intermediateresult created by a map task may flow to a reduce task as input data. The execution timeinformation includes all the time cost in each phase of the map or the reduce taskduring the execution of a MapReduce job. The statistical operation mainly counts theaggregation or average value of the data flow or the execution time information.

The MAM module integrates Btrace [25], a dynamic monitoring tool, to support thecollection of statistical information in the map or reduce tasks on each work node. At thesame time, MAM allows to generate an approximate monitoring and analysis profilebased on the feedback, regarding the monitoring information through the online con-trolling of the Btrace proxy switch on each work node and MapReduce task sampling.Afterwards, the master node can predict the MapReduce job performance and search foroptimized configuration parameters. Details of the performance model and parameterconfiguration optimization algorithm will be described in Sects. 3.3 and 3.4.

The MAM process flow can be divided into five steps. (1) The Btrace scriptmonitoring code is inserted into the MapReduce program in the master node.

MR-COF

ClientJob1 (MapReduce

program)

Hadoop

MapReduce HDFS

MAM (the runtime Monitoring and Analysis

module)

PPM (the Performance Prediction Module)

POM (the configuration Parameter Optimization

Module)

Optimized Configration

Jobn (MapReduceprogram)

…

Fig. 1. MR-COF system architecture

348 C. Liu et al.

(2) The modified MapReduce program is distributed to each work node for processing.(3) The job begins to execute map tasks. The MR-COF MAM module dynamicallymonitors and collects the data flow information and execution time of each phase ofmap tasks and then aggregates that information. (4) The job begins to execute reducetasks, and similarly the MAM module monitors, collects, and aggregates all theinformation in each reduce task. (5) All the monitoring information is written to aprofile and finally merged to generate MapReduce job monitoring and analysis files.

Using dynamic monitoring reduce tasks as an example, we describe how the MAMmodule inserts Btrace monitoring functions at the point when a job or task statechanges (e.g., reduce task start time and end time). The process can be divided into foursteps. (1) The map output intermediate data must be copied before performing thereduce tasks. Therefore, the reduceTask_shuffle monitoring point should be insertedprior to the reduceCopier() method. (2) The merge and sort operations are launched assoon as the full input data is fetched. Thus, the reduceTask_merge monitoring pointshould be inserted after the copyPhase.complete() method in the ReduceTask class.(3) Subsequently, the system executes the reduce.run () method after completing thesort operation, so the reduceTask_reducer monitoring point should be inserted next tothe sortPhase.complete() method. (4) Finally, the output results are written back to theunderlying distributed file system. Therefore, the reduceTask_writeDFS monitoringpoint should be inserted before the mapreduce.RecordWriter() method.

3.3 Cost-Based Performance Prediction Model

In this section, we present the cost-based performance model used to estimate theexecution time of a MapReduce job. First, we describe the basic parameters of theperformance model (Table 1).

The basic parameters can be classified into three types, i.e., input and output (I/O)parameters, cluster configuration parameters and program parameters. The I/O param-eters consist of inputBytes, mapOutputBytes, combineOutputBytes, and outputBytes.The cluster configuration parameters include numNodes and chunkBytes. The programparameters contain numReducers.

Table 1. Performance model parameters

Parameter Name Description

inputBytes The number of bytes in the job input datamapOutputBytes The number of bytes in all map task output datacombineOutputBytes The total number of bytes of all map task output data after applying

the Combine function. If not applied, the number is equal tomapOutputBytes.

outputBytes The number of bytes in the job output datanumNodes The number of nodes in the Hadoop environmentchunkBytes The number of bytes in a data chunknumReducers The number of reduce tasks actually executed


Next, we explain the time cost of each phase of a MapReduce job and define someterms used to deduce the cost-based performance model.

Startup Time: The startup time is denoted as Tstartup and is related to CPU time, thedisk and network I/O time for job execution, depending on the computing environment.For example, consider a user who submits a MapReduce job. First, it causes job startupcost TjobStartup. Following the initialization of the job, the task allocation process isstarted and incurs task startup cost TtaskStartup. In general, the data scale of massivedata-processing applications is far larger than a piece of chunkBytes; therefore, thestartup time is usually negligible because the processing time is dominant.

Job Processing Time: The processing overhead of the map phase and the reducephase constitutes the job processing time. The processing overhead of the map phase isdenoted as Tmap. It can be divided into five parts: TreadDFS, TMapper, Tsort, Tcombiner, andTspill, referring to the overhead of reading a byte from HDFS, the overhead of executingthe Mapper function on the byte that was just read, the overhead of sorting a byte, theoverhead of executing the Combiner function on a byte, the overhead of spilling a byteto the local file system, respectively. The processing overhead of the reduce phase isdenoted as Treduce. It consists of four parts: Tshuffle, Tmerge, TReducer, and TwriteDFS,referring to the overhead of reading a byte that is transmitted through the network fromthe intermediate results generated by the Mapper function, the overhead of sorting andmerging a byte from the previous step, the overhead of executing the Reducer functionon a byte, the overhead of writing a byte of the resulting data to HDFS, respectively.

T ¼ TjobStartup þ TtaskStartup �inputByteschunkBytes þ numReudcers

numNodes

þðTreadDFS þ TMapperÞ � inputBytesnumNodes

ðTsort þ TcombinerÞ � mapOutputBytesnumNodes

Tspill � combineOutputBytesnumNodes

þðTsuffle þ Tmerge þ TReducerÞ � mapOutputBytesnumReducers

þ TwriteDFS � outputBytesnumReducers

ð1Þ

The performance prediction of a MapReduce job can be represented as the totaltime predicted from the job’s beginning to when the job processing is complete.According to the above analysis, the execution time of a MapReduce job includesTjobStartup, TtaskStartup, Tmap, and Treduce. Therefore, we propose a cost-based MapRe-duce performance prediction model, as shown in Eq. (1).

In Eq. (1), different parameters imply different costs. TjobStartup and TtaskStartupinvolve the CPU, disk, and network costs. TMapper, Tsort, Tcombiner, Tmerge, and TReducermainly refer to the CPU cost. Tspill and TwriteDFS are primarily about the disk I/O cost.Finally, Tshuffle covers the network I/O cost.

350 C. Liu et al.

We use MRPerf [23], a lightweight Hadoop simulator, to support the establishmentof a discrete-time simulating MapReduce job execution model. The structure of theperformance prediction module is shown in Fig. 2. The execution flow for job per-formance prediction is as follows. First, a MapReduce job is given with a fixed inputdata size and uses the default parameter configuration. Then, the PPM actually executesthe job using sampling technology to gain statistical information about the time cost ofeach task processing phase by monitoring and analyzing the MapReduce job. Second,the PPM can be established using the previous statistical information. Third, the virtualjob profile of the same MapReduce job with a modified parameter configuration can bededuced through the combination of PPM and the MRPerf simulator. Finally, the timecost of the MapReduce job with the altered configuration is calculated.

3.4 Automatic Parameter Configuration Optimization Algorithm

In this section, we first provide the formal definition for parameter configuration set,MapReduce job, and optimization objective. Then, we propose an automatic parameteroptimization algorithm implemented in POM.

Definition 1. Suppose that S is the parameter configuration set, which is composed ofpairs of parameter names and attribute values. The value range of each attribute value

can be expressed as a one-dimensional vector S½i��!, where the length represents the

number of possible values for the i-th parameter.

Definition 2. A MapReduce job J can be seen as a quad-tuple related to MapReduceprogram p, the data d to be processed, the resource set R in the running environment,and the specified system parameter configuration set S, i.e., J = < p, d, R, S >. In thispaper, we assume that the data is sampled in fixed size and the resource in the job’srunning environment is unchanged. Therefore, d and R can both be viewed as constant.

Job Profile

Dataflow Statistics

Cost Statistics of each Task

Processing Phase

Optimizing Job Configuration File

PPM(the Performance Prediction Module)

Cost-basedPerformance

Model

MRPerfSimulator

Virtual Job Profile

Fig. 2. The PPM structure


Definition 3. For a certain MapReduce program p, we aim to find a parameter con-figuration set Si, according to the special search schema and restriction condition so thatthe job execution time T is approximately shortest. The optimized configuration set isexpressed as Sopt, i.e.,

Sopt ¼ argmin Tpsi2C

where C is the value space of the parameter configuration set.Our optimization objective is to find an optimized parameter configuration from the

finite vector space of parameter values so that the performance of the MapReduce job isapproximately optimal. Genetic algorithm (GA) is a heuristic global search algorithmthat simulates the process of biological evolution. Depending on the proper fitnessfunction, GA can effectively avoid falling into the local optimum. Therefore, GA iswidely used in combinatorial optimization problems [26]. In our MR-COF system, asearch algorithm to optimize MapReduce job parameter configurations is proposedbased on GA and is integrated into the POM. The algorithm is shown in Fig. 3.

The evaluation function of an individual in the population, fitness(Ci), is defined as1=JobCompletionTimei, and the distance() function is used to calculate the differencebetween the average fitness of the current population and the average fitness of theprevious generation population. Empirically, we set T = 20 and μ = 0.05. The algorithmis terminated until the converge condition is satisfied. Due to the quick sort method isapplied to sort the n individuals according to the fitness values during each while loop,so the time complexity of the quick sort equals to OðN � logNÞ. As T is the number ofloops, the overall time complexity of the parameter configuration optimization algo-rithm is OðN � T* logNÞ.

Fig. 3. Parameter configuration-optimization algorithm

352 C. Liu et al.

4 Performance Evaluation

4.1 Experimental Environment

The experiments are run in an in-house cloud environment comprising eight homo-geneous machine nodes. One node is treated as the master node and is used to deployNameNode and JobTracker. The other seven nodes are slavers to deploy DataNode andTaskTracker. Each node has the same software and hardware settings as follows.

Hardware Settings: The CPU is a 2 × Intel 2.26GHZ Xeon E5520 with four cores andan 8 MB L3 cache. The memory is 16 GB DDR 3 with a 1066 MHZ FSB frequency.The SAS disk has a 146 GB capacity. Finally, each node is equipped with 2 × 1000 MEthernet cards.

Software Setting: The software installation includes RHEL 5.1 OS (kernel version2.6.18-128.e15), Java (version 1.6.0_18), and Hadoop (version 0.20.2).

Selected Parameters: The selected lists of parameters tuned through the MR-COFoptimizer are shown in Table 2. The other parameters are set by default.

4.2 Accuracy of Performance Prediction Model

To evaluate the prediction accuracy of MR-COF, we use the default configuration tocompare the actual execution time and the MR-COF’s predicted time for three differentMapReduce jobs by varying the size of the input data. We run the experiments threetimes and record the average time. The results for Sort, WordCount, and Grep jobs with1 GB, 5 GB, and 10 GB of input data are shown in Fig. 4. In general, the relativedifferences between the predicted time and the actual execution time range from 6 % to13 %, indicating that the accuracy is within an acceptable range.

Figure 5 compares the actual and predicted execution time for the Map and Reducebreakdown phases respectively, using a WordCount MapReduce job with 1 GB of dataas an example. As seen in Fig. 5(a), the predicted time is fairly close to the actual

Table 2. Descriptions of the 10 selected parameters

Configuration Parameters Default Value Range: Step

dfs.block.size 64 MB [64, 512]: 64mapred.reduce.tasks 1 [5, 50]: 5mapred.tasktracker.map.tasks.maximum 2 [2, 10]: 2mapred.tasktracker.reduce.tasks.maximum 2 [2, 10]: 2io.sort.factor 10 [10, 100]: 10io.sort.mb 100 [100, 300]: 50io.sort.record.percent 0.05 [0.05,0.15]:0.02io.sort.spill.percent 0.8 [0.2, 0.8]: 0.1mapred.job.shuffle.input.buffer.percent 0.7 [0.7, 0.8]: 0.01mapred.job.shuffle.merge.percent 0.66 [0.66, 0.8]:0.01


execution time in each Map breakdown phase except for the predicted time in the Spilloperation, which is significantly different from the actual time. The main reason for thisis that our cost-based MapReduce performance prediction model does not consider theactual disk I/O overhead. In Fig. 5(b), we also can observe that the difference betweenthe predicted and actual execution time for the Reduce phases is negligible.

4.3 Performance of Configuration Parameter Optimization

To verify MR-COF’s enhanced effect on job performance, we compare the executiontime of Sort, WordCount, and Grep jobs using the default configuration and theoptimized configuration. For example, with 10 GB of input data, the time spent on aSort job with the optimized configuration decreases 41 %, compared to the time spenton a job with the default configuration (Fig. 6).

Table 3 shows the optimized parameter configuration results for three different jobswith 1 GB of input data. It is noted that, the configurations are found through theproposed heuristic algorithm in Sect. 3.4. We can observe that the optimized parameter

(a) Sort (b) WordCount (c) Grep

0200400600800

100012001400

1 5 10

Exe

cuti

on T

ime

of S

ort

(sec

)

Input Data Size (GB)

ActualPredicted

0

200

400

600

800

1000

1 5 10

Exe

cuti

on T

ime

of

Wor

dCou

nt (s

ec)


ActualPredicted

0

200

400

600

800

1000

1200

1 5 10Exe

cuti

on T

ime

of G

rep

(s

ec)


ActualPredicted

Fig. 4. Total execution times for three jobs from the actual run and as predicted by MR-COF

(a) Map (b) Reduce

0

5

10

15

20

25

30

35

40

45

50

Actual Predicted

Map

Pha

se E

xecu

tion

Tim

e (s

ec)

Spill

Combine

Sort

Mapper

readDFS

0

20

40

60

80

100

120

140

160

180

Actual Predicted

Red

uce

Pha

se E

xecu

tion

Tim

e (s

ec)

writeDFS

Reducer

Merge

Shuffle

Fig. 5. Map phase and Reduce phase execution time breakdown for a 1 GB WordCountMapReduce job from the actual run and as predicted by MR-COF

354 C. Liu et al.

values for Grep are different from those of Sort and WordCount. The reason behindsuch fact is that Grep is CPU intensive, while Sort and WordCount are data intensive.For example, Sort and WordCount handle more sorting operations in memory thanGrep. Therefore, Sort and WordCount set larger values for parameters related sortoperation (i.e., io.sort.mb and io.sort.record.percent) compared to Grep.

5 Conclusions

In Hadoop, a large number of parameters can affect the performance of MapReducejobs. In this paper, we present MR-COF, a genetic MapReduce parameter configurationoptimization framework. The optimization framework includes three main modules:the runtime monitoring and analysis module, the performance prediction module andthe configuration parameter optimization module. We propose a cost-based job per-formance prediction model and study a genetic parameter configuration optimizationalgorithm. We conduct extensive experiments using three types of massive dataanalysis applications: Sort, WordCount, and Grep. The experimental results show thatMR-COF has good prediction accuracy. In addition, the optimized configurationsubstantially increases the job execution performance, compared to the defaultconfiguration.

(a) Sort (b) WordCount (c) Grep

0200400600800

100012001400

1 5 10

Exe

cuti

ng T

ime

of S

ort

(sec

)


Default

Optimized

0

200

400

600

800

1 5 10

Exe

cuti

on T

ime

of

Wor

dCou

nt (s

ec)


Default

Optimized

0

200

400

600

800

1000

1200

1 5 10Exe

cuti

ng T

ime

of G

rep

(s

ec)


DefaultOptimized

Fig. 6. Performance comparison of Sort, WordCount, and Grep jobs

Table 3. Optimized parameter configuration comparison of Sort, WordCount, and Grep

Configuration Parameters Sort Word-Count Grep

dfs.block.size 256 256 320mapred.reduce.tasks 15 20 40mapred.tasktracker.map.tasks.maximum 4 4 8mapred.tasktracker.reduce.tasks.maximum 4 4 6io.sort.factor 80 90 30io.sort.mb 200 250 150io.sort.record.percent 0.15 0.12 0.06io.sort.spill.percent 0.8 0.8 0.6mapred.job.shuffle.input.buffer.percent 0.78 0.75 0.74mapred.job.shuffle.merge.percent 0.68 0.68 0.66


Acknowledgments. This paper is supported by China National Natural Science Foundationunder grant Nos. 61272470, 61305087, 61402425, 61440060, 41404076 and 61501412; theChina Postdoctoral Science Foundation funded project under grant No. 2014M562086; the keyprojects of Hubei Provincial Natural Science Foundation under grant No. 2015CFA065; theFundamental Research Funds for the Central Universities, China University of Geosciences,Wuhan under grant No. CUGL130233.

References

1. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.ACM 51(1), 107–113 (2008)

2. Dittrich, J., Quiané-Ruiz, J.A.: Efficient big data processing in hadoop MapReduce. Proc.VLDB Endowment 5(12), 2014–2015 (2012)

3. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.:A comparison of approaches to large-scale data analysis. In: The ACM SIGMODInternational Conference on Management of Data, pp. 165–178. ACM Press (2009)

4. Liu, C., Jin, H., Jiang, W., Hai, L.: Research on performance optimization approach ofdata-intensive application with MapReduce. J. Wuhan Univ. Technol. 32(20), 36–41 (2010)

5. Jiang, D., Ooi, B.C., Shi, L., Wu, S.: The Performance of MapReduce: An In-depth Study.Proc. VLDB Endowment 3(1), 472–483 (2010)

6. Babu, S.: Towards Automatic Optimization of MapReduce Programs. In: 1st ACMsymposium on Cloud computing, pp. 137–142. ACM Press (2010)

7. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreignlanguage for data processing. In: The ACM SIGMOD International Conference onManagement of data, pp. 1099–1110. ACM Press (2008)

8. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: parallel analysiswith Sawzall. Sci. Program. 13(4), 277–298 (2005)

9. Thusoo, A., Sarma, J.S., Jain, N., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.:Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endowment 2(2),1626–1629 (2009)

10. Thusoo, A., Sarma, J. S., Jain, N., Chakka, P., Zhang, N., Antony, S., Liu, H., Murthy, R.:Hive-A petabyte scale data warehouse using hadoop. In: The 26th IEEE InternationalConference on Data Engineering, pp. 996–1005. IEEE Press (2010)

11. Yang, H., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: simplified relationaldata processing on large clusters. In: The ACM SIGMOD International Conference onManagement of Data, pp. 1029–1040. ACM Press (2008)

12. Jiang, D., Tung, A., Chen, G.: Map-Join-Reduce: toward scalable and efficient data analysison large clusters. IEEE Trans. Knowl. Data Eng. 23, 1299–1311 (2011)

13. White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2010)14. Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: MRTuner: a toolkit to enable holistic

optimization for MapReduce jobs. Proc. VLDB Endowment 7(13), 1–12 (2014)15. Li, M., Zeng, L., Meng, S., Tan, J., Zhang, L., Butt, A. R., Fuller, N.: MRONLINE:

mapreduce online performance tuning. In: The 23rd International Symposium onHigh-Performance Parallel and Distributed Computing, pp. 165–176. ACM Press (2014)

16. Tian, F., Chen, K.: Towards Optimal Resource Provisioning for Running MapreducePrograms in Public Clouds. In: The IEEE International Conference on Cloud Computing,pp. 155–162. IEEE Press (2011)

356 C. Liu et al.

17. Zhang, Z., Cherkasova, L., Loo, B.T.: Parameterizable benchmarking framework fordesigning a mapreduce performance mode. Concurrency Comput. Pract. Experience 26(12),2005–2026 (2014)

18. Yigitbasi, N., Willke, T.L., Liao, G., Epema, D.: Towards machine learning-basedauto-tuning of mapreduce. In: The 21st IEEE International Symposium on Modeling,Analysis & Simulation of Computer and Telecommunication Systems, pp. 11–20. IEEEPress (2013)

19. Chen, K., Powers, J., Guo, S., Tian, F.: CRESP: Towards optimal resource provisioning formapreduce computing in public clouds. IEEE Trans. Parallel Distrib. Syst. 25(6), 1403–1412(2014)

20. Liao, G., Datta, K., Willke, T.L.: Gunther: search-based auto-tuning of MapReduce. In:Wolf, F., Mohr, B., an Mey, D. (eds.) Euro-Par 2013. LNCS, vol. 8097, pp. 406–419.Springer, Heidelberg (2013)

21. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F. B., Babu, S.: Starfish: aself-tuning system for big data analytics. In: The Conference on Innovative Data SystemsResearch, vol. 11, pp. 261–272. ACM Press (2011)

22. Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud.In: The 1st USENIX Workshop on Hot Topics in Cloud Computing, pp. 156–172. ACMPress (2009)

23. Wang, G., Butt, A. R., Pandey, P., Gupta, K.: A Simulation Approach to Evaluating DesignDecisions in MapReduce Setups. In: The IEEE International Symposium on Modeling,Analysis & Simulation of Computer and Telecommunication Systems, pp. 1–11. IEEE Press(2009)

24. Verma, A., Cherkasova, L., Campbell, R.H.: Play It Again, SimMR! In: The IEEEInternational Conference on Cluster Computing, pp. 253–261. IEEE Press (2011)

25. A Dynamic Instrumentation Tool for Java. http://kenai.com/projects/btrace26. Srinivas, M., Patnaik, L.M.: Genetic algorithms: a survey. Computer 27(6), 17–26 (1994)


http://kenai.com/projects/btrace

lncs 9531 - mr-cof: a genetic mapreduce configuration...

Documents