on trafﬁc-aware partition and aggregation in mapreduce for …cssongguo/papers/mapreduce15.pdf ·...

1

On Traffic-Aware Partition and Aggregation inMapReduce for Big Data Applications

Huan Ke, Student Member, IEEE, Peng Li, Member, IEEE,Song Guo, Senior Member, IEEE, and Minyi Guo, Senior Member, IEEE

Abstract —The MapReduce programming model simplifies large-scale data processing on commodity cluster by exploiting parallelmap tasks and reduce tasks. Although many efforts have been made to improve the performance of MapReduce jobs, they ignore thenetwork traffic generated in the shuffle phase, which plays a critical role in performance enhancement. Traditionally, a hash function isused to partition intermediate data among reduce tasks, which, however, is not traffic-efficient because network topology and data sizeassociated with each key are not taken into consideration. In this paper, we study to reduce network traffic cost for a MapReduce jobby designing a novel intermediate data partition scheme. Furthermore, we jointly consider the aggregator placement problem, whereeach aggregator can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is proposed to dealwith the large-scale optimization problem for big data application and an online algorithm is also designed to adjust data partition andaggregation in a dynamic manner. Finally, extensive simulation results demonstrate that our proposals can significantly reduce networktraffic cost under both offline and online cases.

✦

1 INTRODUCTION

MapReduce [1] [2] [3] has emerged as the most popularcomputing framework for big data processing due to itssimple programming model and automatic managementof parallel execution. MapReduce and its open sourceimplementation Hadoop [4] [5] have been adopted byleading companies, such as Yahoo!, Google and Face-book, for various big data applications, such as machinelearning [6] [7] [8], bioinformatics [9] [10] [11], and cyber-security [12] [13].

MapReduce divides a computation into two mainphases, namely map and reduce, which in turn arecarried out by several map tasks and reduce tasks,respectively. In the map phase, map tasks are launched inparallel to convert the original input splits into interme-diate data in a form of key/value pairs. These key/valuepairs are stored on local machine and organized intomultiple data partitions, one per reduce task. In thereduce phase, each reduce task fetches its own shareof data partitions from all map tasks to generate thefinal result. There is a shuffle step between map andreduce phase. In this step, the data produced by themap phase are ordered, partitioned and transferred tothe appropriate machines executing the reduce phase.The resulting network traffic pattern from all map tasksto all reduce tasks can cause a great volume of networktraffic, imposing a serious constraint on the efficiencyof data analytic applications. For example, with tensof thousands of machines, data shuffling accounts for58.6% of the cross-pod traffic and amounts to over 200

H. Ke, P. Li and S. Guo are with the School of Computer Science andEngineering, the University of Aizu, Japan. Email: {m5172105, pengli,sguo}@u-aizu.ac.jpM. Guo is with the Department of Computer Science and Engineering,Shanghai Jiao Tong University, China. Email: [email protected]

petabytes in total in the analysis of SCOPE jobs [14]. Forshuffle-heavy MapReduce tasks, the high traffic couldincur considerable performance overhead up to 30-40 %as shown in [15].

By default, intermediate data are shuffled accordingto a hash function [16] in Hadoop, which would lead tolarge network traffic because it ignores network topologyand data size associated with each key. As shown in Fig.1, we consider a toy example with two map tasks andtwo reduce tasks, where intermediate data of three keysK1, K2, and K3 are denoted by rectangle bars under eachmachine. If the hash function assigns data of K1 andK3 to reducer 1, and K2 to reducer 2, a large amountof traffic will go through the top switch. To tacklethis problem incurred by the traffic-oblivious partitionscheme, we take into account of both task locationsand data size associated with each key in this paper.By assigning keys with larger data size to reduce taskscloser to map tasks, network traffic can be significantlyreduced. In the same example above, if we assign K1

and K3 to reducer 2, and K2 to reducer 1, as shown inFig. 1(b), the data transferred through the top switch willbe significantly reduced.

To further reduce network traffic within a MapReducejob, we consider to aggregate data with the same keysbefore sending them to remote reduce tasks. Although asimilar function, called combiner [17], has been alreadyadopted by Hadoop, it operates immediately after a maptask solely for its generated data, failing to exploit thedata aggregation opportunities among multiple tasks ondifferent machines. As an example shown in Fig. 2(a), inthe traditional scheme, two map tasks individually senddata of key K1 to the reduce task. If we aggregate thedata of the same keys before sending them over the topswitch, as shown in Fig. 2(b), the network traffic will be

2

(a) Traditional hash partition

(b) Traffic-aware partition

Fig. 1. Two MapReduce partition schemes.

reduced.

In this paper, we jointly consider data partition andaggregation for a MapReduce job with an objective thatis to minimize the total network traffic. In particular,we propose a distributed algorithm for big data appli-cations by decomposing the original large-scale probleminto several subproblems that can be solved in parallel.Moreover, an online algorithm is designed to deal withthe data partition and aggregation in a dynamic manner.Finally, extensive simulation results demonstrate thatour proposals can significantly reduce network trafficcost in both offline and online cases.

The rest of the paper is organized as follows. Insection II, we review recent related work. Section IIIpresents a system model. Section IV develops a mixed-integer linear programming model for the network trafficminimization problem. Sections V and VI propose thedistributed and online algorithms, respectively, for thisproblem. The experiment results are discussed in sectionVII. Finally, Section VIII concludes the paper.

2 RELATED WORK

Most existing work focuses on MapReduce performanceimprovement by optimizing its data transmission. Blancaet al. [18] have investigated the question of whetheroptimizing network usage can lead to better systemperformance and found that high network utilizationand low network congestion should be achieved simul-taneously for a job with good performance. Palanisamy

(a) Without global aggregation

(b) With global aggregation

Fig. 2. Two schemes of intermediate data transmission inthe shuffle phase.

et al. [19] have presented Purlieus, a MapReduce re-source allocation system, to enhance the performance ofMapReduce jobs in the cloud by locating intermediatedata to the local machines or close-by physical ma-chines. This locality-awareness reduces network trafficin the shuffle phase generated in the cloud data center.However, little work has studied to optimize networkperformance of the shuffle process that generates largeamounts of data traffic in MapReduce jobs. A criticalfactor to the network performance in the shuffle phaseis the intermediate data partition. The default schemeadopted by Hadoop is hash-based partition that wouldyield unbalanced loads among reduce tasks due to itsunawareness of the data size associated with each key.To overcome this shortcoming, Ibrahim et al. [20] havedeveloped a fairness-aware key partition approach thatkeeps track of the distribution of intermediate keys’frequencies, and guarantees a fair distribution amongreduce tasks. Meanwhile, Liya et al. [21] have designedan algorithm to schedule operations based on the keydistribution of intermediate key/value pairs to improvethe load balance. Lars et al. [22] have proposed and e-valuated two effective load balancing approaches to dataskew handling for MapReduce-based entity resolution.Unfortunately, all above work focuses on load balanceat reduce tasks, ignoring the network traffic during theshuffle phase.

In addition to data partition, many efforts have beenmade on local aggregation, in-mapper combining andin-network aggregation to reduce network traffic within

3

MapReduce jobs. Condie et al. [23] have introduced acombiner function that reduces the amount of data tobe shuffled and merged to reduce tasks. Lin and Dyer[24] have proposed an in-mapper combining scheme byexploiting the fact that mappers can preserve state acrossthe processing of multiple input key/value pairs and de-fer emission of intermediate data until all input recordshave been processed. Both proposals are constrainedto a single map task, ignoring the data aggregationopportunities from multiple map tasks. Costa et al. [25]have proposed a MapReduce-like system to decrease thetraffic by pushing aggregation from the edge into thenetwork. However, it can be only applied to the networktopology with servers directly linked to other servers,which is of limited practical use.

Different from existing work, we investigate networktraffic reduction within MapReduce jobs by jointly ex-ploiting traffic-aware intermediate data partition anddata aggregation among multiple map tasks.

3 SYSTEM MODEL

MapReduce is a programming model based on twoprimitives: map function and reduce function. The for-mer processes key/value pairs 〈k, v〉 and produces aset of intermediate key/value pairs 〈k′, v′〉. Intermediatekey/value pairs are merged and sorted based on theintermediate key k′ and provided as input to the reducefunction. A MapReduce job is executed over a distribut-ed system composed of a master and a set of workers.The input is divided into chunks that are assigned tomap tasks. The master schedules map tasks in the work-ers by taking into account of data locality. The outputof the map tasks is divided into as many partitions asthe number of reducers for the job. Entries with thesame intermediate key should be assigned to the samepartition to guarantee the correctness of the execution.All the intermediate key/value pairs of a given partitionare sorted and sent to the worker with the correspondingreduce task to be executed. Default scheduling of reducetasks does not take any data locality constraint intoconsideration. As a result, the amount of data that has tobe transferred through the network in the shuffle processmay be significant.

In this paper, we consider a typical MapReduce jobon a large cluster consisting of a set N of machines.We let dxy denote the distance between two machinesx and y, which represents the cost of delivering a unitdata. When the job is executed, two types of tasks, i.e.,map and reduce, are created. The sets of map and reducetasks are denoted by M and R, respectively, which arealready placed on machines. The input data are dividedinto independent chunks that are processed by map tasksin parallel. The generated intermediate results in formsof key/value pairs may be shuffled and sorted by theframework, and then are fetched by reduce tasks toproduce final results. We let P denote the set of keyscontained in the intermediate results, and m

pi denote

the data volume of key/value pairs with key p ∈ P

generated by mapper i ∈ M .A set of δ aggregators are available to the intermediate

results before they are sent to reducers. These aggrega-tors can be placed on any machine, and one is enough fordata aggregation on each machine if adopted. The datareduction ratio of an aggregator is denoted by α, whichcan be obtained via profiling before job execution.

The cost of delivering a certain amount of traffic overa network link is evaluated by the product of datasize and link distance. Our objective in this paper is tominimize the total network traffic cost of a MapReducejob by jointly considering aggregator placement andintermediate data partition. All symbols and variablesused in this paper are summarized in Table 1.

TABLE 1Notions and Variables

Notations DescriptionN a set of physical machinesdxy distance between two machines x and y

M a set of map tasks in map layerR a set of reduce tasks in reduce layerA a set of nodes in aggregation layerP a set of intermediate keysAi a set of neighbors of mapper i ∈ M

δ maximum number of aggregatorsm

pi data volume of key p ∈ P generated by mapper i ∈ M

φ(u) the machine containing node u

xpij

binary variable denoting whether mapper i ∈ M sendsdata of key p ∈ P to node j ∈ A

fpij traffic for key p ∈ P from mapper i ∈ M to node j ∈ A

Ipj input data of key p ∈ P on node j ∈ A

Mj a set of neighboring nodes of j ∈ A

Opj output data of key p ∈ P on node j ∈ A

α data reduction ratio of an aggregatorαj data reduction ratio of node j ∈ A

zjbinary variable indicating if an aggregator is placedon machine j ∈ N

yp

k

binary variable denoting whether data of key p ∈ Pis processed by reducer k ∈ R

gp

jk

the network traffic regarding key p ∈ P from node j ∈ Ato reducer k ∈ R

zpj an auxiliary variable

νpj Lagrangian multiplier

mpj (t) output of m

pj at time slot t

αj(t) αj at time slot t

Ψjj′ migration cost for aggregator from machine j to j′

Φkk′(·) cost of migrating intermediate data from reducer k to k′

CM (t) total migration cost at time slot t

4 PROBLEM FORMULATION

In this section, we formulate the network traffic mini-mization problem. To facilitate our analysis, we constructan auxiliary graph with a three-layer structure as shownin Fig. 3. The given placement of mappers and reducersapplies in the map layer and the reduce layer, respec-tively. In the aggregation layer, we create a potentialaggregator at each machine, which can aggregate datafrom all mappers. Since a single potential aggregator issufficient at each machine, we also use N to denote allpotential aggregators. In addition, we create a shadow

4

Fig. 3. Three-layer model for the network traffic minimiza-tion problem.

node for each mapper on its residential machine. Incontrast with potential aggregators, each shadow nodecan receive data only from its corresponding mapperin the same machine. It mimics the process that thegenerated intermediate results will be delivered to areduce directly without going through any aggregator.All nodes in the aggregation layers are maintained inset A. Finally, the output data of aggregation layer aresent to the reduce layer. Each edge (u, v) in the auxiliarygraph is associated with a weight dφ(u)φ(v), where φ(u)denotes the machine containing node u in the auxiliarygraph.

To formulate the traffic minimization problem, we firstconsider the data forwarding between the map layer andthe aggregation layer. We define a binary variable x

pij as

follows:

xpij =

1, if mapper i ∈ M sends data of key p ∈ P

to node j ∈ A;

0, otherwise.

Since all data generated in the map layer should besent to nodes in the aggregation layer, we have thefollowing constraint for x

pij :

∑

j∈Ai

xpij = 1, ∀i ∈ M,p ∈ P, (1)

where Ai denotes the set of neighbors of mapper i in theaggregation layer.

We let fpij denote the traffic from mapper i ∈ M to

node j ∈ A, which can be calculated by:

fpij = x

pijm

pi , ∀i ∈ M, j ∈ Ai, p ∈ P. (2)

The input data of node j ∈ A can be calculated bysumming up all incoming traffic, i.e.,

Ipj =

∑

i∈Mj

fpij , ∀j ∈ A, p ∈ P, (3)

where Mj denotes the set of j’s neighbors in the maplayer. The corresponding output data of node j ∈ A is:

Opj = αjI

pj , ∀j ∈ A, p ∈ P, (4)

where αj = α if node j is a potential aggregator.Otherwise, i.e., node j is a shadow node, we have αj = 1.

We further define a binary variable zj for aggregatorplacement, i.e.,

zj =

1, if a potential aggregator j ∈ N is activated

for data aggregation,

0, otherwise.

Since the total number of aggregators is constrainedby δ, we have:

∑

j∈N

zj ≤ δ. (5)

The relationship among xpij and zj can be represented

by:

xpij ≤ zj , ∀j ∈ N, i ∈ Mj, p ∈ P. (6)

In other words, if a potential aggregator j ∈ N isnot activated for data aggregation, i.e., zj = 0, no datashould be forwarded to it, i.e., xp

ij = 0.Finally, we define a binary variable y

pk to describe

intermediate data partition at reducers, i.e.,

ypk =

1, if data of key p ∈ P are processed by

reducer k ∈ R,

0, otherwise.

Since the intermediate data with the same key will beprocessed by a single reducer, we have the constraint:

∑

k∈R

ypk = 1, ∀p ∈ P. (7)

The network traffic from node j ∈ A to reducer k ∈ R

can be calculated by:

gpjk = Ojy

pk, ∀j ∈ A, k ∈ R, p ∈ P. (8)

With the objective to minimize the total cost of net-work traffic within the MapReduce job, the problem canbe formulated as:

min∑

p∈P

(

∑

i∈M

∑

j∈Ai

fpijdij +

∑

j∈A

∑

k∈R

gpjkdjk

)

subject to: (1)− (8).

Note that the formulation above is a mixed-integernonlinear programming (MINLP) problem. By applyinglinearization technique, we transfer it to a mixed-integerlinear programming (MILP) that can be solved by ex-isting mathematical tools. Specifically, we replace thenonlinear constraint (8) with the following linear ones:

0 ≤ gpjk ≤ O

pj , ∀j ∈ A, k ∈ R, p ∈ P, (9)

Opj − (1− y

pk)O

pj ≤ g

pjk ≤ O

pj , ∀j ∈ A, k ∈ R, p ∈ P, (10)

5

where constant Opj = αj

∑

i∈Mjm

pi is the upper bound

of Opj . The MILP formulation after linearization is:

min∑

p∈P

(

∑

i∈M

∑

j∈Ai

fpijdij +

∑

j∈A

∑

k∈R

gpjkdjk

)

subject to: (1)− (7), (9), and (10).

Theorem 1. Traffic-aware Partition and Aggregation problemis NP-hard.

Proof: To prove NP-hardness of our network trafficoptimization problem, we prove the NP-completeness ofits decision version by reducing the set cover problemto it in polynomial time.

The set cover problem: given a set U ={x1, x2, . . . , xn}, a collection of m subsetsS = {S1, S2, . . . , Sm}, Sj ⊆ U ,1 ≤ j ≤ m and aninteger K . The set cover problem seeks for a collectionC such that |C| ≤ K and

⋃

i∈C Si = U .

Fig. 4. A graph instance.

For each xi ∈ U , we create a mapper Mi that generatesonly one key/value pair. All key/value pairs will be sentto a single reducer whose distance with each mapper ismore than 2. For each subset Sj , we create a potentialaggregaor Aj with distance 1 to the reducer. If xi ∈ Sj ,we set the distance between Mi to Aj to 1. Otherwise,their distance is greater than 1. The aggregation ratio isdefined to be 1. The constructed instance of our problemcan be illustrated using Fig. 4. Given K aggregators, welook for a placement such that the total traffic cost is nogreater than 2n. It is easy to see that a solution of theset cover problem generates a solution of our problemwith cost 2n. When we have a solution of our problemwith cost 2n, each mapper should send its result to anaggregator with distance 1 away, which forms a solutionof the corresponding set cover problem.

5 DISTRIBUTED ALGORITHM DESIGN

The problem above can be solved by highly efficientapproximation algorithms, e.g., branch-and-bound, andfast off-the-shelf solvers, e.g., CPLEX, for moderate-sizedinput. An additional challenge arises in dealing with

the MapReduce job for big data. In such a job, thereare hundreds or even thousands of keys, each of whichis associated with a set of variables (e.g., x

pij and y

pk)

and constraints (e.g., (1) and (7)) in our formulation,leading to a large-scale optimization problem that ishardly handled by existing algorithms and solvers inpractice.

In this section, we develop a distributed algorithm tosolve the problem on multiple machines in a parallelmanner. Our basic idea is to decompose the originallarge-scale problem into several distributively solvablesubproblems that are coordinated by a high-level masterproblem. To achieve this objective, we first introducean auxiliary variable z

pj such that our problem can be

equivalently formulated as:

min∑

p∈P

(

∑

i∈M

∑

j∈Ai

fpijdij +

∑

j∈A

∑

k∈R

gpjkdjk

)

subject to: xpij ≤ z

pj , ∀j ∈ N, i ∈ Mj , p ∈ P, (11)

zpj = zj , ∀j ∈ N, p ∈ P, (12)

(1)− (5), (7), (9), and (10).

The corresponding Lagrangian is as follows:

L(ν) =∑

p∈P

Cp +∑

j∈N

∑

p∈P

νpj (zj − z

pj )

=∑

p∈P

Cp +∑

j∈N

∑

p∈P

νpj zj −

∑

j∈N

∑

p∈P

νpj z

pj

=∑

p∈P

(

Cp −∑

j∈N

νpj z

pj

)

+∑

j∈N

∑

p∈P

νpj zj (13)

where νpj are Lagrangian multipliers and Cp is given as

Cp =∑

i∈M

∑

j∈Ai

fpijdij +

∑

j∈A

∑

k∈R

gpjkdjk.

Given νpj , the dual decomposition results in two sets of

subproblems: intermediate data partition and aggregatorplacement. The subproblem of data partition for eachkey p ∈ P is as follows:

SUB DP: min (Cp −∑

j∈N

νpj z

pj )

subject to:(1)− (4), (7), (9), (10), and (11).

These problems regarding different keys can be dis-tributed solved on multiple machines in a parallel man-ner. The subproblem of aggregator placement can besimply written as:

SUB AP: min (∑

j∈N

∑

p∈P

νpj zj) subject to: (5).

The values of νpj are updated in the following masterproblem:

minL(ν) =∑

p∈P

Cp +∑

j∈N

∑

p∈P

νpj zj −

∑

j∈N

∑

p∈P

νpj z

pj

subject to: νpj ≥ 0, ∀j ∈ A, p ∈ P, (14)

where Cp, zpj and zj are optimal solutions returned bysubproblems. Since the objective function of the master

6

Algorithm 1 Distributed Algorithm

1: set t = 1, and νpj (j ∈ A, p ∈ P ) to arbitrary

nonnegative values;2: for t < T do3: distributively solve the subproblem SUB DP and

SUB AP on multiple machines in a parallel man-ner;

4: update the values of νpj with the gradient method(15), and send the results to all subproblems;

5: set t = t+ 1;6: end for

problem is differentiable, it can be solved by the follow-ing gradient method.

νpj (t+ 1) =

[

νpj + ξ

(

zj(νpj (t)) − z

pj (ν

pj (t))

)

]+

, (15)

where t is the iteration index, ξ is a positive step size, and’+’ denotes the projection onto the nonnegative orthants.

In summary, we have the following distributed algo-rithm to solve our problem.

5.1 Network Traffic Traces

In this section, we verify that our distributed algorithmcan be applied in practice using real trace in a clusterconsisting of 5 virtual machines with 1GB memory and2GHz CPU. Our network topology is based on three-tier architectures: an access tier, an aggregation tier anda core tier (Fig. 6). The access tier is made up of cost-effective Ethernet switches connecting rack VMs. Theaccess switches are connected via Ethernet to a set ofaggregation switches which in turn are connected to alayer of core switches. An inter-rack link is the mostcontentious resource as all the VMs hosted on a racktransfer data across the link to the VMs on other racks.Our VMs are distributed in three different racks, and themap-reduce tasks are scheduled as in Fig. 6. For example,rack 1 consists of node 1 and 2; mapper 1 and 2 arescheduled on node 1 and reducer 1 is scheduled on node2. The intermediate data forwarding between mappersand reducers should be transferred across the network.The hop distances between mappers and reducers areshown in Fig. 6, e.g., mapper 1 and reducer 2 has a hopdistance 6.

We tested the real network traffic cost in Hadoop usingthe real data source from latest dumps files in wikimedi-a (http://dumps.wikimedia.org/enwiki/latest/). In themeantime, we executed our distributed algorithm usingthe same data source for comparison. Since our distribut-ed algorithm is based on a known aggregation ratioα, we have done some experiments to evaluate it inHadoop environment. Fig. 5 shows the parameter α interms of different input scale. It turns out to be stablewith the increase of input size, and thus we exploit theaverage aggregation ratio 0.35 for our trace.

To evaluate the experiment performance, we choosethe wordcount application in Hadoop. First of all, we

50MB 150MB 600MB 1GB0

5

10

15

20

25

30

35

40

45

50

19.01MB55.01MB

193.90MB 341.13MB

Data size

Agg

rega

tion

ratio

(%

)

Fig. 5. Data reduction ratio.

tested inputs of 213.44M , 213.40M , 213.44M , 213.41Mand 213.42M for five map tasks to generate correspond-ing outputs, which turn out to be 174.51M , 177.92M ,176.21M , 177.17M and 176.19M , respectively. Based onthese outputs, the optimal solution is to place an ag-gregator on node 1 and to assign intermediate dataaccording to the traffic-aware partition scheme. Sincemappers 1 and 2 are scheduled on node 1, their outputscan be aggregated before forwarding to reducers. Welist the size of outputs after aggregation and the finalintermediate data distribution between reducers in Table2. For example, the aggregated data size on node 1 is139.66M , in which 81.17M data is for reducer 1 and58.49M for reducer 2.

Fig. 6. A small example.

The data size and hop distance for all intermediatedata transfer obtained in the optimal solution are shownin Fig. 6 and Table 2. Finally, we get the network trafficcost as follows:

81.17× 2 + 58.49× 6 + 96.17× 4 + 80.04× 6 + 98.23× 6

+ 78.94× 2 + 94.17× 6 + 82.02× 0 = 2690.48

Since our aggregator is placed on node 1, the outputsof mapper 1 and mapper 2 are merged into 139.66M .

7

TABLE 2Real Cost v.s. Ideal Cost

data size and costNode 1 Node 2 Node 3 Node 4 Node 5

mapper 1 mapper 2 — mapper 3 mapper 4 mapper 5before aggregation 174.51M 177.92M — 176.21M 177.17M 176.19Mafter aggregation 139.66M — 176.21M 177.17M 176.19M

reducer 1 81.17M — 96.17M 98.23M 94.17Mreducer 2 58.49M — 80.04M 78.94M 82.02Mreal cost 2690.48

ideal cost 2673.49

The intermediate data from all mappers is transferredaccording to the traffic-aware partition scheme. We canget the total network cost 2690.48 in the real Hadoopenvironment while the ideal network cost is 2673.49obtained from the optimal solution. They turn out tobe very close to each other, which indicates that ourdistributed algorithm can be applied in practice.

6 ONLINE ALGORITHM

Until now, we take the data size mpi and data aggregation

ratio αj as input of our algorithms. In order to get theirvalues, we need to wait all mappers to finish beforestarting reduce tasks, or conduct estimation via profilingon a small set of data. In practice, map and reducetasks may partially overlap in execution to increasesystem throughput, and it is difficult to estimate systemparameters at a high accuracy for big data applications.These motivate us to design an online algorithm to dy-namically adjust data partition and aggregation duringthe execution of map and reduce tasks.

In this section, we divide the execution of a MapRe-duce job into several time slots with a length of severalminutes or an hour. We let m

pj (t) and αj(t) denote the

parameters collected at time slot t with no assump-tion about their distributions. As the job is running,an existing data partition and aggregation scheme maynot be optimal anymore under current mp

j (t) and αj(t).To reduce traffic cost, we may need to migrate anaggregator from machine j to j′ with a migration costΨjj′ . Meanwhile, the key assignment among reducers isadjusted. When we let reducer k′ process the data withkey p instead of reducer k that is currently in charge ofthis key, we use function Φkk′ (

∑t

τ=1

∑

j∈A

∑

k∈R gpjk(τ))

to denote the cost migrating all intermediate data re-ceived by reducers so far. The total migration cost canbe calculated by:

CM (t) =∑

k,k′∈R

∑

p∈P

ypk(t− 1)ypk′(t)Φkk′ ·

(

t∑

τ=1

∑

j∈A

∑

k∈R

gpjk(τ)

)

+∑

j,j′∈N

zj(t− 1)zj′(t)Ψjj′ . (16)

Our objective is to minimize the overall cost of traffic

Algorithm 2 Online Algorithm

1: t = 1 and t = 1;2: solve the OPT ONE SHOT problem for t = 1;3: while t ≤ T do4: if

∑t

τ=t

∑

p∈P Cpt (τ) > γCM (t) then

5: solve the following optimization problem:

min∑

p∈P

Cp(t)

subject to:(1)− (7), (9), and (10), for time slot t.

6: if the solution indicates a migration event then7: conduct migration according to the new solu-

tion;8: t = t;9: update CM (t);

10: end if11: end if12: t = t+ 1;13: end while

and migration over a time interval [1, T ], i.e.,

min

T∑

t=1

(

CM (t) +∑

p∈P

Cp(t))

, subject to:

(1)− (7), (9), (10), and (16), ∀t = 1, ..., T.

An intuitive method to solve the problem above is todivide it into T one-shot optimization problems:

OPT ONE SHOT: minCM (t) +∑

p∈P

Cp(t)

subject to: (1)− (7), (9), (10), and (16), for time slot t.

Unfortunately, the algorithm of solving above one-shotoptimization in each time slot based on the informationcollected in the previous time slot will be far from op-timal because it may lead to frequent migration events.Moreover, the coupled objective function due to CM (t)introduces additional challenges in distributed algorithmdesign.

In this section, we design an online algorithm whosebasic idea is to postpone the migration operation untilthe cumulative traffic cost exceeds a threshold. As shownin Algorithm 2, we let t denote the time of last migrationoperation, and obtain an initial solution by solving theOPT ONE SHOT problem. In each of the following time

8

slot, we check whether the accumulative traffic cost, i.e.,∑t

τ=t

∑

p∈P Cpt (τ), is greater than γ times of CM (t). If it

is, we solve an optimization problem with the objectiveof minimizing traffic cost as shown in line 5. We con-duct migration operation according to the optimizationresults and update CM (t) accordingly as shown in lines6 to 10. Note that the optimization problem in line 5 canbe solved using the distributed algorithm developed inlast section.

7 PERFORMANCE EVALUATION

In this section, we conduct extensive simulations toevaluate the performance of our proposed distributedalgorithm DA by comparing it to the following twoschemes.

• HNA: Hash-based partition with No Aggregation.As the default method in Hadoop, it makes thetraditional hash partitioning for the intermediate da-ta, which are transferred to reducers without goingthrough aggregators.

• HRA: Hash-based partition with Random Aggre-gation. It adopts a random aggregator placementalgorithm over the traditional Hadoop. Throughrandomly placing aggregators in the shuffle phase, itaims to reducing the network traffic cost comparedto the traditional method in Hadoop.

To our best knowledge, we are the first to proposethe scheme that exploits both aggregator placement andtraffic-aware partitioning. All simulation results are av-eraged over 30 random instances.

7.1 Simulation results of offline cases

We first evaluate the performance gap between ourproposed distributed algorithm and the optimal solutionobtained by solving the MILP formulation. Due to thehigh computational complexity of the MILP formulation,we consider small-scale problem instances with 10 keysin this set of simulations. Each key associated with ran-dom data size within [1-50]. There are 20 mappers, and2 reducers on a cluster of 20 machines. The parameter αis set to 0.5. The distance between any two machines israndomly chosen within [1-60].

As shown in Fig. 7, the performance of our distributedalgorithm is very close to the optimal solution. Althoughnetwork traffic cost increases as the number of keysgrows for all algorithms, the performance enhancementof our proposed algorithms to the other two schemesbecomes larger. When the number of keys is set to10, the default algorithm HNA has a cost of 5.0 × 104

while optimal solution is only 2.7× 104, with 46% trafficreduction.

We then consider large-scale problem instances, andcompare the performance of our distributed algorithmwith the other two schemes. We first describe a defaultsimulation setting with a number of parameters, andthen study the performance by changing one parameter

0 2 4 6 8 100

1

2

3

4

5

6x 10

4

Number of keys

Net

wor

k tr

affi

c

OptimalDAHRAHNA

Fig. 7. Network traffic cost versus number of keys from 1to 10

0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3

3.5x 10

5

Number of keys

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 8. Network traffic cost versus different number ofkeys from 1 to 100.

while fixing others. We consider a MapReduce job with100 keys and other parameters are the same above.

As shown in Fig. 8, the network traffic cost shows asan increasing function of number of keys from 1 to 100under all algorithms. In particular, when the numberof keys is set to 100, the network traffic of the HNAalgorithm is about 3.4× 105, while the traffic cost of ouralgorithm is only 1.7× 105, with a reduction of 50%. Incontrast to HRA and HNA, the curve of DA increasesslowly because most map outputs are aggregated andtraffic-aware partition chooses closer reduce tasks foreach key/value pair, which are beneficial to networktraffic reduction in the shuffle phase.

We then study the performance of three algorithmsunder different values of α in Fig. 9 by changing itsvalue from 0.2 to 1.0. A small value of α indicates alower aggregation efficiency for the intermediate data.We observe that network traffic increases as the growthof α under both DA and HRA. In particular, when α

is 0.2, DA achieves the lowest traffic cost of 1.1 × 105.On the other hand, network traffic of HNA keeps stablebecause it does not conduct data aggregation.

The affect of available aggregator number on networktraffic is investigated in Fig. 10. We change aggrega-tor number from 0 to 6, and observe that DA alwaysoutperforms other two algorithms, and network traffics

9

0.2 0.4 0.6 0.8 11

1.5

2

2.5

3x 10

5

α

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 9. Network traffic cost versus data reduction ratio α.

0 1 2 3 4 5 61.5

2

2.5

3

3.5x 10

5

Maximum number of aggregators

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 10. Network traffic cost versus number of aggrega-tors.

decrease under both HRA and DA. Especially, when thenumber of aggregator is 6, network traffic of the HRAalgorithm is 2.2×105, while of DA’s cost is only 1.5×105,with 26.7% improvement. That is because aggregatorsare beneficial to intermediate data reduction in the shuf-fle process. Similar with Fig. 9, the performance of HNAshows as a horizontal line because it is not affected byavailable aggregator number.

We study the influence of different number of maptasks by increasing the mapper number from 0 to 60. Asshown in Fig. 11, we observe that DA always achievesthe lowest traffic cost as we expected because it jointlyoptimizes data partition and aggregation. Moreover, asthe mapper number increases, network traffic of allalgorithms increases.

We shows the network traffic cost under differentnumber of reduce tasks in Fig. 12. The number of reduc-ers is changed from 1 to 6. We observe that the highestnetwork traffic is achieved when there is only one reducetask under all algorithms. That is because all key/valuepairs may be delivered to the only reducer that locatesfar away, leading to a large amount of network trafficdue to the many-to-one communication pattern. As thenumber of reduce tasks increases, the network trafficdecreases because more reduce tasks share the loadof intermediate data. Especially, DA assigns key/valuepairs to the closest reduce task, leading to least network

0 10 20 30 40 50 600

2

4

6

8

10x 10

5

Number of map tasks

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 11. Network traffic cost versus number of map tasks.

1 2 3 4 5 61

1.5

2

2.5

3

3.5x 10

5

Number of reduce tasks

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 12. Network traffic cost versus number of reducetasks.

traffic. When the number of reduce tasks is larger than3, network traffic decreasing becomes slow because thecapability of intermediate data sharing among reducershas been fully exploited.

The affect of different number of machines is inves-tigated in Fig. 13 by changing the number of physicalnodes from 10 to 60. We observe that network traffic ofall the algorithms increases when the number of nodesgrows. Furthermore, HRA algorithm performs muchworse than other two algorithms under all settings.

7.2 Simulation results of online cases

We then evaluate the performance of proposed algorithmunder online cases by comparing it with other twoschemes: OHRA and OHNA, which are online extensionof HRA and HNA, respectively. The default numberof mappers is 20 and the number of reducers is 5.The maximum number of aggregators is set to 4 andwe also vary it to examine its impact. The key/valuepairs with random data size within [1-100] are generatedrandomly in different slots. The total number of physicalmachines is set to 10 and the distance between any twomachines is randomly choose within [1-60]. Meanwhile,the default parameter α is set to 0.5. The migration costΦkk′ and Ψjj′ are defined as constants 5 and 6. The initialmigration cost CM (0) is defined as 300 and γ is set to

10

10 20 30 40 50 600

1

2

3

4

5x 10

5

Number of nodes

Net

wor

k tr

affi

c

DAHRAHNA

Fig. 13. Network traffic cost versus number of machines.

1 2 3 4 5 60

0.5

1

1.5

2

2.5

3

3.5x 10

5

Time slots T

Net

wor

k tr

affi

c

OnlineOHRAOHNA

Fig. 14. Network traffic cost versus size of time interval T

1000. All simulation results are averaged over 30 randominstances.

We first study the performance of all algorithm underdefault network setting in Fig. 14. We observe that net-work traffic increases at the beginning and then tends tobe stable under our proposed online algorithm. Networktraffics of OHRA and OHNA always keep stable becauseOHNA obeys the same hash partition scheme and noglobal aggregation for any time slot. OHRA introduces s-lightly migration cost due to Ψjj′ is just 6 . Our proposedonline algorithm always updates migration cost CM (t)and executes the distributed algorithm under differenttime slots, which will incur some migration cost in thisprocess.

The influence of key numbers on network traffic isstudied in Fig. 15. We observe that our online algorithmperforms much better than other two algorithms. Inparticular, when the number of keys is 50, the networktraffic for online algorithm is about 2×105 and the trafficfor OHNA is almost 3.1×105, with an increasing of 35%.

In Fig. 16, we compare the performance of threealgorithms under different values of α. The larger α,the lower aggregation efficiency the intermediate datahas. We observe that network traffics increase under ouronline algorithm and OHRA. However, OHNA is notaffected by parameter α because no data aggregationis conducted. When α is 1, all algorithms has similarperformance because α = 1 means no data aggregation.

0 10 20 30 40 500

0.5

1

1.5

2

2.5

3

3.5x 10

5

Number of keys

Net

wor

k tr

affi

c

OnlineOHRAOHNA

Fig. 15. Network traffic cost versus number of keys

0.2 0.4 0.6 0.8 11

1.5

2

2.5

3x 10

5

α

Net

wor

k tr

affi

c

OnlineOHRAOHNA

Fig. 16. Network traffic cost versus data reduction ratio α

On the other hand, our online algorithm outperformsOHRA and OHNA under other settings due to thejointly optimization of traffic-aware partition and globalaggregation.

We investigate the performance of three algorithmsunder different number of aggregators in Fig. 17. Weobserve the online algorithm outperforms other twoschemes. When the number of aggregator is 6, the net-work traffic of the OHNA algorithm is 2.8×105 and ouronline algorithm has a network traffic of 1.7× 105 , withan improvement of 39%. As the increase of aggregatornumbers, it is more beneficial to aggregate intermediatedata, reducing the amount of data in the shuffle process.However, when the number of aggregators is set to 0,which means no global aggregation, OHRA has the samenetwork traffic with OHNA and our online algorithmalways achieves the lowest cost.

8 CONCLUSION

In this paper, we study the joint optimization of inter-mediate data partition and aggregation in MapReduce tominimize network traffic cost for big data applications.We propose a three-layer model for this problem andformulate it as a mixed-integer nonlinear problem, whichis then transferred into a linear form that can be solvedby mathematical tools. To deal with the large-scaleformulation due to big data, we design a distributed

11

0 1 2 3 4 5 61.6

1.8

2

2.2

2.4

2.6

2.8

3x 10

5

Maximum number of aggregators

Net

wor

k tr

affi

c

OnlineOHRAOHNA

Fig. 17. Network traffic cost versus number of aggrega-tors

algorithm to solve the problem on multiple machines.Furthermore, we extend our algorithm to handle theMapReduce job in an online manner when some systemparameters are not given. Finally, we conduct extensivesimulations to evaluate our proposed algorithm underboth offline cases and online cases. The simulation result-s demonstrate that our proposals can effectively reducenetwork traffic cost under various network settings.

ACKNOWLEDGMENT

This work was partially sponsored by the the Nation-al Natural Science Foundation of China (NSFC) (No.61261160502, No. 61272099), the Program for ChangjiangScholars and Innovative Research Team in University(IRT1158, PCSIRT), the Shanghai Innovative Action Plan(No. 13511504200). S. Guo is the corresponding author.

REFERENCES

[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data process-ing on large clusters,” Communications of the ACM, vol. 51, no. 1,pp. 107–113, 2008.

[2] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang, “Map taskscheduling in mapreduce with data locality: Throughput andheavy-traffic optimality,” in INFOCOM, 2013 Proceedings IEEE.IEEE, 2013, pp. 1609–1617.

[3] F. Chen, M. Kodialam, and T. Lakshman, “Joint scheduling of pro-cessing and shuffle phases in mapreduce systems,” in INFOCOM,2012 Proceedings IEEE. IEEE, 2012, pp. 1143–1151.

[4] Y. Wang, W. Wang, C. Ma, and D. Meng, “Zput: A speedy datauploading approach for the hadoop distributed file system,” inCluster Computing (CLUSTER), 2013 IEEE International Conferenceon. IEEE, 2013, pp. 1–5.

[5] T. White, Hadoop: the definitive guide: the definitive guide. ” O’ReillyMedia, Inc.”, 2009.

[6] S. Chen and S. W. Schlosser, “Map-reduce meets wider varietiesof applications,” Intel Research Pittsburgh, Tech. Rep. IRP-TR-08-05,2008.

[7] J. Rosen, N. Polyzotis, V. Borkar, Y. Bu, M. J. Carey, M. Weimer,T. Condie, and R. Ramakrishnan, “Iterative mapreduce for largescale machine learning,” arXiv preprint arXiv:1303.3517, 2013.

[8] S. Venkataraman, E. Bodzsar, I. Roy, A. AuYoung, and R. S.Schreiber, “Presto: distributed machine learning and graph pro-cessing with sparse matrices,” in Proceedings of the 8th ACMEuropean Conference on Computer Systems. ACM, 2013, pp. 197–210.

[9] A. Matsunaga, M. Tsugawa, and J. Fortes, “Cloudblast: Combin-ing mapreduce and virtualization on distributed resources forbioinformatics applications,” in eScience, 2008. eScience’08. IEEEFourth International Conference on. IEEE, 2008, pp. 222–229.

[10] J. Wang, D. Crawl, I. Altintas, K. Tzoumas, and V. Markl, “Com-parison of distributed data-parallelization patterns for big dataanalysis: A bioinformatics case study,” in Proceedings of the FourthInternational Workshop on Data Intensive Computing in the Clouds(DataCloud), 2013.

[11] R. Liao, Y. Zhang, J. Guan, and S. Zhou, “Cloudnmf: A mapreduceimplementation of nonnegative matrix factorization for large-scale biological datasets,” Genomics, proteomics & bioinformatics,vol. 12, no. 1, pp. 48–51, 2014.

[12] G. Mackey, S. Sehrish, J. Bent, J. Lopez, S. Habib, and J. Wang,“Introducing map-reduce to high end computing,” in PetascaleData Storage Workshop, 2008. PDSW’08. 3rd. IEEE, 2008, pp. 1–6.

[13] W. Yu, G. Xu, Z. Chen, and P. Moulema, “A cloud computingbased architecture for cyber security situation awareness,” inCommunications and Network Security (CNS), 2013 IEEE Conferenceon. IEEE, 2013, pp. 488–492.

[14] J. Zhang, H. Zhou, R. Chen, X. Fan, Z. Guo, H. Lin, J. Y. Li,W. Lin, J. Zhou, and L. Zhou, “Optimizing data shuffling in data-parallel computation by understanding user-defined functions,”in Proceedings of the 7th Symposium on Networked Systems Designand Implementation (NSDI), San Jose, CA, USA, 2012.

[15] F. Ahmad, S. Lee, M. Thottethodi, and T. Vijaykumar, “Mapreducewith communication overlap,” pp. 608–620, 2013.

[16] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge: simplified relational data processing on large clus-ters,” in Proceedings of the 2007 ACM SIGMOD international con-ference on Management of data. ACM, 2007, pp. 1029–1040.

[17] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, J. Gerth,J. Talbot, K. Elmeleegy, and R. Sears, “Online aggregation andcontinuous query support in mapreduce,” in Proceedings of the2010 ACM SIGMOD International Conference on Management of data.ACM, 2010, pp. 1115–1118.

[18] A. Blanca and S. W. Shin, “Optimizing network usage in mapre-duce scheduling.”

[19] B. Palanisamy, A. Singh, L. Liu, and B. Jain, “Purlieus: locality-aware resource allocation for mapreduce in a cloud,” in Proceed-ings of 2011 International Conference for High Performance Computing,Networking, Storage and Analysis. ACM, 2011, p. 58.

[20] S. Ibrahim, H. Jin, L. Lu, S. Wu, B. He, and L. Qi, “Leen:Locality/fairness-aware key partitioning for mapreduce in thecloud,” in Cloud Computing Technology and Science (CloudCom),2010 IEEE Second International Conference on. IEEE, 2010, pp. 17–24.

[21] L. Fan, B. Gao, X. Sun, F. Zhang, and Z. Liu, “Improving the loadbalance of mapreduce operations based on the key distributionof pairs,” arXiv preprint arXiv:1401.0355, 2014.

[22] S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu, “A load-balanced mapre-duce algorithm for blocking-based entity-resolution with multiplekeys,” Parallel and Distributed Computing 2014, p. 3, 2014.

[23] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy,and R. Sears, “Mapreduce online.” in NSDI, vol. 10, no. 4, 2010,p. 20.

[24] J. Lin and C. Dyer, “Data-intensive text processing with mapre-duce,” Synthesis Lectures on Human Language Technologies, vol. 3,no. 1, pp. 1–177, 2010.

[25] P. Costa, A. Donnelly, A. I. Rowstron, and G. O’Shea, “Camdoop:Exploiting in-network aggregation for big data applications.” inNSDI, vol. 12, 2012, pp. 3–3.

Huan Ke is a graduate student in the Depart-ment of Computer Science at University of Aizu.Her research interests include cloud computing,big data, network and RFID system. Huan gother bachelor degree from Huazhong Universityof Science and Technology.

12

Peng Li received his BS degree from HuazhongUniversity of Science and Technology, China,in 2007, the MS and PhD degrees from theUniversity of Aizu, Japan, in 2009 and 2012,respectively. He is currently an Associate Profes-sor in the University of Aizu, Japan. His researchinterests include networking modeling, cross-layer optimization, network coding, cooperativecommunications, cloud computing, smart grid,performance evaluation of wireless and mobilenetworks for reliable, energy-efficient, and cost-

effective communications. He is a member of IEEE.

Song Guo (M’02-SM’11) received the PhD de-gree in computer science from the University ofOttawa, Canada in 2005. He is currently a FullProfessor at School of Computer Science andEngineering, the University of Aizu, Japan. Hisresearch interests are mainly in the areas ofwireless communication and mobile computing,cyber-physical systems, data center networks,cloud computing and networking, big data, andgreen computing. He has published over 250papers in refereed journals and conferences in

these areas and received three IEEE/ACM best paper awards. Dr. Guocurrently serves as Associate Editor of IEEE Transactions on Paralleland Distributed Systems, Associate Editor of IEEE Transactions onEmerging Topics in Computing for the track of Computational Networks,and on editorial boards of many others. He has also been in organizingand technical committees of numerous international conferences. Dr.Guo is a senior member of the IEEE and the ACM.

Minyi Guo received the BSc and ME degreesin computer science from Nanjing University,China; and the PhD degree in computer sciencefrom the University of Tsukuba, Japan. He iscurrently Zhiyuan Chair professor and head ofthe Department of Computer Science and Engi-neering, Shanghai Jiao Tong University (SJTU),China. Before joined SJTU, Dr. Guo had beena professor and department chair of school ofcomputer science and engineering, Universityof Aizu, Japan. Dr. Guo received the national

science fund for distinguished young scholars from NSFC in 2007,and was supported by ”1000 recruitment program of China” in 2010.His present research interests include parallel/distributed computing,compiler optimizations, embedded systems, pervasive computing, andcloud computing. He has more than 250 publications in major jour-nals and international conferences in these areas, including the IEEETransactions on Parallel and Distributed Systems, the IEEE Transac-tions on Nanobioscience, the IEEE Transactions on Computers, theACM Transactions on Autonomous and Adaptive Systems, INFOCOM,IPDPS, ICS, ISCA, HPCA, SC, WWW, etc. He received 5 best paperawards from international conferences. He was on the editorial boardof IEEE Transactions on Parallel and Distributed Systems and IEEETransactions on Computers. Dr. Guo is a senior member of IEEE,member of ACM and CCF.

on trafﬁc-aware partition and aggregation in mapreduce for …cssongguo/papers/mapreduce15.pdf ·...

Documents