map reduce (part 2) - ktikti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...map reduce (part...

Map Reduce (Part 2)Databases 2 (VU) (706.711 / 707.030)

Mark Kroll

Institute of Interactive Systems and Data Science,Graz University of Technology

Nov. 19, 2018

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 1 / 38

Outline

1 Brief Recap: MapReduce

2 Maximizing Parallelism

3 Performance Issues with MapReduce

4 Central-Limit Theorem

Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec

Slides “Tutorial: MapReduce Theory and Practice of Data-intensive Applications” by Pietro Michiardi

“The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce” by J.Lin, 2009.

“A Study of Skew in MapReduce Applications” by Kwon, 2011.

“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.


MapReduce vs. Traditional Parallel Programming

ge�ing parallelism not from supercomputers (e.g. HPC )but from computing clusters = large collections of commodity hardware

I easy and cheap to replaceI yet, components do fail→ redundancy which isI provided by a new form of file system - the “distributed file system (dfs)”

freeing the developer from devoting a�ention to managing system-level details, forexample,

I synchronization primitives, inter-process communication, data transfer, etc.


A New Paradigm

MapReduce:I allows for processing and generating large data sets with a parallel, distributed algorithm on a

cluster infrastructure

I is a framework for implementing divide & conquer algorithms in an extremely scalable way

I moves code/algorithm close to data to minimize data movement


A MapReduce Computation

Input: a set of (key, value) pairs

e.g. key is the filename, value is a single line in the file, if the file is too large to fit in memory



Map(k, v)→ (k′, v′)∗

takes a (k, v) pair and outputs a set of (k′, v′) pairsthere is one Map call (mapper) for each (k, v) pair



Reduce(k′, (v′)∗)→ (k′′, v′′)∗

all values v′ with same key k′ are reduced together and processed in v′ orderthere is one Reduce call (reducer) for each unique k′


A MapReduce Computation: An Example

Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…

Map(Star, 1)(Wars, 1)(is, 1)(an, 1)

(American, 1)(epic, 1)(space, 1)(opera, 1)

(franchise, 1)(centered, 1)

(on, 1)(a, 1)

(film, 1)(series, 1)(created, 1)

(by, 1). . .. . .

Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)

(film, 1)(film, 1)(film, 1)

(franchise, 1)(series, 1)(series, 1). . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 6)

(film, 3)(franchise, 1)(series, 2). . .. . .


General Thoughts on Parallelism

to maximize parallelism, we could think ofI using one Reduce task for each reducer, i.e. a single key and its associated value listI and executing each Reduce task at a di�erent compute node

yet, this plan is typically not the best one . . .



→ since there is overhead associated with each task createdI tasks need to be setup, etc.I so map tasks should at least take a minute to execute to pay o� . . .

want to keep the number of Reduce tasks lower than the number of di�erent keysI do not want to create a task for a key with a “short” list

there are o�en far more keys than there are compute nodes,I e.g. count words from Wikipedia or from the Web



so how many Map and Reduce jobs?

if you have M map tasks + R reduce tasks, a rule of a thumb is:I make M much larger than the number of nodes in the clusterI one DFS chunk per map is commonI improves dynamic load balancing and speeds up recovery from worker failures

usually R is smaller than M, because output is spread across R files


1st Refinement: Combiners

if a Reduce function is associative and commutative (e.g. the addition and themultiplication operation); the values can be combined in any order, with the same result

I Commutative: x ◦ y = y ◦ xI Associative: (x ◦ y) ◦ z = x ◦ (y ◦ z)

→ we can push some of the reducers’ work to the Map tasksin that way the output of the Map task is “combined” before grouping and sorting (insteadof emi�ing (w, 1), (w, 1), . . . )

I still necessary to do grouping and aggregation and to pass the result to the Reduce tasks


So instead of . . .

Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…

Map(Star, 1)(Wars, 1)(is, 1)(an, 1)

(American, 1)(epic, 1)(space, 1)(opera, 1)


(on, 1)(a, 1)


(by, 1). . .. . .

Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)

(film, 1)(film, 1)(film, 1)

(franchise, 1)(series, 1)(series, 1). . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 6)

(film, 3)(franchise, 1)(series, 2). . .. . .


Combine results already at the Map phase

Map(Star, 1)(Wars, 1)(is, 1). . .

(American, 1)(epic, 1)(space, 1). . .


(on, 1). . .


. . .

. . .

. . .

Combiner(Star, 2). . .

(Wars, 2)(a, 6). . .

(a,3). . .

(film, 3)(franchise, 1)(series, 2). . .. . .. . .

Group by key

(Star, 2)(Wars, 2)(a, 6)(a, 3)(a, 4). . .

(film, 3)(franchise, 1)(series, 2). . .. . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 13)(film, 3)

(franchise, 1)(series, 2). . .. . .


2nd Refinement: Partition Function

controls the partitioning of the keys of the intermediate map-outputs

reduce step needs to ensure that records with the same intermediate key end up at thesame workerkey (or a subset of the key) is used to derive the partition, typically by a hash function

I (total # of partitions) = (# reduce tasks)

this controls which of the R reduce tasks the intermediate key is sent to for reduction


2nd Refinement: Partition Function

The idea: We want to control how keys get partitioned:

per default the system uses a hash function, e.g.,I hash(key) mod R

it is sometimes useful to override this hash function, e.g.,I e.g., hash(hostname(URL)) mod RI to ensure URLs from the same host end up in the same output file


Input Data Skew

Data skew occurs naturally in many applications, for example,I the PageRank is a link analysis algorithm that assigns weights to each vertex in a graph by

iteratively aggregating the weights of its inbound neighbors. This application can thus exhibitskew if the graph includes nodes with large degree of incoming edges.

as a consequence of input data being skewed - o�en, a small number of mappers takessignificantly longer to complete than the rest,

I thus blocking the progress of the entire job and leading to poor parallelism.

this is called the Stragglers Problem sinceI in MapReduce, the reducers cannot start until all the mappers have finished→ a few

stragglers can have a large impact on overall end-to-end running time


Stragglers can have a large impact

The slowest map task (first one from the top) takes more than twice as long to complete as the secondslowest map task thereby killing the parallelism e�ect. (Fig. from Kwon2011)


Stragglers can have a large impact

similarly a small number of long-running reducers can significantly delay the completion ofa MapReduce job

in addition to long running times, the stragglers phenomenon has implications for clusterutilization

I most of the cluster is idle, while a submi�ed job waits for the last mapper or reducer tocomplete


2 Reasons for the Stragglers Problem

first, idiosyncrasies of machines in a large cluster, i.e. slowness in the “long running” jobcould be due to a faulty hardware, network congestion, or the node could be simply busyetc.

I nicely handled by speculative execution which is implemented in HadoopI → Hadoop is speculating that something is wrong with the “long running” task and runs a

clone task on the other node; multiple instances of the same mapper or reducer areredundantly executed in parallel (wrt. the availability of cluster resources)


2 Reasons for the Stragglers Problem

second, distribution of running times for mappers and reducers is highly skewedI there is o�en significant variation in the lengths of value list for di�erent keysI frequent words (e.g. articles, pronouns) vs. rarly occuring words (e.g. person names)

take the word count example: due to the distribution of terms in natural language, somereducers will have more work than others

I this yields potentially large di�erences in running times, independent of hardwareidiosyncrasies


Distribution of Terms in Natural Language

well-known observation that word occurrences in natural language, to a firstapproximation, follow a Zipfian distribution

I a family of related discrete power law probability distributionsI named a�er the American linguist George Kingsley Zipf

other (naturally-occurring) phenomena can be similarly characterizedI website popularity, sizes of craters on the moon, sizes of German cities, economic power of

countries, . . .

loosely formulated, Zipfian distributions are those where a few elements are exceedinglycommon, but contain a long tail of rare events


Power-law (Zipf) random variable

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

0 1 2 3 4 5 6 7 8 9k

10-3

10-2

10-1

100

pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

resemblance to a hyperbola; the higher α the steeper (loglog plot on the right hand side)


Tackling Input Data Skew

We needI to distribute skewed (power-law) input data into a number of Reduce tasks/ compute nodes

We want thatI the distribution of the key lengths inside of Reduce tasks/ compute nodes should be

approximately normal

We also want thatI the variance of these distributions should be smaller than the original varianceI → if the variance is small an e�icient load balancing is possible


The Setup

each Reduce task receives a number of keys (= a number of reducers)I total number of values to process is the sum of the number of values over all keysI average number of values that a Reduce task processes is the average of the number of values

over all keys

equivalently, each compute node receives a number of Reduce tasksI sum and average for a compute node is the sum and average over all Reduce tasks for that node


How should we distribute keys to Reduce tasks?

Natural ideas might include:

uniformly at random?

calculate the capacity of a single Reduce task; add keys until capacity is reached, . . . ?

Let’s take a step back:

We are averaging over a skewed distribution . . .

Are there laws that describe how the averages of su�iciently large samples drawn from aprobability distribution behave?

I in other words, how are the averages of samples of a random variable (r.v.) distributed?

→ Central-Limit Theorem


In a nutshell

The Central-Limit Theoremdescribes the distribution of the arithmetic mean of su�iciently large samples of independentand identically distributed random variables

the means are normally distributed

the mean of the new distribution equals the mean of the original distribution

the variance of the new distribution equals σ2

n , where σ2 is the variance of the original

distribution

→ thus, we keep the mean and reduce the variance


Central-Limit Theorem

Theoremsuppose X1, . . . ,Xn are independent and identical r.v. with the expectation µ and variance σ2. Let Ybe a r.v. defined as:

Yn =1n

n∑i=1

Xi

the CDF Fn(y) tends to PDF of a normal r.v. with the mean µ and variance σ2 for n→∞:

limn→∞

Fn(y) =1√2πσ2

∫ y

−∞e−

(x−µ)2

2σ2



0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14

16

18

Averages: µ=0.5,σ2 =0.08333

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.700

5

10

15

20

25

30

35

40

Averages: µ=0.499,σ2 =0.00270

practically, it is possible to replace Fn(y) with a normal distribution for n > 30

we should always average over at least 30 valueshttp://onlinestatbook.com/stat_sim/sampling_dist/index.html


http://onlinestatbook.com/stat_sim/sampling_dist/index.html


one assumption made in statistics courses is that the populations that we work with arenormally distributed

I which is o�en unrealistic with real-world data

yet, the use of an appropriate sample size and the central limit theorem help us to getaround the problem of data from populations that are not normal

→ this theorem has many practical applications in statisticsI such as those involving hypothesis testing or confidence intervals


Skewed Input Data

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=196524,µ=1.965,σ2 =243.245

Zipfian Distribution: #words = 100000, α = 2.5Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 31 / 38

Reducing Skew

we can reduce the skew impact by using fewer Reduce tasks than there are reducers

if keys are sent randomly to Reduce tasks, we average over value list lengthsI → we average over the total time for each Reduce task (applying the Central-limit Theorem)I we should make sure that the sample size is large enough (n > 30)


Reducers� Reduce tasks

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=196524,µ=1.965,σ2 =243.245

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

Task key size: sum=196524,µ=196.524,σ2 =25136.428

# Reducers (# Keys) = 100000� # Reduce Tasks = 1000


Key Averages per Task

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

Task key size: sum=196524,µ=196.524,σ2 =25136.428

0 1 2 3 4 5 6100

101

102

Task key averages: µ=1.958,σ2 =1.886

→ we thus keep the mean and reduce the variance


Reducing Skew

we can further reduce the skew by using more Reduce tasks than there are compute nodes

long Reduce tasks might occupy a compute node fully

several shorter Reduce tasks are executed sequentially at a single compute nodeI → we average over the total time for each compute node (again applying the Central-limit

Theorem)I again, we should make sure that the sample size is large enough (n > 30)


Reduce Tasks� Compute Nodes

0 2 4 6 8 100

5000

10000

15000

20000

25000

Node key size: sum=196524,µ=19652.400,σ2 =242116.267

Node key averages: µ=1.976,σ2 =0.030

Reduce Tasks = 1000� # Compute Nodes = 10Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 36 / 38

Other Areas to Improve MapReduce Performance

File SystemI higher throughput, reliability, replication e�iciency

SchedulingI data/task a�inity, scheduling multiple MR apps

Failure DetectionI failure characterization, failure detector and prediction

SecurityI data privacy, distributed result/error checking

Scientific ComputingI iterative MapReduce, numerical processing


The End

Today:Parallelism (and how to maximize it)optimizing Computation Performance,

I e.g. handling input data skew . . .

Next: MapReduce (Part 3)Suitable problem se�ings for MapReduce paradigm

Hadoop Ecosystem


map reduce (part 2) - ktikti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...map reduce (part...

Documents