map reduce (part 2) - ktikti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...map reduce (part...
TRANSCRIPT
Map Reduce (Part 2)Databases 2 (VU) (706.711 / 707.030)
Mark Kroll
Institute of Interactive Systems and Data Science,Graz University of Technology
Nov. 19, 2018
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 1 / 38
Outline
1 Brief Recap: MapReduce
2 Maximizing Parallelism
3 Performance Issues with MapReduce
4 Central-Limit Theorem
Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec
Slides “Tutorial: MapReduce Theory and Practice of Data-intensive Applications” by Pietro Michiardi
“The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce” by J.Lin, 2009.
“A Study of Skew in MapReduce Applications” by Kwon, 2011.
“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 2 / 38
MapReduce vs. Traditional Parallel Programming
ge�ing parallelism not from supercomputers (e.g. HPC )but from computing clusters = large collections of commodity hardware
I easy and cheap to replaceI yet, components do fail→ redundancy which isI provided by a new form of file system - the “distributed file system (dfs)”
freeing the developer from devoting a�ention to managing system-level details, forexample,
I synchronization primitives, inter-process communication, data transfer, etc.
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 3 / 38
A New Paradigm
MapReduce:I allows for processing and generating large data sets with a parallel, distributed algorithm on a
cluster infrastructure
I is a framework for implementing divide & conquer algorithms in an extremely scalable way
I moves code/algorithm close to data to minimize data movement
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 4 / 38
A MapReduce Computation
Input: a set of (key, value) pairs
e.g. key is the filename, value is a single line in the file, if the file is too large to fit in memory
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 5 / 38
A MapReduce Computation
Map(k, v)→ (k′, v′)∗
takes a (k, v) pair and outputs a set of (k′, v′) pairsthere is one Map call (mapper) for each (k, v) pair
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 6 / 38
A MapReduce Computation
Reduce(k′, (v′)∗)→ (k′′, v′′)∗
all values v′ with same key k′ are reduced together and processed in v′ orderthere is one Reduce call (reducer) for each unique k′
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 7 / 38
A MapReduce Computation: An Example
Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…
Map(Star, 1)(Wars, 1)(is, 1)(an, 1)
(American, 1)(epic, 1)(space, 1)(opera, 1)
(franchise, 1)(centered, 1)
(on, 1)(a, 1)
(film, 1)(series, 1)(created, 1)
(by, 1). . .. . .
Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)
(film, 1)(film, 1)(film, 1)
(franchise, 1)(series, 1)(series, 1). . .. . .
Reduce
(Star, 2)(Wars, 2)(a, 6)
(film, 3)(franchise, 1)(series, 2). . .. . .
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 8 / 38
General Thoughts on Parallelism
to maximize parallelism, we could think ofI using one Reduce task for each reducer, i.e. a single key and its associated value listI and executing each Reduce task at a di�erent compute node
yet, this plan is typically not the best one . . .
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 9 / 38
General Thoughts on Parallelism
→ since there is overhead associated with each task createdI tasks need to be setup, etc.I so map tasks should at least take a minute to execute to pay o� . . .
want to keep the number of Reduce tasks lower than the number of di�erent keysI do not want to create a task for a key with a “short” list
there are o�en far more keys than there are compute nodes,I e.g. count words from Wikipedia or from the Web
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 10 / 38
General Thoughts on Parallelism
so how many Map and Reduce jobs?
if you have M map tasks + R reduce tasks, a rule of a thumb is:I make M much larger than the number of nodes in the clusterI one DFS chunk per map is commonI improves dynamic load balancing and speeds up recovery from worker failures
usually R is smaller than M, because output is spread across R files
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 11 / 38
1st Refinement: Combiners
if a Reduce function is associative and commutative (e.g. the addition and themultiplication operation); the values can be combined in any order, with the same result
I Commutative: x ◦ y = y ◦ xI Associative: (x ◦ y) ◦ z = x ◦ (y ◦ z)
→ we can push some of the reducers’ work to the Map tasksin that way the output of the Map task is “combined” before grouping and sorting (insteadof emi�ing (w, 1), (w, 1), . . . )
I still necessary to do grouping and aggregation and to pass the result to the Reduce tasks
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 12 / 38
So instead of . . .
Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…
Map(Star, 1)(Wars, 1)(is, 1)(an, 1)
(American, 1)(epic, 1)(space, 1)(opera, 1)
(franchise, 1)(centered, 1)
(on, 1)(a, 1)
(film, 1)(series, 1)(created, 1)
(by, 1). . .. . .
Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)
(film, 1)(film, 1)(film, 1)
(franchise, 1)(series, 1)(series, 1). . .. . .
Reduce
(Star, 2)(Wars, 2)(a, 6)
(film, 3)(franchise, 1)(series, 2). . .. . .
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 13 / 38
Combine results already at the Map phase
Map(Star, 1)(Wars, 1)(is, 1). . .
(American, 1)(epic, 1)(space, 1). . .
(franchise, 1)(centered, 1)
(on, 1). . .
(film, 1)(series, 1)(created, 1)
. . .
. . .
. . .
Combiner(Star, 2). . .
(Wars, 2)(a, 6). . .
(a,3). . .
(film, 3)(franchise, 1)(series, 2). . .. . .. . .
Group by key
(Star, 2)(Wars, 2)(a, 6)(a, 3)(a, 4). . .
(film, 3)(franchise, 1)(series, 2). . .. . .. . .
Reduce
(Star, 2)(Wars, 2)(a, 13)(film, 3)
(franchise, 1)(series, 2). . .. . .
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 14 / 38
2nd Refinement: Partition Function
controls the partitioning of the keys of the intermediate map-outputs
reduce step needs to ensure that records with the same intermediate key end up at thesame workerkey (or a subset of the key) is used to derive the partition, typically by a hash function
I (total # of partitions) = (# reduce tasks)
this controls which of the R reduce tasks the intermediate key is sent to for reduction
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 15 / 38
2nd Refinement: Partition Function
The idea: We want to control how keys get partitioned:
per default the system uses a hash function, e.g.,I hash(key) mod R
it is sometimes useful to override this hash function, e.g.,I e.g., hash(hostname(URL)) mod RI to ensure URLs from the same host end up in the same output file
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 16 / 38
Input Data Skew
Data skew occurs naturally in many applications, for example,I the PageRank is a link analysis algorithm that assigns weights to each vertex in a graph by
iteratively aggregating the weights of its inbound neighbors. This application can thus exhibitskew if the graph includes nodes with large degree of incoming edges.
as a consequence of input data being skewed - o�en, a small number of mappers takessignificantly longer to complete than the rest,
I thus blocking the progress of the entire job and leading to poor parallelism.
this is called the Stragglers Problem sinceI in MapReduce, the reducers cannot start until all the mappers have finished→ a few
stragglers can have a large impact on overall end-to-end running time
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 17 / 38
Stragglers can have a large impact
The slowest map task (first one from the top) takes more than twice as long to complete as the secondslowest map task thereby killing the parallelism e�ect. (Fig. from Kwon2011)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 18 / 38
Stragglers can have a large impact
similarly a small number of long-running reducers can significantly delay the completion ofa MapReduce job
in addition to long running times, the stragglers phenomenon has implications for clusterutilization
I most of the cluster is idle, while a submi�ed job waits for the last mapper or reducer tocomplete
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 19 / 38
2 Reasons for the Stragglers Problem
first, idiosyncrasies of machines in a large cluster, i.e. slowness in the “long running” jobcould be due to a faulty hardware, network congestion, or the node could be simply busyetc.
I nicely handled by speculative execution which is implemented in HadoopI → Hadoop is speculating that something is wrong with the “long running” task and runs a
clone task on the other node; multiple instances of the same mapper or reducer areredundantly executed in parallel (wrt. the availability of cluster resources)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 20 / 38
2 Reasons for the Stragglers Problem
second, distribution of running times for mappers and reducers is highly skewedI there is o�en significant variation in the lengths of value list for di�erent keysI frequent words (e.g. articles, pronouns) vs. rarly occuring words (e.g. person names)
take the word count example: due to the distribution of terms in natural language, somereducers will have more work than others
I this yields potentially large di�erences in running times, independent of hardwareidiosyncrasies
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 21 / 38
Distribution of Terms in Natural Language
well-known observation that word occurrences in natural language, to a firstapproximation, follow a Zipfian distribution
I a family of related discrete power law probability distributionsI named a�er the American linguist George Kingsley Zipf
other (naturally-occurring) phenomena can be similarly characterizedI website popularity, sizes of craters on the moon, sizes of German cities, economic power of
countries, . . .
loosely formulated, Zipfian distributions are those where a few elements are exceedinglycommon, but contain a long tail of rare events
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 22 / 38
Power-law (Zipf) random variable
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
pro
babili
ty o
f k
Probability mass function of a Zipf random variable; differing α values
α=2.0
α=3.0
0 1 2 3 4 5 6 7 8 9k
10-3
10-2
10-1
100
pro
babili
ty o
f k
Probability mass function of a Zipf random variable; differing α values
α=2.0
α=3.0
resemblance to a hyperbola; the higher α the steeper (loglog plot on the right hand side)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 23 / 38
Tackling Input Data Skew
We needI to distribute skewed (power-law) input data into a number of Reduce tasks/ compute nodes
We want thatI the distribution of the key lengths inside of Reduce tasks/ compute nodes should be
approximately normal
We also want thatI the variance of these distributions should be smaller than the original varianceI → if the variance is small an e�icient load balancing is possible
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 24 / 38
The Setup
each Reduce task receives a number of keys (= a number of reducers)I total number of values to process is the sum of the number of values over all keysI average number of values that a Reduce task processes is the average of the number of values
over all keys
equivalently, each compute node receives a number of Reduce tasksI sum and average for a compute node is the sum and average over all Reduce tasks for that node
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 25 / 38
How should we distribute keys to Reduce tasks?
Natural ideas might include:
uniformly at random?
calculate the capacity of a single Reduce task; add keys until capacity is reached, . . . ?
Let’s take a step back:
We are averaging over a skewed distribution . . .
Are there laws that describe how the averages of su�iciently large samples drawn from aprobability distribution behave?
I in other words, how are the averages of samples of a random variable (r.v.) distributed?
→ Central-Limit Theorem
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 26 / 38
In a nutshell
The Central-Limit Theoremdescribes the distribution of the arithmetic mean of su�iciently large samples of independentand identically distributed random variables
the means are normally distributed
the mean of the new distribution equals the mean of the original distribution
the variance of the new distribution equals σ2
n , where σ2 is the variance of the original
distribution
→ thus, we keep the mean and reduce the variance
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 27 / 38
Central-Limit Theorem
Theoremsuppose X1, . . . ,Xn are independent and identical r.v. with the expectation µ and variance σ2. Let Ybe a r.v. defined as:
Yn =1n
n∑i=1
Xi
the CDF Fn(y) tends to PDF of a normal r.v. with the mean µ and variance σ2 for n→∞:
limn→∞
Fn(y) =1√2πσ2
∫ y
−∞e−
(x−µ)2
2σ2
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 28 / 38
Central-Limit Theorem
0.0 0.2 0.4 0.6 0.8 1.00
2
4
6
8
10
12
14
16
18
Averages: µ=0.5,σ2 =0.08333
0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.700
5
10
15
20
25
30
35
40
Averages: µ=0.499,σ2 =0.00270
practically, it is possible to replace Fn(y) with a normal distribution for n > 30
we should always average over at least 30 valueshttp://onlinestatbook.com/stat_sim/sampling_dist/index.html
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 29 / 38
Central-Limit Theorem
one assumption made in statistics courses is that the populations that we work with arenormally distributed
I which is o�en unrealistic with real-world data
yet, the use of an appropriate sample size and the central limit theorem help us to getaround the problem of data from populations that are not normal
→ this theorem has many practical applications in statisticsI such as those involving hypothesis testing or confidence intervals
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 30 / 38
Skewed Input Data
0 2 4 6 8 10 12100
101
102
103
104
105
Key size: sum=196524,µ=1.965,σ2 =243.245
Zipfian Distribution: #words = 100000, α = 2.5Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 31 / 38
Reducing Skew
we can reduce the skew impact by using fewer Reduce tasks than there are reducers
if keys are sent randomly to Reduce tasks, we average over value list lengthsI → we average over the total time for each Reduce task (applying the Central-limit Theorem)I we should make sure that the sample size is large enough (n > 30)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 32 / 38
Reducers� Reduce tasks
0 2 4 6 8 10 12100
101
102
103
104
105
Key size: sum=196524,µ=1.965,σ2 =243.245
0 200 400 600 800 10000
500
1000
1500
2000
2500
3000
3500
4000
4500
Task key size: sum=196524,µ=196.524,σ2 =25136.428
# Reducers (# Keys) = 100000� # Reduce Tasks = 1000
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 33 / 38
Key Averages per Task
0 200 400 600 800 10000
500
1000
1500
2000
2500
3000
3500
4000
4500
Task key size: sum=196524,µ=196.524,σ2 =25136.428
0 1 2 3 4 5 6100
101
102
Task key averages: µ=1.958,σ2 =1.886
→ we thus keep the mean and reduce the variance
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 34 / 38
Reducing Skew
we can further reduce the skew by using more Reduce tasks than there are compute nodes
long Reduce tasks might occupy a compute node fully
several shorter Reduce tasks are executed sequentially at a single compute nodeI → we average over the total time for each compute node (again applying the Central-limit
Theorem)I again, we should make sure that the sample size is large enough (n > 30)
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 35 / 38
Reduce Tasks� Compute Nodes
0 2 4 6 8 100
5000
10000
15000
20000
25000
Node key size: sum=196524,µ=19652.400,σ2 =242116.267
Node key averages: µ=1.976,σ2 =0.030
Reduce Tasks = 1000� # Compute Nodes = 10Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 36 / 38
Other Areas to Improve MapReduce Performance
File SystemI higher throughput, reliability, replication e�iciency
SchedulingI data/task a�inity, scheduling multiple MR apps
Failure DetectionI failure characterization, failure detector and prediction
SecurityI data privacy, distributed result/error checking
Scientific ComputingI iterative MapReduce, numerical processing
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 37 / 38
The End
Today:Parallelism (and how to maximize it)optimizing Computation Performance,
I e.g. handling input data skew . . .
Next: MapReduce (Part 3)Suitable problem se�ings for MapReduce paradigm
Hadoop Ecosystem
Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 38 / 38