map reduce (part 2) - ktikti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...map reduce (part...

38
Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨ oll Institute of Interactive Systems and Data Science, Graz University of Technology Nov. 19, 2018 Mark Kr¨ oll (ISDS, TU Graz) MapReduce Nov. 19, 2018 1 / 38

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Map Reduce (Part 2)Databases 2 (VU) (706.711 / 707.030)

Mark Kroll

Institute of Interactive Systems and Data Science,Graz University of Technology

Nov. 19, 2018

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 1 / 38

Page 2: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Outline

1 Brief Recap: MapReduce

2 Maximizing Parallelism

3 Performance Issues with MapReduce

4 Central-Limit Theorem

Slides are partially based onSlides “Mining Massive Datasets” by Jure Leskovec

Slides “Tutorial: MapReduce Theory and Practice of Data-intensive Applications” by Pietro Michiardi

“The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce” by J.Lin, 2009.

“A Study of Skew in MapReduce Applications” by Kwon, 2011.

“Limitations and Challenges of HDFS and MapReduce” by Weets et al., 2015.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 2 / 38

Page 3: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

MapReduce vs. Traditional Parallel Programming

ge�ing parallelism not from supercomputers (e.g. HPC )but from computing clusters = large collections of commodity hardware

I easy and cheap to replaceI yet, components do fail→ redundancy which isI provided by a new form of file system - the “distributed file system (dfs)”

freeing the developer from devoting a�ention to managing system-level details, forexample,

I synchronization primitives, inter-process communication, data transfer, etc.

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 3 / 38

Page 4: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

A New Paradigm

MapReduce:I allows for processing and generating large data sets with a parallel, distributed algorithm on a

cluster infrastructure

I is a framework for implementing divide & conquer algorithms in an extremely scalable way

I moves code/algorithm close to data to minimize data movement

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 4 / 38

Page 5: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

A MapReduce Computation

Input: a set of (key, value) pairs

e.g. key is the filename, value is a single line in the file, if the file is too large to fit in memory

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 5 / 38

Page 6: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

A MapReduce Computation

Map(k, v)→ (k′, v′)∗

takes a (k, v) pair and outputs a set of (k′, v′) pairsthere is one Map call (mapper) for each (k, v) pair

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 6 / 38

Page 7: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

A MapReduce Computation

Reduce(k′, (v′)∗)→ (k′′, v′′)∗

all values v′ with same key k′ are reduced together and processed in v′ orderthere is one Reduce call (reducer) for each unique k′

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 7 / 38

Page 8: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

A MapReduce Computation: An Example

Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…

Map(Star, 1)(Wars, 1)(is, 1)(an, 1)

(American, 1)(epic, 1)(space, 1)(opera, 1)

(franchise, 1)(centered, 1)

(on, 1)(a, 1)

(film, 1)(series, 1)(created, 1)

(by, 1). . .. . .

Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)

(film, 1)(film, 1)(film, 1)

(franchise, 1)(series, 1)(series, 1). . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 6)

(film, 3)(franchise, 1)(series, 2). . .. . .

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 8 / 38

Page 9: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

General Thoughts on Parallelism

to maximize parallelism, we could think ofI using one Reduce task for each reducer, i.e. a single key and its associated value listI and executing each Reduce task at a di�erent compute node

yet, this plan is typically not the best one . . .

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 9 / 38

Page 10: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

General Thoughts on Parallelism

→ since there is overhead associated with each task createdI tasks need to be setup, etc.I so map tasks should at least take a minute to execute to pay o� . . .

want to keep the number of Reduce tasks lower than the number of di�erent keysI do not want to create a task for a key with a “short” list

there are o�en far more keys than there are compute nodes,I e.g. count words from Wikipedia or from the Web

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 10 / 38

Page 11: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

General Thoughts on Parallelism

so how many Map and Reduce jobs?

if you have M map tasks + R reduce tasks, a rule of a thumb is:I make M much larger than the number of nodes in the clusterI one DFS chunk per map is commonI improves dynamic load balancing and speeds up recovery from worker failures

usually R is smaller than M, because output is spread across R files

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 11 / 38

Page 12: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

1st Refinement: Combiners

if a Reduce function is associative and commutative (e.g. the addition and themultiplication operation); the values can be combined in any order, with the same result

I Commutative: x ◦ y = y ◦ xI Associative: (x ◦ y) ◦ z = x ◦ (y ◦ z)

→ we can push some of the reducers’ work to the Map tasksin that way the output of the Map task is “combined” before grouping and sorting (insteadof emi�ing (w, 1), (w, 1), . . . )

I still necessary to do grouping and aggregation and to pass the result to the Reduce tasks

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 12 / 38

Page 13: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

So instead of . . .

Big DocumentStar Wars is an Americanepic space opera franchisecentered on a film seriescreated by George Lucas.The film series hasspawned an extensivemedia franchise called theExpanded Universeincluding books, televisionseries, computer and videogames, and comic books.These supplements to thetwo film trilogies…

Map(Star, 1)(Wars, 1)(is, 1)(an, 1)

(American, 1)(epic, 1)(space, 1)(opera, 1)

(franchise, 1)(centered, 1)

(on, 1)(a, 1)

(film, 1)(series, 1)(created, 1)

(by, 1). . .. . .

Group by key(Star, 1)(Star, 1)(Wars, 1)(Wars, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)(a, 1)

(film, 1)(film, 1)(film, 1)

(franchise, 1)(series, 1)(series, 1). . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 6)

(film, 3)(franchise, 1)(series, 2). . .. . .

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 13 / 38

Page 14: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Combine results already at the Map phase

Map(Star, 1)(Wars, 1)(is, 1). . .

(American, 1)(epic, 1)(space, 1). . .

(franchise, 1)(centered, 1)

(on, 1). . .

(film, 1)(series, 1)(created, 1)

. . .

. . .

. . .

Combiner(Star, 2). . .

(Wars, 2)(a, 6). . .

(a,3). . .

(film, 3)(franchise, 1)(series, 2). . .. . .. . .

Group by key

(Star, 2)(Wars, 2)(a, 6)(a, 3)(a, 4). . .

(film, 3)(franchise, 1)(series, 2). . .. . .. . .

Reduce

(Star, 2)(Wars, 2)(a, 13)(film, 3)

(franchise, 1)(series, 2). . .. . .

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 14 / 38

Page 15: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

2nd Refinement: Partition Function

controls the partitioning of the keys of the intermediate map-outputs

reduce step needs to ensure that records with the same intermediate key end up at thesame workerkey (or a subset of the key) is used to derive the partition, typically by a hash function

I (total # of partitions) = (# reduce tasks)

this controls which of the R reduce tasks the intermediate key is sent to for reduction

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 15 / 38

Page 16: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

2nd Refinement: Partition Function

The idea: We want to control how keys get partitioned:

per default the system uses a hash function, e.g.,I hash(key) mod R

it is sometimes useful to override this hash function, e.g.,I e.g., hash(hostname(URL)) mod RI to ensure URLs from the same host end up in the same output file

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 16 / 38

Page 17: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Input Data Skew

Data skew occurs naturally in many applications, for example,I the PageRank is a link analysis algorithm that assigns weights to each vertex in a graph by

iteratively aggregating the weights of its inbound neighbors. This application can thus exhibitskew if the graph includes nodes with large degree of incoming edges.

as a consequence of input data being skewed - o�en, a small number of mappers takessignificantly longer to complete than the rest,

I thus blocking the progress of the entire job and leading to poor parallelism.

this is called the Stragglers Problem sinceI in MapReduce, the reducers cannot start until all the mappers have finished→ a few

stragglers can have a large impact on overall end-to-end running time

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 17 / 38

Page 18: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Stragglers can have a large impact

The slowest map task (first one from the top) takes more than twice as long to complete as the secondslowest map task thereby killing the parallelism e�ect. (Fig. from Kwon2011)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 18 / 38

Page 19: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Stragglers can have a large impact

similarly a small number of long-running reducers can significantly delay the completion ofa MapReduce job

in addition to long running times, the stragglers phenomenon has implications for clusterutilization

I most of the cluster is idle, while a submi�ed job waits for the last mapper or reducer tocomplete

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 19 / 38

Page 20: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

2 Reasons for the Stragglers Problem

first, idiosyncrasies of machines in a large cluster, i.e. slowness in the “long running” jobcould be due to a faulty hardware, network congestion, or the node could be simply busyetc.

I nicely handled by speculative execution which is implemented in HadoopI → Hadoop is speculating that something is wrong with the “long running” task and runs a

clone task on the other node; multiple instances of the same mapper or reducer areredundantly executed in parallel (wrt. the availability of cluster resources)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 20 / 38

Page 21: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

2 Reasons for the Stragglers Problem

second, distribution of running times for mappers and reducers is highly skewedI there is o�en significant variation in the lengths of value list for di�erent keysI frequent words (e.g. articles, pronouns) vs. rarly occuring words (e.g. person names)

take the word count example: due to the distribution of terms in natural language, somereducers will have more work than others

I this yields potentially large di�erences in running times, independent of hardwareidiosyncrasies

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 21 / 38

Page 22: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Distribution of Terms in Natural Language

well-known observation that word occurrences in natural language, to a firstapproximation, follow a Zipfian distribution

I a family of related discrete power law probability distributionsI named a�er the American linguist George Kingsley Zipf

other (naturally-occurring) phenomena can be similarly characterizedI website popularity, sizes of craters on the moon, sizes of German cities, economic power of

countries, . . .

loosely formulated, Zipfian distributions are those where a few elements are exceedinglycommon, but contain a long tail of rare events

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 22 / 38

Page 23: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Power-law (Zipf) random variable

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19k

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

0 1 2 3 4 5 6 7 8 9k

10-3

10-2

10-1

100

pro

babili

ty o

f k

Probability mass function of a Zipf random variable; differing α values

α=2.0

α=3.0

resemblance to a hyperbola; the higher α the steeper (loglog plot on the right hand side)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 23 / 38

Page 24: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Tackling Input Data Skew

We needI to distribute skewed (power-law) input data into a number of Reduce tasks/ compute nodes

We want thatI the distribution of the key lengths inside of Reduce tasks/ compute nodes should be

approximately normal

We also want thatI the variance of these distributions should be smaller than the original varianceI → if the variance is small an e�icient load balancing is possible

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 24 / 38

Page 25: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

The Setup

each Reduce task receives a number of keys (= a number of reducers)I total number of values to process is the sum of the number of values over all keysI average number of values that a Reduce task processes is the average of the number of values

over all keys

equivalently, each compute node receives a number of Reduce tasksI sum and average for a compute node is the sum and average over all Reduce tasks for that node

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 25 / 38

Page 26: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

How should we distribute keys to Reduce tasks?

Natural ideas might include:

uniformly at random?

calculate the capacity of a single Reduce task; add keys until capacity is reached, . . . ?

Let’s take a step back:

We are averaging over a skewed distribution . . .

Are there laws that describe how the averages of su�iciently large samples drawn from aprobability distribution behave?

I in other words, how are the averages of samples of a random variable (r.v.) distributed?

→ Central-Limit Theorem

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 26 / 38

Page 27: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

In a nutshell

The Central-Limit Theoremdescribes the distribution of the arithmetic mean of su�iciently large samples of independentand identically distributed random variables

the means are normally distributed

the mean of the new distribution equals the mean of the original distribution

the variance of the new distribution equals σ2

n , where σ2 is the variance of the original

distribution

→ thus, we keep the mean and reduce the variance

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 27 / 38

Page 28: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Central-Limit Theorem

Theoremsuppose X1, . . . ,Xn are independent and identical r.v. with the expectation µ and variance σ2. Let Ybe a r.v. defined as:

Yn =1n

n∑i=1

Xi

the CDF Fn(y) tends to PDF of a normal r.v. with the mean µ and variance σ2 for n→∞:

limn→∞

Fn(y) =1√2πσ2

∫ y

−∞e−

(x−µ)2

2σ2

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 28 / 38

Page 29: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Central-Limit Theorem

0.0 0.2 0.4 0.6 0.8 1.00

2

4

6

8

10

12

14

16

18

Averages: µ=0.5,σ2 =0.08333

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.700

5

10

15

20

25

30

35

40

Averages: µ=0.499,σ2 =0.00270

practically, it is possible to replace Fn(y) with a normal distribution for n > 30

we should always average over at least 30 valueshttp://onlinestatbook.com/stat_sim/sampling_dist/index.html

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 29 / 38

Page 30: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Central-Limit Theorem

one assumption made in statistics courses is that the populations that we work with arenormally distributed

I which is o�en unrealistic with real-world data

yet, the use of an appropriate sample size and the central limit theorem help us to getaround the problem of data from populations that are not normal

→ this theorem has many practical applications in statisticsI such as those involving hypothesis testing or confidence intervals

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 30 / 38

Page 31: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Skewed Input Data

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=196524,µ=1.965,σ2 =243.245

Zipfian Distribution: #words = 100000, α = 2.5Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 31 / 38

Page 32: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Reducing Skew

we can reduce the skew impact by using fewer Reduce tasks than there are reducers

if keys are sent randomly to Reduce tasks, we average over value list lengthsI → we average over the total time for each Reduce task (applying the Central-limit Theorem)I we should make sure that the sample size is large enough (n > 30)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 32 / 38

Page 33: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Reducers� Reduce tasks

0 2 4 6 8 10 12100

101

102

103

104

105

Key size: sum=196524,µ=1.965,σ2 =243.245

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

Task key size: sum=196524,µ=196.524,σ2 =25136.428

# Reducers (# Keys) = 100000� # Reduce Tasks = 1000

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 33 / 38

Page 34: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Key Averages per Task

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500

Task key size: sum=196524,µ=196.524,σ2 =25136.428

0 1 2 3 4 5 6100

101

102

Task key averages: µ=1.958,σ2 =1.886

→ we thus keep the mean and reduce the variance

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 34 / 38

Page 35: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Reducing Skew

we can further reduce the skew by using more Reduce tasks than there are compute nodes

long Reduce tasks might occupy a compute node fully

several shorter Reduce tasks are executed sequentially at a single compute nodeI → we average over the total time for each compute node (again applying the Central-limit

Theorem)I again, we should make sure that the sample size is large enough (n > 30)

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 35 / 38

Page 36: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Reduce Tasks� Compute Nodes

0 2 4 6 8 100

5000

10000

15000

20000

25000

Node key size: sum=196524,µ=19652.400,σ2 =242116.267

Node key averages: µ=1.976,σ2 =0.030

Reduce Tasks = 1000� # Compute Nodes = 10Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 36 / 38

Page 37: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

Other Areas to Improve MapReduce Performance

File SystemI higher throughput, reliability, replication e�iciency

SchedulingI data/task a�inity, scheduling multiple MR apps

Failure DetectionI failure characterization, failure detector and prediction

SecurityI data privacy, distributed result/error checking

Scientific ComputingI iterative MapReduce, numerical processing

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 37 / 38

Page 38: Map Reduce (Part 2) - KTIkti.tugraz.at/staff/rkern/courses/dbase2/map_reduce_part...Map Reduce (Part 2) Databases 2 (VU) (706.711 / 707.030) Mark Kr¨oll Institute of Interactive Systems

The End

Today:Parallelism (and how to maximize it)optimizing Computation Performance,

I e.g. handling input data skew . . .

Next: MapReduce (Part 3)Suitable problem se�ings for MapReduce paradigm

Hadoop Ecosystem

Mark Kroll (ISDS, TU Graz) MapReduce Nov. 19, 2018 38 / 38