distributed computing with apache hadoop. introduction to mapreduce

42
Distributed Computing with Apache Hadoop Introduction to MapReduce Konstantin V. Shvachko Birmingham Big Data Science Group October 19, 2011

Upload: konstantin-v-shvachko

Post on 15-Jul-2015

102 views

Category:

Software


3 download

TRANSCRIPT

Page 1: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Distributed Computing with

Apache Hadoop

Introduction to MapReduce

Konstantin V. Shvachko

Birmingham Big Data Science Group October 19, 2011

Page 2: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Computing

• History of computing started long time ago

• Fascination with numbers

– Vast universe with simple strict rules

– Computing devices

– Crunch numbers

• The Internet

– Universe of words, fuzzy rules

– Different type of computing

– Understand meaning of things

– Human thinking

– Errors & deviations are a

part of study

2

Computer History Museum, San Jose

Page 3: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Words vs. Numbers

• In 1997 IBM built Deep Blue supercomputer

– Playing chess game with the champion G. Kasparov

– Human race was defeated

– Strict rules for Chess

– Fast deep analyses of current state

– Still numbers

3

• In 2011 IBM built Watson computer to

play Jeopardy

– Questions and hints in human terms

– Analysis of texts from library and the

Internet

– Human champions defeated

Page 4: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Big Data

• Computations that need the power of many computers

– Large datasets: hundreds of TBs, PBs

– Or use of thousands of CPUs in parallel

– Or both

• Cluster as a computer

4

What is a PB?

1 KB = 1000 Bytes

1 MB = 1000 KB

1 GB = 1000 MB

1 TB = 1000 GB

1 PB = 1000 TB

???? = 1000 PB

Page 5: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Examples – Science

• Fundamental physics: Large Hadron Collider (LHC)

– Smashing high-energy protons at the speed of light

– 1 PB of event data per sec, most filtered out

– 15 PB of data per year

– 150 computing centers around the World

– 160 PB of disk + 90 PB of tape storage

• Math: Big Numbers

– 2 quadrillionth (1015) digit of π is 0

– pure CPU workload

– 12 days of cluster time

– 208 years of CPU-time on a cluster with 7600 CPU cores

• Big Data – Big Science

5

Page 6: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Examples – Web

• Search engine Webmap

– Map of the Internet

– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage

• Internet Search Index

– Traditional application

• Social Network Analysis

– Intelligence

– Trends

6

Page 7: Distributed Computing with Apache Hadoop. Introduction to MapReduce

The Sorting Problem

• Classic in-memory sorting

– Complexity: number of comparisons

• External sorting

– Cannot load all data in memory

– 16 GB RAM vs. 200 GB file

– Complexity: + disk IOs (bytes read or written)

• Distributed sorting

– Cannot load data on a single server

– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set

– Complexity: + network transfers

7

Worst Average Space

Bubble Sort O(n2) O(n2) In-place

Quicksort O(n2) O(n log n) In-place

Merge Sort O(n log n) O(n log n) Double

Page 8: Distributed Computing with Apache Hadoop. Introduction to MapReduce

What do we do?

• Need a lot of computers

• How to make them work together

8

Page 9: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Hadoop

• Apache Hadoop is an ecosystem of

tools for processing “Big Data”

• Started in 2005 by D. Cutting and M. Cafarella

• Consists of two main components: Providing unified cluster view

1. HDFS – a distributed file system

– File system API connecting thousands of drives

2. MapReduce – a framework for distributed computations

– Splitting jobs into parts executable on one node

– Scheduling and monitoring of job execution

• Today used everywhere: Becoming a standard of distributed computing

• Hadoop is an open source project

9

Page 10: Distributed Computing with Apache Hadoop. Introduction to MapReduce

MapReduce

• MapReduce

– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.

– “MapReduce: Simplified Data Processing on Large Clusters”

• Computational model

– What is a comp. model ???

• Turing machine, Java

– Split large input data into small enough pieces, process in parallel

• Execution framework

– Compilers, interpreters

– Scheduling, Processing, Coordination

– Failure recovery

10

Page 11: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

11

Page 12: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Functional Programming: reduce

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

12

Page 13: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

• Reduce( x * y, [0,1,2,3,4,5] ) = ?

13

Page 14: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Functional Programming

• Map a higher-order function

– applies a given function to each element of a list

– returns the list of results

• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]

• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]

• Reduce / fold a higher-order function

– Iterates given function over a list of elements

– Applies function to previous result and current element

– Return single result

• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15

• Reduce( x * y, [0,1,2,3,4,5] ) = 0

14

Page 15: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Example: Sum of Squares

• Composition of

– a map followed by

– a reduce applied to the results of the map

• Example.

– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]

– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55

• Map easily parallelizable

– Compute x2 for 1,2,3 on one node and for 4,5 on another

• Reduce notoriously sequential

– Need all squares at one node to compute the total sum.

15

Square Pyramid Number

1 + 4 + … + n2 =

n(n+1)(2n+1) / 6

Page 16: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Computational Model

• MapReduce is a Parallel Computational Model

• Map-Reduce algorithm = job

• Operates with key-value pairs: (k, V)

– Primitive types, Strings or more complex Structures

• Map-Reduce job input and output is a list of pairs {(k, V)}

• MR Job as defined by 2 functions

• map: (k1; v1) → {(k2; v2)}

• reduce: (k2; {v2}) → {(k3; v3)}

16

Page 17: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Job Workflow

17

dogs C, 3

like

cats

V, 1

C, 2 V, 2

C, 3 V, 1

C, 8

V, 4

Page 18: Distributed Computing with Apache Hadoop. Introduction to MapReduce

The Algorithm

18

Map ( null, word)

nC = Consonants(word)

nV = Vowels(word)

Emit(“Consonants”, nC)

Emit(“Vowels”, nV)

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Page 19: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Computation Framework

• Two virtual clusters: HDFS and MapReduce

– Physically tightly coupled. Designed to work together

• Hadoop Distributed File System. View data as files and directories

• MapReduce is a Parallel Computation Framework

– Job scheduling and execution framework

19

Page 20: Distributed Computing with Apache Hadoop. Introduction to MapReduce

HDFS Architecture Principles

• The name space is a hierarchy of files and directories

• Files are divided into blocks (typically 128 MB)

• Namespace (metadata) is decoupled from data

– Fast namespace operations, not slowed down by

– Data streaming

• Single NameNode keeps the entire name space in RAM

• DataNodes store data blocks on local drives

• Blocks are replicated on 3 DataNodes for redundancy and availability

20

Page 21: Distributed Computing with Apache Hadoop. Introduction to MapReduce

MapReduce Framework

• Job Input is a file or a set of files in a distributed file system (HDFS)

– Input is split into blocks of roughly the same size

– Blocks are replicated to multiple nodes

– Block holds a list of key-value pairs

• Map task is scheduled to one of the nodes containing the block

– Map task input is node-local

– Map task result is node-local

• Map task results are grouped: one group per reducer

Each group is sorted

• Reduce task is scheduled to a node

– Reduce task transfers the targeted groups from all mapper nodes

– Computes and stores results in a separate HDFS file

• Job Output is a set of files in HDFS. With #files = #reducers

21

Page 22: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Map Reduce Example: Mean

• Mean

• Input: large text file

• Output: average length of words in the file µ

• Example: µ({dogs, like, cats}) = 4

22

n

ixn 1

1

Page 23: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Mean Mapper

• Map input is the set of words {w} in the partition

– Key = null Value = w

• Map computes

– Number of words in the partition

– Total length of the words ∑length(w)

• Map output

– <“count”, #words>

– <“length”, #totalLength>

23

Map (null, w)

Emit(“count”, 1)

Emit(“length”, length(w))

Page 24: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Single Mean Reducer

• Reduce input

– {<key, {value}>}, where

– key = “count”, “length”

– value is an integer

• Reduce computes

– Total number of words: N = sum of all “count” values

– Total length of words: L = sum of all “length” values

• Reduce Output

– <“count”, N>

– <“length”, L>

• The result

– µ = L / N

24

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Analyze ()

read(“part-r-00000”)

print(“mean = ” + L/N)

Page 25: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Mean: Mapper, Reducer

25

public class WordMean {

private final static Text COUNT_KEY = new Text(new String("count"));

private final static Text LENGTH_KEY = new Text(new String("length"));

private final static LongWritable ONE = new LongWritable(1);

public static class WordMeanMapper

extends Mapper<Object, Text, Text, LongWritable> {

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

String word = itr.nextToken();

context.write(LENGTH_KEY, new LongWritable(word.length()));

context.write(COUNT_KEY, ONE);

} } }

public static class WordMeanReducer

extends Reducer<Text,LongWritable,Text,LongWritable> {

public void reduce(Text key, Iterable<LongWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0;

for (LongWritable val : values)

sum += val.get();

context.write(key, new LongWritable(sum));

} }

. . . . . . . . . . . . . . . .

Page 26: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Mean: main()

26

. . . . . . . . . . . . . . . .

public static void main(String[] args) throws IOException {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(

conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordmean <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word mean");

job.setJarByClass(WordMean.class);

job.setMapperClass(WordMeanMapper.class);

job.setReducerClass(WordMeanReducer.class);

job.setCombinerClass(WordMeanReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

job.setNumReduceTasks(1);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

Path outputpath = new Path(otherArgs[1]);

FileOutputFormat.setOutputPath(job, outputpath);

boolean result = job.waitForCompletion(true);

analyzeResult(outputpath);

System.exit(result ? 0 : 1);

}

. . . . . . . . . . . . . . . .

Page 27: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Mean: analyzeResult()

27

. . . . . . . . . . . . . . . .

private static void analyzeResult(Path outDir) throws IOException {

FileSystem fs = FileSystem.get(new Configuration());

Path reduceFile = new Path(outDir, "part-r-00000");

if(!fs.exists(reduceFile)) return;

long count = 0, length = 0;

BufferedReader in =

new BufferedReader(new InputStreamReader(fs.open(reduceFile)));

while(in != null && in.ready()) {

StringTokenizer st = new StringTokenizer(in.readLine());

String key = st.nextToken();

String value = st.nextToken();

if(key.equals("count")) count = Long.parseLong(value);

else if(key.equals("length")) length = Long.parseLong(value);

}

double average = (double)length / count;

System.out.println("The mean is: " + average);

}

} // end WordMean

Page 28: Distributed Computing with Apache Hadoop. Introduction to MapReduce

MapReduce Implementation

• Single master JobTracker shepherds the distributed heard of TaskTrackers

1. Job scheduling and resource allocation

2. Job monitoring and job lifecycle coordination

3. Cluster health and resource tracking

• Job is defined

– Program: myJob.jar file

– Configuration: conf.xml

– Input, output paths

• JobClient submits the job to the JobTracker

– Calculates and creates splits based on the input

– Write myJob.jar and conf.xml to HDFS

28

Page 29: Distributed Computing with Apache Hadoop. Introduction to MapReduce

MapReduce Implementation

• JobTracker divides the job into tasks: one map task per split.

– Assigns a TaskTracker for each task, collocated with the split

• TaskTrackers execute tasks and report status to the JobTracker

– TaskTracker can run multiple map and reduce tasks

– Map and Reduce Slots

• Failed attempts reassigned to other TaskTrackers

• Job execution status and results reported back to the client

• Scheduler lets many jobs run in parallel

29

Page 30: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Example: Standard Deviation

• Standard deviation

• Input: large text file

• Output: standard deviation σ of word lengths

• Example: σ({dogs, like, cats}) = 0

• How many jobs

30

n

ixn 1

2)(1

Page 31: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Standard Deviation: Hint

31

2

1

22

1

2

11

22

1

22

1

1)

1(2

1

)(1

n

i

nn

i

n

i

n

i

xn

nx

nx

n

xn

Page 32: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Standard Deviation Mapper

• Map input is the set of words {w} in the partition

– Key = null Value = w

• Map computes

– Number of words in the partition

– Total length of the words ∑length(w)

– The sum of lengths squared ∑length(w)2

• Map output

– <“count”, #words>

– <“length”, #totalLength>

– <“squared”, #sumLengthSquared>

32

Map (null, w)

Emit(“count”, 1)

Emit(“length”, length(w))

Emit(“squared”, length(w)2)

Page 33: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Standard Deviation Reducer

• Reduce input

– {<key, {value}>}, where

– key = “count”, “length”, “squared”

– value is an integer

• Reduce computes

– Total number of words: N = sum of all “count” values

– Total length of words: L = sum of all “length” values

– Sum of length squares: S = sum of all “squared” values

• Reduce Output

– <“count”, N>

– <“length”, L>

– <“squared”, S>

• The result

– µ = L / N

– σ = sqrt(S / N - µ2)

33

Reduce(key, {n1, n2, …})

nRes = n1 + n2 + …

Emit(key, nRes)

Analyze ()

read(“part-r-00000”)

print(“mean = ” + L/N)

print(“std.dev = ” +

sqrt(S/N – L*L / N*N))

Page 34: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Combiner, Partitioner

• Combiners perform local aggregation before the shuffle & sort phase

– Optimization to reduce data transfers during shuffle

– In Mean example reduces transfer of many keys to only two

• Partitioners assign intermediate (map) key-value pairs to reducers

– Responsible for dividing up the intermediate key space

– Not used with single Reducer

34

Input

Data

Input

Data

Map Reduce

Input Map Shuffle

& sort

Reduce Output Combiner

Partitioner

Page 35: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Distributed Sorting

• Sort a dataset, which cannot be entirely stored on one node.

• Input:

– Set of files. 100 byte records.

– The first 10 bytes of each record is the key and the rest is the value.

• Output:

– Ordered list of files: f1, … fN

– Each file fi is sorted, and

– If i < j then for any keys k Є fi and r Є fj (k ≤ r)

– Concatenation of files in the given order must form a completely sorted record set

35

Page 36: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Input

Data

Naïve MapReduce Sorting

• If the output could be stored on one node

• The input to any Reducer is always sorted by key

– Shuffle sorts Map outputs

• One identity Mapper and one identity Reducer would do the trick

– Identity: <k,v> → <k,v>

36

Input

Data

Map Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

cats dogs like

Page 37: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Naïve Sorting: Multiple Maps

• Multiple identity Mappers and one identity Reducer – same result

– Does not work for multiple Reducers

37

Input

Data

Output

Data Map

Map

Map

Reduce

dogs

like

cats

cats

dogs

like

Input Map Shuffle Reduce Output

Page 38: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Sorting: Generalization

• Define a hash function, such that

– h: {k} → [1,N]

– Preserves the order: k ≤ s → h(k) ≤ h(s)

– h(k) is a fixed size prefix of string k (2 first bytes)

• Identity Mapper

• With a specialized Partitioner

– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)

• Identity Reducer

– Number of reducers is N: R1, …, RN

– Inputs for Ri are all pairs that have key h(k) = i

– Ri is an identity reducer, which writes output to HDFS file fi

– Hash function choice guarantees that

keys from fi are less than keys from fj if i < j

• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008

38

Page 39: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Undirected Graphs

• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.

– Good old classics

• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}

• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E

• Different representations of E

1. Set of pairs

2. <v, {direct neighbors}>

3. Adjacency matrix

• From 1 to 2 in one MR job

– Identity Mapper

– Combiner = Reducer

– Reducer joins values for each vertex

39

Page 40: Distributed Computing with Apache Hadoop. Introduction to MapReduce

Connected Components

• Partition set of nodes V into disjoint subsets V1, …, VN

– V = V1 U … U VN

– No paths using E from Vi to Vj if i ≠ j

– Gi = <Vi, Ei >

• Representation of connected component

– key = min{Vi}

– value = Vi

• Chain of MR jobs

• Initial data representation

– E is partitioned into sets of records (blocks)

– <v,w> Є E → <min(v,w), {v,w}> = <k, C>

40

Page 41: Distributed Computing with Apache Hadoop. Introduction to MapReduce

MR Connected Components

• Mapper / Reducer Input

– {<k, C>}, where C is a subset of V, k = min(C)

• Mapper

• Reducer

• Iterate. Stop when stabilized

41

Map {<k, C>}

For all <ki, Ci> and <kj, Cj>

if Ci ∩ Cj ≠ Ǿ then

C = Ci U Cj

Emit(min(C), C)

Reduce(k, {C1, C2, …})

resC = C1 U C2 U …

Emit(k, resC)

Page 42: Distributed Computing with Apache Hadoop. Introduction to MapReduce

The End

42