distributed computing with apache hadoop. introduction to mapreduce
TRANSCRIPT
Distributed Computing with
Apache Hadoop
Introduction to MapReduce
Konstantin V. Shvachko
Birmingham Big Data Science Group October 19, 2011
Computing
• History of computing started long time ago
• Fascination with numbers
– Vast universe with simple strict rules
– Computing devices
– Crunch numbers
• The Internet
– Universe of words, fuzzy rules
– Different type of computing
– Understand meaning of things
– Human thinking
– Errors & deviations are a
part of study
2
Computer History Museum, San Jose
Words vs. Numbers
• In 1997 IBM built Deep Blue supercomputer
– Playing chess game with the champion G. Kasparov
– Human race was defeated
– Strict rules for Chess
– Fast deep analyses of current state
– Still numbers
3
• In 2011 IBM built Watson computer to
play Jeopardy
– Questions and hints in human terms
– Analysis of texts from library and the
Internet
– Human champions defeated
Big Data
• Computations that need the power of many computers
– Large datasets: hundreds of TBs, PBs
– Or use of thousands of CPUs in parallel
– Or both
• Cluster as a computer
4
What is a PB?
1 KB = 1000 Bytes
1 MB = 1000 KB
1 GB = 1000 MB
1 TB = 1000 GB
1 PB = 1000 TB
???? = 1000 PB
Examples – Science
• Fundamental physics: Large Hadron Collider (LHC)
– Smashing high-energy protons at the speed of light
– 1 PB of event data per sec, most filtered out
– 15 PB of data per year
– 150 computing centers around the World
– 160 PB of disk + 90 PB of tape storage
• Math: Big Numbers
– 2 quadrillionth (1015) digit of π is 0
– pure CPU workload
– 12 days of cluster time
– 208 years of CPU-time on a cluster with 7600 CPU cores
• Big Data – Big Science
5
Examples – Web
• Search engine Webmap
– Map of the Internet
– 2008 @ Yahoo, 1500 nodes, 5 PB raw storage
• Internet Search Index
– Traditional application
• Social Network Analysis
– Intelligence
– Trends
6
The Sorting Problem
• Classic in-memory sorting
– Complexity: number of comparisons
• External sorting
– Cannot load all data in memory
– 16 GB RAM vs. 200 GB file
– Complexity: + disk IOs (bytes read or written)
• Distributed sorting
– Cannot load data on a single server
– 12 drives * 2 TB = 24 TB disc space vs. 200 TB data set
– Complexity: + network transfers
7
Worst Average Space
Bubble Sort O(n2) O(n2) In-place
Quicksort O(n2) O(n log n) In-place
Merge Sort O(n log n) O(n log n) Double
What do we do?
• Need a lot of computers
• How to make them work together
8
Hadoop
• Apache Hadoop is an ecosystem of
tools for processing “Big Data”
• Started in 2005 by D. Cutting and M. Cafarella
• Consists of two main components: Providing unified cluster view
1. HDFS – a distributed file system
– File system API connecting thousands of drives
2. MapReduce – a framework for distributed computations
– Splitting jobs into parts executable on one node
– Scheduling and monitoring of job execution
• Today used everywhere: Becoming a standard of distributed computing
• Hadoop is an open source project
9
MapReduce
• MapReduce
– 2004 Jeffrey Dean, Sanjay Ghemawat. Google.
– “MapReduce: Simplified Data Processing on Large Clusters”
• Computational model
– What is a comp. model ???
• Turing machine, Java
– Split large input data into small enough pieces, process in parallel
• Execution framework
– Compilers, interpreters
– Scheduling, Processing, Coordination
– Failure recovery
10
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
11
Functional Programming: reduce
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
12
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = ?
13
Functional Programming
• Map a higher-order function
– applies a given function to each element of a list
– returns the list of results
• Map( f(x), X[1:n] ) -> [ f(X[1]), …, f(X[n]) ]
• Example. Map( x2, [0,1,2,3,4,5] ) = [0,1,4,9,16,25]
• Reduce / fold a higher-order function
– Iterates given function over a list of elements
– Applies function to previous result and current element
– Return single result
• Example. Reduce( x + y, [0,1,2,3,4,5] ) = (((((0 + 1) + 2) + 3) + 4) + 5) = 15
• Reduce( x * y, [0,1,2,3,4,5] ) = 0
14
Example: Sum of Squares
• Composition of
– a map followed by
– a reduce applied to the results of the map
• Example.
– Map( x2, [1,2,3,4,5] ) = [0,1,4,9,16,25]
– Reduce( x + y, [1,4,9,16,25] ) = ((((1 + 4) + 9) + 16) + 25) = 55
• Map easily parallelizable
– Compute x2 for 1,2,3 on one node and for 4,5 on another
• Reduce notoriously sequential
– Need all squares at one node to compute the total sum.
15
Square Pyramid Number
1 + 4 + … + n2 =
n(n+1)(2n+1) / 6
Computational Model
• MapReduce is a Parallel Computational Model
• Map-Reduce algorithm = job
• Operates with key-value pairs: (k, V)
– Primitive types, Strings or more complex Structures
• Map-Reduce job input and output is a list of pairs {(k, V)}
• MR Job as defined by 2 functions
• map: (k1; v1) → {(k2; v2)}
• reduce: (k2; {v2}) → {(k3; v3)}
16
Job Workflow
17
dogs C, 3
like
cats
V, 1
C, 2 V, 2
C, 3 V, 1
C, 8
V, 4
The Algorithm
18
Map ( null, word)
nC = Consonants(word)
nV = Vowels(word)
Emit(“Consonants”, nC)
Emit(“Vowels”, nV)
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Computation Framework
• Two virtual clusters: HDFS and MapReduce
– Physically tightly coupled. Designed to work together
• Hadoop Distributed File System. View data as files and directories
• MapReduce is a Parallel Computation Framework
– Job scheduling and execution framework
19
HDFS Architecture Principles
• The name space is a hierarchy of files and directories
• Files are divided into blocks (typically 128 MB)
• Namespace (metadata) is decoupled from data
– Fast namespace operations, not slowed down by
– Data streaming
• Single NameNode keeps the entire name space in RAM
• DataNodes store data blocks on local drives
• Blocks are replicated on 3 DataNodes for redundancy and availability
20
MapReduce Framework
• Job Input is a file or a set of files in a distributed file system (HDFS)
– Input is split into blocks of roughly the same size
– Blocks are replicated to multiple nodes
– Block holds a list of key-value pairs
• Map task is scheduled to one of the nodes containing the block
– Map task input is node-local
– Map task result is node-local
• Map task results are grouped: one group per reducer
Each group is sorted
• Reduce task is scheduled to a node
– Reduce task transfers the targeted groups from all mapper nodes
– Computes and stores results in a separate HDFS file
• Job Output is a set of files in HDFS. With #files = #reducers
21
Map Reduce Example: Mean
• Mean
• Input: large text file
• Output: average length of words in the file µ
• Example: µ({dogs, like, cats}) = 4
22
n
ixn 1
1
Mean Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
• Map output
– <“count”, #words>
– <“length”, #totalLength>
23
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Single Mean Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
• Reduce Output
– <“count”, N>
– <“length”, L>
• The result
– µ = L / N
24
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
Mean: Mapper, Reducer
25
public class WordMean {
private final static Text COUNT_KEY = new Text(new String("count"));
private final static Text LENGTH_KEY = new Text(new String("length"));
private final static LongWritable ONE = new LongWritable(1);
public static class WordMeanMapper
extends Mapper<Object, Text, Text, LongWritable> {
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String word = itr.nextToken();
context.write(LENGTH_KEY, new LongWritable(word.length()));
context.write(COUNT_KEY, ONE);
} } }
public static class WordMeanReducer
extends Reducer<Text,LongWritable,Text,LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (LongWritable val : values)
sum += val.get();
context.write(key, new LongWritable(sum));
} }
. . . . . . . . . . . . . . . .
Mean: main()
26
. . . . . . . . . . . . . . . .
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(
conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordmean <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word mean");
job.setJarByClass(WordMean.class);
job.setMapperClass(WordMeanMapper.class);
job.setReducerClass(WordMeanReducer.class);
job.setCombinerClass(WordMeanReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setNumReduceTasks(1);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
Path outputpath = new Path(otherArgs[1]);
FileOutputFormat.setOutputPath(job, outputpath);
boolean result = job.waitForCompletion(true);
analyzeResult(outputpath);
System.exit(result ? 0 : 1);
}
. . . . . . . . . . . . . . . .
Mean: analyzeResult()
27
. . . . . . . . . . . . . . . .
private static void analyzeResult(Path outDir) throws IOException {
FileSystem fs = FileSystem.get(new Configuration());
Path reduceFile = new Path(outDir, "part-r-00000");
if(!fs.exists(reduceFile)) return;
long count = 0, length = 0;
BufferedReader in =
new BufferedReader(new InputStreamReader(fs.open(reduceFile)));
while(in != null && in.ready()) {
StringTokenizer st = new StringTokenizer(in.readLine());
String key = st.nextToken();
String value = st.nextToken();
if(key.equals("count")) count = Long.parseLong(value);
else if(key.equals("length")) length = Long.parseLong(value);
}
double average = (double)length / count;
System.out.println("The mean is: " + average);
}
} // end WordMean
MapReduce Implementation
• Single master JobTracker shepherds the distributed heard of TaskTrackers
1. Job scheduling and resource allocation
2. Job monitoring and job lifecycle coordination
3. Cluster health and resource tracking
• Job is defined
– Program: myJob.jar file
– Configuration: conf.xml
– Input, output paths
• JobClient submits the job to the JobTracker
– Calculates and creates splits based on the input
– Write myJob.jar and conf.xml to HDFS
28
MapReduce Implementation
• JobTracker divides the job into tasks: one map task per split.
– Assigns a TaskTracker for each task, collocated with the split
• TaskTrackers execute tasks and report status to the JobTracker
– TaskTracker can run multiple map and reduce tasks
– Map and Reduce Slots
• Failed attempts reassigned to other TaskTrackers
• Job execution status and results reported back to the client
• Scheduler lets many jobs run in parallel
29
Example: Standard Deviation
• Standard deviation
• Input: large text file
• Output: standard deviation σ of word lengths
• Example: σ({dogs, like, cats}) = 0
• How many jobs
30
n
ixn 1
2)(1
Standard Deviation: Hint
31
2
1
22
1
2
11
22
1
22
1
1)
1(2
1
)(1
n
i
nn
i
n
i
n
i
xn
nx
nx
n
xn
Standard Deviation Mapper
• Map input is the set of words {w} in the partition
– Key = null Value = w
• Map computes
– Number of words in the partition
– Total length of the words ∑length(w)
– The sum of lengths squared ∑length(w)2
• Map output
– <“count”, #words>
– <“length”, #totalLength>
– <“squared”, #sumLengthSquared>
32
Map (null, w)
Emit(“count”, 1)
Emit(“length”, length(w))
Emit(“squared”, length(w)2)
Standard Deviation Reducer
• Reduce input
– {<key, {value}>}, where
– key = “count”, “length”, “squared”
– value is an integer
• Reduce computes
– Total number of words: N = sum of all “count” values
– Total length of words: L = sum of all “length” values
– Sum of length squares: S = sum of all “squared” values
• Reduce Output
– <“count”, N>
– <“length”, L>
– <“squared”, S>
• The result
– µ = L / N
– σ = sqrt(S / N - µ2)
33
Reduce(key, {n1, n2, …})
nRes = n1 + n2 + …
Emit(key, nRes)
Analyze ()
read(“part-r-00000”)
print(“mean = ” + L/N)
print(“std.dev = ” +
sqrt(S/N – L*L / N*N))
Combiner, Partitioner
• Combiners perform local aggregation before the shuffle & sort phase
– Optimization to reduce data transfers during shuffle
– In Mean example reduces transfer of many keys to only two
• Partitioners assign intermediate (map) key-value pairs to reducers
– Responsible for dividing up the intermediate key space
– Not used with single Reducer
34
Input
Data
Input
Data
Map Reduce
Input Map Shuffle
& sort
Reduce Output Combiner
Partitioner
Distributed Sorting
• Sort a dataset, which cannot be entirely stored on one node.
• Input:
– Set of files. 100 byte records.
– The first 10 bytes of each record is the key and the rest is the value.
• Output:
– Ordered list of files: f1, … fN
– Each file fi is sorted, and
– If i < j then for any keys k Є fi and r Є fj (k ≤ r)
– Concatenation of files in the given order must form a completely sorted record set
35
Input
Data
Naïve MapReduce Sorting
• If the output could be stored on one node
• The input to any Reducer is always sorted by key
– Shuffle sorts Map outputs
• One identity Mapper and one identity Reducer would do the trick
– Identity: <k,v> → <k,v>
36
Input
Data
Map Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
cats dogs like
Naïve Sorting: Multiple Maps
• Multiple identity Mappers and one identity Reducer – same result
– Does not work for multiple Reducers
37
Input
Data
Output
Data Map
Map
Map
Reduce
dogs
like
cats
cats
dogs
like
Input Map Shuffle Reduce Output
Sorting: Generalization
• Define a hash function, such that
– h: {k} → [1,N]
– Preserves the order: k ≤ s → h(k) ≤ h(s)
– h(k) is a fixed size prefix of string k (2 first bytes)
• Identity Mapper
• With a specialized Partitioner
– Compute hash of the key h(k) and assigns <k,v> to reducer Rh(k)
• Identity Reducer
– Number of reducers is N: R1, …, RN
– Inputs for Ri are all pairs that have key h(k) = i
– Ri is an identity reducer, which writes output to HDFS file fi
– Hash function choice guarantees that
keys from fi are less than keys from fj if i < j
• The algorithm was implemented to win Gray’s Terasort Benchmark in 2008
38
Undirected Graphs
• “A Discipline of Programming” E. W. Dijkstra. Ch. 23.
– Good old classics
• Graph is defined by V = {v}, E = {<v,w> | v,w Є V}
• Undirected graph. E is symmetrical, that is <v,w> Є E ≡ <w,v> Є E
• Different representations of E
1. Set of pairs
2. <v, {direct neighbors}>
3. Adjacency matrix
• From 1 to 2 in one MR job
– Identity Mapper
– Combiner = Reducer
– Reducer joins values for each vertex
39
Connected Components
• Partition set of nodes V into disjoint subsets V1, …, VN
– V = V1 U … U VN
– No paths using E from Vi to Vj if i ≠ j
– Gi = <Vi, Ei >
• Representation of connected component
– key = min{Vi}
– value = Vi
• Chain of MR jobs
• Initial data representation
– E is partitioned into sets of records (blocks)
– <v,w> Є E → <min(v,w), {v,w}> = <k, C>
40
MR Connected Components
• Mapper / Reducer Input
– {<k, C>}, where C is a subset of V, k = min(C)
• Mapper
• Reducer
• Iterate. Stop when stabilized
41
Map {<k, C>}
For all <ki, Ci> and <kj, Cj>
if Ci ∩ Cj ≠ Ǿ then
C = Ci U Cj
Emit(min(C), C)
Reduce(k, {C1, C2, …})
resC = C1 U C2 U …
Emit(k, resC)
The End
42