cosc 6397 big data analytics introduction to map reduce (i)gabriel/courses/cosc6397_s14/bda... ·...
Post on 08-May-2020
3 Views
Preview:
TRANSCRIPT
1
COSC 6397
Big Data Analytics
Introduction to Map Reduce (I)
Edgar Gabriel
Spring 2014
Recommended Literature
• Original MapReduce paper by google
http://research.google.com/archive/mapreduce-osdi04.pdf
• Fantastic resource for tutorials, examples etc:
http://www.coreservlets.com/
2
Map Reduce Programming Model
• Input key/value pairs output a set of key/value pairs
• Map
– Input pair intermediate key/value pair
– (k1, v1) list(k2, v2)
• Reduce
– One key all associated intermediate values
– (k2, list(v2)) list(v3)
• Reduce stage starts when final map task is done
Task Granularity and Pipelining
• Many tasks means
– Minimal time for fault recovery
– Better pipeline shuffling with map execution
– Better load balancing
Slide based on Dan Weld’s class at U. Washington (who in turn made his slides based on those by Jeff Dean, Sanjay Ghemawat, Google, Inc.)
3
Map Reduce Framework
• Takes care of distributed processing and coordination
• Scheduling
– Jobs are broken down into smaller chunks called tasks
– Scheduler assigns tasks to available resources
• Tasks Localization with data
– Framework strives to place tasks on the nodes that host
the segment of data to be processed by that task
– Code is moved to data, not data to code
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Major Components
• User Components:
– Mapper
– Reducer
– Combiner (Optional)
– Partitioner (Optional) (Shuffle)
– Writable(s) (Optional)
• System Components:
– Master
– Input Splitter*
– Output Committer* * You can use your own if you really want!
Image source: http://www.ibm.com/developerworks/java/library/l-hadoop-3/index.html
4
Notes
• Mappers and Reducers are typically single threaded and
deterministic
– Determinism allows for restarting of failed jobs, or
speculative execution
– all independent of each other, can run on arbitrary
number of nodes
• Mappers/Reducers run entirely independent of each other
– In Hadoop, they run in separate JVMs
Input Splitter
• Is responsible for splitting input into multiple chunks
• These chunks are then used as input for your mappers
• Splits on logical boundaries
• Multiple splitters available in Hadoop,
– Break up data by line, fixed chunk etc.
– Application can write its on splitter
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
5
Mapper
• Reads in input pair <K,V> (a section as split by the input
splitter)
• Outputs a pair <K’, V’>
• Example: for Word Count with the following input: “The
teacher went to the store. The store was closed; the store
opens in the morning. The store opens at 9am.”
• The output would be: – <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1>
<store, 1> <the, 1> <store, 1> <was, 1> <closed, 1>
<the, 1> <store, 1> <opens, 1> <in, 1> <the, 1>
<morning, 1> <the 1> <store, 1> <opens, 1> <at, 1>
<9am, 1>
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
Reducer
• Accepts the Mapper output, and collects values on the key
– All inputs with the same key must go to the same reducer!
• Input is typically sorted, output is output exactly as is
• For our example, the reducer input would be: – <The, 1> <teacher, 1> <went, 1> <to, 1> <the, 1>
<store, 1> <the, 1> <store, 1> <was, 1> <closed, 1>
<the, 1> <store, 1> <opens, 1> <in, 1> <the, 1>
<morning, 1> <the 1> <store, 1> <opens, 1> <at, 1>
<9am, 1>
• The output would be: – <The, 6> <teacher, 1> <went, 1> <to, 1> <store, 3>
<was, 1> <closed, 1> <opens, 1> <morning, 1> <at, 1>
<9am, 1>
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
6
Combiner
• Combiner is an intermediate reducer
• Optional
• Reduces output from each mapper,
• Reduces bandwidth and limits data size for sorting
• Cannot change the type of its input
– Input types must be the same as output types
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
Output Committer
• Is responsible for taking the reduce output, and
committing it to a file
• Typically, this committer needs a corresponding input
splitter (so that another job can read the input)
• Again, usually built in splitters are good enough, unless
you need to output a special kind of file
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
7
Partitioner (Shuffler)
• Decides which pairs are sent to which reducer
• Default is simply:
– Key.hashCode() % numOfReducers
• User can override to:
– Provide (more) uniform distribution of load between
reducers
– Some values might need to be sent to the same reducer
• Ex. To compute the relative frequency of a pair of
words <W1, W2> you would need to make sure all of
word W1 are sent to the same reducer
– Binning of results
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
Master • Responsible for scheduling & managing jobs
• Scheduled computation should be close to the data if
possible
– Bandwidth is expensive! (and slow)
– This relies on a Distributed File System (GFS / HDFS)!
• If a task fails to report progress (such as reading input,
writing output, etc), crashes, the machine goes down,
etc, it is assumed to be stuck, and is killed, and the
step is re-launched (with the same input)
• The Master is handled by the framework, no user code
is necessary
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
8
Master (II) • HDFS can replicate data to be local if necessary for
scheduling (discussed later in course)
• Because our nodes are (or at least should be)
deterministic
– The Master can restart failed nodes
• Nodes should have no side effects!
– If a node is the last step, and is completing slowly, the
master can launch a second copy of that node
• First one to complete wins, then any other runs are
killed
Slide based on a talk by Jeffrey Dean and Sanjay Ghemawa:” MapReduce- Simplified Data Processing on Large Clusters”
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012/2.1-MapReduce.pptx
Word Count
9
Map Reduce word count code sample
• Implement Mapper
– Input is text
– Tokenize the text and emit first character with a count of
1 - <token, 1>
• Implement Reducer
– Sum up counts for each letter
– Write out the result outputfile
• Configure the Job
– Specify Input, Output, Mapper, Reducer and Combiner
• Run the job
Mapper
• Mapper Class
– Class has 4 Java Generics parameters
– (1) input key (2) input value (3) output key (4) output
value
• Input and output utilizes hadoop’s IO framework org.apache.hadoop.io
• map() method injects Context object, use to:
– Write output
– Create your own counters
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
10
Mapper code in Hadoop
public static class TokenizerMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
Writables
• Types that can be serialized / deserialized to a stream
– framework will serialize your data before writing it to
disk
• Required to be input/output classes
• User can implement this interface, and use their own types
for their input/output/intermediate values
• There are default for basic values, like
– Strings: Text
– Integers: IntWritable
– Long: LongWritable
– Float: FloatWritable
11
Reducer • Similarly to Mapper – generic class with four types
(1) input key (2) input value (3) output key (4) output value
– The output types of map functions must match the input types
of reduce function
– In this case Text and IntWritable
• Map/Reduce framework groups key-value pairs produced by
mapper by key
• For each key there is a set of one or more values
• Input into a reducer is sorted by key
– Known as Shuffle and Sort
• Reduce function accepts key->setOfValues and outputs keyvalue
pairs
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Reducer code in Hadoop
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
12
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf,
args).getRemainingArgs();
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(IntSumReducer.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
job.setInputFormatClass(TextInputFormat.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
• Job class
– Encapsulates information about a job
– Controls execution of the job
• A job is packaged within a jar file
– Hadoop Framework distributes the jar on your behalf
– Needs to know which jar file to distribute
– The easiest way to specify the jar that your job resides in is by
calling job.setJarByClass
job.setJarByClass(getClass());
– Hadoop will locate the jar file that contains the provided class
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
13
job.setInputFormatClass(TextInputFormat.class);
TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
• Specify input data
• Can be a file or a directory
– Directory is converted to a list of files as an input
– Input is specified by implementation of InputFormat - in this case TextInputFormat
– Responsible for creating splits and a record reader
– Controls input types of key-value pairs, in this case LongWritable and Text
– File is broken into lines, mapper will receive 1 line at a time
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
• OutputFormat defines specification for outputting data from
Map/Reduce job
– Count job utilizes an implementation of OutputFormat –
TextOutputFormat
• Define output path where reducer should place its output
– If path already exists then the job will fail
– Each reducer task writes to its own file
– By default a job is configured to run with a single reducer
– Writes key-value pair as plain text
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
14
• Specify the output key and value types for both mapper and
reducer functions
– Many times the same type
– If types differ then use
setMapOutputKeyClass();
setMapOutputValueClass();
• job.waitForCompletion(true)
– Submits and waits for completion
– The boolean parameter flag specifies whether output should be
written to console
– If the job completes successfully ‘true’ is returned, otherwise
‘false’ is returned
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Combiner
• Combine data per Mapper task to reduce amount of
data transferred to reduce phase
• Reducer can very often serve as a combiner
– Only works if reducer’s output key-value pair types are
the same as mapper’s output types
• Combiners are not guaranteed to run
– Optimization only
– Not for critical logic
• Add to main file
job.setCombinerClass(IntSumReducer.class);
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
15
K-means in MapReduce
• Mapper:
– K= datatpoint ID, V=datapoint coordinates
– Calculate the distance for a datapoint to each centroid
– Determine closest cluster
– Write K’,V’ with K’ = cluster ID and V’=coordinates of the
point
• Reducer:
– Each reducer gets all entry associated with one cluster
– Calculate sum of all point coordinates per cluster
– Recalculate cluster mean
K-means in MapReduce
• One iteration of the k-means algorithm
• Multiple iterations within one job not supported by
MapReduce
-> Hadoop has a way on how to specify a sequence of jobs
with dependencies
• How to distribute Cluster centroids to all map tasks?
-> Hadoop distributed cache
16
K-means clustering
• Initial cluster centroids used has a huge impact on the
number of iterations of k-means algorithm
• Especially important due to the complexities of
handling multiple iterations in MapReduce
• Pre-clustering algorithm such as Canopy Clustering to
determine initial cluster centroids
Canopy clustering
• Start with a list of data points and two distances T1<T2
for processing
– Select any point (random) from the list to form a canopy
center
– Calculate distance to all other points in the list
– Put all the points which fall within the distance threshold
of T1 into a canopy
– Remove from the original list all the points which fall
within the threshold of T2. These points are excluded
from being the center of a new canopie.
– Repeat steps until original list is empty.
top related