a.a. 2020/2021 fabiana rossi
TRANSCRIPT
MapReduce: Design Patterns
A.A. 2020/2021
Fabiana Rossi
Laurea Magistrale in Ingegneria
Informatica -II anno
Macroarea di Ingegneria
Dipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Fabiana Rossi - SABD 2020/21 2
Resource Management
Data Storage
Data Processing
High-level Interfaces Su
pp
ort / In
teg
ratio
n
Main reference for this lecture
D. Miner and A. Shook MapReduce Design Patterns: Building
Effective Algorithms and Analytics for Hadoop and Other Systems.
O'Reilly Media, 2012.
3Fabiana Rossi - SABD 2020/21
MapReduce
• Fit your solution into the framework of map and
reduce
• In some situations might be challenging
– MapReduce can be a constraint
– provides clear boundaries for what you can and cannot do
• Figuring out how to solve a problem with constraints
requires
– cleverness
– a change in thinking!
4Fabiana Rossi - SABD 2020/21
MapReduce
• MapReduce is a framework
– Fit your solution into the framework of map and reduce
– Can be challenging in some situations
• Need to take the algorithm and break it into
filter/aggregate steps
– Filter becomes part of the map function
– Aggregate becomes part of the reduce function
• Sometimes we need multiple MapReduce stages
• MapReduce is not a solution to every problem, not
even every parallel problem
• It makes sense when:
– Files are very large and are rarely updated
– We need to iterate over all the files to generate some
interesting property of the data in those files
5Fabiana Rossi - SABD 2020/21
MapReduce Design Pattern
What is a MapReduce design pattern?
• It is a template for solving a common and general data
manipulation problem with MapReduce.
• Inspired by "Design Patterns: Elements of Reusable Object-
Oriented Software" by the Gang of four
A pattern:
• is a general approach for solving a problem
• is not specific to a domain (e.g., text processing, graph analysis)
A design patterns allows
• to use tried and true design principles
• to build better software
6Fabiana Rossi - SABD 2020/21
Hands-on Hadoop(our pre-configured Docker image)
Fabiana Rossi - SABD 2020/21
Hadoop with Dockers
8
• Create a small network named hadoop_networkwith one namenode (master) and 3 datanode (slave).
• We will interact on the master node, exchanging file through the volume mounted in /data
$ docker network create --driver bridge hadoop_network
$ docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 effeerre/hadoop
$ docker run -t -i -p 9863:9864 -d --network=hadoop_network --name=slave2 effeerre/hadoop
$ docker run -t -i -p 9862:9864 -d --network=hadoop_network --name=slave3 effeerre/hadoop
$ docker run -t -i -p 9870:9870 -p 8088:8088 --network=hadoop_network --name=master-v $PWD/hddata:/data effeerre/hadoop
Fabiana Rossi - SABD 2020/21
Hadoop with Dockers
9
• Before we start, we need to initialize our environment
• On the master node
• The WebUI tells us if everything is working properly:– HDFS: http://localhost:9870/
– MapReduce Master: http://localhost:8088/
$ hdfs namenode –format
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
Fabiana Rossi - SABD 2020/21
Hadoop with Dockers
10
How to remove the containers
• stop and delete the namenode and datanodes
• remove the network
$ docker network rm hadoop_network
$ docker kill master slave1 slave2 slave3
$ docker rm master slave1 slave2 slave3
Fabiana Rossi - SABD 2020/21
A simplified view of MapReduce
• Mappers are applied to all input key-value pairs, to generate an
arbitrary number of intermediate pairs
• Reducers are applied to all intermediate values associated with the
same intermediate key
• Between the map and reduce phase lies a barrier that involves a
large distributed sort and group by11Fabiana Rossi - SABD 2020/21
A more detailed view of MapReduce
12
• Combiner: optimization that
anticipates on the map node
the reduce function;
– Hadoop does not provide a
guarantee of how many times it
will call it
• Partitioner: when there are
multiple reducers, divides
keys in partitions that will be
assigned to each reducer
– A custom partitioner can be
used to control how keys are
passed to the reducer, e.g., to
balance load, to guarantee
properties – such as total
ordering
Fabiana Rossi - SABD 2020/21
Job in MapReduce
• A MapReduce (i.e., Java) program, referred
to as a job, consists of:
• Code for Map and Reduce packaged together
• Configuration parameters (where the input lies,
where the output should be stored)
• Input data set, stored on the underlying distributed
file system
• Applications typically implement the Mapper
and Reducer interfaces to provide the map
and reduce methods. They form the core of a
MapReduce job.
13Fabiana Rossi - SABD 2020/21
Job MapReduce: Input
• InputFormat describes the input-specification for a
MapReduce job.
• The default behavior of file-based InputFormat
implementations (typically sub-classes of
FileInputFormat) is to split the input into logical
InputSplit instances based on the total size (in
bytes) of the input files.
• The FileSystem blocksize of the input files is treated
as an upper bound for input splits.
• The Hadoop MapReduce framework spawns one
map task for each InputSplit generated by the
InputFormat for the job.14Fabiana Rossi - SABD 2020/21
Job MapReduce: Output
• OutputFormat describes the output-
specification for a MapReduce job.
• Output files are stored in a FileSystem.
• TextOutputFormat is the default OutputFormat.
15Fabiana Rossi - SABD 2020/21
Mapper and Reducer
public class Map extends Mapper {
public void map(Object key, Text value, Context context){
...
}
}
16
Context object: allows the Mapper/Reducer to interact with the Hadoopsystem. It includes configuration data for the job as well as interfaces whichallow it to emit output.
public class Reduce extends Reducer {
public void reduce(Text key, Iterable values,
Context context) {
...
}
}
Fabiana Rossi - SABD 2020/21
Job MapReduce: Example
/* Create and configure a new MapReduce Job */Configuration conf = new Configuration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);
/* Map function */job.setMapperClass(Mapper.class);
/* Reduce function */job.setReducerClass(Reducer.class);job.setNumReduceTasks(2);
job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);
...
17
This is only an excerpt of WordCount.java
Fabiana Rossi - SABD 2020/21
Design Pattern: Number Summarizations
• Goal: compute some numerical aggregate value
(count, maximum, average, ...) over a set of values
• Structure:
– Mapper: it outputs keys that consist of each field to group by,
and values consisting of any pertinent numerical items
– Combiner: (optional) it can greatly reduce the number of
intermediate key/value pairs to be sent across the network,
but works well only with associative and commutative
operations
– Partitioner: (optional) it can better distribute key/value pairs
across the reduce tasks
– Reducer: The reducer receives a set of numerical values
and applies the aggregation function
18Fabiana Rossi - SABD 2020/21
Design Pattern: Number Summarizations
19Fabiana Rossi - SABD 2020/21
Design Pattern: Number Summarizations
Examples:
• Word count, record count
– Count the number of occurrences of each world
• Min/Max
– Compute the max temperature per region
• Average/Median/Standard Deviation
– Average the number of requests per page per Web site
• Inverted Index Summarization:
– The inverted index pattern is commonly used to generate an
index from a data set to allow for faster searches or data
enrichment capabilities.
20Fabiana Rossi - SABD 2020/21
WordCount: Example
hello world goodbye
hello fabiana
hello john
hello mike
hello mapreduce
21
fabiana 1
goodbye 1
hello 5
john 1
mapreduce 1
mike 1
world 1Fabiana Rossi - SABD 2020/21
Summarization: Example
hello world goodbye
hello fabiana
hello john
hello mike
hello mapreduce
22
g 7.0
m 6.5
w 5.0
f 7.0
h 5.0
j 4.0
• Goal: compute the average word length by initial letter
Fabiana Rossi - SABD 2020/21
Summarization: Example
23
• Goal: compute the average word length by initial
letter
• Check: AverageWordLengthByInitialLetter.java
public void map(Object key, Text value, Context context) {String line = value.toString().toLowerCase();/* Emit length by initial letter */StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {
String word = itr.nextToken();initialLetter.set(word.substring(0,1));length.set(word.length());context.write(initialLetter, length);
}}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Summarization: Example
24
• Goal: compute the average word length by initial
letter
• Check: AverageWordLengthByInitialLetter.java
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int sum = 0;int count = 0;for (IntWritable val : values) {
sum += val.get();count++;
}average.set(((float) sum / (float) count));context.write(key, average);
}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Design Pattern: Filtering
• Goal: filter out records that are not of interest and
keep the others.
• An application of filtering is the sampling
– Sampling can be used to get a smaller, yet representative,
data set
• Structure:
– Mapper: filters data (it does most of the work)
– Reduce: may simply be the identity, if the job does not
produce an aggregation on filtered data
25Fabiana Rossi - SABD 2020/21
Design Pattern: Filtering
26Fabiana Rossi - SABD 2020/21
Design Pattern: Filtering
Use cases:
• Closer view of data: to extract records that have something in
common or something of interest (e.g., same event-date, same
user id)
• Tracking a thread of events: extract a thread of consecutive
events as a case study from a larger data set.
• Distributed grep
• Simple random sampling: simple random sampling of the data
set
– use a filter with an evaluation function that randomly returns
true or false
• Remove low scoring data
27Fabiana Rossi - SABD 2020/21
Filtering: Example
hello world goodbye
hello fabiana
hello john
hello mike
hello mapreduce
28
hello world goodbye
good
• Goal: implement a distributed version of grep
• grep is a command-line utility for searching plain-text data sets for lines that match a regular expression
Fabiana Rossi - SABD 2020/21
Filtering: Example
29
• Goal: implement a distributed version of grep
• Check: DistributedGrep.java
public static class GrepMapperextends Mapper<Object, Text, NullWritable, Text> {
private Pattern pattern = null;
public void setup(Context context) throws ... {pattern = Pattern.compile( ... );
}
public void map(Object key, Text value, Context context) ... {Matcher matcher = pattern.matcher(value.toString());
if (matcher.find()) {context.write(NullWritable.get(), value);
}}
}
This is only an excerptFabiana Rossi - SABD 2020/21
Design Pattern: Distinct
• Special case of filter pattern
• Goal: filter out records that look like another record in
the data set
• Structure:
– Mapper: it takes each record and extracts the data fields for
which we want unique values. The mapper outputs the
record as the key, and null as the value
– Reduce: it groups the nulls together by key. We then simply
output the key. Because each key is grouped together, the
output data set is guaranteed to be unique.
• Examples:
– Retrieve the list of words, with no repetition, in a document
30Fabiana Rossi - SABD 2020/21
Distinct: Example
31
• Goal: retrieve the list of words, with no repetitions, in
a document
• Check: DistinctWords.java
public void map(Object key, Text value, Context context) ... {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {
word.set(itr.nextToken());context.write(word, NullWritable.get());
}}
...
public void reduce(Text key, Iterable<NullWritable> values, Context context) ... {
context.write(key, NullWritable.get());}
This is only an excerptFabiana Rossi - SABD 2020/21
Design Pattern: Data Organization
• Goal: combine and organize data in a more complex
data structure.
• This pattern includes several pattern sub-categories:
– structure to hierarchical pattern (e.g., denormalization)
– partitioning and binning patterns
– total order sorting patterns
– shuffling patterns
32Fabiana Rossi - SABD 2020/21
Design Pattern: Structure to Hierarchical
• Goal: create new records from data stored in very
different structures.
– This pattern follows the denormalization principles of big
data stores
• Structure:
– We might need to combine data from multiple data sources
(use MultipleInputs)
– Map: it associate data to be aggregated to the same key
(e.g., root of hierarchical record). Each data can be enriched
with a label to identify the source.
– Reduce: it creates the hierarchical structure from the list of
received data items
33Fabiana Rossi - SABD 2020/21
Design Pattern: Structure to Hierarchical
34Fabiana Rossi - SABD 2020/21
Structure to Hierarchical: Example
1::Movies
2::Football teams
3::Software
35
{
"topic":"Software",
"items":["Autocad","Eclipse","IntelliJ",
"Microsoft Office","Linux","Google Chrome"]
}
1::Star Wars
1::Mad Max
1::Creed
2::Roma
2::Juventus
3::Autocad
3::Eclipse
3::IntelliJ
3::Microsoft Office
3::Linux
3::Google Chrome
topic
items
Fabiana Rossi - SABD 2020/21
Structure to Hierarchical: Example
• Goal: create a json structure of a topic, which
contains the list of its items
– Two inputs are provided, the list of topics, the list of items
• Check: TopicItemsHierarchy.java
36
public void map(Object key, Text value, Context context) ... {String line = value.toString();String[] parts = line.split("::");if (parts.length != 2)
return;String id = parts[0];String content = parts[1];outKey.set(id);outValue.set(valuePrefix + content);context.write(outKey, outValue);
}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Structure to Hierarchical: Example
• Check: TopicItemsHierarchy.java
37
public void reduce(Text key, Iterable<Text> values,Context context) ... {
Topic topic = new Topic();for (Text t : values) {
String value = t.toString();if (ValueType.TOPIC.equals(discriminate(value))){
topic.setTopic(getContent(value));} else if (ValueType.ITEM.equals(discriminate(value))){
topic.addItem(getContent(value));}
}
/* Serialize topic */String serializedTopic = gson.toJson(topic);context.write(new Text(serializedTopic), NullWritable.get());
}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Design Pattern: Partitioning
• Goal: move the records into categories (i.e., shards,
partitions, or bins) without taking care about the order
of records.
• Structure:
– Map: in most cases, the identity mapper can be used.
– Partitioner: it will determine which reducer to send each
record to; each reducer corresponds to a particular partition
– Reduce: in most cases, the identity reducer can be used
– All you have to define is the function that determines what
partition a record is going to go.
38Fabiana Rossi - SABD 2020/21
Design Pattern: Partitioning
39Fabiana Rossi - SABD 2020/21
Partitioning: Example
• Goal: group date by year. In this case a year
represents a partition
• Check: PartitionDatesByYear.java
40
public static class DatePartitionerextends Partitioner<IntWritable, Text> {
public int getPartition(IntWritable key, Text value, int numPartitions) {
return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions;}
}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Two-stage MapReduce
• As map-reduce calculations get more complex, break
them down into stages
– Output of one stage = input to next stage
• Intermediate output may be useful for different
outputs too, so you can get some reuse
– Intermediate records can be saved in the data store, forming
a materialized view
• Early stages of map-reduce operations often
represent the heaviest amount of data access, so
building and save them once as a basis for many
downstream uses saves a lot of work
41Fabiana Rossi - SABD 2020/21
Design Pattern: Total Order Sorting
• Sort all the records of the data set
– Sorting in a parallel manner is not easy.
• Observe:
– each individual reducer will sort its data by key, but
unfortunately, this sorting is not global across all data.
• Goal: we want to have a total order sorting where, if
you concatenate the output files, the records are
sorted.
• Sorted data has a number of useful properties:
– Sorted by time, it can provide a timeline view on the data
– Finding things in a sorted data set can be done with binary
search
– Some databases can bulk load data faster if the data is
sorted on the primary key or index column42Fabiana Rossi - SABD 2020/21
Design Pattern: Total Order Sorting
• This pattern has two phases (jobs):
– an analyze phase that determines the ranges, and the order
phase that actually sorts the data.
Analyze Phase: identify the data set slices
• Map: it does a simple random sampling
• Reduce: only one reducer will be used. It collects the sort keys
and slices them into the data range boundaries
Order Phase: order the dataset
• Map: similar to the mapper function of the analyze phase, but
the record itself is stored as the value
• Partition: it loads up the partition file, routes data according to
the paritions
– Hadoop provides an implementation: TotalOrderPartitioner
• Reduce: it is the identify function; the number of reducers needs
to be equal to the number of partitions
43Fabiana Rossi - SABD 2020/21
Total Order Sorting: Example
• Goal: order the dataset
– We rely on the TotalOrderPartitioner class
– Slightly different implementation of Analyze and Order Phases
• Check: TotalOrdering.java
• Observe the driver, which defines the chain of MapReduce jobs
44
/* **** Job #1: Analyze phase **** */Job sampleJob = Job.getInstance(conf, "TotalOrdering");
/* Map: samples data; Reduce: identity function */sampleJob.setMapperClass(AnalyzePhaseMapper.class);sampleJob.setNumReduceTasks(0);sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);...if (isCompletedCorrecty(sampleJob)) {
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Total Order Sorting: Example
45
/* **** Job #2: Ordering phase **** */Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage");
/* Map: identity function; Reduce: emits only the key */orderJob.setMapperClass(Mapper.class);orderJob.setReducerClass(OrderingPhaseReducer.class);orderJob.setNumReduceTasks(10);
/* Partitioner */orderJob.setPartitionerClass(TotalOrderPartitioner.class);
/* Define the dataset sampling strategy to identify partition bounds */
InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10));
}
This is only an excerpt
Fabiana Rossi - SABD 2020/21
Order Phase (1)
/* **** Job #2: Ordering phase **** */Job orderJob = Job.getInstance(conf,"TotalOrderSortingStage");orderJob.setJarByClass(TotalOrdering.class);
/* Map: identity function outputs the key/value pairs in the SequenceFile */orderJob.setMapperClass(Mapper.class);
/* Reduce: identity function */orderJob.setReducerClass(OrderingPhaseReducer.class);orderJob.setNumReduceTasks(10);
46
This is only an excerpt of main in TotalOrdering.java
Fabiana Rossi - SABD 2020/21
Order Phase (2)
/* Set input and output files: the input is the previousjob's output */orderJob.setInputFormatClass(SequenceFileInputFormat.class);
orderJob.setPartitionerClass(TotalOrderPartitioner.class);
TotalOrderPartitioner.setPartitionFile(orderJob.getConfiguration(), partitionFile);
InputSampler.writePartitionFile(orderJob,
new InputSampler.RandomSampler(.3, 10));
47
This is only an excerpt of main in TotalOrdering.java
Fabiana Rossi - SABD 2020/21
Analyze Phase (1)
public static class AnalyzePhaseMapper extends Mapper {...
public void map(Object key, Text value, Context context)throws IOException, InterruptedException {
outkey.set(value.toString());context.write(outkey, value);
}}
48
/* Map: samples data; Reduce: identity function */sampleJob.setMapperClass(AnalyzePhaseMapper.class);sampleJob.setNumReduceTasks(0);
/* Set input and output files */sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);
Fabiana Rossi - SABD 2020/21
Design Pattern: Shuffling
• Goal: we want to shuffle our dataset, to randomize
our records (e.g., to improve anonymity)
• Structure:
– Map: it emits the record as the value along with a random
key
– Reduce: the reducer sorts the random keys, further
randomizing the data
49Fabiana Rossi - SABD 2020/21