mapreduce: design patterns - ce.uniroma2.it · mapreduce: design patterns a.a. 2017/18 matteo...
TRANSCRIPT
MapReduce: Design Patterns A.A. 2017/18
Matteo Nardelli
Laurea Magistrale in Ingegneria Informatica - II anno
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
The reference Big Data stack
Matteo Nardelli - SABD 2017/18
1
Resource Management
Data Storage
Data Processing
High-level Interfaces Support / Integration
Main reference for this lecture D. Miner and A. Shook MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O'Reilly Media, 2012.
2 Matteo Nardelli - SABD 2017/18
MapReduce is a Framework • Fit your solution into the framework of map and
reduce
• In some situations might be challenging – MapReduce can be a constraint – provides clear boundaries for what you can and cannot do
• Figuring out how to solve a problem with constraints requires – cleverness – a change in thinking!
3 Matteo Nardelli - SABD 2017/18
MapReduce Design Pattern What is a MapReduce design pattern? • It is a template for solving a common and general data
manipulation problem with MapReduce. • Inspired by "Design Patterns: Elements of Reusable Object-
Oriented Software" by the Gang of four A pattern: • is a general approach for solving a problem • is not specific to a domain (e.g., text processing, graph analysis) A design patterns allows • to use tried and true design principles • to build better software
4 Matteo Nardelli - SABD 2017/18
MapReduce Design Pattern • MapReduce is a framework, not a tool
– Fit your solution into the framework of map and reduce – Can be challenging in some situations
• Need to take the algorithm and break it into filter/aggregate steps – Filter becomes part of the map function – Aggregate becomes part of the reduce function
• Sometimes we need multiple MapReduce stages • MapReduce is not a solution to every problem, not
even every parallel problem • It makes sense when:
– Files are very large and are rarely updated – We need to iterate over all the files to generate some
interesting property of the data in those files
5 Matteo Nardelli - SABD 2017/18
HDFS with Dockers
7
• Create a small network named hadoop_network with one namenode (master) and 3 datanode (slave).
• We will interact on the master node, exchanging file through the volume mounted in /data
$ docker network create --driver bridge hadoop_network
$ docker run -t -i -p 50075:50075 -p 50061:50060 -d --network=hadoop_network --name=slave1 matnar/hadoop
$ docker run -t -i -p 50076:50075 -p 50062:50060 -d --network=hadoop_network --name=slave2 matnar/hadoop
$ docker run -t -i -p 50077:50075 -p 50063:50060 -d --network=hadoop_network --name=slave3 matnar/hadoop
$ docker run -t -i -p 50070:50070 -p 50060:50060 -p 50030:50030 -p 8088:8088 -p 19888:19888 --network=hadoop_network --name=master -v $PWD/hddata:/data matnar/hadoop
Matteo Nardelli - SABD 2017/18
HDFS with Dockers
8
• Before we start, we need to initialize our environment
• On the master node
• The WebUI tells us if everything is working properly: – HDFS: http://localhost:50070/ – MapReduce Master: http://localhost:8088/
$ hdfs namenode –format
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh
Matteo Nardelli - SABD 2017/18
HDFS with Dockers
9
How to remove the containers • stop and delete the namenode and datanodes
• remove the network
$ docker network rm hadoop_network
$ docker kill master slave1 slave2 slave3
$ docker rm master slave1 slave2 slave3
Matteo Nardelli - SABD 2017/18
A simplified view of MapReduce
• Mappers are applied to all input key-value pairs, to generate an arbitrary number of intermediate pairs
• Reducers are applied to all intermediate values associated with the same intermediate key
• Between the map and reduce phase lies a barrier that involves a large distributed sort and group by
10 Matteo Nardelli - SABD 2017/18
A more detailed view of MapReduce
11
• Combiner: optimization that anticipates on the map node the reduce function;
– Hadoop does not provide a guarantee of how many times it will call it
• Partitioner: when there are multiple reducers, divides keys in partitions that will be assigned to each reducer
– A custom partitioner can be used to control how keys are passed to the reducer, e.g., to balance load, to guarantee properties – such as total ordering
Matteo Nardelli - SABD 2017/18
Design Pattern: Number Summarizations • Goal: compute some numerical aggregate value
(count, maximum, average, ...) over a set of values
• Structure: – Mapper: it outputs keys that consist of each field to group by,
and values consisting of any pertinent numerical items – Combiner: (optional) it can greatly reduce the number of
intermediate key/value pairs to be sent across the network, but works well only with associative and commutative operations
– Partitioner: (optional) it can better distribute key/value pairs across the reduce tasks
– Reducer: The reducer receives a set of numerical values and applies the aggregation function
12 Matteo Nardelli - SABD 2017/18
Design Pattern: Number Summarizations Examples: • Word count, record count
– Count the number of requests to *.dicii.uniroma2.it/* • Min/Max
– Compute the max temperature per region • Average/Median/Standard Deviation
– Average the number of requests per page per Web site • Inverted Index Summarization:
– The inverted index pattern is commonly used to generate an index from a data set to allow for faster searches or data enrichment capabilities.
14 Matteo Nardelli - SABD 2017/18
Summarization: Example
15
• Goal: compute the average word length by initial letter
• Check: AverageWordLengthByInitialLetter.java
public void map(Object key, Text value, Context context) { String line = value.toString().toLowerCase(); /* Emit length by initial letter */ StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { String word = itr.nextToken(); initialLetter.set(word.substring(0,1)); length.set(word.length()); context.write(initialLetter, length); } }
This is only an excerpt
Matteo Nardelli - SABD 2017/18
Summarization: Example
16
• Goal: compute the average word length by initial letter
• Check: AverageWordLengthByInitialLetter.java
public void reduce(Text key, Iterable<IntWritable> values, Context context){ int sum = 0; int count = 0; for (IntWritable val : values) { sum += val.get(); count++; } average.set(((float) sum / (float) count)); context.write(key, average); }
This is only an excerpt
Matteo Nardelli - SABD 2017/18
Design Pattern: Filtering • Goal: filter out records that are not of interest and
keep the others. • An application of filtering is sampling
– Sampling can be used to get a smaller, yet representative, data set
• Structure: – Mapper: filters data (it does most of the work) – Reduce: may simply be the identity, if the job does not
produce an aggregation on filtered data
• Examples: – Grep Web logs for requests to *dicii.uniroma2.it/* – Find in the Web logs the URLs accessed by 160.80.85.34 – locate all the files that contain the words ‘Apple’ and ‘Jobs’
17 Matteo Nardelli - SABD 2017/18
Design Pattern: Filtering Use cases: • Closer view of data: to extract records that have something in
common or something of interest (e.g., same event-date, same user id)
• Tracking a thread of events: extract a thread of consecutive events as a case study from a larger data set.
• Distributed grep • Simple random sampling: simple random sampling of the data
set – use a filter with an evaluation function that randomly returns
true or false • Remove low scoring data
19 Matteo Nardelli - SABD 2017/18
Filtering: Example
20
• Goal: implement a distributed version of grep • Check: DistributedGrep.java
public static class GrepMapper extends Mapper<Object, Text, NullWritable, Text> {
private Pattern pattern = null;
public void setup(Context context) throws ... { pattern = Pattern.compile( ... ); } public void map(Object key, Text value, Context context) ... { Matcher matcher = pattern.matcher(value.toString()); if (matcher.find()) { context.write(NullWritable.get(), value); } } }
This is only an excerpt Matteo Nardelli - SABD 2017/18
Design Pattern: Distinct • Special case of filter pattern • Goal: filter out records that look like another record in
the data set • Structure:
– Mapper: it takes each record and extracts the data fields for which we want unique values. The mapper outputs the record as the key, and null as the value
– Reduce: it groups the nulls together by key. We then simply output the key. Because each key is grouped together, the output data set is guaranteed to be unique.
• Examples:
– Retrieve the list of unique visitors of a website
21 Matteo Nardelli - SABD 2017/18
Distinct: Example
22
• Goal: retrieve the list of words, with no repetitions, in a document
• Check: DistinctWords.java public void map(Object key, Text value, Context context) ... { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, NullWritable.get()); } } ... public void reduce(Text key, Iterable<NullWritable> values, Context context) ... { context.write(key, NullWritable.get()); }
This is only an excerpt Matteo Nardelli - SABD 2017/18
Design Pattern: Data Organization • Goal: combine and organize data in a more complex
data structure.
• This pattern includes several pattern sub-categories: – structure to hierarchical pattern (e.g., denormalization) – partitioning and binning patterns – total order sorting patterns – shuffling patterns
23 Matteo Nardelli - SABD 2017/18
Design Pattern: Structure to Hierarchical • Goal: create new records from data stored in very
different structures. – This pattern follows the denormalization principles of big
data stores • Structure:
– We might need to combine data from multiple data sources (use MultipleInputs)
– Map: it associate data to be aggregated to the same key (e.g., root of hierarchical record). Each data can be enriched with a label to identify the source.
– Reduce: it creates the hierarchical structure from the list of received data items
24 Matteo Nardelli - SABD 2017/18
Structure to Hierarchical: Example • Goal: create a json structure of a topic, which
contains the list of its items – Two inputs are provided, the list of topics, the list of items
• Check: TopicItemsHierarchy.java
26
public void map(Object key, Text value, Context context) ... { String line = value.toString(); String[] parts = line.split("::"); if (parts.length != 2) return; String id = parts[0]; String content = parts[1]; outKey.set(id); outValue.set(valuePrefix + content);context.write(outKey, outValue);
}
This is only an excerpt Matteo Nardelli - SABD 2017/18
Structure to Hierarchical: Example • Check: TopicItemsHierarchy.java
27
public void reduce(Text key, Iterable<Text> values, Context context) ... { Topic topic = new Topic(); for (Text t : values) { String value = t.toString(); if (ValueType.TOPIC.equals(discriminate(value))){ topic.setTopic(getContent(value)); } else if (ValueType.ITEM.equals(discriminate(value))){ topic.addItem(getContent(value)); } } /* Serialize topic */ String serializedTopic = gson.toJson(topic); context.write(new Text(serializedTopic), NullWritable.get()); }
This is only an excerpt Matteo Nardelli - SABD 2017/18
Design Pattern: Partitioning • Goal: move the records into categories (i.e., shards,
partitions, or bins) without taking care about the order of records.
• Structure:
– Map: in most cases, the identity mapper can be used. – Partitioner: it will determine which reducer to send each
record to; each reducer corresponds to a particular partition – Reduce: in most cases, the identity reducer can be used – All you have to define is the function that determines what
partition a record is going to go.
28 Matteo Nardelli - SABD 2017/18
Design Pattern: Partitioning • Example:
– Discretize continuous values: partitioning data into bins will allow you to group continuous values in the same partition
– Partition pruning by category – Sharding
30 Matteo Nardelli - SABD 2017/18
Partitioning: Example • Goal: group date by year. In this case a year
represents a partition • Check: PartitionDatesByYear.java
31
public static class DatePartitioner extends Partitioner<IntWritable, Text> { public int getPartition(IntWritable key, Text value, int numPartitions) { return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions; } }
This is only an excerpt
Matteo Nardelli - SABD 2017/18
Two-stage MapReduce • As map-reduce calculations get more complex, break
them down into stages – Output of one stage = input to next stage
• Intermediate output may be useful for different outputs too, so you can get some reuse – Intermediate records can be saved in the data store, forming
a materialized view • Early stages of map-reduce operations often
represent the heaviest amount of data access, so building and save them once as a basis for many downstream uses saves a lot of work
32 Matteo Nardelli - SABD 2017/18
Example of Two-stage MapReduce
33
• Compare sales of products for each month in 2011 to prior year
• 1st stage: produce records showing the sales for a single product in a single month of the year
• 2nd stage: produce the result for a single product by comparing one month’s results with the same month in the prior year
Matteo Nardelli - SABD 2017/18
Design Pattern: Total Order Sorting • Sort all the records of the data set
– Sorting in a parallel manner is not easy. • Observe:
– each individual reducer will sort its data by key, but unfortunately, this sorting is not global across all data.
• Goal: we want to have a total order sorting where, if
you concatenate the output files, the records are sorted.
• Sorted data has a number of useful properties: – Sorted by time, it can provide a timeline view on the data – Finding things in a sorted data set can be done with binary
search – Some databases can bulk load data faster if the data is
sorted on the primary key or index column 34 Matteo Nardelli - SABD 2017/18
Design Pattern: Total Order Sorting • This pattern has two phases (jobs):
– an analyze phase that determines the ranges, and the order phase that actually sorts the data.
Analyze Phase: identify the data set slices • Map: it does a simple random sampling • Reduce: only one reducer will be used. It collects the sort keys
and slices them into the data range boundaries Order Phase: order the dataset • Map: similar to the mapper function of the analyze phase, but
the record itself is stored as the value • Partition: it loads up the partition file, routes data according to
the paritions – Hadoop provides an implementation: TotalOrderPartitioner
• Reduce: it is the identify function; the number of reducers needs to be equal to the number of partitions
35 Matteo Nardelli - SABD 2017/18
Total Order Sorting: Example • Goal: order the dataset
– We rely on the TotalOrderPartitioner class – Slightly different implementation of Analyze and Order Phases
• Check: TotalOrdering.java • Observe the driver, which defines the chain of MapReduce jobs
36
/* **** Job #1: Analyze phase **** */ Job sampleJob = Job.getInstance(conf, "TotalOrdering"); /* Map: samples data; Reduce: identity function */ sampleJob.setMapperClass(AnalyzePhaseMapper.class); sampleJob.setNumReduceTasks(0); sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class); ... if (isCompletedCorrecty(sampleJob)) {
This is only an excerpt
Matteo Nardelli - SABD 2017/18
Total Order Sorting: Example
37
/* **** Job #2: Ordering phase **** */ Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage"); /* Map: identity function; Reduce: emits only the key */ orderJob.setMapperClass(Mapper.class); orderJob.setReducerClass(OrderingPhaseReducer.class); orderJob.setNumReduceTasks(10); /* Partitioner */ orderJob.setPartitionerClass(TotalOrderPartitioner.class); /* Define the dataset sampling strategy to identify partition bounds */ InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10)); }
This is only an excerpt
Matteo Nardelli - SABD 2017/18
Design Pattern: Shuffling • Goal: we want to shuffle our dataset, to randomize
our records (e.g., to improve anonymity)
• Structure: – Map: it emits the record as the value along with a random
key – Reduce: the reducer sorts the random keys, further
randomizing the data
38 Matteo Nardelli - SABD 2017/18