a.a. 2020/2021 fabiana rossi

49
MapReduce: Design Patterns A.A. 2020/2021 Fabiana Rossi Laurea Magistrale in Ingegneria Informatica -II anno Macroarea di Ingegneria Dipartimento di Ingegneria Civile e Ingegneria Informatica

Upload: others

Post on 30-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A.A. 2020/2021 Fabiana Rossi

MapReduce: Design Patterns

A.A. 2020/2021

Fabiana Rossi

Laurea Magistrale in Ingegneria

Informatica -II anno

Macroarea di Ingegneria

Dipartimento di Ingegneria Civile e Ingegneria Informatica

Page 2: A.A. 2020/2021 Fabiana Rossi

The reference Big Data stack

Fabiana Rossi - SABD 2020/21 2

Resource Management

Data Storage

Data Processing

High-level Interfaces Su

pp

ort / In

teg

ratio

n

Page 3: A.A. 2020/2021 Fabiana Rossi

Main reference for this lecture

D. Miner and A. Shook MapReduce Design Patterns: Building

Effective Algorithms and Analytics for Hadoop and Other Systems.

O'Reilly Media, 2012.

3Fabiana Rossi - SABD 2020/21

Page 4: A.A. 2020/2021 Fabiana Rossi

MapReduce

• Fit your solution into the framework of map and

reduce

• In some situations might be challenging

– MapReduce can be a constraint

– provides clear boundaries for what you can and cannot do

• Figuring out how to solve a problem with constraints

requires

– cleverness

– a change in thinking!

4Fabiana Rossi - SABD 2020/21

Page 5: A.A. 2020/2021 Fabiana Rossi

MapReduce

• MapReduce is a framework

– Fit your solution into the framework of map and reduce

– Can be challenging in some situations

• Need to take the algorithm and break it into

filter/aggregate steps

– Filter becomes part of the map function

– Aggregate becomes part of the reduce function

• Sometimes we need multiple MapReduce stages

• MapReduce is not a solution to every problem, not

even every parallel problem

• It makes sense when:

– Files are very large and are rarely updated

– We need to iterate over all the files to generate some

interesting property of the data in those files

5Fabiana Rossi - SABD 2020/21

Page 6: A.A. 2020/2021 Fabiana Rossi

MapReduce Design Pattern

What is a MapReduce design pattern?

• It is a template for solving a common and general data

manipulation problem with MapReduce.

• Inspired by "Design Patterns: Elements of Reusable Object-

Oriented Software" by the Gang of four

A pattern:

• is a general approach for solving a problem

• is not specific to a domain (e.g., text processing, graph analysis)

A design patterns allows

• to use tried and true design principles

• to build better software

6Fabiana Rossi - SABD 2020/21

Page 7: A.A. 2020/2021 Fabiana Rossi

Hands-on Hadoop(our pre-configured Docker image)

Fabiana Rossi - SABD 2020/21

Page 8: A.A. 2020/2021 Fabiana Rossi

Hadoop with Dockers

8

• Create a small network named hadoop_networkwith one namenode (master) and 3 datanode (slave).

• We will interact on the master node, exchanging file through the volume mounted in /data

$ docker network create --driver bridge hadoop_network

$ docker run -t -i -p 9864:9864 -d --network=hadoop_network --name=slave1 effeerre/hadoop

$ docker run -t -i -p 9863:9864 -d --network=hadoop_network --name=slave2 effeerre/hadoop

$ docker run -t -i -p 9862:9864 -d --network=hadoop_network --name=slave3 effeerre/hadoop

$ docker run -t -i -p 9870:9870 -p 8088:8088 --network=hadoop_network --name=master-v $PWD/hddata:/data effeerre/hadoop

Fabiana Rossi - SABD 2020/21

Page 9: A.A. 2020/2021 Fabiana Rossi

Hadoop with Dockers

9

• Before we start, we need to initialize our environment

• On the master node

• The WebUI tells us if everything is working properly:– HDFS: http://localhost:9870/

– MapReduce Master: http://localhost:8088/

$ hdfs namenode –format

$ $HADOOP_HOME/sbin/start-dfs.sh

$ $HADOOP_HOME/sbin/start-yarn.sh

Fabiana Rossi - SABD 2020/21

Page 10: A.A. 2020/2021 Fabiana Rossi

Hadoop with Dockers

10

How to remove the containers

• stop and delete the namenode and datanodes

• remove the network

$ docker network rm hadoop_network

$ docker kill master slave1 slave2 slave3

$ docker rm master slave1 slave2 slave3

Fabiana Rossi - SABD 2020/21

Page 11: A.A. 2020/2021 Fabiana Rossi

A simplified view of MapReduce

• Mappers are applied to all input key-value pairs, to generate an

arbitrary number of intermediate pairs

• Reducers are applied to all intermediate values associated with the

same intermediate key

• Between the map and reduce phase lies a barrier that involves a

large distributed sort and group by11Fabiana Rossi - SABD 2020/21

Page 12: A.A. 2020/2021 Fabiana Rossi

A more detailed view of MapReduce

12

• Combiner: optimization that

anticipates on the map node

the reduce function;

– Hadoop does not provide a

guarantee of how many times it

will call it

• Partitioner: when there are

multiple reducers, divides

keys in partitions that will be

assigned to each reducer

– A custom partitioner can be

used to control how keys are

passed to the reducer, e.g., to

balance load, to guarantee

properties – such as total

ordering

Fabiana Rossi - SABD 2020/21

Page 13: A.A. 2020/2021 Fabiana Rossi

Job in MapReduce

• A MapReduce (i.e., Java) program, referred

to as a job, consists of:

• Code for Map and Reduce packaged together

• Configuration parameters (where the input lies,

where the output should be stored)

• Input data set, stored on the underlying distributed

file system

• Applications typically implement the Mapper

and Reducer interfaces to provide the map

and reduce methods. They form the core of a

MapReduce job.

13Fabiana Rossi - SABD 2020/21

Page 14: A.A. 2020/2021 Fabiana Rossi

Job MapReduce: Input

• InputFormat describes the input-specification for a

MapReduce job.

• The default behavior of file-based InputFormat

implementations (typically sub-classes of

FileInputFormat) is to split the input into logical

InputSplit instances based on the total size (in

bytes) of the input files.

• The FileSystem blocksize of the input files is treated

as an upper bound for input splits.

• The Hadoop MapReduce framework spawns one

map task for each InputSplit generated by the

InputFormat for the job.14Fabiana Rossi - SABD 2020/21

Page 15: A.A. 2020/2021 Fabiana Rossi

Job MapReduce: Output

• OutputFormat describes the output-

specification for a MapReduce job.

• Output files are stored in a FileSystem.

• TextOutputFormat is the default OutputFormat.

15Fabiana Rossi - SABD 2020/21

Page 16: A.A. 2020/2021 Fabiana Rossi

Mapper and Reducer

public class Map extends Mapper {

public void map(Object key, Text value, Context context){

...

}

}

16

Context object: allows the Mapper/Reducer to interact with the Hadoopsystem. It includes configuration data for the job as well as interfaces whichallow it to emit output.

public class Reduce extends Reducer {

public void reduce(Text key, Iterable values,

Context context) {

...

}

}

Fabiana Rossi - SABD 2020/21

Page 17: A.A. 2020/2021 Fabiana Rossi

Job MapReduce: Example

/* Create and configure a new MapReduce Job */Configuration conf = new Configuration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);

/* Map function */job.setMapperClass(Mapper.class);

/* Reduce function */job.setReducerClass(Reducer.class);job.setNumReduceTasks(2);

job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);

...

17

This is only an excerpt of WordCount.java

Fabiana Rossi - SABD 2020/21

Page 18: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Number Summarizations

• Goal: compute some numerical aggregate value

(count, maximum, average, ...) over a set of values

• Structure:

– Mapper: it outputs keys that consist of each field to group by,

and values consisting of any pertinent numerical items

– Combiner: (optional) it can greatly reduce the number of

intermediate key/value pairs to be sent across the network,

but works well only with associative and commutative

operations

– Partitioner: (optional) it can better distribute key/value pairs

across the reduce tasks

– Reducer: The reducer receives a set of numerical values

and applies the aggregation function

18Fabiana Rossi - SABD 2020/21

Page 19: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Number Summarizations

19Fabiana Rossi - SABD 2020/21

Page 20: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Number Summarizations

Examples:

• Word count, record count

– Count the number of occurrences of each world

• Min/Max

– Compute the max temperature per region

• Average/Median/Standard Deviation

– Average the number of requests per page per Web site

• Inverted Index Summarization:

– The inverted index pattern is commonly used to generate an

index from a data set to allow for faster searches or data

enrichment capabilities.

20Fabiana Rossi - SABD 2020/21

Page 21: A.A. 2020/2021 Fabiana Rossi

WordCount: Example

hello world goodbye

hello fabiana

hello john

hello mike

hello mapreduce

21

fabiana 1

goodbye 1

hello 5

john 1

mapreduce 1

mike 1

world 1Fabiana Rossi - SABD 2020/21

Page 22: A.A. 2020/2021 Fabiana Rossi

Summarization: Example

hello world goodbye

hello fabiana

hello john

hello mike

hello mapreduce

22

g 7.0

m 6.5

w 5.0

f 7.0

h 5.0

j 4.0

• Goal: compute the average word length by initial letter

Fabiana Rossi - SABD 2020/21

Page 23: A.A. 2020/2021 Fabiana Rossi

Summarization: Example

23

• Goal: compute the average word length by initial

letter

• Check: AverageWordLengthByInitialLetter.java

public void map(Object key, Text value, Context context) {String line = value.toString().toLowerCase();/* Emit length by initial letter */StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {

String word = itr.nextToken();initialLetter.set(word.substring(0,1));length.set(word.length());context.write(initialLetter, length);

}}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 24: A.A. 2020/2021 Fabiana Rossi

Summarization: Example

24

• Goal: compute the average word length by initial

letter

• Check: AverageWordLengthByInitialLetter.java

public void reduce(Text key, Iterable<IntWritable> values, Context context){

int sum = 0;int count = 0;for (IntWritable val : values) {

sum += val.get();count++;

}average.set(((float) sum / (float) count));context.write(key, average);

}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 25: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Filtering

• Goal: filter out records that are not of interest and

keep the others.

• An application of filtering is the sampling

– Sampling can be used to get a smaller, yet representative,

data set

• Structure:

– Mapper: filters data (it does most of the work)

– Reduce: may simply be the identity, if the job does not

produce an aggregation on filtered data

25Fabiana Rossi - SABD 2020/21

Page 26: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Filtering

26Fabiana Rossi - SABD 2020/21

Page 27: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Filtering

Use cases:

• Closer view of data: to extract records that have something in

common or something of interest (e.g., same event-date, same

user id)

• Tracking a thread of events: extract a thread of consecutive

events as a case study from a larger data set.

• Distributed grep

• Simple random sampling: simple random sampling of the data

set

– use a filter with an evaluation function that randomly returns

true or false

• Remove low scoring data

27Fabiana Rossi - SABD 2020/21

Page 28: A.A. 2020/2021 Fabiana Rossi

Filtering: Example

hello world goodbye

hello fabiana

hello john

hello mike

hello mapreduce

28

hello world goodbye

good

• Goal: implement a distributed version of grep

• grep is a command-line utility for searching plain-text data sets for lines that match a regular expression

Fabiana Rossi - SABD 2020/21

Page 29: A.A. 2020/2021 Fabiana Rossi

Filtering: Example

29

• Goal: implement a distributed version of grep

• Check: DistributedGrep.java

public static class GrepMapperextends Mapper<Object, Text, NullWritable, Text> {

private Pattern pattern = null;

public void setup(Context context) throws ... {pattern = Pattern.compile( ... );

}

public void map(Object key, Text value, Context context) ... {Matcher matcher = pattern.matcher(value.toString());

if (matcher.find()) {context.write(NullWritable.get(), value);

}}

}

This is only an excerptFabiana Rossi - SABD 2020/21

Page 30: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Distinct

• Special case of filter pattern

• Goal: filter out records that look like another record in

the data set

• Structure:

– Mapper: it takes each record and extracts the data fields for

which we want unique values. The mapper outputs the

record as the key, and null as the value

– Reduce: it groups the nulls together by key. We then simply

output the key. Because each key is grouped together, the

output data set is guaranteed to be unique.

• Examples:

– Retrieve the list of words, with no repetition, in a document

30Fabiana Rossi - SABD 2020/21

Page 31: A.A. 2020/2021 Fabiana Rossi

Distinct: Example

31

• Goal: retrieve the list of words, with no repetitions, in

a document

• Check: DistinctWords.java

public void map(Object key, Text value, Context context) ... {StringTokenizer itr = new StringTokenizer(value.toString());while (itr.hasMoreTokens()) {

word.set(itr.nextToken());context.write(word, NullWritable.get());

}}

...

public void reduce(Text key, Iterable<NullWritable> values, Context context) ... {

context.write(key, NullWritable.get());}

This is only an excerptFabiana Rossi - SABD 2020/21

Page 32: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Data Organization

• Goal: combine and organize data in a more complex

data structure.

• This pattern includes several pattern sub-categories:

– structure to hierarchical pattern (e.g., denormalization)

– partitioning and binning patterns

– total order sorting patterns

– shuffling patterns

32Fabiana Rossi - SABD 2020/21

Page 33: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Structure to Hierarchical

• Goal: create new records from data stored in very

different structures.

– This pattern follows the denormalization principles of big

data stores

• Structure:

– We might need to combine data from multiple data sources

(use MultipleInputs)

– Map: it associate data to be aggregated to the same key

(e.g., root of hierarchical record). Each data can be enriched

with a label to identify the source.

– Reduce: it creates the hierarchical structure from the list of

received data items

33Fabiana Rossi - SABD 2020/21

Page 34: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Structure to Hierarchical

34Fabiana Rossi - SABD 2020/21

Page 35: A.A. 2020/2021 Fabiana Rossi

Structure to Hierarchical: Example

1::Movies

2::Football teams

3::Software

35

{

"topic":"Software",

"items":["Autocad","Eclipse","IntelliJ",

"Microsoft Office","Linux","Google Chrome"]

}

1::Star Wars

1::Mad Max

1::Creed

2::Roma

2::Juventus

3::Autocad

3::Eclipse

3::IntelliJ

3::Microsoft Office

3::Linux

3::Google Chrome

topic

items

Fabiana Rossi - SABD 2020/21

Page 36: A.A. 2020/2021 Fabiana Rossi

Structure to Hierarchical: Example

• Goal: create a json structure of a topic, which

contains the list of its items

– Two inputs are provided, the list of topics, the list of items

• Check: TopicItemsHierarchy.java

36

public void map(Object key, Text value, Context context) ... {String line = value.toString();String[] parts = line.split("::");if (parts.length != 2)

return;String id = parts[0];String content = parts[1];outKey.set(id);outValue.set(valuePrefix + content);context.write(outKey, outValue);

}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 37: A.A. 2020/2021 Fabiana Rossi

Structure to Hierarchical: Example

• Check: TopicItemsHierarchy.java

37

public void reduce(Text key, Iterable<Text> values,Context context) ... {

Topic topic = new Topic();for (Text t : values) {

String value = t.toString();if (ValueType.TOPIC.equals(discriminate(value))){

topic.setTopic(getContent(value));} else if (ValueType.ITEM.equals(discriminate(value))){

topic.addItem(getContent(value));}

}

/* Serialize topic */String serializedTopic = gson.toJson(topic);context.write(new Text(serializedTopic), NullWritable.get());

}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 38: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Partitioning

• Goal: move the records into categories (i.e., shards,

partitions, or bins) without taking care about the order

of records.

• Structure:

– Map: in most cases, the identity mapper can be used.

– Partitioner: it will determine which reducer to send each

record to; each reducer corresponds to a particular partition

– Reduce: in most cases, the identity reducer can be used

– All you have to define is the function that determines what

partition a record is going to go.

38Fabiana Rossi - SABD 2020/21

Page 39: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Partitioning

39Fabiana Rossi - SABD 2020/21

Page 40: A.A. 2020/2021 Fabiana Rossi

Partitioning: Example

• Goal: group date by year. In this case a year

represents a partition

• Check: PartitionDatesByYear.java

40

public static class DatePartitionerextends Partitioner<IntWritable, Text> {

public int getPartition(IntWritable key, Text value, int numPartitions) {

return (key.get() - CONFIG_INITIAL_YEAR) % numPartitions;}

}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 41: A.A. 2020/2021 Fabiana Rossi

Two-stage MapReduce

• As map-reduce calculations get more complex, break

them down into stages

– Output of one stage = input to next stage

• Intermediate output may be useful for different

outputs too, so you can get some reuse

– Intermediate records can be saved in the data store, forming

a materialized view

• Early stages of map-reduce operations often

represent the heaviest amount of data access, so

building and save them once as a basis for many

downstream uses saves a lot of work

41Fabiana Rossi - SABD 2020/21

Page 42: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Total Order Sorting

• Sort all the records of the data set

– Sorting in a parallel manner is not easy.

• Observe:

– each individual reducer will sort its data by key, but

unfortunately, this sorting is not global across all data.

• Goal: we want to have a total order sorting where, if

you concatenate the output files, the records are

sorted.

• Sorted data has a number of useful properties:

– Sorted by time, it can provide a timeline view on the data

– Finding things in a sorted data set can be done with binary

search

– Some databases can bulk load data faster if the data is

sorted on the primary key or index column42Fabiana Rossi - SABD 2020/21

Page 43: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Total Order Sorting

• This pattern has two phases (jobs):

– an analyze phase that determines the ranges, and the order

phase that actually sorts the data.

Analyze Phase: identify the data set slices

• Map: it does a simple random sampling

• Reduce: only one reducer will be used. It collects the sort keys

and slices them into the data range boundaries

Order Phase: order the dataset

• Map: similar to the mapper function of the analyze phase, but

the record itself is stored as the value

• Partition: it loads up the partition file, routes data according to

the paritions

– Hadoop provides an implementation: TotalOrderPartitioner

• Reduce: it is the identify function; the number of reducers needs

to be equal to the number of partitions

43Fabiana Rossi - SABD 2020/21

Page 44: A.A. 2020/2021 Fabiana Rossi

Total Order Sorting: Example

• Goal: order the dataset

– We rely on the TotalOrderPartitioner class

– Slightly different implementation of Analyze and Order Phases

• Check: TotalOrdering.java

• Observe the driver, which defines the chain of MapReduce jobs

44

/* **** Job #1: Analyze phase **** */Job sampleJob = Job.getInstance(conf, "TotalOrdering");

/* Map: samples data; Reduce: identity function */sampleJob.setMapperClass(AnalyzePhaseMapper.class);sampleJob.setNumReduceTasks(0);sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);...if (isCompletedCorrecty(sampleJob)) {

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 45: A.A. 2020/2021 Fabiana Rossi

Total Order Sorting: Example

45

/* **** Job #2: Ordering phase **** */Job orderJob = Job.getInstance(conf, "TotalOrderSortingStage");

/* Map: identity function; Reduce: emits only the key */orderJob.setMapperClass(Mapper.class);orderJob.setReducerClass(OrderingPhaseReducer.class);orderJob.setNumReduceTasks(10);

/* Partitioner */orderJob.setPartitionerClass(TotalOrderPartitioner.class);

/* Define the dataset sampling strategy to identify partition bounds */

InputSampler.writePartitionFile(orderJob, new InputSampler.RandomSampler(.3, 10));

}

This is only an excerpt

Fabiana Rossi - SABD 2020/21

Page 46: A.A. 2020/2021 Fabiana Rossi

Order Phase (1)

/* **** Job #2: Ordering phase **** */Job orderJob = Job.getInstance(conf,"TotalOrderSortingStage");orderJob.setJarByClass(TotalOrdering.class);

/* Map: identity function outputs the key/value pairs in the SequenceFile */orderJob.setMapperClass(Mapper.class);

/* Reduce: identity function */orderJob.setReducerClass(OrderingPhaseReducer.class);orderJob.setNumReduceTasks(10);

46

This is only an excerpt of main in TotalOrdering.java

Fabiana Rossi - SABD 2020/21

Page 47: A.A. 2020/2021 Fabiana Rossi

Order Phase (2)

/* Set input and output files: the input is the previousjob's output */orderJob.setInputFormatClass(SequenceFileInputFormat.class);

orderJob.setPartitionerClass(TotalOrderPartitioner.class);

TotalOrderPartitioner.setPartitionFile(orderJob.getConfiguration(), partitionFile);

InputSampler.writePartitionFile(orderJob,

new InputSampler.RandomSampler(.3, 10));

47

This is only an excerpt of main in TotalOrdering.java

Fabiana Rossi - SABD 2020/21

Page 48: A.A. 2020/2021 Fabiana Rossi

Analyze Phase (1)

public static class AnalyzePhaseMapper extends Mapper {...

public void map(Object key, Text value, Context context)throws IOException, InterruptedException {

outkey.set(value.toString());context.write(outkey, value);

}}

48

/* Map: samples data; Reduce: identity function */sampleJob.setMapperClass(AnalyzePhaseMapper.class);sampleJob.setNumReduceTasks(0);

/* Set input and output files */sampleJob.setOutputFormatClass(SequenceFileOutputFormat.class);

Fabiana Rossi - SABD 2020/21

Page 49: A.A. 2020/2021 Fabiana Rossi

Design Pattern: Shuffling

• Goal: we want to shuffle our dataset, to randomize

our records (e.g., to improve anonymity)

• Structure:

– Map: it emits the record as the value along with a random

key

– Reduce: the reducer sorts the random keys, further

randomizing the data

49Fabiana Rossi - SABD 2020/21