hadoop and mapreduce · “introduction to data science” outline • big data and google file...

www.helsinki.fi

Hadoop and MapReduce

Guest Lecturer: Jiaheng Lu

Homepage: https://www.cs.helsinki.fi/u/jilu/

Autumn 2017

17.9.2017 1

Big Data Framework

“Introduction to Data Science”

https://www.cs.helsinki.fi/u/jilu/

www.helsinki.fi

Outline

• Big data and Google File System (GFS)

• Hadoop and HDFS

• MapReduce and examples

• Hands-on exercise on table join

• Questions and answers for quiz

www.helsinki.fi

• One Big challenge in the era of Big Data:

• How to efficiently handle big data?

• Make big data divided

• Hadoop, GFS, MapReduce

• Make big data small

• FM Sketch, Count Sketch, Count Min Sketch

17.9.2017 3

Matemaattis-luonnontieteellinen tiedekunta /

Iso tiedonhallinta/

Jiaheng Lu

Two ways to handle big data

www.helsinki.fi

• One Big challenge in the era of Big Data:

• How to efficiently handle big data?

• Make big data divided

• Hadoop, GFS, MapReduce

• Make big data small

• FM Sketch, Count Sketch, Count Min Sketch

17.9.2017 4


Iso tiedonhallinta/

Jiaheng Lu

Two ways to handle big data

this lecture

To appear in

“Introduction to Big

´Data Management”

www.helsinki.fi

The Google File System(GFS)

A scalable distributed file system for large

distributed data intensive applications

MapReduce Bigtable

Google File System

www.helsinki.fi

GFS: Introduction

Shares many same goals as previous

distributed file systems

performance, scalability, reliability, etc

GFS design has been driven by four key

observations of Google application

workloads and technological environment

www.helsinki.fi

Intro: Observations

•1. Component failures are the norm

constant monitoring, error detection, fault tolerance

and automatic recovery are integral to the system

•2. Huge files (by traditional standards)

Multi GB files are common

I/O operations and blocks sizes must be revisited

www.helsinki.fi

Intro: Observations (Contd)

• 3. Most files are mutated by appending new data

This is the focus of performance optimization and atomicity

guarantees

• 4. Co-designing the applications and APIs

benefits overall system by increasing flexibility

www.helsinki.fi

The Design

Cluster consists of a single master and multiple

chunkservers and is accessed by multiple clients

www.helsinki.fi

The Master

Maintains all file system metadata.

names space, access control info, file to chunk mappings, chunk

(including replicas) location, etc.

Periodically communicates with chunkservers in HeartBeat

messages to give instructions and check state

www.helsinki.fi

Chunkservers

Files are broken into chunks. Each chunk has

a globally unique 64-bit chunk-handle.

handle is assigned by the master at chunk creation

Chunk size is 64 MB

Each chunk is replicated on 3 (default)

servers

www.helsinki.fi

GFS paper

• More information on data update and performance of

GFS, read the original paper:

• http://static.googleusercontent.com/media/research.g

oogle.com/en//archive/bigtable-osdi06.pdf

2017/9/17 12

http://static.googleusercontent.com/media/research.google.com/en/archive/bigtable-osdi06.pdf

www.helsinki.fi

Outline

• Google File System (GFS)

• Hadoop and HDFS




www.helsinki.fi

What is Hadoop?

• Apache top level project, open-source

implementation of frameworks for reliable, scalable,

distributed computing and data storage.

www.helsinki.fi

Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed

Hadoop to support distribution for the Nutch search

engine project.

The project was funded by Yahoo.

2006: Yahoo gave the project to Apache Software

Foundation.

www.helsinki.fi

Some Hadoop Milestones

• 2008 - Hadoop Wins Terabyte Sort Benchmark (sorted 1 terabyte of

data in 209 seconds, compared to previous record of 297 seconds)

• 2010 - Hadoop's Hbase, Hive and Pig subprojects completed,

adding more computational power to Hadoop framework

• 2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.

- Ambari, Cassandra, Mahout have been added

• 2016 - Hadoop 3.0.0 Alpha-1

www.helsinki.fi

Google Origins

2003

2004

2006

www.helsinki.fi 17.9.2017 18


Iso tiedonhallinta/

Jiaheng Lu

www.helsinki.fi

• Hadoop Common - libraries and utilities

• Hadoop Distributed File System (HDFS) – a distributed

file-system

• Hadoop YARN – a resource-management platform,

scheduling

• Hadoop MapReduce – a programming model for large

scale data processing

17.9.2017 19


Iso tiedonhallinta/

Jiaheng Lu

The Basic Hadoop Components

www.helsinki.fi

• Single NameNode - a master server that manages the file

system namespace and regulates access to files by clients.

•

• Multiple DataNodes – typically one per node in the cluster.

Functions: Manage storage, serving read/write requests from

clients

17.9.2017 20


Iso tiedonhallinta/

Jiaheng Lu

Original HDFS Design

www.helsinki.fi

Unique features of HDFS

• Failure tolerant - data is duplicated across multiple

DataNodes to protect against machine failures.

• Scalability - data transfers happen directly with the

DataNodes so your read/write capacity scales fairly well

with the number of DataNodes

21

www.helsinki.fi

HDFS Architecture

22

www.helsinki.fi

• Watch two videos on Hadoop

• https://www.youtube.com/watch?v=9s-vSeWej1U

• https://www.youtube.com/watch?v=4DgTLaFNQq0

17.9.2017 23


Iso tiedonhallinta/

Jiaheng Lu

www.helsinki.fi

Outline


• Hadoop and HDFS




MapReduce: Insight

Consider the problem of counting the number of frequency of each word in a large collection of documents

( Trump)

( Donald Trump)

(Trump Clinton)

(USA President)

(Donald Trump)

(President election)

( Donald, 2)

(election, 1)

(Clinton, 1)

( Trump, 4)

( USA, 1)

www.helsinki.fi

Simple example: Word count

Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

Each mapper

receives some

of documents

as input

1

( Trump)

( Donald Trump)

(Trump Clinton)

(USA President)

(Donald Trump)


www.helsinki.fi


Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)


(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1), (Trump, 1)

( President,1),(election, 1)

( Trump, 1), (Clinton, 1)

( Donald,1),(Trump, 1)

( USA, 1), (President, 1)

www.helsinki.fi


Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)


(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1)

(election, 1)

(Clinton, 1)

(Trump, 1)

(President, 1)

Each KV-pair output

by the mapper is sent

to the reducer

3

( Trump, 1)

( Trump, 1)

( President,1)

( USA, 1)

( Donald,1)

www.helsinki.fi


Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)


(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Trump, 1)

( Donald, 1)

(election, 1)

(Clinton, 1)

(Trump, 1)

(President, 1)

Each KV-pair output


to the reducer

3

( Trump, 1) ( Trump, 1)

( President,1)

( USA, 1)

( Donald,1)

The reducers

sort their input

by key

4

www.helsinki.fi


Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

( Trump)

( Donald Trump)

(Trump Clinton)


(USA President)

(Donald Trump)

Each mapper

receives some

of documents

as input

Mappers

process the

KV-pairs.

1 2

( Donald, 2)

(election, 1)

(Clinton, 1)

(President, 2)

Each KV-pair output


to the reducer

3

( Trump, 4)

( USA, 1)

The reducers

sort their input

by key

4 The reducers

process their

input

5

www.helsinki.fi

MapReduce dataflow

31

Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Reducer

Input data

Outp

ut

data

"The Shuffle"

Intermediate

(key,value) pairs

Pseudo-code

map(String input_key, String input_value):

// input_key: document name

// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

MapReduce: Example

MapReduce in Parallel: Example

www.helsinki.fi

Common mistakes:

Use static variables

• Don't use static shared variables for mappers

• After map + reduce return, they should remember nothing about

the processed data!

35University of Pennsylvania

HashMap h = new HashMap();

map(key, value) {

if (h.contains(key)) {

h.add(key,value);

emit(key, "X");

}

}

Wrong!

www.helsinki.fi

Common mistakes:

Do your own I/O

• Don't try to do your own I/O!

• Don't try to read from, or write to, files in the file system

• The MapReduce framework does all the I/O for you:

‒ All the incoming data will be fed as arguments to map and reduce

‒ Any data your functions produce should be output via emit


map(key, value) {

File foo =

new File("xyz.txt");

while (true) {

s = foo.readLine();

...

}

} Wrong!

www.helsinki.fi

Common mistakes:

Too much data on the same key

• Mapper must not map too much data to the same key

• In particular, don't map everything to the same key!!

• Otherwise the reduce worker will be overwhelmed!

• It's okay if some reduce workers have more work than others


map(key, value) {

emit("FOO", key + " " + value);

}

Wrong!

www.helsinki.fi

Designing MapReduce

algorithms

• Key decision: What should be done by map, and what by

reduce?

• map can do something to each individual key-value pair, but

it can't look at other key-value pairs

• reduce can aggregate data; it can look at multiple values, as long

as map has mapped them to the same (intermediate) key

‒ Example: Count the number of words, add up the total cost, ...


www.helsinki.fi

More details on the MapReduce

data flow

39

Data partitions

by key

Map computation

partitions

Reduce

computation

partitions

Redistribution

by output’s key("shuffle")

Coordinator

www.helsinki.fi

Some additional details

• To make this work, we need a few more parts in Hadoop

HDFS system

• The file system (distributed across all nodes):

• Stores the inputs, outputs, and temporary results

• The driver program (executes on one node):

• Specifies where to find the inputs, the outputs

• Specifies what mapper and reducer to use

• Can customize behavior of the execution

• The runtime system (controls nodes):

• Supervises the execution of tasks

40

Java MapReduce code

on Apache Hadoop 2.7.2

www.helsinki.fi

MapReduce Program

• A MapReduce program consists of the following 3

parts :

• Driver → main (would trigger the map and reduce

methods)

• Mapper

• Reducer

• It is better to include the map reduce and main

methods in 3 different classes

2017/9/17 42

www.helsinki.fi

Mapper

• public static class TokenizerMapper

• extends Mapper<Object, Text, Text, IntWritable>{

• private final static IntWritable one = new IntWritable(1);

• private Text word = new Text();

• public void map(Object key, Text value, Context context

• ) throws IOException, InterruptedException {

• StringTokenizer itr = new StringTokenizer(value.toString());

• while (itr.hasMoreTokens()) {

• word.set(itr.nextToken());

• context.write(word, one);

• }

• }

• }

2017/9/17 43

www.helsinki.fi

Mapper





• public void map(Object key, Text value, Context context






• }

• }

• }

2017/9/17 44

Interface

Mapper<K1,V1,K2,V2> , the

first pair is the input key/value

pair, the second is the output

key/value pair

www.helsinki.fi

Mapper





• public void map (Object key, Text value, Context context) throws IOException, InterruptedException {





• }

• }

• }

2017/9/17 45

Keys are the position in the file,

and values are the line of text.

Context emits the output.

www.helsinki.fi

Reducer

• public static class IntSumReducer

• extends Reducer<Text,IntWritable,Text,IntWritable> {

• private IntWritable result = new IntWritable();

• public void reduce(Text key, Iterable<IntWritable> values,

• Context context


• int sum = 0;

• for (IntWritable val : values) {

• sum += val.get();

• }

• result.set(sum);

• context.write(key, result);

• }

• }

2017/9/17 46

www.helsinki.fi

Main function

• public static void main(String[] args) throws Exception {

• Configuration conf = new Configuration();

• Job job = Job.getInstance(conf, "word count");

• job.setJarByClass(WordCount.class);

• job.setMapperClass(TokenizerMapper.class);

• job.setCombinerClass(IntSumReducer.class);

• job.setReducerClass(IntSumReducer.class);

• job.setOutputKeyClass(Text.class);

• job.setOutputValueClass(IntWritable.class);

• FileInputFormat.addInputPath(job, new Path(args[0]));

• FileOutputFormat.setOutputPath(job, new Path(args[1]));

• System.exit(job.waitForCompletion(true) ? 0 : 1);

• }

2017/9/17 47

Given the Mapper and Reducer

code, the main() starts the

MapReduce running.

www.helsinki.fi

Main function













• }

2017/9/17 48

Configurations are specified by

resources. A resource contains

a set of name/value pairs as

XML data.

www.helsinki.fi

Main function













• }

2017/9/17 49

Normally the user creates the

application, describes various

facets of the job via Job and

then submits the job and

monitor its progress.

www.helsinki.fi

Main function













• }

2017/9/17 50

CombinerClass is a

mini reducer running in

a single Mapper node.

www.helsinki.fi

Combiner class

• Combiner class "mini-reduce"

• machine A emits <the, 1>, <the, 1>

• machine B emits <the, 1>.

• a Combiner on machine A emits <the, 2>. This value,

along with the <the, 1> from machine B will both go

to the Reducer node

• We have now saved bandwidth, but preserved the

computation.

2017/9/17 51

www.helsinki.fi

• Watch a video

• https://www.youtube.com/watch?v=bcjSe0xCHbE

•

2017/9/17 52

https://www.youtube.com/watch?v=bcjSe0xCHbE

www.helsinki.fi

Outline


• Hadoop Eco-system

• MapReduce and examples (with a video)



www.helsinki.fi

Hands-on exercise on

MapReduce

• Write one executable MapReduce programs to perform

the table inner-join in the exercise

A B

1 ab

1 cd

4 ef

A C

1 b

2 d

4 c

Table x Table y

A B C

1 ab b

1 cd b

4 ef c

Output

www.helsinki.fi

Hands-on exercise on

MapReduce

• Download the instructions of the exercise at

• https://www.cs.helsinki.fi/u/jilu/dataset/HadoopExerci

ses.pdf

• Read the instruction to install Hadoop on Ukko

• Download the dataset

https://www.cs.helsinki.fi/u/jilu/dataset/HadoopExercises.pdf

www.helsinki.fi

Reduce-side join

• Map

• output <key, value>

• key>>join key, value>>tagged with data source

• Reduce

• do a full cross-product of values

• output the combination results

www.helsinki.fi

Example

A B

1 ab

1 cd

4 ef

A C

1 b

2 d

4 c

table x

table y

map()

map()

1

4

key

x ab

x cd

x ef

value

1

2

4

key

y b

y d

y c

valuetag

join key

shuffle()1

key

x ab

x cd

y b

valuelist

2 y d

4x ef

y c

reduce()

A B C

1 ab b

1 cd b

4 ef c

output

1

www.helsinki.fi

Outline


• Hadoop Eco-system

• MapReduce and examples (with a video)



www.helsinki.fi

Google File System is scalable, distributed file system

on inexpensive commodity hardware that provides:

A. Fault Tolerance

B. High Aggregate Performance

C. ACID transaction model

D. Failure detection on replicas

17.9.2017 59


Iso tiedonhallinta/

Jiaheng Lu

Question 1

www.helsinki.fi

Google File System is scalable, distributed file system

on inexpensive commodity hardware that provides:

A. Fault Tolerance

B. High Aggregate Performance

C. ACID transaction model

D. Failure detection on replicas

17.9.2017 60


Iso tiedonhallinta/

Jiaheng Lu

Question 1

www.helsinki.fi

What are the assumptions in designing Google File Systems?

A. The system is built from many inexpensive commodity

components.

B. The workloads have very frequent updating operations.

C. The stringent response time requirements for an individual

read or write are not the primary designing goal.

D. The workload consists of both large streaming reads and

small random reads.

17.9.2017 61


Iso tiedonhallinta/

Jiaheng Lu

Question 2

www.helsinki.fi

What are the assumptions in designing Google File Systems?

A. The system is built from many inexpensive commodity

components.

B. The workloads have very frequent updating operations.

C. The stringent response time requirements for an individual

read or write are not the primary designing goal.

D. The workload consists of both large streaming reads and

small random reads.

17.9.2017 62


Iso tiedonhallinta/

Jiaheng Lu

Question 2

www.helsinki.fi

3. What is the chunk size in GFS ?

A. 16MB

B. 32MB

C. 64 MB

D. 128MB

17.9.2017 63


Iso tiedonhallinta/

Jiaheng Lu

Question 3

www.helsinki.fi

3. What is the chunk size in GFS ?

A. 16MB

B. 32MB

C. 64 MB

D. 128MB

17.9.2017 64


Iso tiedonhallinta/

Jiaheng Lu

Question 3

www.helsinki.fi

4. Which are the mistakes on MapReduce programs?

A. Using the static shared variables for mappers

B. Map too much data to the same key

C. Write the own I/O codes

D. Always map all data to the same key

17.9.2017 65


Iso tiedonhallinta/

Jiaheng Lu

Question 4

www.helsinki.fi

Which are the mistakes on MapReduce programs?

A. Using the static shared variables for mappers

B. Map too much data to the same key

C. Write the own I/O codes

D. Always map all data to the same key

17.9.2017 66


Iso tiedonhallinta/

Jiaheng Lu

Question 4

www.helsinki.fi

Which are the typical application scenarios for a

MapReduce program?

A. Perform the matrix multiplication and other

complicated computing operations

B. Run machine learning algorithms with many

iterations

C. Compute the inverted indices

D. Summarize the number of pages crawled per host

on Internet

17.9.2017 67


Iso tiedonhallinta/

Jiaheng Lu

Question 5

www.helsinki.fi

Which are the typical application scenarios for a

MapReduce program?

A. Perform the matrix multiplication and other

complicated computing operations

B. Run machine learning algorithms with many

iterations

C. Compute the inverted indices

D. Summarize the number of pages crawled per host

on Internet

17.9.2017 68


Iso tiedonhallinta/

Jiaheng Lu

Question 5

www.helsinki.fi

MapReduce is an abstraction to hide the following

messy details of parallelization, including:

A. fault-tolerance

B. data distribution

C. high performance

D. load balancing

17.9.2017 69


Iso tiedonhallinta/

Jiaheng Lu

Question 6

www.helsinki.fi

MapReduce is an abstraction to hide the following

messy details of parallelization, including:

A. fault-tolerance

B. data distribution

C. high performance

D. load balancing

17.9.2017 70


Iso tiedonhallinta/

Jiaheng Lu

Question 6

www.helsinki.fi

Which are the correct statements on the workflow of MapReduce program?

A. The intermediate key/value pairs produced by the Map function are buffered in memory and periodically, these buffered pairs are written to local disk.

B. Master node is responsible for forwarding the location of the buffered pairs on local disk to the reduce works.

C. A reduce worker uses remote procedure calls to read the buffered data from the local disks of map workers.

D. When a reduce worker read partial of intermediate data, it start to sort it by the intermediate keys so that the same keys are grouped together.

17.9.2017 71


Iso tiedonhallinta/

Jiaheng Lu

Question 7

www.helsinki.fi

Which are the correct statements on the workflow of MapReduce program?

A. The intermediate key/value pairs produced by the Map function are buffered in memory and periodically, these buffered pairs are written to local disk.

B. Master is responsible for forwarding the location of the buffered pairs on local disk to the reduce workers.

C. A reduce worker uses remote procedure calls to read the buffered data from the local disks of map workers.

D. When a reduce worker read partial of intermediate data, it start to sort it by the intermediate keys so that the same keys are grouped together.

17.9.2017 72


Iso tiedonhallinta/

Jiaheng Lu

Question 7

www.helsinki.fi

Which are the correct statements on the functions of

Mapper and Reducer?

A. Each Mapper can do something to each individual

key-value pair.

B. Each Mapper can look at key-value pairs of other

mappers.

C. Each Reducer can aggregate data.

D. Each Reduce can look at multiple values from other

reducers.

17.9.2017 73


Iso tiedonhallinta/

Jiaheng Lu

Question 8

www.helsinki.fi

Which are the correct statements on the functions of

Mapper and Reducer?

A. Each Mapper can do something to each individual

key-value pair.

B. Each Mapper can look at key-value pairs of other

mappers.

C. Each Reducer can aggregate data.

D. Each Reduce can look at multiple values from other

reducers.

17.9.2017 74


Iso tiedonhallinta/

Jiaheng Lu

Question 8

www.helsinki.fi

What are the purposes of Combine Function?

A. The Combine function is executed on each

machine that performs a reduce task.

B. Typically the same code is used to implement both

the combine and the reduce functions.

C. The output of a combiner function is written to an

intermediate file that will be sent to a reduce task.

D. Partial combining can significantly speed up certain

of MapReduce operations.

17.9.2017 75


Iso tiedonhallinta/

Jiaheng Lu

Question 9

www.helsinki.fi

What are the purposes of Combine Function?

A. The Combine function is executed on each

machine that performs a reduce task.

B. Typically the same code is used to implement both

the combine and the reduce functions.

C. The output of a combiner function is written to an

intermediate file that will be sent to a reduce task.

D. Partial combining can significantly speed up certain

of MapReduce operations.

17.9.2017 76


Iso tiedonhallinta/

Jiaheng Lu

Question 9

www.helsinki.fi

Limitations of Hadoop

• Latency, slow processing speed

• No Real-time Data Processing

• Not fit for small files

2017/9/17 77

www.helsinki.fi

• Hadoop is an open-source platform for big data processing

• MapReduce is a programming framework to process big

data

• More information on big data management, join the course

“Introduction to big data management”:

• https://courses.helsinki.fi/DATA14002/119122647

17.9.2017 78


Iso tiedonhallinta/

Jiaheng Lu

Summary

https://courses.helsinki.fi/DATA14002/119122647

hadoop and mapreduce · “introduction to data science” outline • big data and google file...

Documents