basics of cloud computing –lecture 5 mapreduce algorithms...mapreduce algorithms satish srirama...

34
Basics of Cloud Computing – Lecture 5 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License)

Upload: others

Post on 16-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Basics of Cloud Computing – Lecture 5

MapReduce Algorithms

Satish Srirama

Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Commons Attribution 3.0 License)

Page 2: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Outline

• MapReduce algorithms

• How to write MR algorithms

11.03.2014 Satish Srirama 2/34

Page 3: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

combinecombine combine combine

ba 1 2 c 9 a c5 2 b c7 8

mapmap map map

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

ba 1 2 c 9 a c5 2 b c7 8

partition partition partition partition

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

c 2 3 6 8

11.03.2014 Satish Srirama 3

Page 4: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Hadoop Usage Patterns

• Extract, transform, and load (ETL)

– Perform aggregations, transformation, normalizations on the data (e.g. Log files) and load into RDBMS/ data mart

• Reporting and analytics

– Run ad-hoc queries, analytics and data mining operations – Run ad-hoc queries, analytics and data mining operations on large data

• Data processing pipelines

• Machine learning & Graph algorithms

– Implement machine learning algorithms on huge data sets

– Traverse large graphs and data sets, building models and classifiers

11.03.2014 Satish Srirama 4/34

Page 5: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

MapReduce Examples

• Distributed Grep

• Count of URL Access Frequency

• Reverse Web-Link Graph

• Term-Vector per Host • Term-Vector per Host

• Inverted Index

• Distributed Sort

11.03.2014 Satish Srirama 5/34

Page 6: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

MapReduce Jobs

• Tend to be very short, code-wise

– IdentityReducer is very common

• “Utility” jobs can be composed

• Represent a data flow, more so than a • Represent a data flow, more so than a

procedure

11.03.2014 Satish Srirama 6/34

Page 7: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Count of URL Access Frequency

• Processing web access logs

• Very similar to word count

• Map

– processes logs of web page requests and outputs – processes logs of web page requests and outputs

<URL, 1>

• Reduce

– adds together all values

– emits a <URL, total count> pairs

11.03.2014 Satish Srirama 7/34

Page 8: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Distributed Grep

• Map

– Emits a line if it matches a supplied pattern

• Reduce

– Identity function – Identity function

– Just copies the supplied intermediate data to the

output

11.03.2014 Satish Srirama 8/34

Page 9: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Reverse Web-Link Graph

• Map

– Outputs <target, source> pairs

– for each link to a target URL found in a page

named source.named source.

• Reduce

– Concatenates the list of all source URLs

– Returns <target, list(source)>

11.03.2014 Satish Srirama 9/34

Page 10: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Sort: Inputs

• A set of files, one value per line.

• Mapper key is file name, line number

• Mapper value is the contents of the line

11.03.2014 Satish Srirama 10/34

Page 11: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Sort Algorithm

• Takes advantage of reducer properties:

– (key, value) pairs are processed in order by key;

reducers are themselves ordered

• Mapper: Identity function for value• Mapper: Identity function for value

(k, v) -> (v, _)

• Reducer: Identity function (k’, _) -> (k’, “”)

11.03.2014 Satish Srirama 11/34

Page 12: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Sort: The Trick

• (key, value) pairs from mappers are sent to a

particular reducer based on hash(key)

• Must pick the hash function for your data such

that that

– K1 < K2 => hash(K1) < hash(K2)

• Used as a test of Hadoop’s raw speed

11.03.2014 Satish Srirama 12/34

Page 13: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Inverted Index: Inputs

• A set of files containing lines of text

• Mapper key is file name, line number

• Mapper value is the contents of the line• Mapper value is the contents of the line

11.03.2014 Satish Srirama 13/34

Page 14: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Inverted Index Algorithm

• Mapper: For each word in (file, words),

map to (word, file)

• Reducer: Identity function• Reducer: Identity function

11.03.2014 Satish Srirama 14/34

Page 15: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Index MapReduce

• map(pageName, pageText):

foreach word w in pageText:

emit Intermediate(w, pageName);

Done

• reduce(word, values):

foreach pageName in values:

AddToOutputList(pageName);

Done

emitFinal(FormattedPageListForWord);

11.03.2014 Satish Srirama 15/34

Page 16: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Index: Data Flow

This page

contains

so much of

text

Page A A map output

This : A

page : A

contains : A

so : A

much : A

of : A

Reduced output

This : A, B

page : A, B

too : B

11.03.2014 Satish Srirama

This page

too

contains

some text

Page B

of : A

text : A

B map output

This : B

page : B

too : B

contains : B

some : B

text : B

too : B

contains : A, B

so : A

much : A

of : A

text : A, B

some : B

16/34

Page 17: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Let us focus much bigger problemsLet us focus much bigger problems

11.03.2014 Satish Srirama 17/34

Page 18: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Managing Dependencies

• Remember: Mappers run in isolation

– You have no idea in what order the mappers run

– You have no idea on what node the mappers run

– You have no idea when each mapper finishes

• Tools for synchronization:• Tools for synchronization:

– Ability to hold state in reducer across multiple key-value pairs

– Sorting function for keys

– Partitioner

– Cleverly-constructed data structures

11.03.2014 Satish Srirama 18/34

Page 19: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Motivating Example

• Term co-occurrence matrix for a text collection

– M = N x N matrix (N = vocabulary size)

– Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)(for concreteness, let’s say context = sentence)

• Why?

– Distributional profiles as a way of measuring semantic distance

– Semantic distance useful for many language processing tasks

“You shall know a word by the company it keeps” (Firth, 1957)

e.g., Mohammad and Hirst (EMNLP, 2006)

11.03.2014 Satish Srirama 19/34

Page 20: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

MapReduce: Large Counting Problems

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem

– A large event space (number of terms)

– A large number of events (the collection itself)– A large number of events (the collection itself)

– Goal: keep track of interesting statistics about the events

• Basic approach

– Mappers generate partial counts

– Reducers aggregate partial counts

How do we aggregate partial counts efficiently?

11.03.2014 Satish Srirama 20/34

Page 21: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

First Try: “Pairs”

• Each mapper takes a sentence:

– Generate all co-occurring term pairs

– For all pairs, emit (a, b) → count

• Reducers sums up counts associated with • Reducers sums up counts associated with

these pairs

• Use combiners!

11.03.2014 Satish Srirama 21/34

Page 22: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

“Pairs” Analysis

• Advantages

– Easy to implement, easy to understand

• Disadvantages

– Lots of pairs to sort and shuffle around (upper – Lots of pairs to sort and shuffle around (upper

bound?)

11.03.2014 Satish Srirama 22/34

Page 23: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Another Try: “Stripes”

• Idea: group together pairs into an associative array(a, b) → 1

(a, c) → 2

(a, d) → 5

(a, e) → 3

(a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

• Each mapper takes a sentence:

– Generate all co-occurring term pairs

– For each term, emit a → { b: countb, c: countc, d: countd … }

• Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 }

a → { b: 1, c: 2, d: 2, f: 2 }

a → { b: 2, c: 2, d: 7, e: 3, f: 2 }+

11.03.2014 Satish Srirama 23/34

Page 24: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

“Stripes” Analysis

• Advantages

– Far less sorting and shuffling of key-value pairs

– Can make better use of combiners

• Disadvantages• Disadvantages

– More difficult to implement

– Underlying object is more heavyweight

– Fundamental limitation in terms of size of event

space

11.03.2014 Satish Srirama 24/34

Page 25: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Cluster size: 38 cores

Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which

contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)

11.03.2014 Satish Srirama 25

Page 26: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Conditional Probabilities

• How do we compute conditional probabilities

from counts?

∑==

)',(count

),(count

)(count

),(count)|(

BA

BA

A

BAABP

• How do we do this with MapReduce?

∑'

)',(count)(countB

BAA

11.03.2014 Satish Srirama 26/34

Page 27: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

P(B|A): “Pairs”

(a, b1) → 3

(a, b2) → 12

(a, b3) → 7

(a, b4) → 1

(a, *) → 32

(a, b1) → 3 / 32

(a, b2) → 12 / 32

(a, b3) → 7 / 32

(a, b4) → 1 / 32

Reducer holds this value in memory

• For this to work:– Must emit extra (a, *) for every bn in mapper

– Must make sure all a’s get sent to same reducer (use Partitioner)

– Must make sure (a, *) comes first (define sort order)

11.03.2014 Satish Srirama 27/34

Page 28: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

P(B|A): “Stripes”

• Easy!

– One pass to compute (a, *)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

– One pass to compute (a, *)

– Another pass to directly compute P(B|A)

11.03.2014 Satish Srirama 28/34

Page 29: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Synchronization in Hadoop

• Approach 1: turn synchronization into an ordering problem– Sort keys into correct order of computation

– Partition key space so that each reducer gets the appropriate set of partial results

– Hold state in reducer across multiple key-value pairs to perform computationcomputation

– Illustrated by the “pairs” approach

• Approach 2: construct data structures that “bring the pieces together”– Each reducer receives all the data it needs to complete the

computation

– Illustrated by the “stripes” approach

11.03.2014 Satish Srirama 29/34

Page 30: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Issues and Tradeoffs

• Number of key-value pairs

– Object creation overhead

– Time for sorting and shuffling pairs across the network

• Size of each key-value pair• Size of each key-value pair

– De/serialization overhead

• Combiners make a big difference!

– RAM vs. disk and network

– Arrange data to maximize opportunities to aggregate

partial results

11.03.2014 Satish Srirama 30/34

Page 31: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Complex Data Types in Hadoop

• How do you implement complex data types?

• The easiest way:– Encoded it as Text, e.g., (a, b) = “a:b”

– Use regular expressions to parse and extract data

– Works, but pretty hack-ish– Works, but pretty hack-ish

• The hard way:– Define a custom implementation of

WritableComprable

– Must implement: readFields, write, compareTo

– Computationally efficient, but slow for rapid prototyping

11.03.2014 Satish Srirama 31/34

Page 32: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

This week in lab

• MapReduce for data analysis

• Writing better MapReduce algorithms

11.03.2014 Satish Srirama 32/34

Page 33: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Next Lecture

• Platform as a Service

– Google AppEngine

11.03.2014 Satish Srirama 33/34

Page 34: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms...MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

References

• J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI'04: Sixth Symposium on Operating System Design and Implementation, Dec, 2004.

• Data-Intensive Text Processing with MapReduce• Data-Intensive Text Processing with MapReduce

Authors: Jimmy Lin and Chris Dyer

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Pages 50-57: Pairs and Stripes problem

11.03.2014 Satish Srirama 34/34