lecture 3: data-intensive computing for text analysis (fall 2011)

59
Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 3 September 8, 2011 Matt Lease School of Information University of Texas at Austin ml at ischool dot utexas dot edu Jason Baldridge Department of Linguistics University of Texas at Austin Jasonbaldridge at gmail dot com

Post on 13-Sep-2014

1.981 views

Category:

Technology


1 download

DESCRIPTION

Course at University of Texas at Austin, co-taught with Jason Baldridge

TRANSCRIPT

Page 1: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M

University of Texas at Austin, Fall 2011

Lecture 3 September 8, 2011

Matt Lease

School of Information

University of Texas at Austin

ml at ischool dot utexas dot edu

Jason Baldridge

Department of Linguistics

University of Texas at Austin

Jasonbaldridge at gmail dot com

Page 2: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Acknowledgments

Course design and slides derived from Jimmy Lin’s cloud computing courses at the University of Maryland, College Park

Some figures courtesy of the following excellent Hadoop books (order yours today!)

• Chuck Lam’s Hadoop In Action (2010)

• Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)

Page 3: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Today’s Agenda

• Review

• Toward MapReduce “design patterns”

– Building block: preserving state across calls

– In-Map & In-Mapper combining (vs. combiners)

– Secondary sorting (via value-to-key Conversion)

– Pairs and Stripes

– Order Inversion

• Group Work (examples)

– Interlude: scaling counts, TF-IDF

Page 4: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Review

Page 5: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce: Recap

Required:

map ( K1, V1 ) → list ( K2, V2 )

reduce ( K2, list(V2) ) → list ( K3, V3)

All values with the same key are reduced together

Optional:

partition (K2, N) → Rj maps K2 to some reducer Rj in [1..N]

Often a simple hash of the key, e.g., hash(k’) mod n

Divides up key space for parallel reduce operations

combine ( K2, list(V2) ) → list ( K2, V2 )

Mini-reducers that run in memory after the map phase

Used as an optimization to reduce network traffic

The execution framework handles everything else…

Page 6: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Everything Else”

The execution framework handles everything else…

Scheduling: assigns workers to map and reduce tasks

―Data distribution‖: moves processes to data

Synchronization: gathers, sorts, and shuffles intermediate data

Errors and faults: detects worker failures and restarts

Limited control over data and execution flow

All algorithms must expressed in m, r, c, p

You don’t know:

Where mappers and reducers run

When a mapper or reducer begins or finishes

Which input a particular mapper is processing

Which intermediate key a particular reducer is processing

Page 7: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

combine combine combine combine

b a 1 2 c 9 a c 5 2 b c 7 8

partition partition partition partition

map map map map

k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6

b a 1 2 c c 3 6 a c 5 2 b c 7 8

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

a 1 5 b 2 7 c 2 9 8

r1 s1 r2 s2 r3 s3

Page 8: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Shuffle and Sort

Mapper

Reducer

other mappers

other reducers

circular buffer

(in memory)

spills (on disk)

merged spills

(on disk)

intermediate files

(on disk)

Combiner

Combiner

Page 9: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Shuffle and 2 Sorts

As map emits values, local sorting runs in tandem (1st sort)

Combine is optionally called 0..N times for local aggregation on sorted (K2, list(V2)) tuples (more sorting of output)

Partition determines which (logical) reducer Rj each key will go to

Node’s TaskTracker tells JobTracker it has keys for Rj

JobTracker determines node to run Rj based on data locality

When local map/combine/sort finishes, sends data to Rj’s node

Rj’s node iteratively merges inputs from map nodes as it arrives (2nd sort)

For each (K, list(V)) tuple in merged output, call reduce(…)

Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 178

Page 10: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Scalable Hadoop Algorithms: Themes

Avoid object creation

Inherently costly operation

Garbage collection

Avoid buffering

Limited heap size

Works for small datasets, but won’t scale!

• Yet… we’ll talk about patterns involving buffering…

Page 11: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Importance of Local Aggregation

Ideal scaling characteristics:

Twice the data, twice the running time

Twice the resources, half the running time

Why can’t we achieve this?

Synchronization requires communication

Communication kills performance

Thus… avoid communication!

Reduce intermediate data via local aggregation

Combiners can help

Page 12: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Tools for Synchronization

Cleverly-constructed data structures

Bring partial results together

Sort order of intermediate keys

Control order in which reducers process keys

Partitioner

Control which reducer processes which keys

Preserving state in mappers and reducers

Capture dependencies across multiple keys and values

Page 13: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Secondary Sorting

MapReduce sorts input to reducers by key

Values may be arbitrarily ordered

What if want to sort value also?

E.g., k → (k1,v1), (k1,v3), (k1,v4), (k1,v8)…

Solutions?

Swap key and value to sort by value?

What if we use (k,v) as a joint key (and change nothing else)?

Page 14: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Secondary Sorting: Solutions

Solution 1: Buffer values in memory, then sort

Tradeoffs?

Solution 2: ―Value-to-key conversion‖ design pattern

Form composite intermediate key: (k, v1)

Let execution framework do the sorting

Preserve state across multiple key-value pairs

…how do we make this happen?

Page 15: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Secondary Sorting (Lin 57, White 241)

Create composite key: (k,v)

Define a Key Comparator to sort via both

Possibly not needed in some cases (e.g. strings & concatenation)

Define a partition function based only on the (original) key

All pairs with same key should go to same reducer

Multiple keys may still go to the same reduce node; how do you

know when the key changes across invocations of reduce()?

• i.e. assume you want to do something with all values associated with

a given key (e.g. print all on the same line, with no other keys)

Preserve state in the reducer across invocations

reduce() will be called separately for each pair, but we need to

track the current key so we can detect when it changes

Hadoop also provides Group Comparator

Page 16: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Preserving State in Hadoop

Mapper object

configure

map

close

state

one object per task

Reducer object

configure

reduce

close

state

one call per input

key-value pair

one call per

intermediate key

API initialization hook

API cleanup hook

Page 17: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Combiner Design

Combiners and reducers share same method signature

Sometimes, reducers can serve as combiners

Often, not…

Remember: combiner are optional optimizations

Should not affect algorithm correctness

May be run 0, 1, or multiple times

Page 18: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Page 19: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

Combiner?

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Page 20: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce Algorithm Design

Page 21: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Design Pattern for Local Aggregation

―In-mapper combining‖

Fold the functionality of the combiner into the mapper,

including preserving state across multiple map calls

Advantages

Speed

Why is this faster than actual combiners?

• Construction/deconstruction, serialization/deserialization

• Guarantee and control use

Disadvantages

Buffering! Explicit memory management required

• Can use disk-backed-buffer, based on # items or byes in memory

• What if multiple mappers running on the same node? Do we know?

Potential for order-dependent bugs

Page 22: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Hello World”: Word Count

Map(String docid, String text):

for each word w in text:

Emit(w, 1);

Reduce(String term, Iterator<Int> values):

int sum = 0;

for each v in values:

sum += v;

Emit(term, sum);

Combine = reduce

map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)

Page 23: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Word Count: in-map combining

Are combiners still needed?

Page 24: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Word Count: in-mapper combining

Are combiners still needed?

Page 25: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Example 2: Compute the Mean (v1)

Why can’t we use reducer as combiner?

Page 26: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Example 2: Compute the Mean (v2)

Why doesn’t this work?

Page 27: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Example 2: Compute the Mean (v3)

Page 28: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Computing the Mean:

in-mapper combining

Are combiners still needed?

Page 29: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Example 3: Term Co-occurrence

Term co-occurrence matrix for a text collection

M = N x N matrix (N = vocabulary size)

Mij: number of times i and j co-occur in some context

(for concreteness, let’s say context = sentence)

Why?

Distributional profiles as a way of measuring semantic distance

Semantic distance useful for many language processing tasks

Page 30: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

MapReduce: Large Counting Problems

Term co-occurrence matrix for a text collection

= specific instance of a large counting problem

A large event space (number of terms)

A large number of observations (the collection itself)

Goal: keep track of interesting statistics about the events

Basic approach

Mappers generate partial counts

Reducers aggregate partial counts

How do we aggregate partial counts efficiently?

Page 31: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Approach 1: “Pairs”

Each mapper takes a sentence:

Generate all co-occurring term pairs

For all pairs, emit (a, b) → count

Reducers sum up counts associated with these pairs

Use combiners!

Page 32: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Pairs: Pseudo-Code

Page 33: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Pairs” Analysis

Advantages

Easy to implement, easy to understand

Disadvantages

Lots of pairs to sort and shuffle around (upper bound?)

Not many opportunities for combiners to work

Page 34: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Another Try: “Stripes”

Idea: group together pairs into an associative array

Each mapper takes a sentence:

Generate all co-occurring term pairs

For each term, emit a → { b: countb, c: countc, d: countd … }

Reducers perform element-wise sum of associative arrays

(a, b) → 1

(a, c) → 2

(a, d) → 5

(a, e) → 3

(a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }

a → { b: 1, c: 2, d: 2, f: 2 }

a → { b: 2, c: 2, d: 7, e: 3, f: 2 } +

Page 35: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Stripes: Pseudo-Code

Page 36: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Stripes” Analysis

Advantages

Far less sorting and shuffling of key-value pairs

Can make better use of combiners

Disadvantages

More difficult to implement

Underlying object more heavyweight

Fundamental limitation in terms of size of event space

• Buffering!

Page 37: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Cluster size: 38 cores

Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3),

which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)

Page 38: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Page 39: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Relative Frequencies

How do we estimate relative frequencies from counts?

Why do we want to do this?

How do we do this with MapReduce?

'

)',(count

),(count

)(count

),(count)|(

B

BA

BA

A

BAABf

Page 40: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

f(B|A): “Stripes”

Easy!

One pass to compute (a, *)

Another pass to directly compute f(B|A)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

Page 41: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

f(B|A): “Pairs”

For this to work:

Must emit extra (a, *) for every bn in mapper

Must make sure all a’s get sent to same reducer (use partitioner)

Must make sure (a, *) comes first (define sort order)

Must hold state in reducer across different key-value pairs

(a, b1) → 3

(a, b2) → 12

(a, b3) → 7

(a, b4) → 1

(a, *) → 32

(a, b1) → 3 / 32

(a, b2) → 12 / 32

(a, b3) → 7 / 32

(a, b4) → 1 / 32

Reducer holds this value in memory

Page 42: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

“Order Inversion”

Common design pattern

Computing relative frequencies requires marginal counts

But marginal cannot be computed until you see all counts

Buffering is a bad idea!

Trick: getting the marginal counts to arrive at the reducer before

the joint counts

Optimizations

Apply in-memory combining pattern to accumulate marginal counts

Should we apply combiners?

Page 43: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Synchronization: Pairs vs. Stripes

Approach 1: turn synchronization into an ordering problem

Sort keys into correct order of computation

Partition key space so that each reducer gets the appropriate set

of partial results

Hold state in reducer across multiple key-value pairs to perform

computation

Illustrated by the ―pairs‖ approach

Approach 2: construct data structures that bring partial

results together

Each reducer receives all the data it needs to complete the

computation

Illustrated by the ―stripes‖ approach

Page 44: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Recap: Tools for Synchronization

Cleverly-constructed data structures

Bring data together

Sort order of intermediate keys

Control order in which reducers process keys

Partitioner

Control which reducer processes which keys

Preserving state in mappers and reducers

Capture dependencies across multiple keys and values

Page 45: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Issues and Tradeoffs

Number of key-value pairs

Object creation overhead

Time for sorting and shuffling pairs across the network

Size of each key-value pair

De/serialization overhead

Local aggregation

Opportunities to perform local aggregation varies

Combiners make a big difference

Combiners vs. in-mapper combining

RAM vs. disk vs. network

Page 46: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Group Work (Examples)

Page 47: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

How many distinct words in the document collection start

with each letter?

Note: ―types‖ vs. ―tokens‖

Task 5

Page 48: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

How many distinct words in the document collection start

with each letter?

Note: ―types‖ vs. ―tokens‖

Ways to make more efficient?

Task 5

Mapper<String,String String,String>

Map(String docID, String document)

for each word in document

emit (first character, word)

Page 49: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

How many distinct words in the document collection start

with each letter?

Note: ―types‖ vs. ―tokens‖

Ways to make more efficient?

Task 5

Mapper<String,String String,String>

Map(String docID, String document)

for each word in document

emit (first character, word)

Reducer<Integer,Integer Integer,V3>

Reduce(Integer length, Iterator<K2> values):

set of words = empty set;

for each word

add word to set

emit(letter, size word set)

Page 50: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

How many distinct words in the document collection start

with each letter?

How to use in-mapper combining and a separate combiner

Tradeoffs

Task 5b

Mapper<String,String String,String>

Map(String docID, String document)

for each word in document

emit (first character, word)

Page 51: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

How many distinct words in the document collection start

with each letter?

How to use in-mapper combining and a separate combiner

Tradeoffs?

Task 5b

Mapper<String,String String,String>

Map(String docID, String document)

for each word in document

emit (first character, word)

Combiner<String,String String,String>

Combine(String letter, Iterator<String> words):

set of words = empty set;

for each word

add word to set

for each word in set

emit(letter, word)

Page 52: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Task 6: find median document length

Page 53: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Task 6: find median document length

Mapper<K1,V1 Integer,Integer>

Map(K1 xx, V1 xx)

10,000 / N times

emit( length(generateRandomDocument()), 1)

Page 54: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Task 6: find median document length

conf.setNumReduceTasks(1)

Problems with this solution?

Mapper<K1,V1 Integer,Integer>

Map(K1 xx, V1 xx)

10,000 / N times

emit( length(generateRandomDocument()), 1)

Reducer<Integer,Integer Integer,V3>

Reduce(Integer length, Iterator<K2> values):

static list lengths = empty list;

for each value

append length to list

Close() { output median }

Page 55: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Interlude: Scaling counts

Many applications require counts of words in some

context.

E.g. information retrieval, vector-based semantics

Counts from frequent words like ―the‖ can overwhelm the

signal from content words such as ―stocks‖ and ―football‖

Two strategies for combating high frequency words:

Use a stop list that excludes them

Scale the counts such that high frequency words are downgraded.

Page 56: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Interlude: Scaling counts, TF-IDF

TF-IDF, or term frequency—inverse document frequency

is a standard way of scaling.

Inverse document frequency for a term t is the ratio of the

number of documents in the collection to the number of

documents containing t:

TF-IDF is just the term frequency times the idf:

Page 57: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Interlude: Scaling counts, TF-IDF

TF-IDF, or term frequency—inverse document frequency

is a standard way of scaling.

Inverse document frequency for a term t is the ratio of the

number of documents in the collection to the number of

documents containing t:

TF-IDF is just the term frequency times the idf:

Page 58: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Interlude: Scaling counts using DF

Recall the word co-occurrence counts task from the earlier

slides.

mij represents the number of times word j has occurred in the

neighborhood of word i.

The row mi gives a vector profile of word i that we can use for

tasks like determining word similarity (e.g. using cosine distance)

Words like ―the‖ will tend to have high counts that we want to scale

down so they don’t dominate this computation.

The counts in mij can be scaled down using dfj. Let’s

create a transformed matrix S where:

Page 59: Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)

Task 7

Compute S, the co-occurrence counts scaled by document

frequency.

• First: do the simplest mapper

• Then: simplify things for the reducer