ch. 3 lin and dyer’s text pages 43-73 (39-69)

Designing MapReduce AlgorithmsCh. 3 Lin and Dyer’s text

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdfPages 43-73 (39-69)

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

Word count: Local aggregation as opposed to external

combiner that is NOT guaranteed by the Hadoop framework◦ May not work all the time: what if wanted word

“mean” instead of word “count”: may have to adjust <k,v> types at the output of map

Word co-occurrence (matrix)◦ Very important since many (many) problems are

expressed and solved using matrices◦ Pairs and stripes approaches◦ And comparison of these two methods P.60 (56)

Improvements

First version simplistic counts Then “relative frequency” instead of counts

◦ What is relative frequency? Instead of absolute counts◦ f(wi /wj) = N(wi,wj)/∑w’ (wi, w’)◦ For example, if word “dog” co-occurred with “food” 23 times,

and “dog” co-occurred with all words 460 times, then relative frequency is 23/460 = 1/20 = 0.05

◦ Also the 460 could come from many mappers, many documents over the entire corpus.

◦ These co-occurrences from every mapper are delivered to “corresponding reducer” with a special key

◦ This is delivered as special key item < (wi, *) , count> as the first <k,v> pair

◦ The magic is that reducer processes < (wi, *) , count>

Co-Occurrence

Key Value Reducer operation/compute

Result

(dog,*) [200,350,650]One per mapper with combiner

Marginal ∑w’ (wi, w’)=1200

(dog, bark) 60 Relative frequency <(dog, bark), 0.05>

(dog, cat) 12 Relative frequency <(dog, cat), 0.01>

(dog, food) 600 Relative frequency <(dog, food), 0.5>

…. …

At the reducer: Blue: reducer1/Orange: reducer 2

Key Value Reducer operation

Result

(tiger,*) [100,300,600] Compute marginal

∑w’ (wi, w’)=1000

(tiger, cub) 10 Compute R.freq <(tiger, cub), 10/1000>

(tiger, hunt) 100 Compute R.freq <(tiger, hunt), 100/1000>

(tiger, prey) 200 Compute R.freq <(tiger, prey), 200/1000>

….

4 different reducers

Emitting a special key-value pair for each co-occurring word pair in the mapper to capture its contribution to the marginal.

Controlling the sort order of the intermediate key so that the key-value pairs representing the marginal contributions are processed by the reducer before any of the pairs representing the joint word co-occurrence counts.

Defining a custom partitioner to ensure that all pairs with the same left word are shuffled to the same reducer.

Preserving state across multiple keys in the reducer to first compute the marginal based on the special key-value pairs and then dividing the joint counts by the marginal to arrive at the relative frequencies.

Requirements

Lets generalize this<(var34, left), value><(var34, right), value><(var34, middle), value> all delivered to the same reducer.. What can you do with this?

Reducer can “middle(left’s value, right’s value) “ <var34, computedValue>

Some more: <KEY complex object, VALUE complex object>You can do anything you want for function… “KEY.operation” on “VALUE.data”

Therein lies the power of MR.

Text word count Text co-occurrence pairs and stripes Numerical data processing with most math functions How about sensor data? Consider m sensors, sending out readings rx at various times ty resulting

large volume of data of the format: (t1;m1; r80521) (t1;m2; r14209) (t1;m3; r76042) ::: (t2;m1; r21823) (t2;m2; r66508) (t2;m3; r98347) Suppose you wanted to know the readings by the sensors, how could

process the above to get that info? Use MR to do that…<m1 , (t1; r80521)> etc. But what if wanted that sorted by time t that is a part of the value?

Problem discussed so far

Solution 1: Let the reducer do the in-memory sorting memory bottle neck

Solution 2: Move value to be sorted to the key, and modify the shuffler and partitioner

In the later, the “secondary sorting” is left to the framework and it excels in doing this anyway.

So solution 2 is a preferred approach. Lesson: Let the framework do what it is good at

and don’t try to move into your code… in the latter you will be regressing to the “usual” coding practices and ensuing disadvantages

In memory sorting vs. value-key conversion

Reduce-side join is intuitive but inefficient Map-side join requires simple merge of

respective input files and appropriate sort by the MR framework

In-memory joins can be done for smaller data.

We will NOT discuss this in detail since there are other solutions such as Hive, Hbase available for warehouse data. We will look into these later.

Relational Joins/warehouse data

ch. 3 lin and dyer’s text pages 43-73 (39-69)

Documents