learning with hadoop – a case study on mapreduce based data mining

Learning with Hadoop – A case study on MapReduce

based Data MiningEvan Xiang, HKUST

1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

2

Introduction to Hadoop

Hadoop Map/Reduce is

a java based software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware

in a reliable, fault-tolerant manner.

3

Job submission node

Slave node

TaskTracker DataNode

HDFS master

JobTracker NameNode

Slave node


Slave node


Client

Hadoop Cluster Architecture

4From Jimmy Lin’s slides

Hadoop HDFS

5

Hadoop Cluster Rack Awareness

6

Hadoop Development Cycle

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster


Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine


High-level MapReduce pipeline

9

Detailed Hadoop MapReduce data flow

10




11

Word Count with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 4

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce





13

14

Calculating document pairwise similarity

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

From Jimmy Lin’s slides

15

Better Solution

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in


16

reduce

Decomposition

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

map


17

Standard Indexing

tokenize

tokenize

tokenize

tokenize

combine

combine

combine

doc

doc

doc

doc

posting list

posting list

posting list

Shuffling

group values by: terms

(a) Map (b) Shuffle (c) Reduce


Inverted Indexing with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 2 2 2

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce


19

Indexing (3-doc toy collection)Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama


20

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

How to deal with the long list?





21

PageRank

PageRank – an information propagation model

Intensive access of neighborhood list

22

PageRank with MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

How to maintain the graph structure?





24

K-Means Clustering

25

K-Means Clustering with MapReduce

26

Mapper_iMapper_i-1 Mapper_i+1

Reducer_i Reducer_i+1Reducer_i-1

How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.

Each Mapper loads a set of data samples,

and assign each sample to a nearest centroid

Each Mapper needs to keep a copy of

centroids

1 3 42

1 32

1 42

1 3 42

4

3

23 4

3 2 4

[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering




27

Matrix Factorization for Link Prediction

In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

28

Given X and V, updating U:

Similarly, given X and U, we can alternatively update V

Solving Matrix Factorization via Alternative Least Squares

29

Xm

n

ui

Vk

n

A

k k

k

k

k

k

b

k

MapReduce for ALS

30

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Group rating data in X using

for item j

Group features in V

using for item j

Align ratings and features for item j, and make a copy of Vj for

each observe xij

Stage 1 Stage 2

Rating for item j

Features for item j

i-1 Vj

i Vj

i+1 Vj

Group rating data in X using

for user i

i Vj

i Vj+2

i+1 Vj

Standard ALS:Calculate A and b,

and update Ui




31

Cluster Coefficient

32

In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.

How to maintain the Tier-2 neighbors?

[D. J. Watts and Steven Strogatz (June 1998).

"Collective dynamics of 'small-world' networks".

Nature 393 (6684): 440–442]

Cluster Coefficient with MapReduce

33

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Stage 1 Stage 2

Calculate the cluster coefficient

BFS based method need three stages, but actually we only need two!

Resource Entries to ML labsMahout

Apache’s scalable machine learning librariesJimmy Lin’s Lab

iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections

IBM TJ Watson Research CenterEdward Chang & Yi Wang

Google Beijing

34

http://mahout.apache.org/

http://www.umiacs.umd.edu/~jimmylin/

http://www.dasfa.net/wiki/index.php?title=Jimeng_Sun

http://yanrong.info/

http://infolab.stanford.edu/~echang/

http://cxwangyi.blogspot.com/

Advanced Topics in Machine Learning with MapReduce

35

Probabilistic Graphical models

Gradient based optimization methods

Graph Mining

Others…

Some Advanced TipsDesign your algorithm with a divide and

conquer manner

Make your functional units loosely dependent

Carefully manage your memory and disk storage

Discussions…

36




37

Q&A

Why not MPI? Hadoop is Cheap in everything…D.P.T.H…

What’s the advantages of Hadoop? Scalability!

How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics

How can you beat “large memory” solution? Clever use of Sequential Disk Access

38

learning with hadoop – a case study on mapreduce based data mining

Documents