learning with hadoop – a case study on mapreduce based data mining

38
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1

Upload: lawrence-emerson

Post on 02-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Learning with Hadoop – A case study on MapReduce based Data Mining. Evan Xiang, HKUST. Outline. Hadoop Basics Case Study Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient Resource Entries to ML labs Advanced Topics Q&A. - PowerPoint PPT Presentation

TRANSCRIPT

Learning with Hadoop – A case study on MapReduce

based Data MiningEvan Xiang, HKUST

1

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

2

Introduction to Hadoop

Hadoop Map/Reduce is

a java based software framework for easily writing applications

which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware

in a reliable, fault-tolerant manner.

3

Job submission node

Slave node

TaskTracker DataNode

HDFS master

JobTracker NameNode

Slave node

TaskTracker DataNode

Slave node

TaskTracker DataNode

Client

Hadoop Cluster Architecture

4From Jimmy Lin’s slides

Hadoop HDFS

5

Hadoop Cluster Rack Awareness

6

Hadoop Development Cycle

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

7From Jimmy Lin’s slides

Divide and Conquer

“Work”

w1 w2 w3

r1 r2 r3

“Result”

“worker” “worker” “worker”

Partition

Combine

8From Jimmy Lin’s slides

High-level MapReduce pipeline

9

Detailed Hadoop MapReduce data flow

10

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

11

Word Count with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 4

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

12From Jimmy Lin’s slides

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

13

14

Calculating document pairwise similarity

Trivial Solution

load each vector o(N) times load each term o(dft

2) times

scalable and efficient solutionfor large collections

Goal

From Jimmy Lin’s slides

15

Better Solution

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

From Jimmy Lin’s slides

16

reduce

Decomposition

Load weights for each term onceEach term contributes o(dft

2) partial scores

Each term contributes only if appears in

map

From Jimmy Lin’s slides

17

Standard Indexing

tokenize

tokenize

tokenize

tokenize

combine

combine

combine

doc

doc

doc

doc

posting list

posting list

posting list

Shuffling

group values by: terms

(a) Map (b) Shuffle (c) Reduce

From Jimmy Lin’s slides

Inverted Indexing with MapReduce

1one 1

1two 1

1fish 2

one fish, two fishDoc 1

2red 1

2blue 1

2fish 2

red fish, blue fishDoc 2

3cat 1

3hat 1

cat in the hatDoc 3

1fish 2 2 2

1one 11two 1

2red 1

3cat 12blue 1

3hat 1

Shuffle and Sort: aggregate values by keys

Map

Reduce

18From Jimmy Lin’s slides

19

Indexing (3-doc toy collection)Clinton

Barack

Cheney

Obama

Indexing

2

1

1

1

1

ClintonObamaClinton 1

1

ClintonCheney

ClintonBarackObama

ClintonObamaClinton

ClintonCheney

ClintonBarackObama

From Jimmy Lin’s slides

20

Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs

Clinton

Barack

Cheney

Obama

2

1

1

1

1

1

1

2

2

1

11

2

2 2

2

1

13

1

How to deal with the long list?

From Jimmy Lin’s slides

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

21

PageRank

PageRank – an information propagation model

Intensive access of neighborhood list

22

PageRank with MapReduce

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

n2 n4 n3 n5 n1 n2 n3n4 n5

n2 n4n3 n5n1 n2 n3 n4 n5

n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]

Map

Reduce

How to maintain the graph structure?

From Jimmy Lin’s slides

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

24

K-Means Clustering

25

K-Means Clustering with MapReduce

26

Mapper_iMapper_i-1 Mapper_i+1

Reducer_i Reducer_i+1Reducer_i-1

How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.

Each Mapper loads a set of data samples,

and assign each sample to a nearest centroid

Each Mapper needs to keep a copy of

centroids

1 3 42

1 32

1 42

1 3 42

4

3

23 4

3 2 4

[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

27

Matrix Factorization for Link Prediction

In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

28

Given X and V, updating U:

Similarly, given X and U, we can alternatively update V

Solving Matrix Factorization via Alternative Least Squares

29

Xm

n

ui

Vk

n

A

k k

k

k

k

k

b

k

MapReduce for ALS

30

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Group rating data in X using

for item j

Group features in V

using for item j

Align ratings and features for item j, and make a copy of Vj for

each observe xij

Stage 1 Stage 2

Rating for item j

Features for item j

i-1 Vj

i Vj

i+1 Vj

Group rating data in X using

for user i

i Vj

i Vj+2

i+1 Vj

Standard ALS:Calculate A and b,

and update Ui

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

31

Cluster Coefficient

32

In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.

How to maintain the Tier-2 neighbors?

[D. J. Watts and Steven Strogatz (June 1998).

"Collective dynamics of 'small-world' networks".

Nature 393 (6684): 440–442]

Cluster Coefficient with MapReduce

33

Mapper_i

Reducer_i

Mapper_i

Reducer_i

Stage 1 Stage 2

Calculate the cluster coefficient

BFS based method need three stages, but actually we only need two!

Resource Entries to ML labsMahout

Apache’s scalable machine learning librariesJimmy Lin’s Lab

iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections

IBM TJ Watson Research CenterEdward Chang & Yi Wang

Google Beijing

34

Advanced Topics in Machine Learning with MapReduce

35

Probabilistic Graphical models

Gradient based optimization methods

Graph Mining

Others…

Some Advanced TipsDesign your algorithm with a divide and

conquer manner

Make your functional units loosely dependent

Carefully manage your memory and disk storage

Discussions…

36

OutlineHadoop Basics

Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient

Resource Entries to ML labsAdvanced TopicsQ&A

37

Q&A

Why not MPI? Hadoop is Cheap in everything…D.P.T.H…

What’s the advantages of Hadoop? Scalability!

How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics

How can you beat “large memory” solution? Clever use of Sequential Disk Access

38