apache mahout feb 13, 2012 shannon quinn

76
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS 15-319

Upload: erek

Post on 24-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Cloud Computing CS 15-319. Apache Mahout Feb 13, 2012 Shannon Quinn. MapReduce Review. Scalable programming model Map phase Shuffle Reduce phase MapReduce Implementations Google Hadoop. chunks. C0. C1. C2. C3. Map Phase. M0. M1. M2. M3. mappers. IO0. IO1. IO2. IO3. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Apache Mahout Feb 13,  2012 Shannon Quinn

Apache MahoutFeb 13, 2012

Shannon Quinn

Cloud ComputingCS 15-319

Page 2: Apache Mahout Feb 13,  2012 Shannon Quinn

MapReduce Review

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Scalable programming modelMap phaseShuffleReduce phaseMapReduce Implementations

GoogleHadoop

Figure from lecture 6: MapReduce

Page 3: Apache Mahout Feb 13,  2012 Shannon Quinn

MapReduce Review

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Scalable programming modelMap phaseShuffleReduce phaseMapReduce Implementations

GoogleHadoop

Figure from lecture 6: MapReduce

This is our focus!

Page 4: Apache Mahout Feb 13,  2012 Shannon Quinn

Apache Mahout

A scalable machine learning library

Page 5: Apache Mahout Feb 13,  2012 Shannon Quinn

Apache Mahout

A scalable machine learning libraryBuilt on Hadoop

Page 6: Apache Mahout Feb 13,  2012 Shannon Quinn

Apache Mahout

A scalable machine learning libraryBuilt on HadoopPhilosophy of Mahout (and Hadoop by proxy)

Page 7: Apache Mahout Feb 13,  2012 Shannon Quinn

What does Mahout do?

Page 8: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation

Page 9: Apache Mahout Feb 13,  2012 Shannon Quinn

Classification

Page 10: Apache Mahout Feb 13,  2012 Shannon Quinn

Clustering

Page 11: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Mahout Algorithms

Dimensionality Reduction

Regression

Evolutionary Algorithms

Page 12: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout

1. Recommendation

2. Classification

3. Clustering

Page 13: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

Help users find items they might like based on historical preferences

Page 14: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

Mathematically

Page 15: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

Alice

Bob

Peter

5 1 4

? 2 5

4 3 2 *based on example bySebastian Schelter

Page 16: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

5 1 4

- 2 5

4 3 2 *based on example bySebastian Schelter

Page 17: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

5 1 4

? 2 5

4 3 2

Bob

*based on example bySebastian Schelter

Page 18: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation Overview

5 1 4

1.5 2 5

4 3 2

Bob

*based on example bySebastian Schelter

Page 19: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation in Mahout

1st Map phase: process input(Alice, Matrix, 5)

(Alice, Alien, 1)

(Alice, Inception, 4)

(Bob, Alien, 2)

<Alice, (Matrix, 5)>

<Alice, (Alien, 1)>

<Alice, (Inception, 4)>

<Bob, (Alien, 2)>

*based on example bySebastian Schelter

Page 20: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation in Mahout

1st Map phase: process input

1st Reduce phase: list by user

(Alice, Matrix, 5)

(Alice, Alien, 1)

(Alice, Inception, 4)

(Bob, Alien, 2)

<Alice, (Matrix, 5)>

<Alice, (Alien, 1)>

<Alice, (Inception, 4)>

<Bob, (Alien, 2)>

<Alice, (Matrix, 5);(Alien, 1);…>

<Bob, (Alien, 2);(Inception, 5);…>

<Alice, (Matrix, 5)>

<Alice, (Alien, 1)>

<Alice, (Inception, 4)>

<Bob, (Alien, 2)>

*based on example bySebastian Schelter

Page 21: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation in Mahout

2nd Map phase: Emit co-occurred ratings<Matrix;Alien, (5;1)>

<Matrix;Inception, (5;4)>

<Alien;Inception, (1;4)>

<Alien;Inception, (2;5)>

<Alice, (Matrix, 5);(Alien, 1);…>

<Bob, (Alien, 2);(Inception, 5);…>

*based on example bySebastian Schelter

Page 22: Apache Mahout Feb 13,  2012 Shannon Quinn

Recommendation in Mahout

2nd Map phase: Emit co-occurred ratings

2nd Reduce phase: Compute similarities

<Matrix;Alien, (5;1)>

<Matrix;Inception, (5;4)>

<Alien;Inception, (1;4)>

<Alien;Inception, (2;5)>

<Alice, (Matrix, 5);(Alien, 1);…>

<Bob, (Alien, 2);(Inception, 5);…>

<Matrix;Alien, (-0.47)>

<Matrix;Inception, (0.47)>

<Matrix;Alien, (5;1)><Matrix;Alien, (4;3)>

<Matrix;Inception, (5;4)><Matrix;Inception, (4;2)>

*based on example bySebastian Schelter

Page 23: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout

1. Recommendation

2. Classification

3. Clustering

Page 24: Apache Mahout Feb 13,  2012 Shannon Quinn

Classification Overview

Assigning data to discrete categories

Page 25: Apache Mahout Feb 13,  2012 Shannon Quinn

Classification Overview

Assigning data to discrete categoriesTrain a model on labeled data

Spam Not spam

Page 26: Apache Mahout Feb 13,  2012 Shannon Quinn

Classification Overview

Assigning data to discrete categoriesTrain a model on labeled dataRun the model on new, unlabeled data

Spam Not spam

?

Page 27: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes Example

Page 28: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes Example

Prob (token | label) =

Page 29: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes ExampleNot spam

Page 30: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes Example

President Obama’s Nobel Prize Speech

Not spam

Page 31: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes ExampleSpam

Page 32: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes ExampleSpam

Spam email content

Page 33: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes Example

Page 34: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes Example

“Order a trial Adobe chicken dailyEAB-List new summer savings, welcome!”

Page 35: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!

Page 36: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!Training1. Read the features

Page 37: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics

Page 38: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories

Page 39: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories4. Calculate normalizing factor of each label

Page 40: Apache Mahout Feb 13,  2012 Shannon Quinn

Naïve Bayes in Mahout

Complex!Training1. Read the features2. Calculate per-document statistics3. Normalize across categories4. Calculate normalizing factor of each label

TestingClassification

Page 41: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Classification Algorithms

Stochastic Gradient Descent

Page 42: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Classification Algorithms

Stochastic Gradient DescentSupport Vector Machines

Page 43: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Classification Algorithms

Stochastic Gradient DescentSupport Vector MachinesRandom Forests

Page 44: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout

1. Recommendation

2. Classification

3. Clustering

Page 45: Apache Mahout Feb 13,  2012 Shannon Quinn

Clustering Overview

Grouping unstructured data

Page 46: Apache Mahout Feb 13,  2012 Shannon Quinn

Clustering Overview

Grouping unstructured dataSmall intra-cluster distance

Page 47: Apache Mahout Feb 13,  2012 Shannon Quinn

Clustering Overview

Grouping unstructured dataSmall intra-cluster distanceLarge inter-cluster distance

Page 48: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 49: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 50: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 51: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 52: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 53: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 54: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 55: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 56: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Page 57: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering Example

Cats

Dogs

Page 58: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

+

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase Shuffling Data

Figure from lecture 6: MapReduce

Page 59: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Assume: # clusters <<< # points

Page 60: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Assume: # clusters <<< # points

M0 M1 M2 M3

Page 61: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Assume: # clusters <<< # points

M0 M1 M2 M3

<clusterID, observation>

R0 R1

Page 62: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Map phase: assign cluster IDs(x1, y1) , centroid1, centroid2, …

(x2, y2) , centroid1, centroid2, …

(x3, y3) , centroid1, centroid2, …

(x4, y4) , centroid1, centroid2, …

<cluster_n, (x1, y1)>

<cluster_m, (x2, y2)>

<cluster_i, (x3, y3)>

<cluster_k, (x4, y4)>

Page 63: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Map phase: assign cluster IDs

Reduce phase: reset centroids

(x1, y1) , centroid1, centroid2, …

(x2, y2) , centroid1, centroid2, …

(x3, y3) , centroid1, centroid2, …

(x4, y4) , centroid1, centroid2, …

<cluster_n, (x1, y1)>

<cluster_m, (x2, y2)>

<cluster_i, (x3, y3)>

<cluster_k, (x4, y4)>

<cluster_n, (x1, y1)>

<cluster_m, (x2, y2)>

<cluster_i, (x3, y3)>

<cluster_k, (x4, y4)>

<cluster_n, centroid_n>

<cluster_m, centroid_m>

<cluster_i, centroid_i>

<cluster_k, centroid_k>

Page 64: Apache Mahout Feb 13,  2012 Shannon Quinn

K-Means Clustering in Mahout

Important notes--maxIter--convergenceDeltamethod

Page 65: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Page 66: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Page 67: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Page 68: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Spectral clusteringTreat points as a graph

Page 69: Apache Mahout Feb 13,  2012 Shannon Quinn

Other Clustering Algorithms

Latent Dirichlet AllocationTopic models

Fuzzy K-MeansPoints are assigned multiple clusters

Canopy clusteringFast approximations of clusters

Spectral clusteringTreat points as a graph

K-Means & Eigencuts

Page 70: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Page 71: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Scalable library

Page 72: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Scalable libraryThree primary areas of focus

Page 73: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithms

Page 74: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithmsAll in your friendly neighborhood MapReduce

Page 75: Apache Mahout Feb 13,  2012 Shannon Quinn

Mahout in Summary

Scalable libraryThree primary areas of focusOther algorithmsAll in your friendly neighborhood MapReducehttp://mahout.apache.org/

Page 76: Apache Mahout Feb 13,  2012 Shannon Quinn

Thank you!