machine learning with mapreduce. k-means clustering 3

51
Machine Learning with MapReduce

Upload: domenic-nicholson

Post on 16-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Machine Learning with MapReduce. K-Means Clustering 3

Machine Learning with MapReduce

Page 2: Machine Learning with MapReduce. K-Means Clustering 3
Page 3: Machine Learning with MapReduce. K-Means Clustering 3

K-Means Clustering

3

Page 4: Machine Learning with MapReduce. K-Means Clustering 3

How to MapReduce K-Means?

• Given K, assign the first K random points to be the initial cluster centers

• Assign subsequent points to the closest cluster using the supplied distance measure

• Compute the centroid of each cluster and iterate the previous step until the cluster centers converge within delta

• Run a final pass over the points to cluster them for output

Page 5: Machine Learning with MapReduce. K-Means Clustering 3

K-Means Map/Reduce Design• Driver

– Runs multiple iteration jobs using mapper+combiner+reducer– Runs final clustering job using only mapper

• Mapper– Configure: Single file containing encoded Clusters– Input: File split containing encoded Vectors– Output: Vectors keyed by nearest cluster

• Combiner– Input: Vectors keyed by nearest cluster– Output: Cluster centroid vectors keyed by “cluster”

• Reducer (singleton)– Input: Cluster centroid vectors– Output: Single file containing Vectors keyed by cluster

Page 6: Machine Learning with MapReduce. K-Means Clustering 3

Mapper - mapper has k centers in memory.

Input Key-value pair (each input data point x).

Find the index of the closest of the k centers (call it iClosest).

Emit: (key,value) = (iClosest, x)

Reducer(s) – Input (key,value) Key = index of centerValue = iterator over input data points closest to ith center

At each key value, run through the iterator and average all the Corresponding input data points.

Emit: (index of center, new center)

Page 7: Machine Learning with MapReduce. K-Means Clustering 3

Improved Version: Calculate partial sums in mappers

Mapper - mapper has k centers in memory. Running through one input data point at a time (call it x). Find the index of the closest of the k centers (call it iClosest). Accumulate sum of inputs segregated into K groups depending on which center is closest.

Emit: ( , partial sum)OrEmit(index, partial sum)

Reducer – accumulate partial sums and

Emit with index or without

Page 8: Machine Learning with MapReduce. K-Means Clustering 3

EM-Algorithm

Page 9: Machine Learning with MapReduce. K-Means Clustering 3

What is MLE?

• Given– A sample X={X1, …, Xn}– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find)(maxarg

LML

Page 10: Machine Learning with MapReduce. K-Means Clustering 3

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of p(x|θ), solving optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

Page 11: Machine Learning with MapReduce. K-Means Clustering 3

An easy case

• Assuming– A coin has a probability p of being heads, 1-p of

being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

Page 12: Machine Learning with MapReduce. K-Means Clustering 3

An easy case (cont)

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

Page 13: Machine Learning with MapReduce. K-Means Clustering 3

Basic setting in EM

• X is a set of data points: observed data• Θ is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler, where Y is

“hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

Page 14: Machine Learning with MapReduce. K-Means Clustering 3

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

Page 15: Machine Learning with MapReduce. K-Means Clustering 3

The log-likelihood function

• L is a function of θ, while holding X constant:

)|()()|( XPLXL

)|,(log

)|(log

)|(log

)|(log)(log)(

1

1

1

yxP

xP

xP

XPLl

iy

n

i

i

n

i

n

ii

Page 16: Machine Learning with MapReduce. K-Means Clustering 3

The iterative approach for MLE

)|,(logmaxarg

)(maxarg

)(maxarg

1

yxp

l

L

n

i yi

ML

,....,...,, 10 tIn many cases, we cannot find the solution directly.

An alternative is to find a sequence:

....)(...)()( 10 tlll s.t.

Page 17: Machine Learning with MapReduce. K-Means Clustering 3

])|,(

)|,([log

])|,(

)|,([log

)|,(

)|,(),|(log

)|,(

)|,(

)|',(

)|,(log

)|,(

)|,(

)|',(

)|,(log

)|',(

)|,(log

)|,(

)|,(

log

)|,(log)|,(log

)|(log)|(log)()(

1),|(

1),|(

1

'1

'1

'1

1

11

ti

in

ixyP

ti

in

ixyP

ti

itn

i yi

ti

i

yt

yi

ti

n

i

ti

ti

yt

yi

in

i

yt

yi

in

i

t

yi

yin

i

t

yi

n

iyi

n

i

tt

yxP

yxPE

yxP

yxPE

yxP

yxPxyP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxP

yxPyxP

XPXPll

ti

ti

Jensen’s inequality

Page 18: Machine Learning with MapReduce. K-Means Clustering 3

Jensen’s inequality

])([()](([, xgEfxgfEthenconvexisfif

)])([log()]([log( xpExpE

])([()](([, xgEfxgfEthenconcaveisfif

log is a concave function

Page 19: Machine Learning with MapReduce. K-Means Clustering 3

Maximizing the lower bound

)]|,([logmaxarg

)|,(log),|(maxarg

)|,(

)|,(log),|(maxarg

])|,(

)|,([logmaxarg

1),|(

1

1

1),|(

)1(

yxPE

yxPxyP

yxP

yxPxyP

yxp

yxpE

i

n

ixyP

it

i

n

i y

ti

iti

n

i y

ti

in

ixyP

t

ti

ti

The Q function

Page 20: Machine Learning with MapReduce. K-Means Clustering 3

The Q-function

• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

1

1),|(

),|(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

it

n

i yi

n

iixyP

Y

t

XYP

tt

ti

t

Page 21: Machine Learning with MapReduce. K-Means Clustering 3

The inner loop of the EM algorithm

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 22: Machine Learning with MapReduce. K-Means Clustering 3

L(θ) is non-decreasing at each iteration

• The EM algorithm will produce a sequence

• It can be proved that

,....,...,, 10 t

....)(...)()( 10 tlll

Page 23: Machine Learning with MapReduce. K-Means Clustering 3

The inner loop of the Generalized EM algorithm (GEM)

• E-step: calculate

• M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

),(),( 1 tttt QQ

Page 24: Machine Learning with MapReduce. K-Means Clustering 3

Recap of the EM algorithm

Page 25: Machine Learning with MapReduce. K-Means Clustering 3

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Page 26: Machine Learning with MapReduce. K-Means Clustering 3

Idea #2: find the θt sequence

No analytical solution iterative approach, find s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Page 27: Machine Learning with MapReduce. K-Means Clustering 3

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

Page 28: Machine Learning with MapReduce. K-Means Clustering 3

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

])|,(

)|,([logmaxarg

1),|(

1),|(

)1(

yxPE

yxp

yxpE

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

Page 29: Machine Learning with MapReduce. K-Means Clustering 3

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 30: Machine Learning with MapReduce. K-Means Clustering 3

Important classes of EM problem

• Products of multinomial (PM) models• Exponential families• Gaussian mixture• …

Page 31: Machine Learning with MapReduce. K-Means Clustering 3

Probabilistic Latent Semantic Analysis (PLSA)

• PLSA is a generative model for generating the co-occurrence of documents d∈D={d1,…,dD} and terms w∈W={w1,…,wW}, which associates latent variable z∈Z={z1,…,zZ}.

• The generative processing is:

w1w1

w2w2

wWwW

d1d1

d2d2

dDdD

z1

z2

zZ

P(d)

P(z|d) P(w|z)

Page 32: Machine Learning with MapReduce. K-Means Clustering 3

Model

• The generative process can be expressed by:

( , ) ( ) ( | ),

( | ) ( | ) ( | )z Z

P d w P d P w d

where P w d P w z P z d

Two independence assumptions:1) Each pair (d,w) are assumed to be generated independently,

corresponding to ‘bag-of-words’2) Conditioned on z, words w are generated independently of the

specific document d.

Page 33: Machine Learning with MapReduce. K-Means Clustering 3

Model• Following the likelihood principle, we detemines P(z),

P(d|z), and P(w|z) by maximization of the log-likelihood function

( | , , ) ( , ) log ( , )d D w W

L d w z n d w P d w

( , ) ( | ) ( | ) ( ) ( | ) ( | ) ( )z Z z Z

where P d w P w z P z d P d P w z P d z P z

co-occurrence times of d and w.

Observed data

Unobserved data

P(d), P(z|d), and P(w|d)

Page 34: Machine Learning with MapReduce. K-Means Clustering 3

Maximum-likelihood• Definition

– We have a density function P(x|Θ) that is govened by the set of parameters Θ, e.g., P might be a set of Gaussians and Θ could be the means and covariances

– We also have a data set X={x1,…,xN}, supposedly drawn from this distribution P, and assume these data vectors are i.i.d. with P.

– Then the likehihood function is:

– The likelihood is thought of as a function of the parameters Θwhere the data X is fixed. Our goal is to find the Θthat maximizes L. That is

1

( | ) ( | ) ( | )N

ii

P X P x L X

* arg max ( | )L X

Page 35: Machine Learning with MapReduce. K-Means Clustering 3

Jensen’s inequality

0)(

0

1

)()(

jg

a

a

provided

jgajg

j

jj

j j

aj

j

Page 36: Machine Learning with MapReduce. K-Means Clustering 3

Dd Ww Zz

zdPzwPzPwdnzwdL )|()|()(log),(max),,|(max

Estimation-using EM

difficult!!!

Idea: start with a guess t, compute an easily computed lower-bound B(; t) to the function log P(|U) and maximize the bound instead

By Jensen’s inequality:

),|(

]),|(

)|()|()([),|(

),|(

)|()|()( dwzP

Zz j dwzP

zdPzwPzPdwzP

dwzP

zdPzwPzP

Dd Ww z

dwzP

zDd Ww

t

dwzPdwzPzdPzwPzPwdn

dwzP

zdPzwPzPwdnB

),|()],|(log)|()|()([log),(max

]),|(

)|()|()([log),(max),(max

),|(

Page 37: Machine Learning with MapReduce. K-Means Clustering 3

(1)Solve P(w|z)• We introduce Lagrange multiplier λwith the constraint that

∑wP(w|z)=1, and solve the following equation:

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | ) 1) 0( | )

( , ) ( | , )0,

( | )

( , ) ( | , )( | ) ,

( | ) 1,

( , ) ( | , ),

( , )( | )

d D w W z w

d D

d D

w

w W d D

n d w P z P w z P d z P z w d P z w d P w zP w z

n d w P z d w

P w z

n d w P z d wP w z

P w z

n d w P z d w

n d w PP w z

( | , )

( , ) ( | , )d D

w W d D

z d w

n d w P z d w

Page 38: Machine Learning with MapReduce. K-Means Clustering 3

(2)Solve P(d|z)

• We introduce Lagrange multiplier λwith the constraint that ∑dP(d|z)=1, and get the following result:

( , ) ( | , )( | )

( , ) ( | , )w W

d D w W

n d w P z d wP d z

n d w P z d w

Page 39: Machine Learning with MapReduce. K-Means Clustering 3

(3)Solve P(z)• We introduce Lagrange multiplier λwith the constraint that

∑zP(z)=1, and solve the following equation:

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( ) 1) 0( )

( , ) ( | , )0,

( )

( , ) ( | , )( ) ,

( ) 1,

( , ) ( | , ) ( , ),

d D w W z z

d D w W

d D w W

z

d D w W z d D w W

n d w P z P w z P d z P z w d P z w d P zP z

n d w P z d w

P z

n d w P z d wP z

P z

n d w P z d w n d w

( , ) ( | , )( )

( , )d D w W

w W d D

n d w P z d wP z

n d w

Page 40: Machine Learning with MapReduce. K-Means Clustering 3

(1)Solve P(z|d,w) • We introduce Lagrange multiplier λwith the constraint that

∑zP(z|d,w)=1, and solve the following equation:,

,

,

( , ) [log ( ) ( | ) ( | ) log ( | , )] ( | , ) ( ( | , ) 1) 0( | , )

( , )[log ( ) ( | ) ( | ) log ( | , ) 1] 0,

log ( | , ) log ( ) ( | ) ( | ) 1 0,

( |

d wd D w W z d D w W z

d w

d w

n d w P z P w z P d z P z d w P z d w P z d wP z d w

n d w P z P w z P d z P z d w

P z d w P z P w z P d z

P z d

,

,

,

1

1

,

1

1 (1 log ( ) ( | ) ( | ))

, ) ( ) ( | ) ( | )

( | , ) 1,

( ) ( | ) ( | ) 1

1 log ( ) ( | ) ( | )

( ) ( | ) ( | )( | )

( ) ( | ) ( | )

( ) ( | ) ( | )

( ) ( |

d w

d w

d w

z

z

z

d wz

P z P w z P d z

w P z P w z P d z e

P z d w

P z P w z P d z e

P z P w z P d z

P z P w z P d zP w z

eP z P w z P d z

eP z P w z P d z

P z P w z

) ( | )z

P d z

Page 41: Machine Learning with MapReduce. K-Means Clustering 3

(4)Solve P(z|d,w) -2

( , , )( | , )

( , )

( , | ) ( )

( , )

( | ) ( | ) ( )

( | ) ( | ) ( )z Z

P d w zP z d w

P d w

P w d z P z

P d w

P w z P d z P z

P w z P d z P z

Page 42: Machine Learning with MapReduce. K-Means Clustering 3

The final update Equations

• E-step:

• M-step:

( | ) ( | ) ( )( | , )

( | ) ( | ) ( )z Z

P w z P d z P zP z d w

P w z P d z P z

( , ) ( | , )( | )

( , ) ( | , )d D

w W d D

n d w P z d wP w z

n d w P z d w

( , ) ( | , )( | )

( , ) ( | , )w W

d D w W

n d w P z d wP d z

n d w P z d w

( , ) ( | , )( )

( , )d D w W

w W d D

n d w P z d wP z

n d w

Page 43: Machine Learning with MapReduce. K-Means Clustering 3

Coding Design• Variables:

• double[][] p_dz_n // p(d|z), |D|*|Z|• double[][] p_wz_n // p(w|z), |W|*|Z|• double[] p_z_n // p(z), |Z|

• Running Processing:1. Read dataset from file

ArrayList<DocWordPair> doc; // all the docsDocWordPair – (word_id, word_frequency_in_doc)

2. Parameter InitializationAssign each elements of p_dz_n, p_wz_n and p_z_n with a random double value, satisfying

∑d p_dz_n=1, ∑d p_wz_n =1, and ∑d p_z_n =13. Estimation (Iterative processing)

1. Update p_dz_n, p_wz_n and p_z_n 2. Calculate Log-likelihood function to see where ( |Log-likelihood – old_Log-likelihood|

< threshold)4. Output p_dz_n, p_wz_n and p_z_n

Page 44: Machine Learning with MapReduce. K-Means Clustering 3

Coding Design• Update p_dz_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_dz_n[d][z] += tfwd*P_z_condition_d_w;

denominator_p_dz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d}// end for each doc d

For each doc d {For each topic z {

p_dz_n_new[d][z] = nominator_p_dz_n[d][z]/ denominator_p_dz_n[z];} // end for each topic z

}// end for each doc d

Page 45: Machine Learning with MapReduce. K-Means Clustering 3

Coding Design• Update p_wz_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_wz_n[w][z] += tfwd*P_z_condition_d_w;

denominator_p_wz_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z }// end for each word w included in d}// end for each doc d

For each w {For each topic z {

p_wz_n_new[w][z] = nominator_p_wz_n[w][z]/ denominator_p_wz_n[z];} // end for each topic z

}// end for each doc d

Page 46: Machine Learning with MapReduce. K-Means Clustering 3

Coding Design• Update p_z_n

For each doc d{ For each word w included in d {

denominator = 0; nominator = new double[Z]; For each topic z { nominator[z] = p_dz_n[d][z]* p_wz_n[w][z]* p_z_n[z]

denominator +=nominator[z]; } // end for each topic z For each topic z { P_z_condition_d_w = nominator[j]/denominator; nominator_p_z_n[z] += tfwd*P_z_condition_d_w; } // end for each topic z

denominator_p_z_n[z] += tfwd; }// end for each word w included in d}// end for each doc d

For each topic z{p_dz_n_new[d][j] = nominator_p_z_n[z]/ denominator_p_z_n;

} // end for each topic z

Page 47: Machine Learning with MapReduce. K-Means Clustering 3

Apache Mahout

Industrial Strength Machine Learning

GraphLab

Page 48: Machine Learning with MapReduce. K-Means Clustering 3

Current Situation• Large volumes of data are now available• Platforms now exist to run computations over

large datasets (Hadoop, HBase)• Sophisticated analytics are needed to turn data

into information people can use• Active research community and proprietary

implementations of “machine learning” algorithms

• The world needs scalable implementations of ML under open license - ASF

Page 49: Machine Learning with MapReduce. K-Means Clustering 3

History of Mahout

• Summer 2007– Developers needed scalable ML– Mailing list formed

• Community formed– Apache contributors– Academia & industry– Lots of initial interest

• Project formed under Apache Lucene– January 25, 2008

Page 50: Machine Learning with MapReduce. K-Means Clustering 3

Current Code Base• Matrix & Vector library

– Memory resident sparse & dense implementations• Clustering

– Canopy– K-Means– Mean Shift

• Collaborative Filtering– Taste

• Utilities– Distance Measures– Parameters

Page 51: Machine Learning with MapReduce. K-Means Clustering 3

Others?

• Naïve Bayes• Perceptron• PLSI/EM• Genetic Programming• Dirichlet Process Clustering• Clustering Examples• Hama (Incubator) for very large arrays