clus terin g and the em algo rithm unsu pervis ed le arning ... sw er. unsu pervis ed le arning...

Clustering and the EM Algorithm

Susanna Ricco

CPS 27125 October 2007

Material borrowed from:Lise Getoor, Andrew Moore, Tom Dietterich, Sebastian Thrun, Rich Maclin, ...

(and Ron Parr)

Unsupervised Learning

Supervised LearningGiven data in the form < x , y >, y is the target to learn.

! Good news: Easy to tell if our algorithm is giving the rightanswer.

Unsupervised LearningGiven data in the form < x > without any explicit target.

! Bad news: How do we define “good performance”?

! Good news: We can use our results for more than justpredicting y .

Unsupervised Learning is Model Learning

GoalProduce global summary of the data.

How?Assume data are sampled from underlying model with easilysummarized properties.

Why?

! Filter out noise

! Data compression

Good ClustersWant points in a cluster to be:

1. as similar as possible to other points in same cluster2. as di!erent as possible from points in another cluster

Warning:Definition of similar and di!erent depend on specific application.

We’ve already seen a lot of ways to measure distance between twodata points.

Types of Clustering Algorithms

Hierarchical methodse.g., hierarchical agglomerative clustering

Partition-based methodse.g., K-means

Probabilistic model-based methodse.g., learning mixture models

Spectral methodsI’m not going to talk about these

Hierarchical Clustering

Build a hierarchy of nested clusters.

Either gradually

! Merge similar clusters (agglomerative method)

! Divide loose superclusters (divisive method)

Result displayed as a dendrogram showing the sequence of mergesor splits.

Agglomerative Hierarchical Clustering

Initialize Ci = {x (i)} for i ! [1, n].

While more than one cluster left:

1. Let Ci , Cj be clusters that minimize D(Ci ,Cj)

2. Ci = Ci + Cj

3. Remove Cj from list of clusters

4. Store current clusters

Measuring DistanceWhat is D(Ci ,Cj)?

! Single link method:

D(Ci ,Cj) = min{d(x , y)|x ! Ci , y ! Cj}! Complete link method:

D(Ci ,Cj) = max{d(x , y)|x ! Ci , y ! Cj}! Average link method:

D(Ci ,Cj) =1

|Ci ||Cj |!

x!Ci

!

y!Cj

d(x , y)

! Centroid measure:

D(Ci ,Cj) = d(ci , cj), where ci and cj are centroids

! Ward’s measure:

D(Ci ,Cj) =!

x!Ci

d(x , x) +!

y!Cj

d(y , y) "!

u!Ci"Cj

d(u, u)

Result

that is, abstract representations that reflect what is knownabout words. The actual words that one processes in anygiven utterance are ‘tokens’ of that type. Thus, we recognizethat in the sentence The big boy picked on the little boy, thetwo instancesofboy refer todifferent individualsbutthat theindividuals are instances of the same type.

In the SRN, this distinction might seem to be lost. Eachoccurrence of boy results in a state that is similar to allother occurrences, but not identical. Because the internalstates also reflect prior context, the states produced bydifferent tokens will differ slightly. ‘Gotcha!’ exclaimsupporters of the traditional lexicon.

Not so fast. The internal states resulting from differentinstances of boy are indeed different. But, importantly, two

additional things are true. First, every boy state inhabits abounded region of state space inhabited only by othermembers of this lexeme. In Figure 3, for example, only asingle instance of boy is shown but, in fact, every occurrenceof the word will produce a state within the same boundedregion. The various states differ slightly because they haveoccurred in different contexts but are clustered tightlytogether. Each region contains only tokens of the same type.The type boy is not explicitly represented, but this does notmatter; the type membership of the token is easilyrecoverable from the fact that this state region is reservedonly for tokens of the same type.

Second, there is a pay-off in the context-sensitivity ofthe token representations [32]. The variation in the state

Figure 2. Hierarchical clustering diagram of hidden-unit activation patterns in response to different words. The similarity between words and groups of words is reflected inthe tree structure; items that are closer are joined further down the tree (i.e. to the right as shown here).

TRENDS in Cognitive Sciences

smell

move

think

exist

sleep

see

break

smash

like

chase

ear

mouse

cat

dog

monster

lion

dragon

woman

girl

man

boy

cat

book

rock

sandwich

cookie

bread

plate

glass

2.0 1.5 1.0 0.0

Distance (arbitrary scale)

Intransitive (always)

Transitive (sometimes)

Transitive (always)

VERBS

Animals

Humans

Animates

InanimatesFood

Breakables

NOUNS

Opinion TRENDS in Cognitive Sciences Vol.8 No.7 July 2004 303

www.sciencedirect.com

Elman, J.L. An alternative view of the mental lexicon. Trends in Cognitive Science, 7, 301-306.

Divisive Hierarchical Clustering

Begin with one single cluster, split to form smaller clusters.

Can be di!cult to choose potential splits:

! Monolithic methods split based on values a single variable

! Polythetic methods consider all variables together

Less popular than agglomerative methods.

Partition-based Clustering

Pick some number of clusters K

Assign each point x (i) to a single cluster Ck so that SCORE (C ,D)is minimized/maximized.

! (What is the score function?)

Total number of possible allocations: kn

Use iterative improvement instead of intractable exhaustive search.

The K-Means Algorithm

A popular partition-based clustering algorithm with the scorefunction given by:

SCORE (C ,D) =K!

k=1

d(x , ck)

where

ck =1

nk

!

x!Ck

x

andd(x , y) = ||x ! y ||2.

Pseudo-code for K-Means

1. Initialize k cluster centers, ck .

2. For each x (i), assign cluster with closest center

x (i) assigned to k = arg mink

d(x , ck).

3. For each cluster, recompute center:

ck =1

nk

!

x!Ck

x

4. Check convergence (Have cluster centers moved?)

5. If not converged, go to 2.

K-Means Example

3

Partition-based Clustering Algorithms

! Given set of n data points D={x(1), …, x(n)}

partition data into k clusters C = {C1, …, Ck}

such that each x(i) is assigned to a unique Cj

and score(C,D) is minimized/maximized

! Combinatorial optimization: searching for allocation of

n objects into k classes that maximizes score function

" Number of possible allocations ! nk

" exhaustive typically finding the optimal solution is

intractable

" Resort to iterative improvement

Generic Score Function

! Score function:

" clusters compact " minimize within cluster distance, wc(C)

" clusters should be far apart " maximize distance between clusters, bc(C)

! Given a clustering C, assign cluster centers, ck

" if points belong to space where means make sense, we can use the centroid of

the points in the cluster:

#$

%

kCxk

k xn

1c

• wc(C) = sum-of-squares within cluster distance

)c,x(d)C(wc)C(wc k

K

1k Cx

K

1kk

k

# ##% $%

%%

• bc(C) = distance between clusters

#&'&

%Kkj1

kj )c,c(d)C(bc

• Score(C,D) = f(wc(C), bc(C))

CA#1: K-means

! Idea:

" Start with randomly chosen cluster centers

" Assign points to give greatest increase in

score

" Recompute cluster centers

" Reassign points

" Repeat until no changes

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Original unlabeled data.

K-Means Example

3










intractable



! Score function:






#$

%

kCxk

k xn

1c


)c,x(d)C(wc)C(wc k

K

1k Cx

K

1kk

k

# ##% $%

%%


#&'&

%Kkj1

kj )c,c(d)C(bc


CA#1: K-means

! Idea:



score


" Reassign points


K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Pick initial centers randomly.

K-Means Example

3










intractable



! Score function:






#$

%

kCxk

k xn

1c


)c,x(d)C(wc)C(wc k

K

1k Cx

K

1kk

k

# ##% $%

%%


#&'&

%Kkj1

kj )c,c(d)C(bc


CA#1: K-means

! Idea:



score


" Reassign points


K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Assign points to nearest cluster.

K-Means Example

4

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Recompute cluster centers.

K-Means Example

4

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Reassign points to nearest clusters.

K-Means Example

4

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

K-means example #2

X(2)

X(3)

X(5)

X(1)

X(6)

X(7)

X(8)

X(4)

c1

c3

c2

Recompute cluster centers.

Understanding K-Means

Time complexity per iteration?

Does algorithm terminate?

Does algorithm converge to global optimum?

K-Means Convergence

Model data as drawn from spherical Gaussians centered at clustercenters.

log P(data|assignments) = const ! 1

2

K!

k=1

!

x!Ck

(x ! ck)2.

! How does this change when we reassign a point?

! How does this change when we recompute the means?

Monotonic improvement + finite assignments = convergence.

Demo

http://home.dei.polimi.it/matteucc/Clustering/tutorial html/AppletKM.html

Variations on K-Means

What if we don’t know K?Allow merging or splitting of clusters using heuristics.

What if means don’t make sense?Use k-mediods instead.

Mixture Models

Assume data generated using the following procedure.

1. Pick one of k components according to P(zk).This selects a (hidden) class label zk .

2. Generate a data point by sampling from p(x |zk).

Results in probability distribution of single point

p(x (i)) =K!

k=1

P(zk)p(x (i)|zk)

where p(x |zk) is any distribution (gaussian, poisson, exponential,etc.).

Gaussian Mixture Model (GMM)

Most common mixture model is a Gaussian mixture model:

p(x |zk) = N (µk ,!k)

With this model, likelihood of data becomes

p(x) =N!

n=1

K!

k=1

P(zk)p(x (i)|zk ;µk ,!k).

LDA and GMMs

LDA

! Built models p(x |zk) and P(zk) using maximum likelihoodgiven our training data.

! Used these models to compute P(zk |x) to classify new querypoints.

Clustering with GMMs

! Want to find P(zk) and p(x |zk) to learn underlying model andfind clusters.

! Want to compute P(zk |x) for each point in training set toassign them to clusters.

! Can we use maximum likelihood to infer both model andassignments?

! Requires solving non-linear system of equations! No e!cient analytic solution

Problem: Missing Labels

If we knew assignments, we could learn component models easily.

! We did this to train an LDA.

If we new the component models, we could estimate the mostlikely assignments easily.

! This is just classification.

Solution: The Expectation Maximization (EM) Algorithm

We deal with missing labels by alternating between two steps:

1. Expectation: Fix model and estimate missing labels.

2. Maximization: Fix missing labels (or a distribution over themissing labels) and find the model that maximizes theexpected log-likelihood of the data.

Simple Example

Labeled Data

Clusters correspond to “grades in class”.

Model to learn:

P(A) =1

2P(B) = µ

P(C ) = 2µ

P(D) =1

2! 3µ

Training data:

a people got an A

b people got a B

c people got a C

d people got a D

What is maximum likelihood estimate for µ?

Simple Example

Labeled Data

Likelihood:

P(a, b, c, d |µ) = K

„1

2

«a

(µ)b(2µ)c„

1

2! 3µ

«d

log P(a, b, c, d |µ) = log K + a log1

2+ b log µ + c log(2µ) + d log

„1

2! 3µ

«

!

!µlog P(a, b, c, d |µ) =

b

µ+

2c

2µ!

3d12 ! 3µ

For MLE, set !!µ log P = 0 and solve for µ to get

µ =b + c

6(b + c + d)

Simple Example

Hidden Labels

What if we only know that there are h “high grades”? (Exact labels are missing.)

Now how do we find the maximum likelihood estimate of µ?

1. Expectation:Fix µ and infer the expected values of a and b:

a =1/2

1/2 + µh, b =

µ

1/2 + µh

Since we know ab = 1/2

µ and a + b = h.

2. Maximization:Fix these fractions a and b and compute the maximum likelihood µ as before:

µnew =b + c

6(b + c + d).

3. Repeat.

Formal Setup for General EM Algorithm

Let D = {x (1), . . . , x (n)} be n observed data vectors.

Let Z = {z(1), . . . , z(n)} be n values of hidden variables (i.e., thecluster labels).

Log-likelihood of observed data given model:

L(!) = log p(D|!) = log!

Z

p(D,Z |!)

Note: both ! and Z are unknown.

Fun with Jensen’s Inequality

Let Q(Z ) be any distribution over the hidden variables:

log P(D|!) = log!

Z

Q(Z )p(D,Z |!)

Q(Z )

!!

Z

Q(Z ) logp(D,Z |!)

Q(Z )

=!

Z

Q(Z ) log p(D,Z |!) +!

Z

Q(Z ) log1

Q(Z )

= F (Q, !)

General EM Algorithm

Alternate between steps until convergence:

E step:

! Maximize F wrt Q, keeping θ fixed.

! Solution:Qk+1 = p(Z |D, θk)

M step:

! Maximize F wrt θ, keeping Q fixed

! Solution:

θk+1 = arg max!

F (Qk+1, θ)

= arg max!

!

Z

p(Z |D, θk) log p(X ,Z |θ)

General EM Algorithm in English

Alternate steps until model parameters don’t change much:

E step:Estimate distribution over labels given a certain fixed model.

M step:Choose new parameters for model to maximize expectedlog-likelihood of observed data and hidden variables.

Convergence

The EM Algorithm will converge because:

! During E step, we make F (Qk+1, !k) = log P(D|!k).

! During M step, we choose !k+1 that increases F .

! Recall that F is a lower bound,

F (Qk+1, thetak+1) ! log P(D|!k+1).

! Implieslog P(D|!k) ! log P(D|!k+1)

! Implies convergence! (Why?)

Notes

Things to remember:

! Often closed form for both E and M step.

! Must specify stopping criteria.

! Complexity depends on number of iterations and time tocompute E and M steps.

! May (will) converge to local optimum.

Example: EM for GMM

20

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*AB

C787*D"% <424%=9*<88;E)4%=)47**F2*)(4*)G)(*&)4%=)&"2*94)*":%*4;)&H=)4;*I4

!) J*K*L.M)N/*L,M)N*O*LPM)N/*".M)N/*",M)N*O*"PM)N/*#.M)N/*#,M)N*O*#PM)N*Q

CR;)4#

!"H#:)4*S4>#4P)43T*P9=;;4;*"D*=99*3=)=#"&2);*D"%*4=P(*P9=;;

# $# $ # $

# $# $

# $%&

"

"&&

c

j

jjjjk

iiiik

tk

titik

tki

tpttwx

tpttwx

x

wwxxw

!

"#"#"$#$%

"#"#"$#$%

%

&$%$&

'

'

!

!!!

8R;)4#7**

!"H#:)4*8=>7*9&U4* '&V42*":%*3=)=G;*P9=;;*H4HI4%;(&#*3&;)%&I:)&"2;

#&M)N*&;*;("%)(=23*D"%*4;)&H=)4*"D*WM(&N "2*)G)(*&)4%=)&"2

# $# $

# $%

%&)

k

tki

k

k

tki

ixw

xxw

t!

!

$&

'$&

!( # $# $ # $* + # $* +

# $%

% ),),

&)"

k

tki

T

ikik

k

tki

ixw

txtxxw

t!

''!

$&

!!'$&

!

# $# $

R

xw

tp k

tki

i

%&)

!$&

! X J*Y%4P"%3;

Z:;)*4V=9:=)4*=*<=:;;&=2*=)*>U

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0-

<=:;;&=2*8&>):%4*C>=H#94?*@)=%)

13V=2P4*=#"9"'&4;?*&2*[9=PU*=23*6(&)4*)(&;*4>=H#94*5&99*I4*

&2P"H#%4(42;&I94

Initial model parameters.

Example: EM for GMM

21

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0.

1A)4%*A&%;)*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0,

1A)4%*,23*&)4%=)&"2

After first iteration

Example: EM for GMM

21

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0.

1A)4%*A&%;)*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0,

1A)4%*,23*&)4%=)&"2

After second iteration

Example: EM for GMM

22

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0A

1B)4%*A%3*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*00

1B)4%*0)(*&)4%=)&"2

After third iteration

Example: EM for GMM

22

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0A

1B)4%*A%3*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*00

1B)4%*0)(*&)4%=)&"2

After fourth iteration

Example: EM for GMM

23

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0A

1B)4%*A)(*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0C

1B)4%*C)(*&)4%=)&"2

After fifth iteration

Example: EM for GMM

23

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0A

1B)4%*A)(*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0C

1B)4%*C)(*&)4%=)&"2

After sixth iteration

Example: EM for GMM

24

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0A

1B)4%*,-)(*&)4%=)&"2

!"#$%&'()*+*,--./*,--0/*123%45*67*8""%4 !9:;)4%&2'*5&)(*<=:;;&=2*8&>):%4;?*@9&34*0C

@"D4*E&"*1;;=$*3=)=

After convergence

Relation to K-Means

SimilaritiesK-Means used GMM with:

! covariance ! = I (fixed)

! uniform P(Zk) (fixed)

! unknown means

Alternated estimating labels and recomputing unknown modelparameters.

Di!erenceMakes “hard” assignment to cluster during E step.

How to Pick K?

Do we want to pick the K that maximizes likelihood?

Other options:

! Cross-validation

! Add complexity penalty to objective function

! Prior knowledge

How to Pick K?

Do we want to pick the K that maximizes likelihood?

Other options:

! Cross-validation

! Add complexity penalty to objective function

! Prior knowledge

Summary

Clustering:Infer assignments to hidden variables and hidden model parameterssimultaneously.

EM Algorithm:Powerful, popular, general method for doing this.

EM Applications:

! Image segmentation

! SLAM

! Estimating motion models for tracking

! Hidden Markov Models

! etc.

clus terin g and the em algo rithm unsu pervis ed le arning ... sw er. unsu pervis ed le arning...

Documents