gmm, em, k-means, max. likelihood (konu anlatımı)

14
1 Gaussian Mixture Models (GMM) and Expectation-Maximization (EM) (and K-Means too!) Source Material for Lecture http://www.autonlab.org/tutorials/gmm.html http://research.microsoft.com/~cmbishop/talks.htm Review: The Gaussian Distribution Multivariate Gaussian mean covariance Bishop, 2003 Isotropic (circularly symmetric) if covariance is diag(k,k,...,k) Likelihood Function Data set Assume observed data points generated independently Viewed as a function of the parameters, this is known as the likelihood function Bishop, 2003 Maximum Likelihood Set the parameters by maximizing the likelihood function Equivalently maximize the log likelihood Bishop, 2003 Maximum Likelihood Solution Maximizing w.r.t. the mean gives the sample mean Maximizing w.r.t covariance gives the sample covariance Bishop, 2003

Upload: oezguer-genc

Post on 27-Nov-2014

108 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

1

Gaussian Mixture Models (GMM)

and Expectation-Maximization (EM)

(and K-Means too!)

Source Material for Lecture

http://www.autonlab.org/tutorials/gmm.html

http://research.microsoft.com/~cmbishop/talks.htm

Review: The Gaussian Distribution

• Multivariate Gaussian

mean covariance

Bishop, 2003

Isotropic (circularly symmetric) if covariance is diag(k,k,...,k)

Likelihood Function

• Data set

• Assume observed data points generated independently

• Viewed as a function of the parameters, this is known as the likelihood function

Bishop, 2003

Maximum Likelihood

• Set the parameters by maximizing the likelihood function

• Equivalently maximize the log likelihood

Bishop, 2003

Maximum Likelihood Solution

• Maximizing w.r.t. the mean gives the sample mean

• Maximizing w.r.t covariance gives the sample covariance

Bishop, 2003

Page 2: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

2

Bias of Maximum Likelihood

• Consider the expectations of the maximum likelihood estimates under the Gaussian distribution

• The maximum likelihood solution systematically under-estimates the covariance

• This is an example of over-fitting

Bishop, 2003

Intuitive Explanation of Over-fitting

Bishop, 2003

Unbiased Variance Estimate

• Clearly we can remove the bias by using

since this gives

• Arises naturally in a Bayesian treatment

• For an infinite data set the two expressions are equal

Bishop, 2003

Comments

Gaussians are well understood and easy to estimate

However, they are unimodal, thus cannot be usedto represent inherently multimodal datasets

Fitting a single Gaussian to a multimodal dataset islikely to give a mean value in an area with lowprobability, and to overestimate the covariance.

Old Faithful Data Set

Duration of eruption (minutes)

Time betweeneruptions (minutes)

Bishop, 2004 Copyright © 2001, Andrew W. Moore

Some Bio Assay data

some axis

so

me

oth

er

ax

is

Page 3: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

3

Idea: Use a Mixture of Gaussians

• Linear super-position of Gaussians

• Normalization and positivity require

• Can interpret the mixing coefficients as prior probabilities

Bishop, 2003

Example: Mixture of 3 Gaussians

Bishop, 2004

Sampling from the Gaussian

• To generate a data point:

– first pick one of the components with probability

– then draw a sample from that component

• Repeat these two steps for each new data point

Bishop, 2003

Sampling from a GMM

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

1

2

3

Copyright © 2001, Andrew W. Moore

Sampling from a GMM

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1

2

3

Copyright © 2001, Andrew W. Moore

Sampling from a GMM

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2

Copyright © 2001, Andrew W. Moore

Page 4: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

4

Sampling from a GMM

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix 2I

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2. Datapoint ~ N(i, 2I )

2

x

Copyright © 2001, Andrew W. Moore

Sampling from a General GMM

1

2

3

• There are k components. The i’th component is called i

• Component i has an associated mean vector i

• Each component generates data from a Gaussian with mean i and covariance matrix i

Assume that each datapoint is generated according to the following recipe:

1. Pick a component at random. Choose component i with probability P(i).

2. Datapoint ~ N(i, i )Copyright © 2001, Andrew W. Moore

Copyright © 2001, Andrew W. Moore

GMM describing assay data

Copyright © 2001, Andrew W. Moore

GMM density function

Note: now we have a continuous estimate of the density, so can estimate a value at any point.

Also, could draw constant-probability contours if we wanted to.

Fitting the Gaussian Mixture

• We wish to invert this process – given the data set, find the corresponding parameters:

– mixing coefficients

– means

– covariances

• If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster

• Problem: the data set is unlabelled

• We shall refer to the labels as latent (= hidden) variables

Bishop, 2003

Maximum Likelihood for the GMM

• The log likelihood function takes the form

• Note: sum over components appears inside the log

• There is no closed form solution for maximum likelihood

• However, with labeled data, the story is different

Bishop, 2003

Page 5: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

5

Labeled vs Unlabeled Data

labeled unlabeled

Bishop, 2003

Easy to estimate params(do each color separately)

Hard to estimate params(we need to assign colors)

Side-Trip : Clustering using K-means

K-means is a well-known method of clustering data.

Determines location of clusters (cluster centers), as well as which data points are “owned” by which cluster.

Motivation: K-means may give us some insight into how tolabel data points by which cluster they come from (i.e. determine ownership or membership)

K-means and Hierarchical Clustering

Andrew W. Moore

Associate Professor

School of Computer Science

Carnegie Mellon Universitywww.cs.cmu.edu/~awm

[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Some Data

This could easily be modeled by a Gaussian Mixture (with 5 components)

But let’s look at an satisfying, friendly and infinitely popular alternative…

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

Page 6: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

6

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns”a set of datapoints)

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

K-means1. Ask user how many

clusters they’d like. (e.g. k=5)

2. Randomly guess k cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns…

5. …and jumps there

6. …Repeat until terminated!

K-means Start

Example generated by Dan Pelleg’s super-duper fast K-means system:

Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org/pap.html)

K-means continues…

K-means continues…

Page 7: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

7

K-means continues…

K-means continues…

K-means continues…

K-means continues…

K-means continues…

K-means continues…

Page 8: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

8

K-means terminates

K-means Questions

• What is it trying to optimize?

• Are we sure it will terminate?

• Are we sure it will find an optimal clustering?

• How should we start it?

• How could we automatically choose the number of centers?

….we’ll deal with these questions over the next few slides

DistortionGiven..

•an encoder function: ENCODE : m [1..k]

•a decoder function: DECODE : [1..k] m

Define…

R

iii

1

2)]([Distortion ENCODEDECODE xx

DistortionGiven..

•an encoder function: ENCODE : m [1..k]

•a decoder function: DECODE : [1..k] m

Define…

We may as well write

R

ii

j

i

j

1

2)(ENCODE )(Distortionso

][DECODE

xcx

c

R

iii

1

2)]([Distortion ENCODEDECODE xx

The Minimal Distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

R

ii i

1

2)(ENCODE )(Distortion xcx

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx

..at the minimal distortion

Page 9: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

9

The Minimal Distortion (1)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

R

ii i

1

2)(ENCODE )(Distortion xcx

2

},...,{)(ENCODE )(minarg

21

jikj

icxc

ccccx

..at the minimal distortion

Otherwise distortion could be reduced by replacing ENCODE[xi] by the nearest center

The Minimal Distortion (2)

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(2) The partial derivative of Distortion with respect to each center location must be zero.

R

ii i

1

2)(ENCODE )(Distortion xcx

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum)a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

OwnedBy(cj ) = the set of records owned by Center cj .

(2) The partial derivative of Distortion with respect to each center location must be zero.

minimum)a(for 0

)(2

)(Distortion

)(

)(Distortion

)OwnedBy(

)OwnedBy(

2

1 )OwnedBy(

2

1

2)(ENCODE

j

j

j

i

iji

iji

jj

k

j iji

R

ii

c

c

c

x

cx

cxcc

cx

cx

)OwnedBy(|)OwnedBy(|

1

jii

jj

c

xc

cThus, at a minimum:

At the minimum distortion

What properties must centers c1 , c2 , … , ck have when distortion is minimized?

(1) xi must be encoded by its nearest center

(2) Each Center must be at the centroid of points it owns.

R

ii i

1

2)(ENCODE )(Distortion xcx

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

(1) Change encoding so that xi is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

There’s no point applying either operation twice in succession.

But it can be profitable to alternate.

…And that’s K-means!

Easy to prove this procedure will terminate in a state at which neither (1) or (2) change the configuration. Why?

R

ii i

1

2)(ENCODE )(Distortion xcx

Page 10: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

10

Improving a suboptimal configuration…

What properties can be changed for centers c1 , c2 , … , ckhave when distortion is not minimized?

(1) Change encoding so that xi is encoded by its nearest center

(2) Set each Center to the centroid of points it owns.

There’s no point applying either operation twice in succession.

But it can be profitable to alternate.

…And that’s K-means!

Easy to prove this procedure will terminate in a state at which neither (1) or (2) change the configuration. Why?

R

ii i

1

2)(ENCODE )(Distortion xcx

There are only a finite number of ways of partitioning R

records into k groups.

So there are only a finite number of possible

configurations in which all Centers are the centroids of

the points they own.

If the configuration changes on an iteration, it must have

improved the distortion.

So each time the configuration changes it must go to a

configuration it’s never been to before.

So if it tried to go on forever, it would eventually run out of

configurations.

Will we find the optimal configuration?• Not necessarily.

• Can you invent a configuration that has converged, but does not have the minimum distortion?

Will we find the optimal configuration?• Not necessarily.

• Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3configuration here…)

Will we find the optimal configuration?• Not necessarily.

• Can you invent a configuration that has converged, but does not have the minimum distortion? (Hint: try a fiendish k=3configuration here…)

Trying to find good optima• Idea 1: Be careful about where you start

• Idea 2: Do many runs of k-means, each from a different random start configuration

• Many other ideas floating around.

Trying to find good optima• Idea 1: Be careful about where you start

• Idea 2: Do many runs of k-means, each from a different random start configuration

• Many other ideas floating around.Neat trick:

Place first center on top of randomly chosen datapoint.

Place second center on datapoint that’s as far away as possible from first center

:

Place j’th center on datapoint that’s as far away as possible from the closest of Centers 1 through j-1

:

Page 11: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

11

Common uses of K-means

• Often used as an exploratory data analysis tool

• In one-dimension, a good way to quantize real-valued variables into k non-uniform buckets

• Used on acoustic data in speech understanding to convert waveforms into one of k categories (known as Vector Quantization)

• Also used for choosing color palettes on old fashioned graphical display devices!

• Used to initialize clusters for the EM algorithm!!!

Back to Estimating GMMs

Recall: Maximum Likelihood for the GMM

• The log likelihood function takes the form

• Note: sum over components appears inside the log

• There is no closed form solution for maximum likelihood

• However, with labeled data, the story is different

Bishop, 2003

SHOW THIS ON THE BOARD!!

Latent Variable View of EM

• Binary latent variables describing which component generated each data point

• If we knew the values for the latent variables, we would maximize the complete-data log likelihood

which gives a trivial closed-form solution (fit each component to the corresponding set of data points)

• We don’t know the values of the latent variables

• However, for given parameter values we can compute the expected values of the latent variables

Bishop, 2003

EM – Latent Variable Viewpoint

• To compute the expected values of the latent variables, given anobservation x, we need to know their distribution.

• P(Z|X) = P(X,Z) / P(X) - distribution of latent variables given x

• Recall:

p(x,z)

Expected Value of Latent Variable

Bishop, 2003

FOR INTERPRETATION OF THIS QUANTITYWE AGAIN ARE GOING TO THE BOARD!!

Page 12: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

12

Expected Complete-Data Log Likelihood

• Suppose we make a guess for the parameter values (means, covariances and mixing coefficients)

• Use these to evaluate the responsibilities

• Consider expected complete-data log likelihood

where responsibilities are computed using

• We are implicitly ‘filling in’ latent variables with best guess

• Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results

Bishop, 2003

(ownership weights)

Expected Complete-Data Log Likelihood

• To summarize what we just did, we replaced:

• With:

Bishop, 2003

unknown discrete value 0 or 1

known continuous value between 0 and 1

Posterior Probabilities

• We can think of the mixing coefficients as prior probabilities for the components

• For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities

• These are given from Bayes’ theorem by

Bishop, 2003

Posterior Probabilities (colour coded)

Bishop, 2003

EM Algorithm for GMM

ownership weights

means covariances

mixing probabilities

E

M

Copyright © 2001, Andrew W. Moore

Gaussian Mixture Example: Start

Page 13: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

13

Copyright © 2001, Andrew W. Moore

After first iteration

Copyright © 2001, Andrew W. Moore

After 2nd iteration

Copyright © 2001, Andrew W. Moore

After 3rd iteration

Copyright © 2001, Andrew W. Moore

After 4th iteration

Copyright © 2001, Andrew W. Moore

After 5th iteration

Copyright © 2001, Andrew W. Moore

After 6th iteration

Page 14: GMM, EM, K-Means, Max. Likelihood (konu anlatımı)

14

Copyright © 2001, Andrew W. Moore

After 20th iteration Homework

4) Change your code for generating N points from a single multivariate Gaussian to instead generate N points from a mixture of Gaussians. Assume K Gaussians, each of which is specified by a mixing parameter 0<=p_i<=1, a 2x1 mean vector mu_i, and a 2x2 covariance matrix C_i.

5) write code to perform the K-means algorithm, given a set of N data points and a number K of desired clusters. You can either start the algorithm with random cluster centers, or else try something smarter that you can think up.

6) write a program to estimate the parameters of a mixture of Gaussians. At the heart of it will be code that looks a lot like the code you wrote last time to estimate MLE parameters of a multivariate Gaussian, except now you are computing weighted sample means, weighted sample covariances, and the mixing parameters, for each of K gaussian components [this is the “M” step of EM]. Also, you will estimate the ownership weights for each point, in the “E” step, to determine those weights. Starting with N sample points and a number K of desired Gaussian components, use EM to estimate the K mixing weights p_i, K 2x1 vectors mu_i, and K 2x2 covariance matrices C_i. To initialize, you could use a random start (say random selection of K mean vectors, identity matrices for each covariance, and 1/K for each mixing weight p_i). Or, you could first run k-means to find a more appropriate set of cluster centers to use as initial mean vectors.