csc 311: introduction to machine learningrgrosse/courses/csc311_f20/slides/...assume the data...

42
CSC 311: Introduction to Machine Learning Lecture 10 - k-Means and EM Algorithm Roger Grosse Chris Maddison Juhan Bae Silviu Pitis University of Toronto, Fall 2020 Intro ML (UofT) CSC311-Lec10 1 / 42

Upload: others

Post on 23-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

CSC 311: Introduction to Machine LearningLecture 10 - k-Means and EM Algorithm

Roger Grosse Chris Maddison Juhan Bae Silviu Pitis

University of Toronto, Fall 2020

Intro ML (UofT) CSC311-Lec10 1 / 42

Page 2: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Overview

In the last two lectures, we covered PCA, Autoencoders andMatrix Factorization—all unsupervised learning algorithms.

I Each algorithm can be used to approximate high dimensional datausing some lower dimensional form.

Those methods made an interesting assumption that data dependson some latent variables that are never observed. Such models arecalled latent variable models.

I For PCA, these correspond to the code vectors (representation).

Today’s lecture:I First, introduce K-means, a simple algorithm for clustering, i.e.

grouping data points into clustersI Then, we will reformulate clustering as a latent variable model,

apply the EM algorithm

Intro ML (UofT) CSC311-Lec10 2 / 42

Page 3: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Clustering

Sometimes the data form clusters, where samples within a clusterare similar to each other, and samples in different clusters aredissimilar:

Such a distribution is multimodal, since it has multiple modes, orregions of high probability mass.

Grouping data points into clusters, with no observed labels, iscalled clustering. It is an unsupervised learning technique.

E.g. clustering machine learning papers based on topic (deeplearning, Bayesian models, etc.)

I But topics are never observed (unsupervised).

Intro ML (UofT) CSC311-Lec10 3 / 42

Page 4: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Clustering problem

Assume the data {x(1), . . . ,x(N)} lives in a Euclidean space, x(n) ∈ RD.

Assume each data point belongs to one of K clusters

Assume the data points from same cluster are similar, i.e. close inEuclidean distance.

How can we identify those clusters (data points that belong to eachcluster)? Let’s formulate as an optimization problem.

Intro ML (UofT) CSC311-Lec10 4 / 42

Page 5: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means Objective

<latexit sha1_base64="U5ptr1wypzLY69PGCvN2ZH2JI2Q=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuiG5cV7AM6Q8mkmTY0yQxJRhiG/oYbF4q49Wfc+Tdm2llo64HA4Zx7uScnTDjTxnW/ncra+sbmVnW7trO7t39QPzzq6jhVhHZIzGPVD7GmnEnaMcxw2k8UxSLktBdO7wq/90SVZrF8NFlCA4HHkkWMYGMl3xfYTMIoF7PhdFhvuE13DrRKvJI0oER7WP/yRzFJBZWGcKz1wHMTE+RYGUY4ndX8VNMEkyke04GlEguqg3yeeYbOrDJCUazskwbN1d8bORZaZyK0k0VGvewV4n/eIDXRTZAzmaSGSrI4FKUcmRgVBaARU5QYnlmCiWI2KyITrDAxtqaaLcFb/vIq6V40vaum+3DZaN2WdVThBE7hHDy4hhbcQxs6QCCBZ3iFNyd1Xpx352MxWnHKnWP4A+fzB3LNkfM=</latexit>mk

<latexit sha1_base64="3lopfPkkAXivKC86622GnO9S8cA=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiSi6LLoxmUF+4A2lsl00g6dTMLMpFhC/sSNC0Xc+ifu/BsnbRbaemDgcM693DPHjzlT2nG+rZXVtfWNzdJWeXtnd2/fPjhsqSiRhDZJxCPZ8bGinAna1Exz2oklxaHPadsf3+Z+e0KlYpF40NOYeiEeChYwgrWR+rbdC7Ee+UH6lD2mVXGW9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBtZcyESeaCjI/FCQc6QjlNaABk5RoPjUEE8lMVkRGWGKiTVllU4K7+OVl0jqvuZc15/6iUr8p6ijBMZxAFVy4gjrcQQOaQGACz/AKb1ZqvVjv1sd8dMUqdo7gD6zPH5sTk6I=</latexit>

x(n)

<latexit sha1_base64="AwgEJiiMfgLTV1d53gPayZCOjDI=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRahXsquKHoRil48VrAf0K4lm2bb0CS7JtlCWfo7vHhQxKs/xpv/xrTdg7Y+GHi8N8PMvCDmTBvX/XZyK6tr6xv5zcLW9s7uXnH/oKGjRBFaJxGPVCvAmnImad0ww2krVhSLgNNmMLyd+s0RVZpF8sGMY+oL3JcsZAQbK/nqMS3L00l3iK6R1y2W3Io7A1omXkZKkKHWLX51ehFJBJWGcKx123Nj46dYGUY4nRQ6iaYxJkPcp21LJRZU++ns6Ak6sUoPhZGyJQ2aqb8nUiy0HovAdgpsBnrRm4r/ee3EhFd+ymScGCrJfFGYcGQiNE0A9ZiixPCxJZgoZm9FZIAVJsbmVLAheIsvL5PGWcW7qLj356XqTRZHHo7gGMrgwSVU4Q5qUAcCT/AMr/DmjJwX5935mLfmnGzmEP7A+fwBF2mQ/w==</latexit>

r(n)k = 1

K-means Objective: Find cluster centers {mk}Kk=1 and assignments {r(n)}Nn=1

to minimize the sum of squared distances of data points {x(n)} to theirassigned centers.

Data sample n = 1, .., N : x(n) ∈ RD (observed),

Cluster center k = 1, ..,K: mk ∈ RD (not observed),

Responsibilities: Cluster assignment for sample n:r(n) ∈ RK 1-of-K encoding (not observed)

Intro ML (UofT) CSC311-Lec10 5 / 42

Page 6: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means Objective

K-means Objective: Find cluster centers {mk}Kk=1 and assignments{r(n)}Nn=1 to minimize the sum of squared distances of data points {x(n)}to their assigned centers.

I Data sample n = 1, .., N : x(n) ∈ RD (observed),I Cluster center k = 1, ..,K: mk ∈ RD (not observed),I Responsibilities: Cluster assignment for sample n:

r(n) ∈ RK 1-of-K encoding (not observed)

Mathematically:

min{mk},{r(n)}

J({mk}, {r(n)}) = min{mk},{r(n)}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

where r(n)k = I[x(n) is assigned to cluster k], i.e., r(n) = [0, .., 1, .., 0]>

Finding an optimal solution is an NP-hard problem!

Intro ML (UofT) CSC311-Lec10 6 / 42

Page 7: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means Objective

Optimization problem:

min{mk},{r(n)}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2︸ ︷︷ ︸

distance between x(n)

and its assigned cluster center

Since r(n)k = I[x(n) is assigned to cluster k], i.e., r(n)=[0, .., 1, .., 0]>

inner sum is over K terms but only one of them is non-zero.

E.g. say sample x(n) is assigned to cluster k = 3, then

rn = [0, 0, 1, 0, ...]

K∑k=1

r(n)k ||mk − x(n)||2 = ||m3 − x(n)||2

Intro ML (UofT) CSC311-Lec10 7 / 42

Page 8: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

How to optimize?: Alternating Minimization

Optimization problem:

min{mk},{r(n)}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

Problem is hard when minimizing jointly over the parameters{mk}, {r(n)}

But note that if we fix one and minimize over the other, then it becomeseasy.

Doesn’t guarantee the same solution!

Intro ML (UofT) CSC311-Lec10 8 / 42

Page 9: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

How to optimize?: Alternating Minimization

Optimization problem:

min{mk},{r(n)}

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

Note:

I If we fix the centers {mk} then we can easily find the optimalassignments {r(n)} for each sample n

minr(n)

K∑k=1

r(n)k ||mk − x(n)||2

I Assign each point to the cluster with the nearest center

r(n)k =

{1 if k = argminj ‖x(n) −mj‖20 otherwise

I E.g. if x(n) is assigned to cluster k̂,

r(n) = [0, 0, ..., 1, ..., 0]>︸ ︷︷ ︸Only k̂-th entry is 1

Intro ML (UofT) CSC311-Lec10 9 / 42

Page 10: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Alternating Minimization

Likewise, if we fix the assignments {r(n)} then can easily find optimalcenters {mk}

I Set each cluster’s center to the average of its assigned data points:For l = 1, 2, ...,K

0 =∂

∂ml

N∑n=1

K∑k=1

r(n)k ||mk − x(n)||2

=2

N∑n=1

r(n)l (ml − x(n)) =⇒ ml =

∑n r

(n)l x(n)∑n r

(n)l

Let’s alternate between minimizing J({mk}, {r(n)}) with respect to{mk} and {r(n)}

This is called alternating minimization

Intro ML (UofT) CSC311-Lec10 10 / 42

Page 11: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means Algorithm

High level overview of algorithm:

Initialization: randomly initialize cluster centers

The algorithm iteratively alternates between two steps:

I Assignment step: Assign each data point to the closest clusterI Refitting step: Move each cluster center to the mean of the data

assigned to it

Assignments Refitted means

Intro ML (UofT) CSC311-Lec10 11 / 42

Page 12: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Figure from Bishop Simple demo: http://syskall.com/kmeans.js/

Intro ML (UofT) CSC311-Lec10 12 / 42

Page 13: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

The K-means Algorithm

Initialization: Set K cluster means m1, . . . ,mK to random values

Repeat until convergence (until assignments do not change):

I Assignment: Optimize J w.r.t. {r}: Each data point x(n) assignedto nearest center

k̂(n) = arg mink||mk − x(n)||2

and Responsibilities (1-hot or 1-of-K encoding)

r(n)k = I[k̂(n) = k] for k = 1, ..,K

I Refitting: Optimize J w.r.t. {m}: Each center is set to mean ofdata assigned to it

mk =

∑n r

(n)k x(n)∑n r

(n)k

.

Intro ML (UofT) CSC311-Lec10 13 / 42

Page 14: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means for Vector Quantization

Figure from Bishop

Given image, construct “dataset“ of pixels represented by their RGBpixel intensities

Run k-means, replace each pixel by its cluster center

Intro ML (UofT) CSC311-Lec10 14 / 42

Page 15: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

K-means for Image Segmentation

Given image, construct “dataset” of pixels, represented by their RGBpixel intensities and grid locations

Run k-means (with some modifications) to get superpixels

Intro ML (UofT) CSC311-Lec10 15 / 42

Page 16: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Questions about K-means

Why does update set mk to mean of assigned points?

What if we used a different distance measure?

How can we choose the best distance?

How to choose K?

Will it converge?

Hard cases – unequal spreads, non-circular spreads, in-between points

Intro ML (UofT) CSC311-Lec10 16 / 42

Page 17: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Why K-means Converges

K-means algorithm reduces the cost at each iteration.

I Whenever an assignment is changed, the sum squared distances J ofdata points from their assigned cluster centers is reduced.

I Whenever a cluster center is moved, J is reduced.

Test for convergence: If the assignments do not change in the assignmentstep, we have converged (to at least a local minimum).

This will always happen after a finite number of iterations, since thenumber of possible cluster assignments is finite

K-means cost function after each assignment step (blue) and refittingstep (red). The algorithm has converged after the third refitting step.

Intro ML (UofT) CSC311-Lec10 17 / 42

Page 18: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Local Minima

The objective J is non-convex (socoordinate descent on J is notguaranteed to converge to the globalminimum)

There is nothing to prevent k-meansgetting stuck at local minima.

We could try many random startingpoints

A bad local optimum

Intro ML (UofT) CSC311-Lec10 18 / 42

Page 19: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Soft K-means

Instead of making hard assignments of data points to clusters, we canmake soft assignments. One cluster may have a responsibility of .7 for adatapoint and another may have a responsibility of .3.

I Allows a cluster to use more information about the data in therefitting step.

I How do we decide on the soft assignments?I We already saw this in multi-class classification:

I 1-of-K encoding vs softmax assignments

Intro ML (UofT) CSC311-Lec10 19 / 42

Page 20: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Soft K-means Algorithm

Initialization: Set K means {mk} to random values

Repeat until convergence (measured by how much J changes):

I Assignment: Each data point n given soft “degree of assignment” toeach cluster mean k, based on responsibilities

r(n)k =

exp[−β‖mk − x(n)‖2]∑j exp[−β‖mj − x(n)‖2]

=⇒ r(n) = softmax(−β{‖mk − x(n)‖2}Kk=1)

I Refitting: Model parameters, means, are adjusted to match samplemeans of datapoints they are responsible for:

mk =

∑n r

(n)k x(n)∑n r

(n)k

Intro ML (UofT) CSC311-Lec10 20 / 42

Page 21: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Questions about Soft K-means

Some remaining issues

How to set β?

Clusters with unequal weight and width?

These aren’t straightforward to address with K-means. Instead, in the sequel,we’ll reformulate clustering using a generative model.

As β →∞, soft k-Means becomes k-Means! (Exercise)

Intro ML (UofT) CSC311-Lec10 21 / 42

Page 22: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

A Generative View of Clustering

Next: probabilistic formulation of clustering

We need a sensible measure of what it means to cluster the data well

I This makes it possible to judge different methodsI It may help us decide on the number of clusters

An obvious approach is to imagine that the data was produced by agenerative model

I Then we adjust the model parameters using maximum likelihoodi.e. to maximize the probability that it would produce exactly thedata we observed

Intro ML (UofT) CSC311-Lec10 22 / 42

Page 23: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

The Generative Model

We’ll be working with the following generative model for data DAssume a datapoint x is generated as follows:

I Choose a cluster z from {1, . . . ,K} such that p(z = k) = πkI Given z, sample x from a Gaussian distribution N (x|µz, I)

Can also be written:p(z = k) = πk

p(x|z = k) = N (x|µk, I)

Intro ML (UofT) CSC311-Lec10 23 / 42

Page 24: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Clusters from Generative Model

This defines joint distribution p(z,x) = p(z)p(x|z) withparameters {πk,µk}Kk=1

The marginal of x is given by p(x) =∑

z p(z,x)

p(z = k|x) can be computed using Bayes rule

p(z = k|x) =p(x | z = k)p(z = k)

p(x)

and tells us the probability x came from the kth cluster

Intro ML (UofT) CSC311-Lec10 24 / 42

Page 25: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

The Generative Model

500 points drawn from a mixture of 3 Gaussians.

a) Samples from p(x | z) b) Samples from the marginal p(x) c) Responsibilities p(z |x)

Intro ML (UofT) CSC311-Lec10 25 / 42

Page 26: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Maximum Likelihood with Latent Variables

How should we choose the parameters {πk,µk}Kk=1?

Maximum likelihood principle: choose parameters to maximizelikelihood of observed data

We don’t observe the cluster assignments z, we only see the data x

Given data D = {x(n)}Nn=1, choose parameters to maximize:

log p(D) =

N∑n=1

log p(x(n))

We can find p(x) by marginalizing out z:

p(x) =

K∑k=1

p(z = k,x) =

K∑k=1

p(z = k)p(x|z = k)

Intro ML (UofT) CSC311-Lec10 26 / 42

Page 27: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Gaussian Mixture Model (GMM)

What is p(x)?

p(x) =

K∑k=1

p(z = k)p(x|z = k) =

K∑k=1

πkN (x|µk, I)

This distribution is an example of a Gaussian Mixture Model (GMM),and πk are known as the mixing coefficients

In general, we would have different covariance for each cluster, i.e.,p(x | z = k) = N (x|µk,Σk). For this lecture, we assume Σk = I forsimplicity.

If we allow arbitrary covariance matrices, GMMs are universalapproximators of densities (if you have enough Gaussians). Evendiagonal GMMs are universal approximators.

Intro ML (UofT) CSC311-Lec10 27 / 42

Page 28: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Visualizing a Mixture of Gaussians – 1D Gaussians

If you fit a Gaussian to data:

Now, we are trying to fit a GMM (with K = 2 in this example):

[Slide credit: K. Kutulakos]

Intro ML (UofT) CSC311-Lec10 28 / 42

Page 29: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Visualizing a Mixture of Gaussians – 2D Gaussians

Intro ML (UofT) CSC311-Lec10 29 / 42

Page 30: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Fitting GMMs: Maximum Likelihood

Maximum likelihood objective:

log p(D) =

N∑n=1

log p(x(n)) =

N∑n=1

log

(K∑k=1

πkN (x(n)|µk, I)

)

How would you optimize this w.r.t. parameters {πk,µk}?I No closed form solution when we set derivatives to 0I Difficult because sum inside the log

One option: gradient ascent. Can we do better?

Can we have a closed form update?

Intro ML (UofT) CSC311-Lec10 30 / 42

Page 31: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Maximum Likelihood

Observation: if we knew z(n) for every x(n), (i.e. our dataset wasDcomplete = {(z(n),x(n))}Nn=1) the maximum likelihood problem is easy:

log p(Dcomplete) =

N∑n=1

log p(z(n),x(n))

=

N∑n=1

log p(x(n)|z(n)) + log p(z(n))

=

N∑n=1

K∑k=1

I[z(n) = k](

logN (x(n)|µk, I) + log πk

)

Intro ML (UofT) CSC311-Lec10 31 / 42

Page 32: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Maximum Likelihood

log p(Dcomplete) =

N∑n=1

K∑k=1

I[z(n) = k](

logN (x(n)|µk, I) + log πk

)

We have been optimizing something similar for Naive bayes classifiers

By maximizing log p(Dcomplete), we would get this:

µ̂k =

∑Nn=1 I[z(n) = k] x(n)∑Nn=1 I[z(n) = k]

= class means

π̂k =1

N

N∑n=1

I[z(n) = k] = class proportions

Intro ML (UofT) CSC311-Lec10 32 / 42

Page 33: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Maximum Likelihood

We haven’t observed the cluster assignments z(n), but we can computep(z(n)|x(n)) using Bayes rule

Conditional probability (using Bayes rule) of z given x

p(z = k|x) =p(z = k)p(x|z = k)

p(x)

=p(z = k)p(x|z = k)∑Kj=1 p(z = j)p(x|z = j)

=πkN (x|µk, I)∑Kj=1 πjN (x|µj , I)

Intro ML (UofT) CSC311-Lec10 33 / 42

Page 34: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Maximum Likelihood

log p(Dcomplete) =

N∑n=1

K∑k=1

I[z(n) = k](logN (x(n)|µk, I) + log πk)

We don’t know the cluster assignments I[z(n) =k], but we know theirexpectation E[I[z(n) =k] |x(n)]=p(z(n) =k|x(n)).

If we plug in r(n)k = p(z(n) = k|x(n)) for I[z(n) = k], we get:

N∑n=1

K∑k=1

r(n)k (logN (x(n)|µk, I) + log πk)

This is still easy to optimize! Solution is similar to what we have seen:

µ̂k =

∑Nn=1 r

(n)k x(n)∑N

n=1 r(n)k

π̂k =

∑Nn=1 r

(n)k

N

Note: this only works if we treat r(n)k = πkN (x(n)|µk,I)∑K

j=1 πjN (x(n)|µj ,I)as fixed.

Intro ML (UofT) CSC311-Lec10 34 / 42

Page 35: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

How Can We Fit a Mixture of Gaussians?

This motivates the Expectation-Maximization algorithm, whichalternates between two steps:

1. E-step: Compute the posterior probabilities r(n)k = p(z(n) = k|x(n))

given our current model - i.e. how much do we think a cluster isresponsible for generating a datapoint.

2. M-step: Use the equations on the last slide to update the

parameters, assuming r(n)k are held fixed- change the parameters of

each Gaussian to maximize the probability that it would generatethe data it is currently responsible for.

.95

.5

.5

.05

.5 .5

.95 .05

Intro ML (UofT) CSC311-Lec10 35 / 42

Page 36: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

EM Algorithm for GMM

Initialize the means µ̂k and mixing coefficients π̂k

Iterate until convergence:

I E-step: Evaluate the responsibilities r(n)k given current parameters

r(n)k = p(z(n)=k|x(n)) =

π̂kN (x(n)|µ̂k, I)∑Kj=1 π̂jN (x(n)|µ̂j , I)

=π̂k exp{− 1

2‖x(n) − µ̂k‖2}∑K

j=1 π̂j exp{− 12‖x(n) − µ̂j‖2}

I M-step: Re-estimate the parameters given current responsibilities

µ̂k =1

Nk

N∑n=1

r(n)k x(n)

π̂k =Nk

Nwith Nk =

N∑n=1

r(n)k

I Evaluate log likelihood and check for convergence

log p(D) =N∑

n=1

log

(K∑

k=1

π̂kN (x(n)|µ̂k, I)

)

Intro ML (UofT) CSC311-Lec10 36 / 42

Page 37: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Intro ML (UofT) CSC311-Lec10 37 / 42

Page 38: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

What just happened: A review

The maximum likelihood objective∑N

n=1 log p(x(n)) was hard tooptimize

The complete data likelihood objective was easy to optimize:

N∑n=1

log p(z(n),x(n)) =

N∑n=1

K∑k=1

I[z(n) = k](logN (x(n)|µk, I) + log πk)

We don’t know z(n)’s (they are latent), so we replaced I[z(n) = k]

with responsibilities r(n)k = p(z(n) = k|x(n)).

That is: we replaced I[z(n) = k] with its expectation underp(z(n)|x(n)) (E-step).

Intro ML (UofT) CSC311-Lec10 38 / 42

Page 39: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

What just happened: A review

We ended up with the expected complete data log-likelihood:

N∑n=1

Ep(z(n)|x(n))[log p(z(n),x(n))] =

N∑n=1

K∑k=1

r(n)k

(logN (x(n)|µk, I)+log πk

)which we maximized over parameters {πk,µk}k (M-step)

The EM algorithm alternates between:

I The E-step: computing the r(n)k = p(z(n) = k|x(n)) (i.e. expectations

E[I[z(n) = k]|x(n)]) given the current model parameters πk,µkI The M-step: update the model parameters πk,µk to optimize the

expected complete data log-likelihood

Intro ML (UofT) CSC311-Lec10 39 / 42

Page 40: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Relation to k-Means

The K-Means Algorithm:

1. Assignment step: Assign each data point to the closest cluster2. Refitting step: Move each cluster center to the average of the data

assigned to it

The EM Algorithm:

1. E-step: Compute the posterior probability over z given our currentmodel

2. M-step: Maximize the probability that it would generate the data itis currently responsible for.

Can you find the similarities between the soft k-Means algorithmand EM algorithm with shared covariance 1

β I?

Both rely on alternating optimization methods and can suffer frombad local optima.

Intro ML (UofT) CSC311-Lec10 40 / 42

Page 41: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

Further Discussion

We assumed the covariance of each Gaussian was I to simplify the math.This assumption can be removed, allowing clusters to have differentspatial extents. The resulting algorithm is still very simple.

Possible problems with maximum likelihood objective:

I Singularities: Arbitrarily large likelihood when a Gaussian explainsa single point with variance shrinking to zero

I Non-convex

EM is more general than what was covered in this lecture. Here, EMalgorithm is used to find the optimal parameters under the GMMs.

Intro ML (UofT) CSC311-Lec10 41 / 42

Page 42: CSC 311: Introduction to Machine Learningrgrosse/courses/csc311_f20/slides/...Assume the data fx(1);:::;x(N)glives in a Euclidean space, x(n) 2RD. Assume each data point belongs to

GMM Recap

A probabilistic view of clustering - Each cluster corresponds to adifferent Gaussian.

Model using latent variables.

General approach, can replace Gaussian with other distributions(continuous or discrete)

More generally, mixture models are very powerful models, i.e.universal distribution approximators

Optimization is done using the EM algorithm.

Intro ML (UofT) CSC311-Lec10 42 / 42