expectation-maximization - svcl · expectation-maximization to derive an em algorithm you need to...

35
Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD

Upload: others

Post on 10-Jul-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-Maximization

Nuno Vasconcelos ECE Department, UCSDp ,

Page 2: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Recalllast class, we will have “Cheetah Day”what:what:• 4 teams, average of 6 people• each team will write a report on the 4 p

cheetah problems• each team will give a presentation on one

of the problemsp

I am waiting to hear on the teams

2

Page 3: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Plan for todaywe have been talking about mixture modelslast time we introduced the basics of EMlast time we introduced the basics of EMtoday we study the application of EM for ML estimation of mixture parameters p

t lnext class:• proof that EM maximizes likelihood of incomplete data

3

Page 4: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

mixture modeltwo types of random variables• Z – hidden state variable

PZ(z)

zZ – hidden state variable• X – observed variable

observations sampled with a

zi

ptwo-step procedure• a state (class) is sampled from the

distribution of the hidden variablePX|Z(x|0) PX|Z(x|1) PX|Z(x|K)…

distribution of the hidden variable

PZ(z) → zixi

• an observation is drawn from the class conditional density for the selected state

4

PX|Z(x|zi) → xi

4

Page 5: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

mixture modelthe sample consists of pairs (xi,zi)

D = {(x z ) (x z )}D = {(x1,z1), …, (xn,zn)}but we never get to see the zi

the pdf of the observed data is# of mixture components

component “weight”

cth “mixture component”

55

Page 6: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

The basics of EMas usual, we start from an iid sample D = {x1,…,xN}goal is to find parameters Ψ* that maximize likelihood withgoal is to find parameters Ψ that maximize likelihood with respect to D

the setthe setDc = {(x1,z1), …, (xN,zN)}

is called the complete datais called the complete datathe set

D = {x1, …, xN}{ 1, , N}is called the incomplete data

6

Page 7: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Learning with incomplete data (EM)the basic idea is quite simple1 start with an initial parameter estimate Ψ(0)1. start with an initial parameter estimate Ψ( )

2. E-step: given current parameters Ψ(i) and observations in D, “guess” what the values of the zi are

3. M-step: with the new zi, we have a complete data problem, solve this problem for the parameters, i.e. compute Ψ(i+1)

4. go to 2.

this can be summarized as

E-step

estimateparameters

fill in classassignments

zi

p

iM-step

7

Page 8: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Classification-maximizationC-step:• given estimates Ψ (i) = {Ψ (i) Ψ (i) }given estimates Ψ ( ) = {Ψ ( )

1, …, Ψ ( )C }

• determine zi by the BDR

• split the training set according to the labels zi1 2 CD1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}

M-step:as before determine the parameters of each class• as before, determine the parameters of each class independently

8

Page 9: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

For Gaussian mixturesC-step:•

• split the training set according to the labels zi

D1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}

M-step:•

9

Page 10: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

K-meanswhen covariances are identity and priors uniformC step:C-step:•• split the training set according to the labels zip g g i

D1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}

M-step:•

this is the K-means algorithm aka generalized Loydthis is the K-means algorithm, aka generalized Loyd algorithm, aka LBG algorithm in the vector quantization literature:

10

• “assign points to the closest mean; recompute the means”

Page 11: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

The Q functionis defined as

and is a bit tricky:• it is the expected value of likelihood with respect to complete data

(joint X and Z)• given that we observed incomplete data (X=D)g p ( )

• note that the likelihood is a function of Ψ (the parameters that we want to determine)b t t t th t d l d t th t• but to compute the expected value we need to use the parameter values from the previous iteration (because we need a distribution for Z|X)

the EM algorithm is, therefore, as follows

11

Page 12: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationE-step:• given estimates Ψ (n) = {Ψ (n) Ψ (n) }given estimates Ψ ( ) = {Ψ ( )

1, …, Ψ ( )C }

• compute expected log-likelihood of complete data

M-step:• find parameter set that maximizes this expected log-likelihood

let’s make this more concrete by looking at the mixturecasecase

12

Page 13: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data2. E-step: write down the Q function, i.e. its expectation given the

observed data3. M-step: solve the maximization, deriving a closed-form solution if

there is one

13

Page 14: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for mixtures (step 1)the first thing we always do in a EM problem is • compute the likelihood of the COMPLETE datacompute the likelihood of the COMPLETE data

very neat trick to use when z is discrete (classes)• instead of using z in {1, 2, ..., C}g { , , , }• use a binary vector of size equal to the # of classes

• where z = j in the z in {1, 2, ..., C} notation, now becomeswhere z j in the z in {1, 2, ..., C} notation, now becomes

14

Page 15: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for mixtures (step 1)we can now write the complete data likelihood as

for example, if z = k in the z in {1, 2, ..., C} notation,

the advantage is thatthe advantage is that

becomes LINEAR in the components zj !!!

15

Page 16: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

The assignment vector trickthis is similar to something that we used alreadyBernoulli random variableBernoulli random variable

⎩⎨⎧ =

=011

)(zp

zPZ

can be written as⎩⎨ =− 01 zpZ

zzP −1)1()(

or, using instead of , as

zzZ ppzP −= 1)1()(

⎬⎫

⎨⎧

⎥⎤

⎢⎡

⎥⎤

⎢⎡

∈1

,0

z { }1,0∈zor, using instead of , as

21 )1()( zzZ ppzP −=

⎭⎬

⎩⎨ ⎥

⎦⎢⎣

⎥⎦

⎢⎣

∈0

,1

z { }1,0∈z

16

)1()(Z ppzP

Page 17: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for mixtures (step 1)for the complete iid dataset Dc = {(x1,z1), …, (xN,zN)}

and the complete data log-likelihood is

this does not depend on z and simply becomes a constant for the expectation that we have to compute in the E-step p

17

Page 18: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data2. E-step: write down the Q function, i.e. its expectation given the

observed data3. M-step: solve the maximization, deriving a closed-form solution if

there is one

important E-step advice:p p• do not compute terms that you do not need• at the end of the day we only care about the parameters• terms of Q that do not depend on the parameters are useless,

e.g. inQ = f(z,Ψ) + log(sin z)

th t d l f l ( i ) t b diffi lt d ithe expected value of log(sin z) appears to be difficult and is completely unnecessary, since it is dropped in the M-step

18

Page 19: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for mixtures (step 2)once we have the complete data likelihood

i.e. to compute the Q function we only need to compute

note that this expectation can only be computed because we use Ψ(n)because we use Ψnote that the Q function will be a function of both Ψ and Ψ(n)

19

Page 20: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for mixtures (step 2)since zij is binary and only depends on xi

the E-step reduces to computing the posterior probability of each point under each class!probability of each point under each class!defining

the Q function is

20

Page 21: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data

2. E-step: write down the Q function, i.e. its expectation given the observed data

3. M-step: solve the maximization, deriving a closed-form solution if there is one

21

Page 22: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM vs CMlet’s compare this with the CM algorithm• the C-stepthe C-step

assigns each point to the class of largest posterior• the E-step

assigns the point to all classes with weight given by the posterior

for this, EM is said to make “soft-assignments”for this, EM is said to make soft assignments• it does not commit to any of the classes (unless the posterior is

one for that class), i.e. it is less greedy• no longer partition space into rigid cells, but now the boundaries

are soft 22

Page 23: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM vs CMwhat about the M-steps?• for CMfor CM

• for EM

these are the same if we threshold the hij to make, for each i, maxj hij = 1 and all other hij = 0M t th t th diff f i tM-steps the same up to the difference of assignments

23

Page 24: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM for Gaussian mixturesin summary: • CM = EM + hard assignments• CM special case, cannot be better

let’s look at the special case of Gaussian mixturesE-step:

24

Page 25: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

M-step for Gaussian mixturesM-step:

important note: • in the M-step, the optimization must be subject to whatever p, p j

constraint may hold• in particular, we always have the constraint• as usual we introduce a Lagrangian• as usual we introduce a Lagrangian

25

Page 26: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

M-step for Gaussian mixturesLagrangian

setting derivatives to zero

26

Page 27: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

M-step for Gaussian mixturesleads to the update equations

comparing to those of CMp g

they are the same up to hard vs soft assignments.ey a e e sa e up o a d s so ass g e s

27

Page 28: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationnote that the procedure is the same for all mixtures1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data

2. E-step: write down the Q function, i.e. its expectation given the observed data

3. M-step: solve the maximization, deriving a closed-form solution if there is one

28

Page 29: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Expectation-maximizationE.g. for a mixture of exponential distributions

Cλ∑

1 E-step: write down the Q function i e its expectation given the

x

iiiX

iexP λλπ −

=∑=

1

)(

1. E-step: write down the Q function, i.e. its expectation given the observed data

ij xjj e

xjPhλλπ −

== )|(ic x

C

ccc

jjiXZij

exjPh

λλπ −

=∑

==

1

| )|(

2. M-step: solve the maximization, deriving a closed-form solution if there is one

29

Page 30: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

M-step for exponential mixturesM-step:

[ ]xn h ijλ λl)1(Ψ ∑ −+ [ ][ ]( )

xjj

ijij

n

xh

eh ij

λπλ

λπ λ

logminarg

logmaxarg)1(

−=

∑Ψ

+

[ ]( )jjijij

ij xh λπλ logminarg = ∑Ψ

the Lagragian is

( ) ⎞⎜⎛∑∑ 1llhL λλ( ) ⎟

⎠⎜⎜⎝

−+−−= ∑∑ 1loglogj

jjjijij

ij xhL πκπλλ

30

Page 31: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

M-step for exponential mixtures

( ) ⎟⎟⎠

⎞⎜⎜⎝

⎛−+−−= ∑∑ 1loglog

jjjjij

ijij xhL πκπλλ

and has minimum at

⎠⎝ jj

∑ iik xh1

01=⎟⎟

⎞⎜⎜⎝

⎛−=

∂∂ ∑

ki

iik

k

xhLλλ ∑

=

iik

i

k hλ1

0=+−=∂∂ ∑ κ

ππ i k

ik

k

hL ∑=ij

ikhκ

∑h01 =−=

∂∂ ∑

jj

L πκ ∑

∑=

ik

iik

k h

∑ij

ik

31

Page 32: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

EM algorithmnote, however, that EM is much more general than this recipe for mixturespit can be applied for any problem where we have observed and hidden random variables here is a very simple example• X observer Gaussian variable, X~ N(µ,1), • Z hidden exponential variable• It is known that Z is independent of X• sample D = {x1 x } of iid observations from Xsample D {x1, …, xn} of iid observations from X

note that the assumption of independence does not really make sense (why?) how does this affect EM?

32

Page 33: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Exampletoy model: X iid, Z iid, Xi ~ N(µ,1), Zi ~ λe-λz,X independent of Z

33

Page 34: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

Example

this makes sense:• since hidden variables Z are independent of observed X• ML estimate of µ is always the same: the sample mean, no

dependence on zidependence on zi

• ML estimate of λ is always the initial estimate λ(0): since the observations are independent of the zi we have no information on what λ should be other than initial guess

34

what λ should be, other than initial guess.

note that model does not make sense, not EM solution

Page 35: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood

35