the em algorithm ling 572 fei xia week 10: 03/09/2010

56
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Upload: gertrude-henderson

Post on 16-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The EM algorithm

LING 572

Fei Xia

Week 10: 03/09/2010

Page 2: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

What is EM?

• EM stands for “expectation maximization”.

• A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).

• The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.

Page 3: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Outline

• MLE

• EM– Basic concepts– Main ideas of EM

• Additional slides:– EM for PM models– An example: Forward-backward algorithm

Page 4: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

MLE

Page 5: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

What is MLE?

• Given– A sample X={X1, …, Xn}

– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find )(maxarg

LML

Page 6: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

Page 7: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

An easy case

• Assuming– A coin has a probability p of being heads, 1-p

of being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

Page 8: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

An easy case (cont)**

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

Page 9: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

EM: basic concepts

Page 10: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Basic setting in EM

• X is a set of data points: observed data• is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

Page 11: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

Page 12: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The “missing” data Y

• Y need not necessarily be missing in the practical sense of the word.

• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).

• There could be many possible Ys.

Page 13: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Examples of EM

HMM PCFG MT Coin toss

X(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

bijk

P(ABC) t(f | e)d(aj | j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

Page 14: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The EM algorithm

• Consider a set of starting parameters

• Use these to “estimate” the missing data

• Use “complete” data to update parameters

• Repeat until convergence

Page 15: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Highlights

• General algorithm for missing data problems

• Requires “specialization” to the problem at hand

• Examples of EM:– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models

Page 16: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

EM: main ideas

Page 17: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Page 18: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Idea #2: find the θt sequence

No analytical solution iterative approach, find

s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Page 19: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

Page 20: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Idea #4: find θt+1 that maximizes the Q function**

)]|,([logmaxarg

])|,(

)|,([logmaxarg

1),|(

1),|(

)1(

yxPE

yxp

yxpE

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

Page 21: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The Q-function**

• Define the Q-function (a function of θ):

– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log

)]|,([log),(

1

1),|(

),|(

yxPxyP

yxPE

YXPEQ

it

n

i yi

n

iixyP

XYP

t

ti

t

Page 22: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Calculating model expectation in MaxEnt

22

YyXx

jjp yxfyxpfE,

),(),(

YyXx

j yxfxypxp,

),()|()(

YyXx

j yxfxypxp,

),()|()(~

Xx Yy

j yxfxypxp ),()|()(~

Sx Yy

j yxfxypxp ),()|()(~

N

i Yyiji yxfxyp

N 1

),()|(1

Page 23: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Page 24: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Important classes of EM problem

• Products of multinomial (PM) models• Exponential families• Gaussian mixture• …

Page 25: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Summary

• EM falls into the general framework of maximum-likelihood estimation (MLE).

• Calculating P(X | θ) directly is hard. Introduce Y (“hidden” data), and calculating P(X,Y|θ)

instead.

• Start with initial estimate, repeat E-step and M-step until convergence.

• Closed-form solutions for some classes of models.

Page 26: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

Page 27: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

Page 28: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Additional slides

Page 29: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The EM algorithm for PM models

Page 30: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

PM models

mi

yxici

i

yxici

i

yxici pppyxp ),,(),,(),,(

1

...)|,(

1 ji

ip

Where is a partition of all the parameters, and for any j

),...,( 1 m

Page 31: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

HMM is a PM

kji

j

kw

ik

jsis

ji

wss

yxsscj

w

i

yxsscji

ssp

ssp

yxp

,,

),,(

),,(

)(

)(

)|,(

,

1

1

kijk

jij

b

a

Page 32: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

PCFG

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii

n

ii AAPyxP

)|(

)|(

)|(

)|,(

VPsleepsVPP

NPJimNPP

SVPNPSP

yxP

Page 33: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

PCFG is a PM

,

),,()(

)|,(

A

yxAcAp

yxp

A

Ap 1)(

Page 34: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

),(

11

),,(

1

),,(

1

),,(

1

1

1

1

jij

tn

i yiji

j

tn

i yi

yxjC

jj

tn

i yi

yxjC

jj

tn

i yi

yxjC

k jj

tn

i yi

it

n

i yi

t

pyxjCxyPpyxjCxyP

pxyPpxyP

pxyP

yxPxyP

Q

k

i

k

i

k

),(1tQ

Page 35: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Maximizing the Q function

jj

it

n

i yi

t pyxjCxyPQ log),,(),|(),(11

1

11

j

jp

Maximize

Subject to the constraint

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

Use Lagrange multipliers

Page 36: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Optimal solution

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

1

1

jit

n

i yi

j

t

pyxjCxypp

Q

),,(),|(1

yxjCxyp

pi

tn

i yi

j

Normalization factor

Expected count

Page 37: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

PCFG example

• Calculate expected counts

• Update parameters

Page 38: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

EM: An example

Page 39: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Forward-backward algorithm

• HMM is a PM (Product of Multi-nominal) Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

Page 40: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Expected counts

)(

)|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

t

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

tij

Tt

T

tt

jX

iTTTT

Yjiji

T

Page 41: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Expected counts (cont)

),()(

),(*),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

kk

T

tij

kkTt

T

tt

jX

w

iTTTT

j

w

iY

j

w

i

wOt

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

k

kk

Page 42: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Page 43: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

HMM

• A HMM is a tuple :– A set of states S={s1, s2, …, sN}.

– A set of output symbols Σ={w1, …, wM}.

– Initial state probabilities – State transition prob: A={aij}.

– Symbol emission prob: B={bijk}

• State sequence: X1…XT+1

• Output sequence: o1…oT

}{ i

),,,,( BAS

Page 44: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Forward probability

The probability of producing oi,t-1 while ending up in state si:

),()( 1,1 iXOPt tt

def

i

Page 45: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Calculating forward probability

tijoiji

i

ttttti

tttttti

tti

t

ttj

bat

iXjXoPiXOP

iXOjXoPiXOP

jXiXOP

jXOPt

)(

)|,(*),(

),|,(*),(

),,(

),()1(

11,1

1,111,1

1,1

1,1

ii )1(Initialization:

Induction:

Page 46: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Backward probability

• The probability of producing the sequence Ot,T, given that at time t, we are at state s i.

)|()( , iXOPt tTt

def

i

Page 47: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Calculating backward probability

tijoijj

j

tTtttj

t

tttTtttj

t

ttTtj

t

tTt

def

i

bat

jXOPiXjXoP

ojXiXOPiXjXoP

iXjX,OoP

iXOPt

)1(

)|(*)|,(

),,|(*)|,(

)|,(

)|()(

1,11

1),1(1

1),1(

,

1)1( TiInitialization:

Induction:

Page 48: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Calculating the prob of the observation

)1()(1

i

N

iiOP

)1()(1

TOPN

ii

)()(

),()(

1

1

tt

iXOPOP

i

N

ii

N

it

Page 49: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Estimating parameters

• The prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)

N

mmm

jijoiji

tt

ttij

tt

tbat

OP

OjXiXP

OjXiXPt

t

1

1

1

)()(

)1()(

)(

),,(

)|,()(

Page 50: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

N

jttti OjXiXPOiXPt

11 )|,()|()(

)()(1

ttN

jiji

The prob of being at state i at time t given O:

Page 51: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Expected counts

Sum over the time index:• Expected # of transitions from state i to j in O:

• Expected # of transitions from state i in O:

N

j

T

tij

T

t

N

jij

T

ti ttt

1 11 11

)()()(

T

tij t

1

)(

Page 52: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Update parameters

)1(1expˆ ii ttimeatistateinfrequencyected

T

tij

N

j

T

tij

T

ti

T

tij

ij

t

t

t

t

istatefromstransitionofected

jtoistatefromstransitionofecteda

11

1

1

1

)(

)(

)(

)(

#exp

#exp

T

tij

T

tijkt

ijk

t

two

jtoistatefromstransitionofected

observedkwithjtoistatefromstransitionofectedb

1

1

)(

)(),(

#exp

#exp

Page 53: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Final formulae

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Page 54: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

Page 55: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

Relations among the algorithms

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

Page 56: The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

// for each parameter