the em algorithm ling 572 fei xia week 10: 03/09/2010

The EM algorithm

LING 572

Fei Xia

Week 10: 03/09/2010

What is EM?

• EM stands for “expectation maximization”.

• A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).

• The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.

Outline

• MLE

• EM– Basic concepts– Main ideas of EM

• Additional slides:– EM for PM models– An example: Forward-backward algorithm

What is MLE?

• Given– A sample X={X1, …, Xn}

– A vector of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find )(maxarg

LML

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

An easy case

• Assuming– A coin has a probability p of being heads, 1-p

of being tails.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

An easy case (cont)**

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

EM: basic concepts

Basic setting in EM

• X is a set of data points: observed data• is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

The “missing” data Y

• Y need not necessarily be missing in the practical sense of the word.

• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).

• There could be many possible Ys.

Examples of EM

HMM PCFG MT Coin toss

X(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

bijk

P(ABC) t(f | e)d(aj | j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

The EM algorithm

• Consider a set of starting parameters

• Use these to “estimate” the missing data

• Use “complete” data to update parameters

• Repeat until convergence

Highlights

• General algorithm for missing data problems

• Requires “specialization” to the problem at hand

• Examples of EM:– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models

EM: main ideas

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

Idea #2: find the θt sequence

No analytical solution iterative approach, find

s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

The Q-function**

• Define the Q-function (a function of θ):

– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log

)]|,([log),(

1

1),|(

),|(

yxPxyP

yxPE

YXPEQ

it

n

i yi

n

iixyP

XYP

t

ti

t

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

Important classes of EM problem

• Products of multinomial (PM) models• Exponential families• Gaussian mixture• …

Summary

• EM falls into the general framework of maximum-likelihood estimation (MLE).

• Calculating P(X | θ) directly is hard. Introduce Y (“hidden” data), and calculating P(X,Y|θ)

instead.

• Start with initial estimate, repeat E-step and M-step until convergence.

• Closed-form solutions for some classes of models.

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

Additional slides

The EM algorithm for PM models

PM models

mi

yxici

i

yxici

i

yxici pppyxp ),,(),,(),,(

1

...)|,(

1 ji

ip

Where is a partition of all the parameters, and for any j

),...,( 1 m

HMM is a PM

kji

j

kw

ik

jsis

ji

wss

yxsscj

w

i

yxsscji

ssp

ssp

yxp

,,

),,(

),,(

)(

)(

)|,(

,

1

1

kijk

jij

b

a

PCFG

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii

n

ii AAPyxP

)|(

)|(

)|(

)|,(

VPsleepsVPP

NPJimNPP

SVPNPSP

yxP

PCFG is a PM

,

),,()(

)|,(

A

yxAcAp

yxp

A

Ap 1)(

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

),(

11

),,(

1

),,(

1

),,(

1

1

1

1

jij

tn

i yiji

j

tn

i yi

yxjC

jj

tn

i yi

yxjC

jj

tn

i yi

yxjC

k jj

tn

i yi

it

n

i yi

t

pyxjCxyPpyxjCxyP

pxyPpxyP

pxyP

yxPxyP

Q

k

i

k

i

k

),(1tQ

Maximizing the Q function

jj

it

n

i yi

t pyxjCxyPQ log),,(),|(),(11

1

11

j

jp

Maximize

Subject to the constraint

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

Use Lagrange multipliers

Optimal solution

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

1

1

jit

n

i yi

j

t

pyxjCxypp

Q

),,(),|(1

yxjCxyp

pi

tn

i yi

j

Normalization factor

Expected count

PCFG example

• Calculate expected counts

• Update parameters

EM: An example

Forward-backward algorithm

• HMM is a PM (Product of Multi-nominal) Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

Expected counts

)(

)|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

t

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

tij

Tt

T

tt

jX

iTTTT

Yjiji

T

Expected counts (cont)

),()(

),(*),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

kk

T

tij

kkTt

T

tt

jX

w

iTTTT

j

w

iY

j

w

i

wOt

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

k

kk

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

HMM

• A HMM is a tuple :– A set of states S={s1, s2, …, sN}.

– A set of output symbols Σ={w1, …, wM}.

– Initial state probabilities – State transition prob: A={aij}.

– Symbol emission prob: B={bijk}

• State sequence: X1…XT+1

• Output sequence: o1…oT

}{ i

),,,,( BAS

Forward probability

The probability of producing oi,t-1 while ending up in state si:

),()( 1,1 iXOPt tt

def

i

Calculating forward probability

tijoiji

i

ttttti

tttttti

tti

t

ttj

bat

iXjXoPiXOP

iXOjXoPiXOP

jXiXOP

jXOPt

)(

)|,(*),(

),|,(*),(

),,(

),()1(

11,1

1,111,1

1,1

1,1

ii )1(Initialization:

Induction:

Backward probability

• The probability of producing the sequence Ot,T, given that at time t, we are at state s i.

)|()( , iXOPt tTt

def

i

Calculating backward probability

tijoijj

j

tTtttj

t

tttTtttj

t

ttTtj

t

tTt

def

i

bat

jXOPiXjXoP

ojXiXOPiXjXoP

iXjX，OoP

iXOPt

)1(

)|(*)|,(

),,|(*)|,(

)|,(

)|()(

1,11

1),1(1

1),1(

,

1)1( TiInitialization:

Induction:

Calculating the prob of the observation

)1()(1

i

N

iiOP

)1()(1

TOPN

ii

)()(

),()(

1

1

tt

iXOPOP

i

N

ii

N

it

Estimating parameters

• The prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)

N

mmm

jijoiji

tt

ttij

tt

tbat

OP

OjXiXP

OjXiXPt

t

1

1

1

)()(

)1()(

)(

),,(

)|,()(

N

jttti OjXiXPOiXPt

11 )|,()|()(

)()(1

ttN

jiji

The prob of being at state i at time t given O:

Expected counts

Sum over the time index:• Expected # of transitions from state i to j in O:

• Expected # of transitions from state i in O:

N

j

T

tij

T

t

N

jij

T

ti ttt

1 11 11

)()()(

T

tij t

1

)(

Update parameters

)1(1expˆ ii ttimeatistateinfrequencyected

T

tij

N

j

T

tij

T

ti

T

tij

ij

t

t

t

t

istatefromstransitionofected

jtoistatefromstransitionofecteda

11

1

1

1

)(

)(

)(

)(

#exp

#exp

T

tij

T

tijkt

ijk

t

two

jtoistatefromstransitionofected

observedkwithjtoistatefromstransitionofectedb

1

1

)(

)(),(

#exp

#exp

Final formulae

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

Relations among the algorithms

The generalized EM

The EM algorithm

PM Gaussian MixInside-OutsideForward-backwardIBM models

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

// for each parameter

the em algorithm ling 572 fei xia week 10: 03/09/2010

Documents