the em algorithm ling 572 fei xia week 10: 03/09/2010
TRANSCRIPT
The EM algorithm
LING 572
Fei Xia
Week 10: 03/09/2010
What is EM?
• EM stands for “expectation maximization”.
• A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).
• The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.
Outline
• MLE
• EM– Basic concepts– Main ideas of EM
• Additional slides:– EM for PM models– An example: Forward-backward algorithm
MLE
What is MLE?
• Given– A sample X={X1, …, Xn}
– A vector of parameters θ
• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)
• Given X, find )(maxarg
LML
MLE (cont)
• Often we assume that Xis are independently identically distributed (i.i.d.)
• Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.
)|(logmaxarg
)|(logmaxarg
)|,...,(logmaxarg
)|(logmaxarg
)(maxarg
1
ii
ii
n
ML
XP
XP
XXP
XP
L
An easy case
• Assuming– A coin has a probability p of being heads, 1-p
of being tails.– Observation: We toss a coin N times, and the
result is a set of Hs and Ts, and there are m Hs.
• What is the value of p based on MLE, given the observation?
An easy case (cont)**
)1log()(log
)1(log)|(log)(
pmNpm
ppXPL mNm
01
))1log()(log()(
p
mN
p
m
dp
pmNpmd
dp
dL
p= m/N
EM: basic concepts
Basic setting in EM
• X is a set of data points: observed data• is a parameter vector.• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,
where Y is “hidden” data (or “missing” data).
)|(logmaxarg
)(maxarg
XP
LML
The basic EM strategy
• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)
The “missing” data Y
• Y need not necessarily be missing in the practical sense of the word.
• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).
• There could be many possible Ys.
Examples of EM
HMM PCFG MT Coin toss
X(observed)
sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences
Parse trees Word alignment
Coin id sequences
θ aij
bijk
P(ABC) t(f | e)d(aj | j, l, m), …
p1, p2, λ
Algorithm Forward-backward
Inside-outside
IBM Models N/A
The EM algorithm
• Consider a set of starting parameters
• Use these to “estimate” the missing data
• Use “complete” data to update parameters
• Repeat until convergence
Highlights
• General algorithm for missing data problems
• Requires “specialization” to the problem at hand
• Examples of EM:– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models
EM: main ideas
Idea #1: find θ that maximizes the likelihood of training data
)|(logmaxarg
)(maxarg
XP
LML
Idea #2: find the θt sequence
No analytical solution iterative approach, find
s.t.
,....,...,, 10 t
....)(...)()( 10 tlll
Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll
a tight lower bound
])|,(
)|,([log)()(
1),|( t
i
in
ixyP
t
yxP
yxPEll t
i
Idea #4: find θt+1 that maximizes the Q function**
)]|,([logmaxarg
])|,(
)|,([logmaxarg
1),|(
1),|(
)1(
yxPE
yxp
yxpE
i
n
ixyP
ti
in
ixyP
t
ti
ti
Lower bound of )()( tll
The Q function
The Q-function**
• Define the Q-function (a function of θ):
– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.
• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.
)|,(log),|(
)]|,([log
)]|,([log),(
1
1),|(
),|(
yxPxyP
yxPE
YXPEQ
it
n
i yi
n
iixyP
XYP
t
ti
t
Calculating model expectation in MaxEnt
22
YyXx
jjp yxfyxpfE,
),(),(
YyXx
j yxfxypxp,
),()|()(
YyXx
j yxfxypxp,
),()|()(~
Xx Yy
j yxfxypxp ),()|()(~
Sx Yy
j yxfxypxp ),()|()(~
N
i Yyiji yxfxyp
N 1
),()|(1
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
Important classes of EM problem
• Products of multinomial (PM) models• Exponential families• Gaussian mixture• …
Summary
• EM falls into the general framework of maximum-likelihood estimation (MLE).
• Calculating P(X | θ) directly is hard. Introduce Y (“hidden” data), and calculating P(X,Y|θ)
instead.
• Start with initial estimate, repeat E-step and M-step until convergence.
• Closed-form solutions for some classes of models.
Strengths of EM
• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.
• The EM handles parameter constraints gracefully.
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.
• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)
The initial estimate is important.
Additional slides
The EM algorithm for PM models
PM models
mi
yxici
i
yxici
i
yxici pppyxp ),,(),,(),,(
1
...)|,(
1 ji
ip
Where is a partition of all the parameters, and for any j
),...,( 1 m
HMM is a PM
kji
j
kw
ik
jsis
ji
wss
yxsscj
w
i
yxsscji
ssp
ssp
yxp
,,
),,(
),,(
)(
)(
)|,(
,
1
1
kijk
jij
b
a
PCFG
• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.
)|()|,(1
ii
n
ii AAPyxP
)|(
)|(
)|(
)|,(
VPsleepsVPP
NPJimNPP
SVPNPSP
yxP
PCFG is a PM
,
),,()(
)|,(
A
yxAcAp
yxp
A
Ap 1)(
Q-function for PM
)log),,(),|((...)log),,(),|((
)log),|((...)log),|((
))(log(),|(
)|,(log),|(
),(
11
),,(
1
),,(
1
),,(
1
1
1
1
jij
tn
i yiji
j
tn
i yi
yxjC
jj
tn
i yi
yxjC
jj
tn
i yi
yxjC
k jj
tn
i yi
it
n
i yi
t
pyxjCxyPpyxjCxyP
pxyPpxyP
pxyP
yxPxyP
Q
k
i
k
i
k
),(1tQ
Maximizing the Q function
jj
it
n
i yi
t pyxjCxyPQ log),,(),|(),(11
1
11
j
jp
Maximize
Subject to the constraint
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
Use Lagrange multipliers
Optimal solution
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
0)/),,(),|((),(ˆ
1
1
jit
n
i yi
j
t
pyxjCxypp
Q
),,(),|(1
yxjCxyp
pi
tn
i yi
j
Normalization factor
Expected count
PCFG example
• Calculate expected counts
• Update parameters
EM: An example
Forward-backward algorithm
• HMM is a PM (Product of Multi-nominal) Model
• Forward-backward algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is an O1T.
• Y (hidden data): state sequence X1T.
• Θ (parameters): aij, bijk, πi.
Expected counts
)(
)|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
t
OjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
tij
Tt
T
tt
jX
iTTTT
Yjiji
T
Expected counts (cont)
),()(
),(*),|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
kk
T
tij
kkTt
T
tt
jX
w
iTTTT
j
w
iY
j
w
i
wOt
wOOjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
k
kk
The inner loop for forward-backward algorithm
Given an input sequence and1. Calculate forward probability:
• Base case• Recursive case:
2. Calculate backward probability:• Base case:• Recursive case:
3. Calculate expected counts:
4. Update the parameters:
),,,,( BAKS
ii )1(
tijoiji
ij batt )()1(
1)1( Ti
tijoijj
ji batt )1()(
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
HMM
• A HMM is a tuple :– A set of states S={s1, s2, …, sN}.
– A set of output symbols Σ={w1, …, wM}.
– Initial state probabilities – State transition prob: A={aij}.
– Symbol emission prob: B={bijk}
• State sequence: X1…XT+1
• Output sequence: o1…oT
}{ i
),,,,( BAS
Forward probability
The probability of producing oi,t-1 while ending up in state si:
),()( 1,1 iXOPt tt
def
i
Calculating forward probability
tijoiji
i
ttttti
tttttti
tti
t
ttj
bat
iXjXoPiXOP
iXOjXoPiXOP
jXiXOP
jXOPt
)(
)|,(*),(
),|,(*),(
),,(
),()1(
11,1
1,111,1
1,1
1,1
ii )1(Initialization:
Induction:
Backward probability
• The probability of producing the sequence Ot,T, given that at time t, we are at state s i.
)|()( , iXOPt tTt
def
i
Calculating backward probability
tijoijj
j
tTtttj
t
tttTtttj
t
ttTtj
t
tTt
def
i
bat
jXOPiXjXoP
ojXiXOPiXjXoP
iXjX,OoP
iXOPt
)1(
)|(*)|,(
),,|(*)|,(
)|,(
)|()(
1,11
1),1(1
1),1(
,
1)1( TiInitialization:
Induction:
Calculating the prob of the observation
)1()(1
i
N
iiOP
)1()(1
TOPN
ii
)()(
),()(
1
1
tt
iXOPOP
i
N
ii
N
it
Estimating parameters
• The prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)
N
mmm
jijoiji
tt
ttij
tt
tbat
OP
OjXiXP
OjXiXPt
t
1
1
1
)()(
)1()(
)(
),,(
)|,()(
N
jttti OjXiXPOiXPt
11 )|,()|()(
)()(1
ttN
jiji
The prob of being at state i at time t given O:
Expected counts
Sum over the time index:• Expected # of transitions from state i to j in O:
• Expected # of transitions from state i in O:
N
j
T
tij
T
t
N
jij
T
ti ttt
1 11 11
)()()(
T
tij t
1
)(
Update parameters
)1(1expˆ ii ttimeatistateinfrequencyected
T
tij
N
j
T
tij
T
ti
T
tij
ij
t
t
t
t
istatefromstransitionofected
jtoistatefromstransitionofecteda
11
1
1
1
)(
)(
)(
)(
#exp
#exp
T
tij
T
tijkt
ijk
t
two
jtoistatefromstransitionofected
observedkwithjtoistatefromstransitionofectedb
1
1
)(
)(),(
#exp
#exp
Final formulae
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
Summary
• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of
problems.
• The EM algorithm for PM models– Simpler formulae– Three special cases
• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT
Relations among the algorithms
The generalized EM
The EM algorithm
PM Gaussian MixInside-OutsideForward-backwardIBM models
The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter