lecture9 xing
TRANSCRIPT
Eric Xing © Eric Xing @ CMU, 2006-2010 1
Machine Learning
Mixture Model, HMM, and Expectation Maximization
Eric Xing
Lecture 9, August 14, 2010
Reading:
Eric Xing © Eric Xing @ CMU, 2006-2010 2
Data log-likelihood
MLE
What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n
zk
kn
n k
zk
nnn
nn
nn
kn
kn
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
22
12 µπ
σµπ
σµπ
σ
θl
Gaussian Discriminative Analysis
zi
xiN
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =∑∑
,ˆ ⇒ n
kn
n nkn
MLEk z
xz=µ
{ }{ }∑ −−
−−==
'
2'2
12/12'
22
12/12
)(exp)2(
1
)(exp)2(
1
),,|1(2
2
kknk
knk
nknyp
µπσ
π
µπσ
πσµ
σ
σ
x
xx
Eric Xing © Eric Xing @ CMU, 2006-2010 3
Clustering
Eric Xing © Eric Xing @ CMU, 2006-2010 4
Unobserved Variables A variable can be unobserved (latent) because:
it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process e.g., speech recognition models, mixture models …
it is a real-world object and/or phenomena, but difficult or impossible to measure e.g., the temperature of a star, causes of a disease, evolutionary ancestors …
it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors; or was measure with a noisy channel, etc. e.g., traffic radio, aircraft signal on a radar screen,
Discrete latent variables can be used to partition/cluster data into sub-groups (mixture models, forthcoming).
Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc., later lectures).
Eric Xing © Eric Xing @ CMU, 2006-2010 5
Mixture Models A density model p(x) may be multi-modal. We may be able to model it as a mixture of uni-modal
distributions (e.g., Gaussians). Each mode may correspond to a different sub-population
(e.g., male and female).
⇒
Eric Xing © Eric Xing @ CMU, 2006-2010 6
Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:
Z is a latent class indicator vector:
X is a conditional Gaussian variable with a class-specific mean/covariance
The likelihood of a sample:
( )∏):(multi)(k
zknn
knzzp ππ ==
{ })-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp µµ
πµ 1
21
212211 −ΣΣ
=Σ=
( )( ) ∑∑ ∏∑
Σ=Σ=
Σ===Σ
k kkkz kz
kknz
k
kkk
n
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,|,()|(),(
µπµπ
µπµ 11mixture proportion
mixture component
Z
X
Eric Xing © Eric Xing @ CMU, 2006-2010 7
Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:
This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in
astronomical data, etc.
∑ Σ=Σk kkkn xNxp ),|,(),( µπµ
mixture proportion mixture component
Eric Xing © Eric Xing @ CMU, 2006-2010 8
Learning mixture models Given data
Likelihood:
( )∏ ∑∏ Σ=Σ=Σn
k kkkn
n xNxpDL ),|,(),,();,,( µπµπµπ
{ } ),,,(maxarg**,*, DL Σ=Σ µπµπ
Eric Xing © Eric Xing @ CMU, 2006-2010 9
Why is Learning Harder? In fully observed iid settings, the log likelihood decomposes
into a sum of local terms.
With latent variables, all the parameters become coupled together via marginalization
),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l
∑∑ ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 10
Recall MLE for completely observed data
Data log-likelihood
MLE
What if we do not know zn?
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n
zk
kn
n k
zk
nnn
nn
nn
kn
kn
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
22
12 µπ
σµπ
σµπ
σ
θl
Toward the EM algorithm
zi
xiN
),;(maxargˆ , DMLEk θlππ =
);(maxargˆ , DMLEk θlµµ =
);(maxargˆ , DMLEk θlσσ =∑∑
,ˆ ⇒ n
kn
n nkn
MLEk z
xz=µ
Eric Xing © Eric Xing @ CMU, 2006-2010 11
Recall K-means Start:
"Guess" the centroid µk and coveriance Σk of each of the K clusters
Loop For each point n=1 to N,
compute its cluster label:
For each cluster k=1:K
)()(minarg )()(1)()( tkn
tk
Ttknk
tn xxz µµ −Σ−= −
∑∑=+
nt
n
n nt
ntk kz
xkz),(
),()(
)()(
δδ
µ 1 ...)( =Σ +1tk
Eric Xing © Eric Xing @ CMU, 2006-2010 12
Expectation-Maximization Start:
"Guess" the centroid µk and coveriance Σk of each of the K clusters
Loop
Eric Xing © Eric Xing @ CMU, 2006-2010 13
─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., π and µ).
Here we are essentially doing inference
∑ ),|,(),|,(),,|( )()()(
)()()()()()(
)(
i
ti
tin
ti
tk
tkn
tkttk
nqkn
tkn xN
xNxzpz t ΣΣ
=Σ===µπ
µπµτ 1
E-step
Zn
XnN
Eric Xing © Eric Xing @ CMU, 2006-2010 14
─ Maximization step: compute the parameters under current results of the expected value of the hidden variables
This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")
M-step
Zn
XnN
⇒
s.t. ,⇒ ,)( ⇒ ,)(maxarg
)(*
k∂∂*
∑
∑
)(
Nn
NNz
kll
kntk
nn qkn
k
kcck
t
k
===
===
∑ τπ
ππ π 10θθ
∑∑
)(
)()1(* ⇒ ,)(maxarg
ntk
n
n ntk
ntkk
xl
τ
τµµ == +θ
∑∑
)(
)()()()(*
))(( ⇒ ,)(maxarg
ntk
n
nTt
knt
kntk
ntkk
xxl
τ
µµτ 111
+++
−−=Σ=Σ θ
Eric Xing © Eric Xing @ CMU, 2006-2010 15
How is EM derived? A mixture of K Gaussians:
Z is a latent class indicator vector
X is a conditional Gaussian variable with a class-specific mean/covariance
The likelihood of a sample:
The “complete” likelihood
Zn
XnN
( )∏):(multi)(k
zknn
knzzp ππ ==
{ })-()-(-exp)(
),,|( // knkT
knk
mknn xxzxp µµ
πµ 1
21
212211 −ΣΣ
=Σ=
( )( ) ∑∑ ∏∑
Σ=Σ=
Σ===Σ
k kkkz kz
kknz
k
kk
nk
nn
xNxN
zxpzpxp
n
kn
kn ),|,(),:(
),,1|,()|1(),(
µπµπ
µπµ
),|,(),,1|,()|1(),1,( kkkk
nk
nknn xNzxpzpzxp Σ=Σ===Σ= µπµπµ
But this is itself a random variable! Not good as objective function
[ ]∏ Σ=Σk
zkkknn
knxNzxp ),|,(),,( µπµ
Eric Xing © Eric Xing @ CMU, 2006-2010 16
How is EM derived? The complete log likelihood:
The expected complete log likelihood
We maximize iteratively using the above iterative procedure:
Zn
XnN
( )∑∑∑∑
∑∑
log)()(21log
),,|(log)|(log),;()|()|(
n kkknk
Tkn
kn
n kk
kn
nxzpnn
nxzpnc
Cxxzz
zxpzpzx
+Σ+−Σ−−=
Σ+=
− µµπ
µπ
1
θl
Cxzz
xN
zxpzpxzpD
n kkn
kn
n kk
kn
n
zk
kn
n k
zk
nnn
nn
nn
kn
kn
+=
+=
== ∏
∑∑∑∑
∑ ∏∑ ∏
∏
)-(-log
),;(loglog
),,|()|(log),(log);(
22
12 µπ
σµπ
σµπ
σ
θl
)(θcl
Eric Xing © Eric Xing @ CMU, 2006-2010 17
Compare: K-means The EM algorithm for mixtures of Gaussians is like a "soft
version" of the K-means algorithm. In the K-means “E-step” we do hard assignment:
In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:
)()(maxarg )()()()( tkn
tk
Ttknk
tn xxz µµ −Σ−= −1
∑∑=+
nt
n
n nt
ntk kz
xkz),(
),()(
)()(
δδ
µ 1
=+
∑∑
)(
)()1(
ntk
n
n ntk
ntk
x
τ
τµ
( ))(
)(tq
kn
tkn z=τ
Eric Xing © Eric Xing @ CMU, 2006-2010 18
Theory underlying EM What are we doing?
Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.
But we do not observe z, so computing
is difficult!
What shall we do?
∑∑ ==z
xzz
c zxpzpzxpD ),|()|(log)|,(log);( θθθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 19
Complete & Incomplete Log Likelihoods Complete log likelihood
Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then
Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).
Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.
But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.
Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:
This objective won't decouple
)|,(log),;(def
θθ zxpzxc =l
∑==z
c zxpxpx )|,(log)|(log);( θθθl
Eric Xing © Eric Xing @ CMU, 2006-2010 20
Expected Complete Log Likelihood
∑=z
qc zxpxzqzx )|,(log),|(),;(def
θθθl
∑
∑
∑
≥
=
=
=
z
z
z
xzqzxpxzq
xzqzxpxzq
zxpxpx
)|()|,(log)|(
)|()|,()|(log
)|,(log
)|(log);(
θ
θ
θ
θθl
qqc Hzxx +≥⇒ ),;();( θθ ll
For any distribution q(z), define expected complete log likelihood:
A deterministic function of θ Linear in lc() --- inherit its factorizabiility Does maximizing this surrogate yield a maximizer of the likelihood?
Jensen’s inequality
Eric Xing © Eric Xing @ CMU, 2006-2010 21
Lower Bounds and Free Energy For fixed data x, define a functional called the free energy:
The EM algorithm is coordinate-ascent on F : E-step:
M-step:
);()|(
)|,(log)|(),(def
xxzq
zxpxzqqFz
θθ
θ l≤= ∑
),(maxarg tq
t qFq θ=+1
),(maxarg ttt qF θθθ
11 ++ =
Eric Xing © Eric Xing @ CMU, 2006-2010 22
E-step: maximization of expected lc w.r.t. q Claim:
This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).
Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )
Can also show this result using variational calculus or the fact that
),|(),(maxarg ttq
t xzpqFq θθ ==+1
);()|(log
)|(log),(
),()|,(log),()),,((
xxp
xpxzp
xzpzxpxzpxzpF
ttz
tt
zt
tttt
θθ
θθ
θθθθθ
l==
=
=
∑
∑
( )),|(||KL),();( θθθ xzpqqFx =−l
Eric Xing © Eric Xing @ CMU, 2006-2010 23
E-step ≡ plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z|θ) is a
generalized exponential family distribution:
Special cases: if p(X|Z) are GLIMs, then
The expected complete log likelihood under is
)(),(
)()|,(log),|(),;(
),|(θθ
θθθθ
θAzxf
Azxpxzqzx
ixzqi
ti
z
ttq
tc
t
t
−=
−=
∑
∑+1l
= ∑i
ii zxfzxhZ
zxp ),(exp),()(
),( θθ
θ 1
)()(),( xzzxf iTii ξη=
),|( tt xzpq θ=+1
)()()(),|(
GLIM~
θξηθθ
Axzi
ixzqiti
p
t −= ∑
Eric Xing © Eric Xing @ CMU, 2006-2010 24
M-step: maximization of expected lc w.r.t. θ Note that the free energy breaks into two terms:
The first term is the expected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy.
Thus, in the M-step, maximizing with respect to θ for fixed qwe only need to consider the first term:
Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|θ), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,θ).
qqc
zz
z
Hzx
xzqxzqzxpxzqxzq
zxpxzqqF
+=
−=
=
∑∑
∑
),;(
)|(log)|()|,(log)|()|()|,(log)|(),(
θ
θ
θθ
l
∑== ++
zqc
t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθθθ
11 l
Eric Xing © Eric Xing @ CMU, 2006-2010 25
Summary: EM Algorithm A way of maximizing likelihood function for latent variable
models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:
1. Estimate some “missing” or “unobserved” data from observed data and current parameters.
2. Using this “complete” data, find the maximum likelihood parameter estimates.
Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: E-step: M-step:
In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.
),(maxarg tq
t qFq θ=+1
),(maxarg ttt qF θθθ
11 ++ =
Eric Xing © Eric Xing @ CMU, 2006-2010 26
From static to dynamic mixture models
Dynamic mixture
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Static mixture
AX1
Y1
NThe sequence:
The underlying source:
Phonemes,
Speech signal,
sequence of rolls,
dice,
Eric Xing © Eric Xing @ CMU, 2006-2010 27
Chromosomes of tumor cell:
Predicting Tumor Cell States
Eric Xing © Eric Xing @ CMU, 2006-2010 28
Copy number profile for chromosome1 from 600 MPE cell line
Copy number profile for chromosome8 from COLO320 cell line
60-70 fold amplification of CMYC region
Copy number profile for chromosome 8in MDA-MB-231 cell line
deletion
DNA Copy number aberration types in breast cancer
Eric Xing © Eric Xing @ CMU, 2006-2010 29
A real CGH run
Eric Xing © Eric Xing @ CMU, 2006-2010 30
Hidden Markov Model Observation space
Alphabetic set:Euclidean space:
Index set of hidden states
Transition probabilities between any two states
or
Start probabilities
Emission probabilities associated with each state
or in general:
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
{ }Kccc ,,, 21=CdR
{ }M,,, 21=I
,)|( ,jii
tj
t ayyp === − 11 1
( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiii
tt 211 1
( ).,,,lMultinomia~)( Myp πππ 211
( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiii
tt 211
( ) .,|f~)|( I∈∀⋅= iyxp ii
tt θ1
Graphical model
K
1
…
2
State automata
Eric Xing © Eric Xing @ CMU, 2006-2010 31
The Dishonest Casino
A casino has two dice: Fair die
P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die
P(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2
Casino player switches back-&-forth between fair and loaded die once every 20 turns
Game:1. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with fair die,
maybe with loaded die)4. Highest number wins $2
Eric Xing © Eric Xing @ CMU, 2006-2010 32
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
The Dishonest Casino Model
Eric Xing © Eric Xing @ CMU, 2006-2010 33
Puzzles Regarding the Dishonest Casino
GIVEN: A sequence of rolls by the casino player
1245526462146146136136661664661636616366163616515615115146123562344
QUESTION How likely is this sequence, given our model of how the casino
works? This is the EVALUATION problem in HMMs
What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs
How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs
Eric Xing © Eric Xing @ CMU, 2006-2010 34
Joint Probability
1245526462146146136136661664661636616366163616515615115146123562344
Eric Xing © Eric Xing @ CMU, 2006-2010 35
Probability of a Parse Given a sequence x = x1……xT
and a parse y = y1, ……, yT, To find how likely is the parse:
(given our HMM and the sequence)
p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)
Marginal probability:
Posterior probability:
∑ ∑ ∑ ∑ ∏ ∏= =
−==
yyxx
1 2 112 1
y y y
T
t
T
tttyyy
N ttyxpapp )|(),()( ,π
)(/),()|( xyxxy ppp =
A AA Ax2 x3x1 xT
y2 y3y1 yT...
...
Eric Xing © Eric Xing @ CMU, 2006-2010 36
Example: the Dishonest Casino Let the sequence of rolls be:
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
Then, what is the likelihood of y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?
(say initial probs a0Fair = ½, aoLoaded = ½)
½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =
½ × (1/6)10 × (0.95)9 = .00000000521158647211 = 5.21 × 10-9
Eric Xing © Eric Xing @ CMU, 2006-2010 37
Example: the Dishonest Casino So, the likelihood the die is fair in all this run
is just 5.21 × 10-9
OK, but what is the likelihood of π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,
Loaded, Loaded, Loaded?
½ × P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) =
½ × (1/10)8 × (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 × 10-9
Therefore, it is after all 6.59 times more likely that the die is fair all the way, than that it is loaded all the way
Eric Xing © Eric Xing @ CMU, 2006-2010 38
Example: the Dishonest Casino Let the sequence of rolls be:
x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6
Now, what is the likelihood π = F, F, …, F? ½ × (1/6)10 × (0.95)9 = 0.5 × 10-9, same as before
What is the likelihood y = L, L, …, L?
½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 = 5 × 10-7
So, it is 100 times more likely the die is loaded
Eric Xing © Eric Xing @ CMU, 2006-2010 39
Three Main Questions on HMMs1. Evaluation
GIVEN an HMM M, and a sequence x,FIND Prob (x | M)ALGO. Forward
2. DecodingGIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),
or the most probable subsequence of statesALGO. Viterbi, Forward-backward
3. LearningGIVEN an HMM M, with unspecified transition/emission probs.,
and a sequence x,FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. Baum-Welch (EM)
Eric Xing © Eric Xing @ CMU, 2006-2010 40
Applications of HMMs Some early applications of HMMs
finance, but we never saw them speech recognition modelling ion channels
In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.
Some current applications of HMMs to biology mapping chromosomes aligning biological sequences predicting sequence structure inferring evolutionary relationships finding genes in DNA sequence
Eric Xing © Eric Xing @ CMU, 2006-2010 41
The Forward Algorithm We want to calculate P(x), the likelihood of x, given the HMM
Sum over all possible ways of generating x:
To avoid summing over an exponential number of paths y, define
(the forward probability)
The recursion:
),,...,()(def
11 1 ==== ktt
kt
kt yxxPy αα
∑ −==i
kiit
ktt
kt ayxp ,)|( 11 αα
∑=k
kTP α)(x
∑ ∑ ∑ ∑ ∏ ∏= =
−==
yyxx
1 2 112 1
y y y
T
t
T
tttyyy
N ttyxpapp )|(),()( ,π
Eric Xing © Eric Xing @ CMU, 2006-2010 42
The Forward Algorithm –derivation Compute the forward probability:
),,,...,( 111 == −k
tttkt yxxxPα
),,...,,|(),...,,|(),,...,( 111111111 111
−−−−−− ===∑−
ttk
ttttk
ty tt yxxyxPxxyyPyxxPt
)|()|(),,...,( 11 11111
=== −−−∑−
kttt
kty tt yxPyyPyxxP
t
)|(),,...,()|( 1111 1111 ===== −−−∑ it
kti
itt
ktt yyPyxxPyxP
kiiit
ktt ayxP ,)|( ∑ −== 11 α
AA xtx1
yty1 ...
Axt-1
yt-1
...
...
...
),|()|()(),,( :ruleChain BACPABPAPCBAP =
Eric Xing © Eric Xing @ CMU, 2006-2010 43
The Forward Algorithm We can compute for all k, t, using dynamic programming!
Initialization:
Iteration:
Termination:
ktα
kkk yxP πα )|( 1111 ==
kk
kk
kk
yxPyPyxP
yxP
π
α
)|(
)()|(
),(
111
1
11
111
111
==
===
==
kiiit
ktt
kt ayxP ,)|( ∑ −== 11 αα
∑=k
kTP α)(x
Eric Xing © Eric Xing @ CMU, 2006-2010 44
The Backward Algorithm We want to compute ,
the posterior probability distribution on the t th position, given x
We start by computing
The recursion:
)|( x1=ktyP
Forward, αtk Backward,
),...,,,,...,(),( Ttk
ttk
t xxyxxPyP 11 11 +=== x
)|...()...(
),,...,|,...,(),,...,(
, 1111
11
111
===
===
+
+
ktTt
ktt
kttTt
ktt
yxxPyxxPyxxxxPyxxP
)|,...,( 11 == +k
tTtk
t yxxPβ
∑ +++ ==i
it
ittik
kt yxpa 111, )1|( ββ
A Axt+1 xT
yt+1 yT...
Axt
yt
...
...
...
Eric Xing © Eric Xing @ CMU, 2006-2010 45
Example:
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kiiit
ktt
kt ayxP ,)|( ∑ −== 11 αα
it
itti ik
kt yxPa 111 1 +++ ==∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 46
Alpha (actual)0.0833 0.05000.0136 0.00520.0022 0.00060.0004 0.00010.0001 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.0000
Beta (actual)0.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0001 0.00010.0007 0.00060.0045 0.00550.0264 0.01120.1633 0.10331.0000 1.0000
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kiiit
ktt
kt ayxP ,)|( ∑ −== 11 αα
it
itti ik
kt yxPa 11 1 +++ ==∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 47
Alpha (logs)-2.4849 -2.9957-4.2969 -5.2655-6.1201 -7.4896-7.9499 -9.6553-9.7834 -10.1454
-11.5905 -12.4264-13.4110 -14.6657-15.2391 -15.2407-17.0310 -17.5432-18.8430 -19.8129
Beta (logs)-16.2439 -17.2014-14.4185 -14.9922-12.6028 -12.7337-10.8042 -10.4389-9.0373 -9.7289-7.2181 -7.4833-5.4135 -5.1977-3.6352 -4.4938-1.8120 -2.2698
0 0
FAIR LOADED
0.05
0.05
0.950.95
P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6
P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2
x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4
kiiit
ktt
kt ayxP ,)|( ∑ −== 11 αα
it
itti ik
kt yxPa 11 1 +++ ==∑ ββ )|(,
Eric Xing © Eric Xing @ CMU, 2006-2010 48
What is the probability of a hidden state prediction?
Eric Xing © Eric Xing @ CMU, 2006-2010 49
Posterior decoding We can now calculate
Then, we can ask What is the most likely state at position t of sequence x:
Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?
Posterior Decoding:
This is different from MPA of a whole sequence of hidden states
This can be understood as bit error ratevs. word error rate
)()(),()|(
xxxx
PPyPyP
kt
kt
ktk
tβα
==
==11
)|(maxarg* x1== ktkt yPk
{ } : * Tty tk
t 11 ==
Example:MPA of X ?MPA of (X, Y) ?
x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3
Eric Xing © Eric Xing @ CMU, 2006-2010 50
Viterbi decoding GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that
P(y|x) is maximized:y* = argmaxy P(y|x) = argmaxπ P(y,x)
Let
= Probability of most likely sequence of states ending at state yt = k The recursion:
Underflows are a significant problem
These numbers become extremely small – underflow Solution: Take the logs of all values:
),,...,,,...,(max ,--},...{ -1111111
== kttttyy
kt yxyyxxPV
t
itkii
ktt
kt VayxpV 11 −== ,max)|(
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
x1 x2 x3 ……………………...……..xN
State 12
K
Vi(t)k
tV
tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( 11121111 −
= π
( )( )itkii
ktt
kt VayxpV 11 −++== ,logmax)|(log
Eric Xing © Eric Xing @ CMU, 2006-2010 51
Computational Complexity and implementation details What is the running time, and space required, for Forward,
and Backward?
Time: O(K2N); Space: O(KN).
Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant
∑ −==i
kiit
ktt
kt ayxp ,1)1|( αα
it
itt
iik
kt yxpa 111, )1|( +++ ==∑ ββ
itkii
ktt
kt VayxpV 1,max)1|( −==
Eric Xing © Eric Xing @ CMU, 2006-2010 52
(Homework!)
Learning HMM Given x = x1…xN for which the true state path y = y1…yN is
known,
Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x
We can show that the maximum likelihood parameters θ are:
What if y is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …
∑∑ ∑∑ ∑ ==
•→→
== −
= −
' ',
,,
)(#)(#
j ij
ij
nTt
itn
jtnn
Tt
itnML
ij AA
yyy
ijia
2 1
2 1
∑∑ ∑∑ ∑ ==
•→→
==
=
' ',
,,
)(#)(#
k ik
ik
nTt
itn
ktnn
Tt
itnML
ik BB
yxy
ikib
1
1
( ){ }NnTtyx tntn :,::, ,, 11 ==
(Homework!)
∏ ∏∏
==
==−
n
T
ttntn
T
ttntnn xxpyypypp
1,,
21,,1, )|()|()(log),(log),;( yxyxθl
Eric Xing © Eric Xing @ CMU, 2006-2010 53
Unsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is
unknown,
EXPECTATION MAXIMIZATION
0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data
How? , , How? (homework)
2. Update θ according to Aij , Bik Now a "supervised learning" problem
3. Repeat 1 & 2, until convergence
This is called the Baum-Welch Algorithm
We can get to a provably more (or equally) likely parameter set θ each iteration
ktntn
itnik xyB ,, ,∑=∑ −= tn
jtn
itnij yyA
, ,, 1
Eric Xing © Eric Xing @ CMU, 2006-2010 54
The Baum Welch algorithm The complete log likelihood
The expected complete log likelihood
EM The E step
The M step ("symbolically" identical to MLE)
∏ ∏∏
==
==−
n
T
ttntn
T
ttntnnc xxpyypypp
1211 )|()|()(log),(log),;( ,,,,,yxyxθl
∑∑∑∑∑==
−
+
+
=
− n
T
tkiyp
itn
ktn
n
T
tjiyyp
jtn
itn
niyp
inc byxayyy
ntnntntnnn 1211
11,)|(,,,)|,(,,)|(, logloglog),;(
,,,, xxxyxθ πl
)|( ,,, nitn
itn
itn ypy x1===γ
)|,( ,,,,,, n
jtn
itn
jtn
itn
jitn yypyy x1111 ==== −−ξ
∑ ∑∑ ∑
−
=
==n
Tt
itn
nTt
jitnML
ija 1
1
2
,
,,
γ
ξ
∑ ∑∑ ∑
−
=
==n
Tt
itn
ktnn
Tt
itnML
ikx
b 1
1
1
,
,,
γ
γ
Nn
inML
i∑= 1,γ
π
Eric Xing © Eric Xing @ CMU, 2006-2010 55
Summary Modeling hidden transitional trajectories (in discrete state
space, such as cluster label, DNA copy number, dice-choice, etc.) underlying observed sequence data (discrete, such as dice outcomes; or continuous, such as CGH signals)
Useful for parsing, segmenting sequential data Important HMM computations:
The joint likelihood of a parse and data can be written as a product to local terms (i.e., initial prob, transition prob, emission prob.)
Computing marginal likelihood of the observed sequence: forward algorithm Predicting a single hidden state: forward-backward Predicting an entire sequence of hidden states: viterbi Learning HMM parameters: an EM algorithm known as Baum-Welch