ch 9. mixture models and em pattern recognition and machine learning, c. m. bishop, 2006....
Post on 25-Dec-2015
221 Views
Preview:
TRANSCRIPT
Ch 9. Mixture Models and EMCh 9. Mixture Models and EM
Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.
Biointelligence Laboratory, Seoul National Universityhttp://bi.snu.ac.kr/
2 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.1 K-means Clustering 9.2 Mixtures of Gaussians
9.2.1 Maximum likelihood 9.2.2 EM for Gaussian mixtures
9.3 An Alternative View of EM 9.3.1 Gaussian mixtures revisited 9.3.2 Relation to K-means 9.3.3 Mixtures of Bernoulli distributions
9.4 The EM Algorithm in General
ContentsContents
3 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.1 K-means Clustering (1/3)9.1 K-means Clustering (1/3)
Problem of identifying groups, or clusters, of data points in a multidimensional space Partitioning the data set into some number K of
clusters Cluster: a group of data points whose inter-point
distances are small compared with the distances to points outside of the cluster
Goal: an assignment of data points to clusters such that the sum of the squares of the distances to each data point to its closest vector (the center of the cluster) is a minimum
4 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.1 K-means Clustering (2/3)9.1 K-means Clustering (2/3)
Two-stage optimization In the 1st stage: minimizing J with respect to the rnk, keepin
g the μk fixed
In the 2nd stage: minimizing J with respect to the μk, keeping rnk fixed
The mean of all of the data points assigned to cluster k
5 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.1 K-means Clustering (3/3)9.1 K-means Clustering (3/3)
6 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2 Mixtures of Gaussians (1/3)9.2 Mixtures of Gaussians (1/3)A formulation of Gaussian mixtures in terms of discrete A formulation of Gaussian mixtures in terms of discrete latentlatent variablesvariables
Gaussian mixture distribution can be written as a linear superposition of Gaussian
An equivalent formulation of the Gaussian mixture involving an explicit latent variable Graphical representation of a mixture
model A binary random variable z having a 1-of-K representation
The marginal distribution of x is a Gaussian mixture of the form (*) ( for every observed data point xn, there is a corresponding latent variable zn)
…(*)
7 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2 Mixtures of Gaussians (2/3)9.2 Mixtures of Gaussians (2/3)
γ(zk) can also be viewed as the responsibility that component k takes for explaining the observation x
8 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2 Mixtures of Gaussians (3/3)9.2 Mixtures of Gaussians (3/3)
Generating random samples distributed according to the Gaussian mixture model Generating a value for z, which denoted as from the margina
l distribution p(z) and then generate a value for x from the conditional distribution
a. The three states of z, corresponding to the three components of the mixture, are depicted in red, green, blue
b. The corresponding samples from the marginal distribution p(x) c. The same samples in which the colors represent the value of the responsibilities γ(zn
k) associated with data point Illustrating the responsibilities by evaluating the posterior probability for each compon
ent in the mixture distribution which this data set was generated
9 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.1 Maximum likelihood (1/3)9.2.1 Maximum likelihood (1/3)
Graphical representation of a Gaussian mixture model fora set of N i.i.d. data points{xn}, with corresponding latent points {zn}
The log of the likelihood function
….(*1)
10 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.1 Maximum likelihood (2/3)9.2.1 Maximum likelihood (2/3)
For simplicity, consider a Gaussian mixture whose components have covariance matrices given by Suppose that one of the components of the mixture model has its mea
n μ j exactly equal to one of the data points so that μ j = xn
This data point will contribute a term in the likelihood function of the form
Once there are at least two components in the mixture, one ofthe components can have a finite variance and therefore assignfinite probability to all of the datapoints while the other componentcan shrink onto one specific data point and thereby contribute an everincreasing additive value to the loglikelihood over-fitting problem
11 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
Overfittings when one of the Gaussian components Overfittings when one of the Gaussian components collapses onto a specific data pointcollapses onto a specific data point
LLim
xxxxxx
xxxL
LLim
xxxx
xxxxxxL
xNzxpxxNzxpxxxD
0
22
23
221
13
122
22
221
12
1
22
21
221
11
1
0
121
1312
2
1
21
131211
3
1
222111321
1
2222
22
1
22
222
2
)(exp
2
1
2
)(exp
2
1
2
)(exp
2
1
2
)(exp
2
1
2
)(exp
2
1
2
)(exp
2
1
case) (mixture
0
1
2
)()(exp
1
2
)()()(exp
1
case)Gaussian single(
),|()1|( ),,|()1|( },,,{
12 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.1 Maximum likelihood (3/3)9.2.1 Maximum likelihood (3/3)
Over-fitting problem Example of the over-fitting in a maximum likelihood approach This problem does not occur in the case of Bayesian approach In applying maximum likelihood to a Gaussian mixture models, t
here should be steps to avoid finding such pathological solutions and instead seek local minima of the likelihood function that are well behaved
Identifiability problem A K-component mixture will have a total of K! equivalent solution
s corresponding to the K! ways of assigning K sets of parameters to K components
Difficulty of maximizing the log likelihood function the presence of the summation over k that appears inside the logarithm gives no closed form solution as in the single case
13 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (1/4)(1/4)I. Assign some initial values for the means, covariances, and mixin
g coefficientsII. Expectation or E step
• Using the current value for the parameters to evaluate the posterior probabilities or responsibilities
III. Maximization or M step• Using the result of II to re-estimate the means, covariances, and mixi
ng coefficients It is common to run the K-means algorithm in order to find a suit
able initial values• The covariance matrices the sample covariances of the clusters f
ound by the K-means algorithm• Mixing coefficients the fractions of data points assigned to the res
pective clusters
14 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (2/4) (2/4) Given a Gaussian mixture model, the goal is to maximize the likelih
ood function with respect to the parameters1. Initialize the means μk, covariance Σk and mixing coefficients πk
2. E step
3. M step
4. Evaluate the log likelihood
15 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (3/4)(3/4)
Setting the derivatives of (*2) with respect to the means of the Gaussian components to zero
Setting the derivatives of (*2) with respect to the covariance of the Gaussian components to zero
…(*2)
Responsibilityγ(znk )
A weighted mean of all of the points in the data set
- Each data point weighted by the corresponding posterior probability
- The denominator given by the effective # of points associated with the corresponding component
16 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.2.2 EM for Gaussian mixtures 9.2.2 EM for Gaussian mixtures (4/4)(4/4)
17 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3 An Alternative View of EM 9.3 An Alternative View of EM
In maximizing the log likelihood function the summation prevents the logarithm from acting directly on the joint distributio
n Instead, the log likelihood function for the complete data set {X, Z} is straightforw
ard. In practice since we are not given the complete data set, we consider instead its
expected value Q under the posterior distribution p( Z|X, Θ) of the latent variable General EM
1. Choose an initial setting for the parameters Θold 2. E step Evaluate p(Z|X,Θold ) 3. M step Evaluate Θnew given by
Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)
4. It the covariance criterion is not satisfied, then let Θold Θnew
18 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.1 Gaussian mixtures revisited 9.3.1 Gaussian mixtures revisited (1/2) (1/2) Maximizing the likelihood for the complete data {X, Z}
The logarithm acts directly on the Gaussian distribution much simpler solution to the maximum likelihood problem the maximization with respect to a mean or a covariance is
exactly as for a single Gaussian (closed form)
N
n
K
kkknk
N
n
K
kkknnk
N
n
K
k
zkknk
xNμ|Xpcf
xNzμ|ZXp
xNμ|ZXp nk
1 1
1 1k
1 1
),|(ln ),, (ln .)(
}),|(ln {ln),, ,(ln
)],|([ ),, ,(
19 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.1 Gaussian mixtures revisited 9.3.1 Gaussian mixtures revisited (2/2)(2/2) Unknown latent variables considering expectation of the
complete-data log likelihood with respect to the posterior distribution of the latent variables
Posterior distribution
The expected value of the indicator variable under this posterior distribution
The expected value of the complete-data log likelihood function
…(*3)
))11.9( ),10.9( .(
)],|([)(),, ,|( ),, ,|(1 1
ref
xNZpμZXpμXZpN
n
K
k
zkknk
nk
)(),|(
),|(
)(
)1|()1()|1()(
1
nkK
jjjnj
kknk
n
nknnknnknk
zxN
xN
xp
zxpzpxzpzE
N
n
K
kkknnk xNzμ|ZXpE
1 1k }),|(ln ){ln()],, ,([ln
20 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.2 Relation to K-means9.3.2 Relation to K-means
K-means performs a hard assignment of data points to the clusters (each data point is associated uniquely with one cluster
EM makes a soft assignment based on the posterior probabilities K-means can be derived as a particular limit of EM for Gaussian
mixtures: As epsilon gets smaller, the terms for which is farthest
will go to zero most quickly. Hence the responsibility go all zero except for the term k for which the responsibility will go to unity
Thus maximizing the expected complete data log-likelihood is equivalent to minimizing the distortion measure J for the K-means
(In Elliptical K-means, the covariance is estimated also.)
N
n
K
kknnk
j jnj
knknkknkkn
constxrZXpE
x
xzxIxp
1 1
2
2
22
2
1)],,|,([ln
}2/exp{
}2/exp{)( ,
2
1exp
2
1),|(
2
jnx
21 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
(1 1 ) (2 2 )
9.3.3 Mixtures of Bernoulli 9.3.3 Mixtures of Bernoulli distributionsdistributions
22 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (1/2) (1/2)
on.distributi Bernulli single a unlike , variablesebetween th nscorrelatio capture
canon distributi mixture thediagonal,longer no is )Cov(matrix covariance theBecause *
})({)}1()1({
)(})1({})1({)(
),,()(,)1()1()(:)|(,)|(:)mixture(
)1()(),,,()(,)1()(:)( :component) (single
1) 1, 0, 0, (1, .].[
) are variabesindividual that note and
)( )|( ,)|( then ),())(Let (
)()()()())(()()()()(
,)()(
)),,( },{ (
(mixture) )1()|( ),|(),|(
)})1({)( ),,,()(),,,((
component) (single )1()|(
22211
221
211222111
22211
22222
21111
221122112
2322
21
3112211
23
2)(
2)()(
11
11
1
1
)1(
1
11
1
)1(
x
x
xx
xxx
x
μμμ|xx
xxμμxxμ|xxxxxxx
μμ|xx
μμ,,μμ
μxμxμx
Σxμxx
μx
T
Tkk
TT
kK1
1
111
I
IICov
EpcHpcHp
ICovEpHp
gE
t, given μindependenx
jixxEcxEccE
EEEEEEEECov
EE
ppp
diagCovExx
p
i
kkjikijkkikiikijk
K
k
Tkk
K
k
Tkk
T
K
kkk
K
kkk
kDk
D
i
xki
xki
K
kkkk
iiDD
D
i
xi
xi
ii
ii
23 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (2/2)(2/2)
infinity togoes
likelihood hein which t iessingularit no are thereGaussians, of mixture theocontrast tIn *
, step)-(M
)(1
,)( , )|(
)|(][)( step)-(E
)]1ln()1(ln[ln)()],|,(ln[E
)]1ln()1(ln[ln),|,(ln
:)function likelihood log data-complete(
)(
variables)indicator binary a is ),(( )|(),|(
)|(ln),|(ln
k
11
1
1 11Z
1 11
1
T1
1
1 1
N
N
zN
zNp
pzEz
xxzp
xxzp
p
z,zpp
pp
kk
N
nnk
k
N
nnkkK
jjj
kknknk
N
n
D
ikinikinik
K
knk
N
n
D
ikinikinik
K
knk
K
k
zk
K
K
k
z
N
n
K
kkk
k
k
k
nk
n
n
k
n
xμ
xxμx
μx
μZX
μZX
z
zμxμzx
μxμx
π
π
π
π
|
24 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.3.3 Mixtures of Bernoulli distributions 9.3.3 Mixtures of Bernoulli distributions (2a/2)(2a/2)
N=600 digit images, 3 mixtures A mixture of k=3 Bernoulli distributions by 10 EM iterations Parameters for each of the three components/single multivariate Bernoulli The analysis of Bernoulli mixtures can be extended to the case of
multinomial binary variables having M>2 states (Ex.9.19)
25 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (1/3) (1/3) Direct optimization of p (X|) is difficult while optimization of
complete data likelihood p (X, Z|) is significantly easier. Decomposition of the likelihood p (X|)
)),(for expression theinto )|(ln ),|(ln )|,(ln e(substitut
))(
),|(ln)()||( ,
)(
)|,(ln)(),( (where
)||(),()|(ln
)|,()|(
qXpXZpZXp
Zq
XZpZqpqKL
Zq
ZXpZqq
pqKLqXp
ZXpXp
ZZ
Z
L
L
L
)|(lnon boundlower a is ),()|(ln),( XpqXpq LL
26 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (2/3)(2/3)
(E step) The lower bound L (q, old) is maximized while holding old fixed. Since ln p(X| ) does not depend on q(Z), L (q, old) will be the largest when KL(q||p) vanishes (i.e. when q(Z) is equal to the posterior distribution p(Z|X, old))
(M step) q(Z) is fixed and the lower bound L (q, old) is maximized wrt. to new . When the lower bound is increased, is updated making KL(q||p) greater than 0. Thus the increase in the log likelihood function is greater than the increase in the lower bound.
In the M step, the quantity being maximized is the expectation of the complete-data log-likelihood
)),|( be set to is )( ( ),(
),|(ln),|()|,(ln),|(),(
oldold
oldZ
oldZ
old
XZpZqwhereconstQ
XZpXZpZXpXZpq
L
27 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
c.f. - 9.3 An Alternative View of EM c.f. - 9.3 An Alternative View of EM
In maximizing the log likelihood function the summation prevents the logarithm from acting directly on the joint distributio
n Instead, the log likelihood function for the complete data set {X, Z} is straightforw
ard. In practice since we are not given the complete data set, we consider instead its
expected value Q under the posterior distribution p( Z|X, Θ) of the latent variable General EM
1. Choose an initial setting for the parameters Θold 2. E step Evaluate p(Z|X,Θold ) 3. M step Evaluate Θnew given by
Θnew = argmaxΘQ(Θ ,Θold) Q(Θ ,Θold) = ΣZ p(Z|X, Θold)ln p(X, Z| Θ)
4. It the covariance criterion is not satisfied, then let Θold Θnew
28 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3/3)(3/3) Start with initial parameter value old
In the first E step, evaluation of posterior distribution over latent variables gives rise to a lower bound L (q, old) whose value equals the log likelihood at old (blue curve)
Note that the bound makes a tangential contact with the log-likelihood at old , so that both curves have the same gradient
For mixture components from the exponential family, this bound is a convex function
In the M step, the bound is maximized giving the value new which gives a larger value of log-likelihood than old .
The subsequent E step constructs a bound tangential at new (green curve)
29 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3a/3)(3a/3)
EM can be also used to maximize the posterior distribution p(|X) over parameters.
Optimize the RHS alternatively wrt q and Optimization wrt q is the same E step M step required only a small modification through the introduction of the pri
or term ln p()
.)(ln),()(ln)(ln)||(),(
)(ln)(ln)|(ln)|(ln (9.70),ion decompositby
)(ln),(ln)|(ln ),(/),()|(
constpqXpppqKLq
XppXpXp
XpXpXpXpXpXp
LL
30 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General 9.4 The EM Algorithm in General (3b/3)(3b/3)
For complex problems, either E step or M step or both are intractable: Intractable M: Generalized EM (GEM), expectation conditional maximization (ECM) Intractable E: Partial E step GEM: instead of maximizing L(q,) wrt , it seeks to change the parameters to increase
its value. ECM: makes several constrained optimization within each M step. For instance, parame
ters are partitioned into groups and the M step is broken down into multiple steps each of which involves optimizing one of the subset with the remainder held fixed.
Partial (or incremental) EM: (Note) For any given , there is a unique maximum L(q*,) wrt q . Since L(q*,) = ln p(X|), there is a * for the global maximum of L(q,) and ln p(X|*) is a global maximum too. Any algorithm that converges to the global maximum of L(q,) will find a value of that is also a global maximum of the log likelihood ln p(X|)
Each E or M step in partial E step algorithm is increasing the value of L(q,) and if the algorithm converges to a local (or global) maximum of L(q,) , this will correspond to a local (or global) maximum of the log likelihood function ln p(X|).
31 (C) 2007, SNU Biointelligence La
b, http://bi.snu.ac.kr/
9.4 The EM Algorithm in General (3c/3)9.4 The EM Algorithm in General (3c/3)
(Incremental EM) For a Gaussian mixture, suppose xm is updated with old and new values of responsibilities old(zmk), new(zmk) in the E-step .
In the M step, the means are updated as,
Both E step and M step take fixed time independent of the total number of data points. Because the parameters are revised after each data point, rather than waiting until after the whole data set is processed, this incremental version can converge faster than the batch version.
)()(
)()()(
)( )18.9(
)(1
)17.9(
1
1
mkold
mknewold
knewk
oldkmnew
k
mkold
mknew
oldk
newk
N
nnkk
n
N
nnk
kk
zzNN
xN
zz
zN
xzN
mnewk
mkold
mknew
oldknew
k
mkold
mknew
newk
mmkold
mknewold
kmkold
mknewnew
knewk
newk
mmkold
mmknewold
koldk
newk
newk
xN
zz
N
zz
xzzzzNN
xzxzNN
)()()()(1
)]()([)]()([
)()(
top related