lecture 11 - csd.uwo.ca · lecture 11 unsupervised learning em. today • new topic: unsupervised...
Post on 23-Jul-2020
7 Views
Preview:
TRANSCRIPT
CS9840a Learning and Computer Vision Prof. g p
Olga Veksler
Lecture 11Unsupervised Learning
EM
Today
• New Topic: Unsupervised Learning• supervised vs. unsupervised learning• unsupervised learning
• nonparametric unsupervised learning = clustering• Proximity Measures• Proximity Measures• Criterion Functions• k‐means
b i f i t t B i d i i th• brief intro to Bayesian decision theory • need this for parametric supervised learning
• parametric unsupervised learning• Expectation Maximization (EM)
Supervised vs. Unsupervised LearningU t id d i d l i i• Up to now considered supervised learning scenario, where we have1. samples x1,…, xn1. samples x1,…, xn2. class label yi for all samples xi• this is also called learning with teacher
• In this lecture we consider unsupervised learningscenario, where we are only given, y g1. samples x1,…, xn • this is also called learning without teacher• do not split data into training and test sets• harder to say how good results are compared to the
supervised casep
Unsupervised Learning
l b l d• Data is not labeled
• Parametric Approach• Parametric Approach• assume parametric distribution of data • estimate parameters of this distributionestimate parameters of this distribution
• Non‐parametric Approach• group the data into clustersg p• samples inside each cluster are more
similar than samples across clusters• each cluster (hopefully) says something
about categories (classes) present in the data
Why Unsupervised Learning?• Unsupervised learning is harder• Unsupervised learning is harder
• How do we know if results are meaningful? No answer labels are available.
• Let the expert look at the results (external evaluation)• Define an objective function on clustering (internal evaluation)
W th l d it b• We nevertheless need it because1. Labeling large datasets is very costly (speech recognition)
• sometimes can label only a few examples by handsometimes can label only a few examples by hand
2. May have no idea what/how many classes there are (data mining)
3. May want to use clustering to gain some insight into the structure of the data before designing a classifier
• Clustering as data descriptionClustering as data description
Clustering• Seek “natural” clusters in the data• Seek natural clusters in the data
• What is a good clustering?• internal (within the cluster) distances should be small
l (i l ) h ld b l
• Clustering is a way to discover new categories (classes)
• external (intra‐cluster) should be large
What we Need for Clustering1 Proximity measure either large d small s1. Proximity measure, either large d, small s
large s, small d• similarity measure s(xi,xk): large if xi,xk are
similardi i il i ( di )• dissimilarity (or distance) measure d(xi,xk): small if xi,xk are similar
2. Criterion function to evaluate a clustering
good clustering bad clustering
3. Algorithm to compute clustering• usually by optimizing the criterion function
How Many Clusters?
3 clusters or 2 clusters?
• Possible approachesPossible approaches • fix the number of clusters to k• find the best clustering according to the criterion
function (number of clusters may vary)
Proximity Measures
d l d d• good proximity measure is application dependent• clusters should be invariant under transformations “natural”
to the problemto the problem• for object recognition, should have invariance to rotation
small distance
• For character recognition, no invariance to rotationFor character recognition, no invariance to rotation
9 6l di t9 6large distance
Distance (dissimilarity) Measures• Euclidean distance• Euclidean distance
d
k
kj
kiji xxxxd
1
2,
• translation invariant
• Manhattan (city block) distance
d
• translation invariant
k
kj
kiji xxxxd
1,
• approximation to Euclidean distance, h t tcheaper to compute
• Chebyshev distance ||max,
1
kj
kidkji xxxxd
• approximation to Euclidean distance, hcheapest to compute
Similarity Measures• Cosine similarity:• Cosine similarity:
||||||||
,ji
jTi
ji xxxx
xxs
• smaller angle, gives larger similarity• scale invariant measure
l ff
• popular in text retrieval
• Correlation coefficient• popular in image processing
2/1
22
1,
d dkk
d
ki
kii
ki
ji
xxxxxxs
1 1
k kj
kji
ki xxxx
Feature Scale
ld bl h h l l f• old problem: how to choose appropriate relative scale for features?
• [length (in meters or cms?) weight(in in grams or kgs?)]• [length (in meters or cms?), weight(in in grams or kgs?)]• In supervised learning, can normalize to zero mean unit variance with no problemsp
• in clustering this is more problematic, if variance in data is due to cluster presence, then normalizing features is not a
d thigood thing
before normalization after normalizationbefore normalization after normalization
Criterion Functions for Clustering
l• Have samples x1,…,xn• Suppose partitioned samples into c subsets D1,…,Dc
1D3D
2D
• There are approximately cn/c! distinct partitions• Can define a criterion function J(D1,…,Dc) which
measures the quality of a partitioning D1,…,Dc
pp y / p
q y p g 1 c
• Then the clustering problem is well defined • optimal clustering is partition which optimizes criterion p g p p
function
Iterative Optimization Algorithms• Now have both proximity measure and criterion function• Now have both proximity measure and criterion function,
need algorithm to find the optimal clustering• Exhaustive search is impossible since there areExhaustive search is impossible, since there are
approximately cn/c! possible partitions• Usually some iterative algorithm is usedUsually some iterative algorithm is used
1. find a reasonable initial partition2. repeat: move samples from one group to another s.t. the
objective function J is improved
move
samples to improve J
J = 777,777
improve J
J =666,666
K‐means Clustering• Probably the most famous clustering algorithm• Probably the most famous clustering algorithm• JSSE objective function:
k
i DxiSSE
i
xJ1
2||||
• can work for some other objective functions, but not proven to work as well
• Fix number of clusters to k (c = k)It ti l t i l ith• Iterative clustering algorithm
• Moves many samples from one cluster to anther in a smart waysmart way
K‐means Clustering
1 Initialize x
k = 3
1. Initialize• pick k cluster centers arbitrary• assign each example to closest center
x
xx• assign each example to closest center
x
x x2. compute sample means
for each cluster
x3 reassign all samples to the x
x x3. reassign all samples to the
closest mean
4. if clusters changed at step 3, go to step 2
K‐means Clustering• Consider steps 2 and 3 of the algorithm
2. compute sample means for each cluster
• Consider steps 2 and 3 of the algorithm
12
k
i DxiSSE
i
xJ1
2||||
= sum of= sum of
3. reassign all samples to the closest mean
If we represent clusters 1
2 by their old means, the error has gotten smaller
K‐means Clustering3 reassign all samples to the closest mean3. reassign all samples to the closest mean
If we represent clusters by their old means, the error h tt ll1
2has gotten smaller
• However we represent clusters by their new means• However we represent clusters by their new means, and mean is always smallest representation of cluster
1 1
iDxzx
z2||||
21
iDx
t zzxxz
22 ||||2||||21
iDx
zx 0
1
iDxix
nz 1
K‐means Clustering
d h b d d h• We just proved that by doing steps 2 and 3, the objective function goes down
• found “smart “ move which decreases objective function• found “smart “ move which decreases objective function
• Algorithm converges after a finite number of iterations
• However not guaranteed to find a global minimum
1x
2x
2‐means gets stuck here global minimum of JSSE
K‐means Clustering
d h f h d• Finding the optimum of JSSE is NP‐hard• In practice, k‐means usually performs well• Very efficient• Its solution can be used as a starting point for other
clustering algorithms• Many variants and improvements of k‐means clustering
i texist
Bayesian Decision Theory
• Know probability distribution of the categories • almost never the case in real life!almost never the case in real life!• nevertheless useful since other cases can be reduced to this one after some workreduced to this one after some work
• Do not even need training dataC d i ti l l ifi• Can design optimal classifier
Example: Fish Sorting• Respected fish expert says that
• salmon length has distribution N(5,1)• sea bass length has distribution N(10,4)
• Recall N(μ,σ2) 2lRecall N(μ,σ )
2
2
2
21
l
elp
2
• Thus class conditional densities are
25 2
21|
l
esalmonlp
4*210 2
221|
l
ebasslp
Likelihood function
• Class conditional densities are
25 2
1|
l
esalmonlp
4*210 2
1|
l
ebasslp
• Class conditional densities are
2
| esalmonlp
22
| ebasslpfixed fixed
• Fix length, let fish class varyg , y• get likelihood function
• it is not density and not probability massit is not density and not probability mass
salmonclassife
21 2
5l 2
bassclassife
221
2class|lp810l 2
fixed 22 fixed
Likelihood vs. Class Conditional Density
p(l | class)
length7 length7
Suppose a fish has length 7. How do we classify it?pp g y
ML (maximum likelihood) Classifier• Intuitively want to choose salmon if• Intuitively, want to choose salmon if
bass|7lengthPrsalmon|7lengthPr
• However, since length is a continuous r.v., 0bass|7lengthPrsalmon|7lengthPr
5 2l 10 2
1 l
• Instead, we choose class which maximizes likelihood
25 2
21|
l
esalmonlp
4*210
221|
l
ebasslp
• ML classifier: bass|lp?salmon|lp
salmon
bass
• if p(l | salmon) > p(l | bass), classify salmon, else classify bass
Interval Justification
p( 7 |bass) Thus we choose the class (bass) which is
p( 7 |salmon)
( )more likely to have given the observation
p( | )
7 bass|7p2bass|7BlPr
salmon|7p2salmon|7BlPr
Decision Boundary
l if l l if bclassify as salmon classify as sea bass
length6.70
Priors
f k l d d h b• Prior comes from prior knowledge, no data has been seen yet
• Suppose a fish expert says: in the fall, there are twice as many salmon as sea bass
• Prior for our fish sorting problemP( l ) 2/3• P(salmon) = 2/3
• P(bass) = 1/3
• With the addition of prior to our model, how should we classify a fish of length 7?y g
How Prior Changes Decision Boundary?
• Without priors
salmon sea bass
6.70 length
• How should this change with prior?• P(salmon) = 2/3• P(bass) = 1/3
? ?? ?
6.70
salmon sea bass
length
Bayes Decision Rule
1. Have likelihood functions • p(length | salmon) • p(length | bass)
2. Have priors P(salmon) and P(bass)
• Question: Having observed fish of certain length, do we classify it as salmon or bass?
• Natural Idea:• salmon if P(salmon|length) > P(bass|length)( | g ) ( | g )• bass if P(bass|length) > P(salmon|length)
Posterior
• P(salmon | length) and P(bass | length) are called posterior distributions, because the data (length) was revealed (post data)
• How to compute posteriors?( )
• Bayes rule: ( ) ( )( )lengthp
length,salmonp=length|salmonP
( )( ) ( )( )lengthp
salmonPsalmon|lengthp=
lengthp
bassPbass|lengthplength|bassP • Similarly:
MAP (maximum a posteriori) classifier
lengthbassPlengthsalmonP |?|
bass
salmon
bassPbass|lengthpsalmonPsalmon|lengthp salmon
bass
lengthp
bassPbass|lengthp?lengthp
salmonPsalmon|lengthp
bass
bassPbass|lengthp?salmonPsalmon|lengthpsalmon bassPbass|lengthp?salmonPsalmon|lengthpbass
Back to Fish Sorting
25 2
21|
l
esalmonlp
810 2
221|
l
ebasslp
• likelihood
• Priors: P(salmon) = 2/3, P(bass) = 1/3 1121 10l5l 22
31e
221
32e
21 82
• Solve inequality
salmon sea bassnew decision
boundary
6.70 length7.18
• New boundary makes sense: expect to see more salmon
More on Posterior
Prior(given)
posterior density(our goal)
likelihood(given)
lP
cPc|lPl|cP
(given)(our goal) (g )
lPnormalizing factor, often do not even need it for l ifi i i (l) d d d l f
basspbass|lpsalmonpsalmon|lplP
classification since P(l) does not depend on class c. If we do need it, from the law of total probability:
p|pp|pNotice this formula consists of likelihoods and priors, which are given
More on Priors
• Prior comes from prior knowledge, no data has been seen yet
• If there is a reliable source prior knowledge, it should be usedS bl t b l d li bl• Some problems cannot even be solved reliably without a good prior
• However prior alone is not enough, we still need likelihood• P(salmon)=2/3, P(sea bass)=1/3• If I don’t let you see the data, but ask you to guess, y , y g ,will you choose salmon or sea bass?
More on Map Classifier
t i likelihood prior
lP
cPc|lPl|cP posterior likelihood p
P|lPl|P
• Do not care about P(l) when maximizing P(c|l )proportional
• If P(salmon)=P(bass) (uniform prior) MAP classifier
cPc|lPl|cP proportional
( ) ( ) ( p )becomes ML classifier c|lPl|cP
If f b ti l P(l| l ) P(l|b ) th thi• If for some observation l, P(l|salmon)=P(l|bass), then this observation is uninformative and decision is based solely on the prior cPl|cPon the prior cPl|cP
Justification for MAP Classifier• Probability of error for MAP estimate:• Probability of error for MAP estimate:
l|bassP?l|salmonP salmon
l|bassP?l|salmonPbass
• For any particular l probability of error• For any particular l, probability of error
if d id lP(b |l)Pr[error| l ]=
if we decide salmonP(bass|l)
if we decide bassP(salmon|l)
Thus MAP classifier is optimal for each individual l
Justification for MAP Classifier
• We are interested to minimize error not just for one l, we really want to minimize the average error over all l
dllpl|errorPrdll,errorperrorPr
• If Pr[error| l ]is as small as possible, the integral is ll blsmall as possible
• But Bayes rule makes Pr[error| l ] as small as possible
Thus MAP classifier minimizes the probability of errorThus MAP classifier minimizes the probability of error
Parametric Supervised Learning
• Supervised parametric learning• have m classes• have samples x1,…, xn each of class 1, 2,…, m• suppose Di holds samples from class i• probability distribution for class i is p (x| )• probability distribution for class i is pi(x|i)
p1(x|1) p2(x|2)p1( | 1) p2( | 2)
Parametric Supervised Learning• Use the ML method to estimate parameters iUse the ML method to estimate parameters i
• find i which maximizes the likelihood function F(i))(F)|x(p)|D(p iiii )()|(p)|(p ii
Dxii
i
• or, equivalently, find qi which maximizes the log likelihoodl(i)l(i)
iDx
iiii xpDpl |ln)|(ln
)|(lnmaxargˆ111
1
Dp )|(lnmaxargˆ222
2
Dp
Parametric Supervised Learning• now the distributions are fully specified• can classify unknown sample using MAP (Maximum A Posteriori)
classifier
)ˆ|( 11 xp )ˆ|( xp)|( 11p )|( 22 xp
Parametric Unsupervised Learning
• Expectation Maximization (EM)• one of the most useful statistical methods• one of the most useful statistical methods• oldest version in 1958 (Hartley)
i l i 1977 (D t t l )• seminal paper in 1977 (Dempster et al.)• can also be used when some samples are missing
featuresfeatures
Parametric Unsupervised Learning• Assume the data was generated by a model with• Assume the data was generated by a model with
known shape but unknown parameters
P(x |)
• Advantages of having a model • Gives a meaningful way to cluster data
• adjust the parameters of the model to maximize the probability that the model produced the observed datathe model produced the observed data
• Can sensibly measure if a clustering is good• compute the likelihood of data induced by clustering
• Can compare 2 clustering algorithms• which one gives the higher likelihood of the observed data?
Parametric Unsupervised Learning
• In unsupervised learning, no one tells us the true classes for samples. We still know
h l• have m classes• have samples x1,…, xn each of unknown class• probability distribution for class i is pi(x|i)probability distribution for class i is pi(x|i)
• Can we determine the classes and parameters simultaneously?y
Example: MRI Brain Segmentation
t tisegmentation
• In MRI brain image different brain tissues have different
Picture from M. Leventon
• In MRI brain image, different brain tissues have different intensities
• Know that brain has 6 major types of tissuesh f b d l d b ( 2)• Each type of tissue can be modeled by a Gaussian N(i,i
2) reasonably well, parameters i,i
2 are unknown• Segmenting (classifying) the brain image into different tissue
l i f lclasses is very useful• don’t know which image pixel corresponds to which tissue (class)• don’t know parameters for each N(i,i
2)
Mixture Density Model• Model data with mixture density
j
m
jj cPcxpxp ,|)|(
component densities
jj
jj cPcxpxp 1
,|)|(
m ,...,1• wheremixing parameters
• To generate a sample from distribution p(x|) 1...21 mcPcPcP•
• first select class j with probability P(cj)• then generate x according to probability law p(x|cj ,j )
p(x|c1 ,1 )p(x|c2 ,2 )
p(x|c3 ,3 )
Example: Gaussian Mixture Density
• Mixture of 3 Gaussians
010N)x(p
p2(x)
10,0N)x(p1
40
04,6,6N)x(p2
p1(x)
40,6,6N)x(p2
60
06,7,7N)x(p3 ( )
60,,)(p3 p3(x)
xp5.0xp3.0xp2.0)x(p 321
Mixture Density j
m
jj cPcxpxp ,|)|( jj
jj cPcxpxp 1
,|)|(
• P(c1),…, P(cm) can be known or unknown• Suppose we know how to estimate 1,…, m and
P(c1),…, P(cm)• Can “break apart” mixture p(x| ) for classification• To classify sample x, use MAP estimation, that is choose
class i which maximizes
)|( xcP cP)c|x(p ),|( ii xcP iii cP),c|x(p probability of component i probability of p y p
to generate xp ycomponent i
ML Estimation for Mixture Densitym m
j
m
jjj cPcxpxp
1
,|),|( i
m
jjjcxp
1
,|
• Use Maximum Likelihood estimation for a mixture densityN d i• Need to estimate• 1,…, m
• = P(c ) = P(c ) and = { }• 1 = P(c1),…, m = P(cm), and = {1 ,…, m }• As in supervised case, form the log likelihood function
),|(ln, Dpl
n
ki
m
jjjcxp
1 1,|ln
n
kkxp
1,|ln
ML Estimation for Mixture Densityn m
d t i i l( ) ith t t d
n
ki
m
jjjcxpl
1 1,|ln,
• need to maximize l(,) with respect to and • l(, ) is hard to maximize directly
• partial derivatives with respect to , usually give a “coupled” nonlinear system of equation
• usually closed form solution cannot be foundusually closed form solution cannot be found
• We could use the gradient ascent method• in general, it is not the greatest method to use, should onlyin general, it is not the greatest method to use, should only
be used as last resort
• There is a better algorithm, called EM
Mixture Density• Before EM, let’s look at the mixture density again
j
m
jjjcxpxp
1
,|),|(
Before EM, let s look at the mixture density again
• Say we know how to estimate 1,…, m and 1,…,m• estimating the class of x is easy with MAP, maximize
iiiiii cxpcPcxp ),|(),|(
• Suppose know class of samples x1,…, xnpp p 1, , n• just as supervised case, so estimating 1,…, m and
1,…,m is easy
Thi i l f hi k d bl
)|(lnmaxargˆiii Dp
i
nDi
i||ˆ
• This is an example of chicken‐and‐egg problem• EM algorithm solution is adding “hidden” variables
Expectation Maximization Algorithm• EM is an algorithm for ML parameter estimation when
the data has missing values. It is used when1. data is incomplete (has missing values)
• some features are missing for some samples due to data corruption, partial survey responses, etc.
2 Suppose data X is complete but p(X|q) is hard to optimize2. Suppose data X is complete, but p(X|q) is hard to optimize. Suppose further that introducing certain hidden variables Uwhose values are missing, and suppose it is easier to optimize the “complete” likelihood function p(X,U|q). Then EM is useful. • useful for the mixture density estimation and is subjectuseful for the mixture density estimation, and is subject
of our lecture today• Notice after we introduce artificial (hidden) variables U with
missing values, case 2 is completely equivalent to case 1
EM: Hidden Variables for Mixture Density
m
cpp |)|( jj
jjcxpxp
1
,|)|(
• For simplicity, assume component densities are 21
2
2
2exp
21,|
j
jj
xcxp
• assume for now that the variance is knownassume for now that the variance is known• need to estimate q = {m1,…, mm}
• If we knew which sample came from which componentIf we knew which sample came from which component the ML parameter estimation would be easy
• To get this easier problem introduce hidden variablesTo get this easier problem, introduce hidden variables which indicate which component each sample belongs to
EM: Hidden Variables for Mixture Density
• For define hidden variables z (k)ki 11
01k
izif sample i was generated by component kotherwise
• For , define hidden variables zi(k)mkni 1,1
miiii zzxx ,...,, 1
• zi(k) are indicator random variables, they indicate which Gaussian component generated sample xi
• Let zi = {zi(1),…, zi(m)}, indicator r.v. corresponding to xi
2,~,| kii Nzxp• Conditioned on zi , distribution of xi is Gaussian
• where k is s.t. zi(k) = 1
EM: Joint Likelihood• Let zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn}
|)|( ZX
Let zi {zi ,…, zi }, and Z {z1,…, zn}
n
|• The complete likelihood is
|,...,,,...,)|,( 11 nn zzxxpZXp
i
ii zxp1
|,
n
|zpz|xp
1i
iii |zp,z|xp
gaussian part of c
• If we observed Z, the log likelihood ln[p(X,Z|q)] would be tri ial to ma imi e ith respect to q and r
• The problem, is that the values of Z are missing, since we d it (th t i Z i hidd )
trivial to maximize with respect to q and ri
made it up (that is Z is hidden)
EM Derivation
• Instead of maximizing ln[p(X Z|q)] idea behind EM is to• Instead of maximizing ln[p(X,Z|q)] idea behind EM is to maximize some function of ln[p(X,Z|q)], usually its expected value (conditioned on X)expected value (conditioned on X)
|Z,XplnE X|Z
• common trick: integrate out unknown variables• the expectation is with respect to the missing data Z
• that is with respect to density p(Z|X,q)
• however q is our ultimate goal, we don’t know q !
EM Algorithm
• EM solution is to iterate• EM solution is to iterate
1. start with initial parameters q (0)
E t th t ti f l lik lih d ith
iterate the following 2 step until convergence
E. compute the expectation of log likelihood with respect to current estimate q (t) and X. Let’s call it Q(q|q (t))Let s call it Q(q|q )
tX|Z
t ,X||Z,XplnE|Q
M. maximize Q(q|q (t)) tt Q |maxarg1 Q
|maxarg
EM Algorithm
• It can be proven that EM algorithm converges to the• It can be proven that EM algorithm converges to the local maximum of the log‐likelihood
|l X |ln Xp
• Why is it better than gradient ascent?Why is it better than gradient ascent?• Convergence of EM is usually significantly faster, in the
beginning, very large steps are made (that is likelihood function increases rapidly), as opposed to gradient ascent which usually takes tiny steps
• gradient descent is not guaranteed to convergegradient descent is not guaranteed to converge• recall the difficulties of choosing the appropriate learning rate
EM: Lower Bound Maximization |ln Xp
(t) ( )
l( |(t))
• It can be shown that at time step t EM algorithm
(t) (t+1)
• constructs function l( |(t)) which is bounded above by lnp(X|) and touches l( |(t)) at =(t)
• finds (t+1) that maximizes l( |(t))• finds (t+1) that maximizes l( |(t))• Therefore, log likelihood ln p(X|) can only go up
EM for Mixture of Gaussians: E step• Let’s come back to our example j
m
jjcxpxp ,|)|(Let s come back to our example jj
jjpp 1
,|)|(
2
2
2exp
21,|
j
jj
xcxp
need to estimate = { } and need to estimate = {1,…, m} and 1,…, m
• For , define zi(k)mkni 1,1
01k
izif sample i was generated by component kotherwise
• need log likelihood of observed X and hidden Z
• as before, zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn}
• need log‐likelihood of observed X and hidden Z
n
iii zxpZXp
1|,ln|,ln
n
iiii zPzxp
1,|ln
EM for Mixture of Gaussians: E step
• We need log‐likelihood of observed X and hidden Z• We need log‐likelihood of observed X and hidden Z
n
iii zxpZXp
1|,ln|,ln
n
iiii zPzxp
1,|ln
• First let’s rewrite iii zPzxp ,|
f| 111
iii zPzxp ,|
1zif1zP,1z|xp
1zif1zP,1z|xp
mi
mi
mii
1i
1i
1ii
m zkk
kizPzxp 11|
,|p iiii
k
ii
iizPzxp
11,1|
m
zkki
ki
zPx 2
1exp1
k
ki
ki zP1
2 12
exp2
EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is
n
iiii zPzxpZXp
1,|ln|,ln
z k
i2
n
i
m
k
zk
iki
i
zPx1 1
2
2
12
exp21ln
k
kiz
n
i
ki
kim
kzPx
12
2
11
2exp
21ln
n
i
m
k
ki
kiki zPxz
1 12
2
1ln22
1ln
n m
kik x 2
l1l
kki cPkclassfromxsampleP
i kk
kiki
xz1 1
2 ln22
1ln
EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is
n
i
m
kk
kiki
xzZXp1 1
2
2
ln22
1ln|,ln
• For the E step, we must compute
tZ XZXpE ,||,ln
n mtkik xE
2
ln1ln
tm
ttm
tt QQ ,...,,,...,|| 11
i k
tk
kikiZ zE
1 12 ln
22ln
bxEabxaE
n m
kik x 21
bxEabxaEi
iXii
iiX
i k
kkik
iZxzE
1 12 ln
221ln
EM for Mixture of Gaussians: E step
n m x 21
• need to compute E [z (k)] in the expression above
i kk
kikiZ
t xzEQ1 1
2 ln22
1ln|
• need to compute EZ[zi( )] in the expression above
itk
iitk
ik
iZ xzPxzPzE ,|1*1,|0*0
itk
i xzP ,|1
ti
tki
ki
ti
xpzPzxp
||11,|
mtt
tki
tk x
2
22
12
1exp
m
tki
tk x 2
21exp
j
tji
tj x
1
222
1exp
• Done with the E step
j
tji
ji
ti zPzxP
1|11,|
• Done with the E step• for implementation, need to compute EZ[zi(k)] ’s don’t need Q
EM for Mixture of Gaussians: M step
n m x 21
• Need to maximize Q with respect to all parameters
i kk
kikiZ
t xzEQ1 1
2 ln22
1ln|
• Need to maximize Q with respect to all parameters • First differentiate with respect to mk
t
kQ
|
n
i
kikiZ
xzE1
2
0
nkE
n
kiZ
ii
kiZ
tkk
zE
xzEnew 11
i
iZ1
the mean for class k is weighted average of all samples andthe mean for class k is weighted average of all samples, and this weight is proportional to the current estimate of
probability that the sample belongs to class k
EM for Mixture of Gaussians: M step
n m
kikt xzEQ2
ln1ln|
i k
kiZ zEQ1 1
2 ln22
ln|
• For k use Lagrange multipliers to preserve constraint 11
m
jj
1j
• Need to differentiate
1|,1
m
jj
tQF
01,1
n
i
kiZ
kkzEF 0
1
k
n
i
kiZ zE
• Summing up over all components:
m
kk
m
k
n
i
kiZ zE
11 1
• Since and we get nzEm
k
n
i
kiZ
1 11
1
m
kk n
n1
n
i
kiZ
tk zE
n 1
1 1
EM AlgorithmThis algorithm applies to univariate Gaussian with known variances
1. Randomly initialize 1,…, m , 1,…, m , • with constraint i = 1i
E for all i k computeiterate until no change in 1,…, m , 1,…, m
E. for all i, k, compute
kik
kiZ
xzE
222
1exp
M for all k do parameter update
m
jjij
iZ
xzE
1
222
1exp
M. for all k, do parameter update
n
ii
kiZ xzE
1
nk
iZk zE1
n
i
kiZ
k
zE1
i
iZk n 1
EM Algorithm• For more general case of multivariate Gaussians withFor more general case of multivariate Gaussians with
unknown means and variances
|xp
m
1jjjj
kkkkiZ
,|xp
,|xpzE
• E step:
t11
k1
kt
k2/11k
2/dkk xx21exp
)2(1,|xp
where
• M step:
n
Tk
n
kiZk zE
n1
M step:
n
ik
iZ xzE
n
kiZ
i
Tkiki
kiZ
k
zE
xxzE1
in 1
n
i
kiZ
ik
zE1
1
iiZ
1
EM Algorithm and K‐means• k‐means can be derived from EM algorithmk means can be derived from EM algorithm
21
• Setting mixing parameters equal for all classes,
x 21exp
m
jjij
kikk
iZ
x
xzE
1
22
22
21exp
2exp
m
jji
ki
x
x
1
22
2
21exp
2exp
j 1 j 1
• If we let , then 0
otherwisexxjifzE jikik
iZ 0||||||||,1
• E step for each current mean find all points closest to it• E step, for each current mean, find all points closest to it and form new clusters
• M step, compute new means inside current clusters
n
ii
kiZk xzE
n 1
1
EM Gaussian Mixture Example
After first iteration
EM Gaussian Mixture ExampleAfter first iteration
After second iteration
EM Gaussian Mixture ExampleAfter second iteration
EM Gaussian Mixture ExampleAfter third iteration
EM Gaussian Mixture Example
After 20th iteration
EM Example
• Example from R. Gutierrez-OsunaExample from R. Gutierrez Osuna• Training set of 900 examples forming an annulus• Mixture model withm = 30 Gaussian components ofMixture model with m 30 Gaussian components of
unknown mean and variance is used• Training:Training:
• Initialization:• means to 30 random examples• covaraince matrices initialized to be diagonal, with large
variances on the diagonal, compared to training data variancevariance
• During EM training, components with small mixing coefficients were trimmed• This is a trick to get in a more compact model, with fewer
than 30 Gaussian components
EM Example
from R. Gutierrez-Osuna
EM Texture Segmentation Example
Figure from “Color and Texture Based Image Segmentation Using EM and Its ApplicationFigure from Color and Texture Based Image Segmentation Using EM and Its Application to Content Based Image Retrieval”,S.J. Belongie et al., ICCV 1998
EM Motion Segmentation Example
Three frames from the MPEG “flower garden” sequence
Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Figure from Representing Images with layers, , by J. Wang and .H. Adelson, ITransactions on Image Processing, 1994, c 1994, IEEE
EM Algorithm Summary
• Advantages• Guaranteed to converge (to a local max)• If the assumed data distribution is correct, the
algorithm works well
• Disadvantages• If assumed data distribution is wrong, results can g,
be quite bad. • In particular, bad results if use incorrect number of classes (i.e. the number of mixture components)
top related