maximum-likelihood and bayesian parameter estimationsrihari/cse555/chap3.part7.pdf ·...

CSE 555: Srihari

Maximum-Likelihood and Bayesian Parameter Estimation

Expectation Maximization (EM)

CSE 555: Srihari

Estimating Missing Feature ValueEstimating missing variable with known

parameters

Choosing mean of missing feature (over all classes) will result in worse performance!This is a case of estimating the hidden variable given the parameters.In EM unknowns are both: parameters and hidden variables

(Μ(Μissing variable)issing variable)

KnownKnownValueValue

In the absence of In the absence of xx11, most , most likely class is likely class is ωω22Choose that value of xChoose that value of x11which maximizes the which maximizes the likelihoodlikelihood

CSE 555: Srihari

EM Task

• Estimate unknown parameters θ given measurement data U

• However some variables J are missing which need to be integrated out

• We want to maximize the posterior probability of θ, given the data U, marginalizing over J

)|,( maxarg UJPnJJ

∑=εθ

θθ

ParameterParameterto be estimatedto be estimated ΜΜissing Variablesissing Variables

DataData

CSE 555: Srihari

EM Principle

• Estimate unknown parameters θ given measurement data U and not given nuisance variables J which need to be integrated out

• Alternate between estimating unknowns θ and the hidden variables J

• At each iteration, instead of finding the best J ε J given an estimate θ at each iteration, EM computes adistribution over the space J

)|,(

maxarg UJPnJ

∑=Jεθ

θθ

CSE 555: Srihari

k-means Algorithm as EM• Estimate means of k classes when class labels are unknown

• Parameters: means to be estimated • Hidden variables: class labels

begin initialize m1, m2,..mk (E-step)do classify n samples according to nearest mi (M-step)

recompute mi

until no change in mi

return m1,m2,..mk

endAn iterative algorithm derivable from EM

CSE 555: Srihari

EM Importance• EM algorithm

• widely used for learning in the presence of unobserved variables, e.g., missing features, class labels

• used even for variables whose value is never directly observed, provided the general form of the pdf governing these variables is known

• has been used to train Radial Basis Function Networks and Bayesian Belief Networks

• basis for many unsupervised clustering algorithms• basis for widely used Baum-Welch forward-backward

algorithm for HMMs

CSE 555: Srihari

EM Algorithm

• Learning in the presence of unobserved variables• Only a subset of relevant instance features might be

observable• Case of unsupervised learning or clustering• How many classes are there?

CSE 555: Srihari

EM Principle• EM algorithm iteratively estimates the

likelihood given the data that is present

CSE 555: Srihari

Likelihood Formulation• Sample points from a single distribution

• Any sample has good and missing (bad ) features

• Features are divided into two sets

},..,{ 1 nxx=W

},..,{ kbkgk xxx =

bg WWW U=

CSE 555: Srihari

Likelihood Formulation• Central equation in EM

•• Algorithm will select the best candidate Algorithm will select the best candidate θθ and call it and call it θθ i+1i+1

Current best estimate for the full distributionCurrent best estimate for the full distribution

Candidate Vector for an improved estimateCandidate Vector for an improved estimate

Expected Value is over missing featuresExpected Value is over missing features

CSE 555: Srihari

Algorithm EMAlgorithm EMbegin initializebegin initialize θ θ 00, T, i = 0, T, i = 0dodo i = i+1i = i+1

E step:E step: compute compute Q(Q(θθ ;;θ θ ii))M step: M step: θ θ i+1i+1==argarg maxmaxθθ Q(Q(θθ ;;θ θ ii))

until until Q(Q(θθ i+1i+1;;θ θ ii) ) -- Q(Q(θθ ii;;θθ ii--11))< < TT

CSE 555: Srihari

EM for 2D Normal Model

Suppose data consists of 4 points in 2 dimensions, one point of which is missing a feature

, where * represents

the unknown value of the first feature of point x4. Thus our bad data Db consists of the single feature x41, and the good data Dgconsists of the rest.

⎭⎬⎫

⎩⎨⎧

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛==

4*

,22

,01

,20

},,,{ 4321 xxxxD

xx11

xx22

CSE 555: Srihari

EM for 2D Normal Model

Assuming that the model is a Gaussian with diagonal covariance and arbitrary mean, it can be described by the parameter vector

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

22

21

2

1

σσµµ

θ

CSE 555: Srihari

EM for 2D Normal model

We take our initial initial guess to be a Gaussian centered on the origin having , that is,

1=Σ

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

1100

0θ

CSE 555: Srihari

EM for a 2D Normal Model

To find improved estimate, must calculate To find improved estimate, must calculate

CSE 555: Srihari

EM for a 2D Normal Model

SimplifyingSimplifying

Completes E step. Gives the next estimateCompletes E step. Gives the next estimateFinal SolutionFinal Solution

0.6670.667

CSE 555: Srihari

EM for 2D normal model

Four data points, one missing the value of x1, are in red. Initial estimate is a circularly symmetric Gaussian, centered on the origin (gray). (A better initial estimate could have been derived from the 3 known points.) Each iteration leads to an improved estimate, labeled by iteration number i; after 3 iterations, the algorithm converged.

CSE 555: Srihari

EM to Estimate Means of k Gaussians

• Data D drawn from mixture of k distinct normal distributions

• Two-step process generates samples• One of the k distributions is selected at random• A single random instance xi is generated

according to selected distribution

CSE 555: Srihari

Instances Generated by a Mixture of Two Normal Distributions

instances

CSE 555: Srihari

Example of EM to Estimate Means of k Gaussians• Each instance is generated by1 Choosing one of the k Gaussians with uniform probability2 Generating an instance at random according to that Gaussian

• The single normal distribution is chosen with uniform probability

• Each of the k Normal distributions has the same known variance

• Learning task: output a hypothesisthat describes the means of each of the k distributions>=< kh µµ ,..,1

CSE 555: Srihari

Estimating Means of k Gaussians

• We would like to find a maximum likelihood hypothesis for these means: a hypothesis h that maximizes p(D|h)

CSE 555: Srihari

Maximum Likelihood Estimate of Mean of a Single Gaussian

• Given observed data instances

• Drawn from a single distribution that is normally distributed

• Problem is to find the mean of that distribution

mxxx ...,, 21

CSE 555: Srihari

Maximum Likelihood Estimate of Mean of a Single Gaussian

2

121 )(minarg µµ

µ−= ∑

=

m

iML ix

∑=

=m

iiML x

m 1

1µ

• Maximum likelihood estimate of the mean of a normal distribution can be shown to be one that minimizes the sum of squared errors

• Right hand side has a maximum value at

• which is the sample mean

CSE 555: Srihari

Mixture of Two Normal Distributions

• We cannot observe as to which instances were generated by which distribution

• Full description of instance

• xi = observed value of ith instance• zi1 and zi2 indicate which of two normal distributions was used

to generate xi

• zij = 1 if zij was used to generate xi , 0 otherwise

• zi1 and zi2 are hidden variables, which have probability distributions associated with them

>< 21,, iii zzx

CSE 555: Srihari

Hidden variables specify distribution

(xi,1,0)

Zi1=1Zi2=0

Zi1=0Zi2=1

CSE 555: Srihari

2-Means Problem• Full description of instance

• xi = observed variable• zi1 and zi2 are hidden variables• If zi1 and zi2 were observed, we could use maximum

likelihood estimates for the means:

• Since we do not know zi1 and zi2, we will use EM instead

>< 21,, iii zzx

∑=∋

=

=m

1z1i

i

i1

xm1

1µ ∑=∋

=

=m

1z1i

i

i2

xm1

2µ

CSE 555: Srihari

EM Algorithm Applied to k-Means Problem• Search for a Maximum Likelihood Hypothesis by repeatedly

re-estimating expected values of hidden binary variables zij

given its current hypothesis

• Then recalculate the maximum likelihood hypothesis using these expected values for the hidden variables

>< kµµ ,...1

CSE 555: Srihari

EM algorithm for two means

Zi1=1Zi2=0

Zi1=0Zi2=1

1. Hypothesize means, then determine expected value of hidden variables for all samples2. Use these hidden variable values to recalculate the means

RegardedAsProbabilities

CSE 555: Srihari

EM Applied to Two-Means Problem

• Initialize the hypothesis to

• Estimate the expected values of hidden variables zijgiven its current hypothesis

• Recalculate the maximum likelihood hypothesis using these expected values for the hidden variables

• Re-estimate h repeatedly until the procedure converges to a stationary value of h

>=< 21, µµh

CSE 555: Srihari

EM Algorithm for 2-MeansStep 1

Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis holds

Step 2Calculate new maximum likelihood hypothesis assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in Step 1. Then replace hypothesis by the new hypothesis and iterate. ⟩⟨= 21, µµh

⟩′′⟨=′ 21, µµh

⟩⟨= 21, µµh

⟩′′⟨=′ 21, µµh

CSE 555: Srihari

EM First Step

Calculate the expected value E[zij] of each hidden variable zij, assuming the current hypothesis holds ⟩⟨= 21, µµh

)|(

)|(][ 2

1∑ ===

===

n ni

jiij

xxp

xxpzE

µµ

µµ

∑ =1ne

−−

−−=

2

)(2

1

2)(22

1

22 x

jix

ji

eµ

σ

µσ

= Probability that instance xi wasgenerated by the jth Gaussian

CSE 555: Srihari

EM Second StepCalculate a new maximum likelihood hypothesis

Observations: 1. Similar to earlier sample mean calculation for a

single Gaussian.

⟩′′⟨=′ 21, µµh

∑∑

=

=← m

i ij

m

i iijj

zE

xzE

1

1

][

][µ

∑=

=m

1iix

m1

MLµ

CSE 555: Srihari

Clustering using EM

CSE 555: Srihari

Feature ExtractionFeature Extraction

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000 0.000000

0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000

0.175000 0.962500 0.688889 0.129167

0.419643 1.000000 0.387500

0.000000

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000

0.727273

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.337662

0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000

0.599078

0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943

0.630728

0.127660 0.351064 0.073286 0.879433 1.000000

0.607903 0.263830

0.474164

0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815

0.687831

0.367379

0.715380Image I_1

Feature Extraction

Feature Vector (74 values)

i

CSE 555: Srihari

Feature ExtractionFeature Extraction

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000 0.000000

0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000

0.175000 0.962500 0.688889 0.129167

0.419643 1.000000 0.387500

0.000000

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000

0.727273

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.337662

0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000

0.599078

0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943

0.630728

0.127660 0.351064 0.073286 0.879433 1.000000

0.607903 0.263830

0.474164

0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815

0.687831

0.367379

0.715380Image I_1

Feature Extraction

2 Global Features: Aspect ratio Stroke ratio

72 Local Features

s(i,j) i= 1,9 j = 0,7F(i,j) =

N(i)*S(j)

s(i,j) - no. of components with slope j in subimage i N(i) - no. of components in subimage i

S(j) = max s(i,j)Feature Vector (74 values)i N(i)

CSE 555: Srihari

For One FeatureFor One Feature

P(I_1|C_1) P(I_1|C_2) P(I_1|C_3) P(I_1|C_4) P(I_1|C_5)

1.000000 0.000000 0.000000 0.000000 0.000000Cycle 1Cycle 2

…Cycle N

1.000000 0.000000 0.000000 0.000000 0.000000

0.990915 0.009083 0.000001 0.000000 0.000000

Final Clusters Centers (Images closer to the means)

P(C_3)= 0.203198

P(C_1)= 0.206088 P(C_2)= 0.228980 P(C_3)= 0.218480 P(C_3)= 0.143254

CSE 555: Srihari

ClusteringClustering

0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000 0.000000

0.076923 0.000000 0.529915 0.993590 0.387363 0.000000 0.596154 1.000000

0.175000 0.962500 0.688889 0.129167

0.419643 1.000000 0.387500

0.000000

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.000000

0.727273

1.000000 0.000000 0.000000 0.000000 0.000000 0.000000

0.337662

0.000000 0.000000 1.000000 1.000000 0.108295 0.000000 1.000000

0.599078

0.207547 0.207547 0.194969 0.779874 0.760108 0.215633 0.350943

0.630728

0.127660 0.351064 0.073286 0.879433 1.000000

0.607903 0.263830

0.474164

0.111111 0.407407 0.637860 0.765432 0.870370 0.000000 0.114815

0.687831

0.367379

0.715380

0.023339 0.002406 0.003211 0.008192 0.0113290.003749 0.046144 0.039846

0.471009 0.055688 0.091545 0.345479 0.4567060.133725 0.513211 0.573237

0.428854 0.554012 0.838875 0.297697 0.373176

0.467050 0.583712 0.227030

0.251790 0.101519 0.029442 0.022731 0.052462

0.040323 0.063098 0.085502

0.393534 0.193757 0.096639 0.209345 0.1938140.077345 0.255358 0.525436

0.090627 0.071594 0.741128 0.849134 0.103370

0.042000 0.650248 0.632097

0.597694 0.246492 0.174002 0.360720 0.718818

0.419992 0.430754 0.447975

0.558005 0.359154 0.128743 0.511530 0.824493

0.492662 0.127002

0.391781

0.418266 0.543050 0.442549 0.444120 0.7202680.512506 0.137092 0.187808

0.545132

0.825456

10 cycles

Initial Mean for Cluster 1 (Initialized with FV for Image 1)

Final Mean for Cluster 1

CSE 555: Srihari

Essence of EM Algorithm

• The current hypothesis is used to estimate the unobserved variables

• Expected values of these variables are used to calculate an improved hypothesis

• It can be shown that on each iteration through the loop EM increases the likelihood P(D/h) unless it is at a local maximum

• Algorithm converges on a local maximum likelihood hypothesis for >< 21, µµ

CSE 555: Srihari

General Statement of EM Algorithm

Parameters of interest were

Full data were the triples , of which only xi is observed

>=< 21,µµθ

>< 21,, iii zzx

CSE 555: Srihari

General Statement of EM Algorithm (cont’d.)

• Let

• denote the observed data in a set of m independently drawn instances

• Let

• denote the unobserved data in these same instances• Let

• denote the full data

}...,,{ 21 mxxxX =

}...,,{ 21 mzzzZ =

ZXY ∪=

CSE 555: Srihari


• The unobserved Z can be treated as a random variable whose p.d.f. depends on the unknown parameters and on the observed data X

• Similarly Y is a random variable because it is defined in terms of the random variable Z

• h denotes current hypothesized values of the parameters

• denotes revised hypothesis estimated on each iteration of EM algorithm

θ

θh ′

CSE 555: Srihari


• EM algorithm searches for m.l.e. hypothesis by seeking the that maximizes

)]|([ln hYPE ′h′h′

CSE 555: Srihari


• Define the function that gives

• as a function of under the assumption that

• And given the observed portion X of the full data Y

)|( hhQ ′

)]|([ln hYPE ′

h=θ

],|)|([ln)|( XhhYPEhhQ ′←′

h′

CSE 555: Srihari

General Statement of EM Algorithm

• Repeat until convergence• Step 1: Estimation (E) Step: Calculate

using the current hypothesis h and the observed data X to estimate the probability distribution over Y

• Step 2: Maximization (M) Step: Replace hypothesis h by the hypothesis h′ that maximizes this Q function

)|( hhQ ′

],|)|([ln)|( XhhYPEhhQ ′←′

)|(maxarg hhQhh

′←′

CSE 555: Srihari

Derivation of k Means Algorithm from General EM algorithm

• Derive previously seen algorithm for estimating the means of a mixture of k Normal distributions. Estimate the parameters

• We are given the observed data

• The hidden variables

• Indicate which of the k Normal distributions was used to generate xi

},{ 1 ⟩⟨= iki zzZ K

}{ ⟩⟨= ixX

>=< kµµθ ,...1

CSE 555: Srihari

Derivation of the k Means Algorithm from General EM Algorithm (cont’d.)

• Need to derive an expression for

• First we derive an expression for

)'/( hhQ

)|(ln hYP ′

CSE 555: Srihari

Derivation of the k Means Algorithm

• The probability of a single instanceof the full data can be written

)|( hyp i ′

⟩⟨= ikii zzxy K,, 11

∑=′=′ =

′−−k

j jij xz

ikiii ehzzxphyp 12

12 )(2

1

212

1)|,,()|(µ

σ

σπK

CSE 555: Srihari

Derivation of the k Means Algorithm (cont’d.)

• Given this probability for a single instance , the logarithmic probability for all m instances in the data is

)|( hyp i ′

)|(ln hYP ′

)|(ln)|(ln1

∏=

′=′m

ii hyphYP

)|(ln1

hyp i

m

i

′= ∑=

∑ ∑= = ⎟

⎟

⎠

⎞

⎜⎜

⎝

⎛′−−=

m

i

k

jjiij xz

1 1

222

)(2

12

1ln µσσπ

CSE 555: Srihari

Derivation of the k Means Algorithm (cont’d.)

• In general, for any function f(z) that is a linear function of z, the following equality holds

])|()]([ zEfzfE =

⎥⎥⎦

⎤

⎢⎢⎣

⎡

⎟⎟

⎠

⎞

⎜⎜

⎝

⎛′−−=′ ∑ ∑

= =

m

i

k

jjiij xzEhYPE

1 1

222

)(2

12

1ln)]|([ln µσσπ

∑ ∑= = ⎟

⎟

⎠

⎞

⎜⎜

⎝

⎛′−−=

m

i

k

jjiij xzE

1 1

222

)]([2

12

1ln µσσπ

CSE 555: Srihari

Derivation of the k Means Algorithm

• To summarize, where and is calculated on the current hypothesis h and the observed data X, the function for the k means problem is

( ) ∑ ∑= = ⎟

⎟

⎠

⎞

⎜⎜

⎝

⎛′−−=′

m

i

k

jjiij xzEhhQ

1 1

222

)]([2

12

1ln| µσσπ

)|( hhQ ′

⟩′′⟨=′ kh µµ ,,1 K ][ ijzE

∑ =

−−

−−

=k

n

x

x

ijni

ji

e

ezE

1

)(2

1

)(2

1

22

22

][µ

σ

µσ

CSE 555: Srihari

Derivation of k Means Algorithm:Second (Maximization Step) To Find the Values

⟩′′⟨=′ kh µµ ,,1 K

∑ ∑= =′′ ⎟

⎟

⎠

⎞

⎜⎜

⎝

⎛′−−=′

m

i

k

jjiij

hhxzEhhQ

1 1

222

)]([2

12

1lnmaxarg)|(maxarg µσσπ

∑∑= =′

′−=m

i

k

jjiij

hxzE

1 1

2)]([minarg µ

∑∑

=

=← m

i ij

m

i iijj

zE

xzE

1

1

][

][µ

CSE 555: Srihari

Summary

• In many parameter estimation tasks, some of the relevant instance variables may be unobservable.

• In this case, the EM algorithm is useful.

maximum-likelihood and bayesian parameter estimationsrihari/cse555/chap3.part7.pdf ·...

Documents