computational statistics, 2nd edition -...

31
Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods Presented by: Weiyu Li, Jincheng Pang 2018.03 Givens & Hoeting, Computational Statistics, 2nd Edition 1

Upload: others

Post on 21-Jun-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Computational Statistics, 2nd Edition

Chapter4: EM Optimization Methods

Presented by: Weiyu Li, Jincheng Pang

2018.03

Givens & Hoeting, Computational Statistics, 2nd Edition 1

Page 2: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Focus

1. Introduction: the MM Algorithm

2. The EM Algorithm

Examples, Convergence, Variance Estimation

3. Improvements

MCEM, ECM, EM gradient, Acceleration methods

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 2

Page 3: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Introduction: MM

MM: Majorize-Minimization or Minorize-Maximization.

Algorithm: find a surrogate function g(θ|θ(t)) that minorizes the objective

function f (θ) (concave) and maximize g.

1. find g(θ|θ(t)) satisfying {g(θ|θ(t)) ≤ f (θ), ∀θg(θ(t)|θ(t)) = f (θ(t))

(1)

2. maximization. θ(t+1) = argmaxθg(θ|θ(t)).

3. Stop or return to 1.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 3

Page 4: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Figure 1: Illustration of how MM works.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 4

Page 5: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Introduction: What’s E-M for?

EM can be treated as a special case of the MM algorithm.

• Expection

Missing data Z ← Expection from observation X

Complete data Y = (X,Z)

• Maximization (aim)

Maximize L(θ|X)

*Bayesian: estimate the mode of a posterior distribution f (θ|X)

(maximize a posteriori estimation)

• Y|θ, Z|(X,θ) easier to work with

• Latent : not actually missing

• Bayesian: parameters rather than data

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 5

Page 6: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

EM Algorithm

Q(θ|θ(t))def= E

{logL(θ|Y)

∣∣ x,θ(t)}

(2)

=E{

log fY(y|θ)∣∣ x,θ(t)

}(3)

=

∫ [log fY(y|θ)

]fZ|X(z|x,θ(t)) dz (4)

Initial: θ(0)

Iterations: altering between E and M.

1. E: Compute Q(θ|θ(t)).

2. M: θ(t+1) = argmaxθQ(θ|θ(t)).

3. Stop or return to 1.

Stopping criteria: d(θ(t+1),θ(t)) / d(Q(θ(t+1)|θ(t)), Q(θ(t)|θ(t))) / . . .

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 6

Page 7: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Example 1: How EM works

Y1, Y2 ∼ i.i.d. Exp(θ) with y1 = 5 observed but y2 missing.

Thus

Q(θ|θ(t)) = 2 log{θ} − 5θ − θ/θ(t) (5)

Updating equation: θ(t+1) = 2θ(t)

5θ(t)+1→ θ = 0.2.

Easy analytic solution. No need of EM at all!

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 7

Page 8: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Example 2: Peppered moths

alleles C>I>T.

How do we estimate allele frequencies from phenotype counts?

Hardy-Weinberg principle: if the allele frequencies in the population are pC,

pI, and p

T, then the genotype frequencies should be p2

C, 2p

Cp

I, 2p

Cp

T, p2

I, 2p

Ip

T,

and p2T, for genotypes CC, CI, CT, II, IT, and TT, respectively.

Observations: phenotypes x = (nC, n

I, n

T) , where n = n

C+ n

I+ n

T.

Complete data: y = (nCC, n

CI, n

CT, n

II, n

IT, n

TT).

Aim: estimate p = (pC, pI) , where pT = 1− pC − pI.

x = (nC, n

I, n

T) = M(y) = (n

CC+ n

CI+ n

CT, n

II+ n

IT, n

TT)

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 8

Page 9: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Example 2: Peppered moths - computation

log{fY(y|p)} = nCC

log{p2C}+n

CIlog{2p

Cp

I}+ . . .+log

(n

nCC

nCIn

CTn

IIn

ITn

TT

). (6)

E-Step:

Q(p|p(t)) = n(t)CC

log{p2C} + n(t)

CIlog{2p

Cp

I} + . . . + n

TTlog{p2

T} + k(n

C, n

I, n

T,p(t)), (7)

where E{NCC|nC, nI , nT ,p(t)} = n

(t)CC = nC(p

(t)C )2

(p(t)C )2+2p

(t)C p

(t)I +2p

(t)C p

(t)T

and so forth.

M-Step:

Setting dQ(p|p(t))dpC

= dQ(p|p(t))dpI

= 0 yields

p(t+1)C

=2n

(t)CC + n

(t)CI + n

(t)CT

2n, p(t+1)

I=

2n(t)II + n

(t)IT + n

(t)CI

2n, and p(t+1)

T=

2n(t)TT + n

(t)CT + n

(t)IT

2n.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 9

Page 10: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Example 2: Peppered moths - simulation results

Observed data: nC

= 85, nI= 196, and n

T= 341

Table 1: EM results for peppered moth example. R(t) is the relative convergence criterion; D(t)C , and D

(t)I

are ratios of consecutive errors.

t p(t)C p

(t)I R(t) D

(t)C D

(t)I

0 0.333333 0.3333331 0.081994 0.237406 5.7× 10−1 0.0425 0.3372 0.071249 0.197870 1.6× 10−1 0.0369 0.1883 0.070852 0.190360 3.6× 10−2 0.0367 0.1784 0.070837 0.189023 6.6× 10−3 0.0367 0.1765 0.070837 0.188787 1.2× 10−3 0.0367 0.1766 0.070837 0.188745 2.1× 10−4 0.0367 0.1767 0.070837 0.188738 3.6× 10−5 0.0367 0.1768 0.070837 0.188737 6.4× 10−6 0.0367 0.176

One can notice that β = 1.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 10

Page 11: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Convergence

Note that log fX(x|θ) = log fY(y|θ)− log fZ|X(z|x,θ), take expectations with

respect to Z|(X,θ) yields

log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) (8)

where H(θ|θ(t)) = E{

log fZ|X(z|x,θ)∣∣ x,θ(t)

}.

Claim: maxH(θ|θ(t)) = H(θ(t)|θ(t)). (hint: using Jensen’s inequality.)

Therefore, increasing Q(θ|θ(t)) leads to increasing log fX(x|θ) (our aim!).

Generalized EM(GEM): increase Q(θ|θ(t)), i.e. Q(θ(t+1)|θ(t)) > (θ(t)|θ(t)).

Convergence order: Linear (slow!). Rate is inversely related to the

proportion of missing data.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 11

Page 12: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Remarks about EM

• Ease of implement and stable ascent.

• Optimization transfer: let G(θ|θ(t))def= Q(θ|θ(t)) + l(θ(t)|x)−Q(θ(t)|θ(t)) yields

the surrogate function g in MM algorithm!

– Q(θ|θ(t)), G(θ|θ(t)) maximized at the same θ.

– minorizing function: G(θ|θ(t)) ≤ l(θ|x),∀θ.

– G is tangent to l at θ(t).

G is more convenient to maximize.

Each E step forms a minorizing function G, and each M step maximizes

it to provide an uphill step.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 12

Page 13: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Discussion: Exponential families

Derivations:

f (y|θ) = c1(y)c2(θ) exp{θTs(y)}

Q(θ|θ(t)) = k + log c2(θ) +∫θTs(y)fZ|X(z|x,θ(t)) dz

Set Q′(θ|θ(t)) = 0 yields −c′2(θ)

c2(θ) =∫

s(y)fZ|X(z|x,θ(t)) dz.

Note that c′2(θ) = −c2(θ)E{s(Y)|θ}, then θ(t+1) is the solution of

E{s(Y)|θ} =

∫s(y)fZ|X(z|x,θ(t)) dz. (9)

Algorithm:

1. E step: Compute s(t) def= E{s(Y)|x,θ(t)} =

∫s(y)fZ|X(z|x,θ(t)) dz.

2. M step: θ(t+1) solves E{s(Y)|θ} = s(t).

3. Stop or return to 1.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 13

Page 14: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: Outline

• Aim: estimate V ar(θ) → compute −l′′(θ|x).

(Bayestian: the Hessian of the log posterior density)

• Theoretical derivations: Louis’s method.

• Methods:

– SEM: easy, fast, reliable.

– Bootstrapping: easier, nested looping.

– Others: empirical information, numerical differentiation, ...

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 14

Page 15: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: Louis’s method

Take 2nd derivatives to log fX(x|θ) = Q(θ|θ(t))−H(θ|θ(t)) with respect to θ

yields

−l′′(θ|x) = −Q′′(θ|ω)|ω=θ + H′′(θ|ω)|ω=θ (10)

Define iX(θ) = −l′′(θ|x), iY(θ) = −Q′′(θ|ω)|ω=θ (= −E{l′′(θ|Y)|x,θ(t)}) ,

iZ|X(θ) = −H′′(θ|ω)|ω=θ = V arZ|X{d log fZ|X(z|x,θ)

dθ }.

Missing information principle:

iX(θ) = iY(θ)− iZ|X(θ) (11)(observed

information

)=

(complete

information

)−(

missing

information

)(12)

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 15

Page 16: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: Louis’s method - remarks

• Define SZ|X(θ) =d log fZ|X(z|x,θ)

dθ , then

iZ|X(θ) =

∫SZ|X(θ)SZ|X(θ)TfZ|X(z|X, θ) dz (13)

since E{SZ|X(θ)} = 0.

• Avoid calculations of θ|X, easier to derive and code sometimes.

• If difficult to compute analytically, then Monte Carlo method.

e.g. estimate iY(θ) by

1

m

m∑i=1

−d2 log fY(yi|θ)

dθ2 , (14)

where zi i.i.d. drawn from fZ|X.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 16

Page 17: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: Example - censored exponential data

Observed data:

xi =

{(ci, 0), yi > ci (censored) ,

(yi, 1), yi ≤ ci (uncensored) ,Y1, . . . , Yn i.i.d. ∼ Exp(λ). (15)

Complete data log likelihood: l(λ|y) = n log λ− λ∑n

i=1 yi.

Q(λ|λ(t)) = n log λ− λn∑i=1

E{Yi|xi, λ(t)} (16)

= n log λ− λn∑i=1

[yiδi + ci(1− δi)]−λ

λ(t)C (17)

where δi = 1{i is uncensored}, C =∑n

i=1(1− δi) denotes the number of censored

cases.

Therefore iY(λ) = −Q′′(λ|λ(t)) = nλ2

, and we can also calculate

iZ|X(λ) = V ar{d log fZ|X(z|x,λ)

dλ } = Cλ2

. Applying Louis’s method, we find

iX(λ) = Uλ2

, where U =∑n

i=1 δi denotes the number of uncensored cases.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 17

Page 18: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: SEM - introduction

Motivation:

Let Ψ denotes the EM mapping, having fixed point θ and Jacobian matrix

Ψ′(θ) with (i, j)th element equaling dΨi(θ)dθj

. It can be shown that

Ψ′(θ)T = iZ|X(θ)iY(θ)−1 (18)

Further use of the missing information principle leads to

Var{θ} = iY(θ)−1(I + Ψ′(θ)T (I−Ψ′(θ)T )−1

). (19)

SEM considers complete data and an incremental matrix. No need to

worry about the uncertainty of the missing data.

SEM is more stable than the generic numerical differentiation approach.

Aim: estimate Ψ′(θ).

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 18

Page 19: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: SEM - algorithm

1. Find θ by standard EM.

2. Restart from θ(0) closer to θ. For t = 0, 1, 2, . . .

(a) Produce θ(t+1) from θ(t) by standard EM.

(b) Define θ(t)(j) = (θ1, . . . , θj−1, θ(t)j , θj+1, . . . , θp) and calculate

r(t)ij =

Ψi(θ(t)(j))− θiθ

(t)j − θj

. (20)

(c) Stop when convergence criteria met.

• Plug the final estimate of Ψ′(θ) into (19) to get the variance.

• Asymmetry (slightly).

• No inverse.

• transformation of θ.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 19

Page 20: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: SEM - remark

Why restart?

Using r(t)ij =

Ψi(θ(t−1)1 ,...,θ

(t−1)j−1 ,θ

(t)j ,θ

(t−1)j+1 ,...,θ

(t−1)p )−Ψi(θ

(t−1))

θ(t)j −θ

(t)j−1

will not require fewer iterations

overall and will be less stable.

Restart closer offset the steps to find θ.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 20

Page 21: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Variance estimation: Other methods

Bootstrapping:

1. Initialization. θ1 = θEM , applied to x1, . . . ,xn.

2. Pseudo-data. θj = θ(j)

EM , applied to x(j)1 , . . . ,x

(j)n generated randomly with

replacement, j = 2, . . . , B.

3. f (θ) = 1B

∑Bj=1 f (θj), so variance can be estimated.

The nested looping can be computationally burdensome.

Empirical information: (related to the Fisher information)

1

n

n∑i=1

l′(θ|xi)l′(θ|xi)T −1

n2l′(θ|x)l′(θ|x)T (21)

for i.i.d data.

All terms are by-products of the M step, since H(θ|θ(t)) is maximized at θ(t),

then l′(θ|x)|θ=θ(t)

= Q′(θ|θ(t))|θ=θ(t)

.

Numerical differentiation

Inaccuracy by perturbation v.s. round off error.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 21

Page 22: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: Outline

1. E step

• Monte Carlo EM: mean value.

2. M step

• E Conditional Maximization: a CM cycle.

• EM gradient: a single step of Newton’s method.

3.Acceleration methods

• Aitken acceleration: a Newton update with Taylor expansion

approximation.

• Quasi-Newton acceleration: Quasi-Newton update.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 22

Page 23: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: MCEM

Replace the E step with

1. Draw Z(t)1 , . . . ,Z

(t)

m(t) i.i.d. from fZ|X(z|x,θ(t)).

2. Calculate Q(t+1)(θ|θ(t)) = 1m(t)

∑m(t)

j=1 log fY(Y(t)j |θ), a MC estimate of Q(θ|θ(t)).

• Choice of m(t): small first.

• Convergence: eventually bounce around the true maximum.

Example: censored exponential data reviewed

• Ordinary EM update: λ(t+1) = n∑ni=1 xi+

C

λ(t)

.

• Using MCEM: Q(t+1)(λ|λ(t)) = n log λ− λm(t)

∑m(t)

j=1 YTj 1, where

Yj = (Xj, Zj1, . . . , ZjC), Zjk − ck ∼ i.i.d.Exp(λ(t)), k = 1, . . . , C.

Therefore the MCEM update is λ(t+1) = n

1

m(t)

∑m(t)j=1 YT

j 1.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 23

Page 24: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: ECM

Using S conditional maximization problems to simplify the original

maximization problem.

θ(t+s/S) = arg maxgs(θ

(t+(s−1)/S))Q(θ|θ(t)), (22)

for s = 1, . . . , S, and set θ(t+1) = θ(t+S/S).

Choice of constraints:

Partition θ into S subvectors, θ = (θ1, . . . ,θS).

1. gs(θ) = (θ1, . . . ,θs−1,θs+1, . . . ,θS), i.e. holding θ(−s) fixed. (Gauss-Seidel)

2. gs(θ) = θs, i.e. holding θs fixed.

Nested iterative loops.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 24

Page 25: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: ECM - (sketched) example

Multivariate regression

Let U1, . . . ,Un be independent where

Ui ∼ Nd (µi,Σ) (23)

for µi = Viβ, where Vi are known d× p matrices, and β and Σ are unknown.

Algorithm: partition the unknown parameters into (β,Σ).

• E-step: Find the expectation of the complete data sufficient statistics

conditional on the observed data and β(t), Σ(t). (The sufficient stats are∑ni=1Uij for j = 1, . . . , d and

∑ni=1UijUik for j, k = 1, . . . , d.)

• CM 1: β(t+1/2) is estimated given Σ = Σ(t).

• CM 2: Σ(t+2/2) is estimated given β = β(t+1/2).

• Return to the E-step.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 25

Page 26: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: EM gradient

Since our aim is maximizing Q(θ|θ(t)) in the M step, replace the update by

θ(t+1) = θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)

Q′(θ|θ(t))∣∣∣θ=θ(t)

(24)

= θ(t) − Q′′(θ|θ(t))−1∣∣∣θ=θ(t)

l′(θ(t)|x) (25)

• Avoid the computational burden of nested looping.

• Same rate of convergence to θ as EM.

• Choice of step length: scaling can ensure ascent while inflating speeds

convergence.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 26

Page 27: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: Aitken acceleration

Acceleration methods speed convergence by Newton-like steps.

To maximize l(θ(t)|x), the Newton update would be

θ(t+1) = θ(t) − l′′(θ(t)|x)−1l′(θ(t)|x). (26)

To replace l′(θ(t)|x), note that l′(θ(t)|x) = Q′(θ|θ(t))∣∣∣θ=θ(t)

and the Taylor

expansion

0 = Q′(θ|θ(t))∣∣∣θ=θ

(t+1)EM

≈ Q′(θ|θ(t))∣∣∣θ=θ(t)

− iY(θ(t))(θ(t+1)EM − θ

(t)).

Therefore the update would be

θ(t+1) = θ(t) − l′′(θ(t)|x)−1iY(θ(t))(θ(t+1)EM − θ

(t)). (27)

• Since iY included, the increment would be more when more informa-

tion missing.

• A precise approximation only when θ(t) near θ, so initially iterate EM.

• Equivalent to applying the Newton method to find a zero of Ψ(θ)− θwhere the mapping Ψ producing θ(t+1) = Ψ(θ(t)).

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 27

Page 28: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Improvements: Quasi-Newton acceleration

Newton-like method: θ(t+1) = θ(t) − (M(t))−1l′(θ(t)|x).

Take

M(t) = Q′′(θ|θ(t))∣∣∣θ=θ(t)

−B(t)

where B(t) approximates H′′(θ(t)|θ(t)) to make M(t) approximate l′′(θ(t)|x).

Quasi-Newton EM algorithm:

• Start with B(0) = 0.

• Gradually accumulate information about H′′ using secant condition

B(t+1)(θ(t+1) − θ(t)) = H′(θ|θ(t+1))∣∣∣θ=θ(t+1)

− H′(θ|θ(t))∣∣∣θ=θ(t)

.

Then quasi-Newton methods like BFGS can be used.

• Quasi-Newton EM v.s. EM gradient.

• Scaling to guarantee M(t) < 0.

• A potentially superior strategy approximates (l′′)−1 instead of l′′.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 28

Page 29: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Figure 2: Steps taken by the EM gradient algorithm (long dashes). Ordinary EM steps are shown withthe solid line. Steps from two methods from later sections (Aitken and quasi-Newton acceleration) are alsoshown, as indicated in the key. The observed data log likelihood is shown with the grey scale, with lightshading corresponding to high likelihood. All algorithms were started from pC = pI = 1/3.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 29

Page 30: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Miscellanea: Gaussian Mixture Model (GMM)

Data from K different normal distributions:

p(x) =

K∑k=1

πkN(x|µk,Σk), (28)

where mixture coefficient πk is the weight of the kth component.

We can use EM to estimate the unknown πk,µk,Σk by introducing a latent

variable zk. zk = 1 iff the point belongs to the kth normal distribution.

In the view of Bayesian, with the prior density p(z) and the likelihood

function p(x|z), the posterior density is

γ(zk) = p(zk = 1|x) =πkN(x|µk,Σk)∑Kj=1 πjN(x|µj,Σj)

.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 30

Page 31: Computational Statistics, 2nd Edition - USTChome.ustc.edu.cn/~liweiyu/documents/chapter4_EM_20180328.pdf · Computational Statistics, 2nd Edition Chapter4: EM Optimization Methods

Knowing π(t)k ,µ

(t)k ,Σ

(t)k , we solve out the update would be

µ(t+1)k =

1

N(t)k

N∑n=1

γ(t)(znk)xn, (29)

Σ(t+1)k =

1

N(t)k

N∑n=1

γ(t)(znk)(xn − µ(t+1)k )(xn − µ(t+1)

k )T , (30)

π(t+1)k =

N(t)k

N, (31)

where N(t)k =

∑Nn=1 γ

(t)(znk).

The M step seems like a CM cycle rather than the ordinary one.

EM Optimization Methods Givens & Hoeting, Computational Statistics, 2nd Edition 31