expectation- maximization algorithm and applicationseugenew/publications/em-talk.pdf · can prove...

Expectation-Maximization Algorithm and Applications

Eugene WeinsteinCourant Institute of Mathematical SciencesNov 14th, 2006

2/31

List of Concepts

Maximum-Likelihood Estimation (MLE)Expectation-Maximization (EM)Conditional ProbabilityMixture ModelingGaussian Mixture Models (GMMs)String edit-distanceForward-backward algorithms

3/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Edit-Distance

4/31

One-Slide MLE ReviewSay I give you a coin with

But I don’t tell you the value of θNow say I let you flip the coin n times

You get h heads and n-h tailsWhat is the natural estimate of θ ?

This isMore formally, the likelihood of θ is governed by a binomial distribution:

Can prove is the maximum-likelihood estimate of θDifferentiate with respect to θ, set equal to 0

5/31

EM MotivationSo, to solve any ML-type problem, we analytically maximize the likelihood function?

Seems to work for 1D Bernoulli (coin toss)Also works for 1D Gaussian (find µ, σ 2 )

Not quiteDistribution may not be well-behaved, or have too many parametersSay your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)Direct maximization is not feasible

Solution: introduce hidden variables toSimplify the likelihood function (more common)Account for actual missing data

6/31

Hidden and Observed VariablesObserved variables: directly measurable from the data, e.g.

The waveform values of a speech recordingIs it raining today?Did the smoke alarm go off?

Hidden variables: influence the data, but not trivial to measure

The phonemes that produce a given speech recordingP (rain today | rain yesterday)

Is the smoke alarm malfunctioning?

7/31

Expectation-MaximizationModel dependent random variables:

Observed variable xUnobserved (hidden) variable y that generates x

Assume probability distributions:θ represents set of all parameters of distribution

Repeat until convergenceE-step: Compute expectation of

(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q

8/31

Let X, Y be r.v.’s drawn from the distributions P(x) and P(y)

Conditional distribution given by:Then

For function h(Y ):

Given a particular value of X (X=x):

Conditional Expectation Review

9/31

Maximum Likelihood ProblemWant to pick θ that maximizes the log-likelihood of the observed (x) and unobserved (y) variables given

Observed variable xPrevious parameters θ ′

Conditional expectation of given x and θ ′ is

10/31

EM DerivationLemma (Special case of Jensen’s Inequality): Let p(x), q(x) be probability distributions. Then

Proof: rewrite as:

Interpretation: relative entropy non-negative

11/31

EM DerivationEM Theorem:

Ifthen

Proof:

By some algebra and lemma,

So, if this quantity is positive, so is

12/31

EM SummaryRepeat until convergence

E-step: Compute expectation of

(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q

EM Theorem: Ifthen

InterpretationAs long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable xActually, it’s not necessary to maximize the expectation, just need to make sure that it increases – this is called “Generalized EM”

13/31

EM CommentsIn practice, the x is series of data points

To calculate expectation, can assume i.i.d and sum over all points:

Problems with EM?Local maximaNeed to bootstrap training process (pick a θ )

When is EM most useful?When model distributions easy to maximize (e.g., Gaussian mixture models)

EM is a meta-algorithm, needs to be adapted to particular application

14/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Distance

15/31

EM Applications: Mixture ModelsGaussian/normal distribution

Parameters: mean µ and variance σ 2

In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensionsWe can model arbitrary distributions with density mixtures

16/31

Density MixturesCombine m elementary densities to model a complex data distribution

kth Gaussian parametrized by

17/31


18/31


Log-likelihood function of the data x given :

Log of sum – hard to optimize analytically!Instead, introduce hidden variable y

: x generated by Gaussian k

EM formulation: maximize

19/31

Gaussian Mixture Model EMGoal: maximize n (observed) data points: n (hidden) labels:

: xi generated by Gaussian kSeveral pages of math later, we get:E step: compute likelihood of

M step: update αk, µk, σk for each Gaussian k =1..m

20/31

GMM-EM DiscussionSummary: EM naturally applicable to training probabilistic modelsEM is a generic formulation, need to do some hairy math to get to implementationProblems with GMM-EM?

Local maximaNeed to bootstrap training process (pick a θ )

GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc.Hours of fun with GMM-EM

21/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Distance

22/31

String Edit-DistanceNotation:Operate on two strings:Edit-distance: transform one string into another using

Substitution: kitten bitten, costInsertion: cop crop, costDeletion: learn earn, cost

Can compute efficiently recursively

23/31

Stochastic String Edit-DistanceInstead of setting costs, model edit operation sequence as a random processEdit operations selected according to a probability distribution

For edit operation sequenceView string edit-distance as

memoryless (Markov):stochastic: random process according to δ (⋅) is governed by a true probability distributiontransducer:

24/31

Edit-Distance TransducerArc label a:b/0 means input a, output b and weight 0Assume

25/31

Two DistancesDefine yield of an edit sequence ν (zn#)as the set of all strings hx,yi such that zn# turns x into yViterbi edit-distance: negative log-likelihood of most likely edit sequence

Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y

26/31

Evaluating LikelihoodViterbi:Stochastic:

Both require calculation of over all possible edit sequences

possibilities (three edit operations)However, memoryless assumption allows us to compute likelihood efficientlyUse the forward-backward method!

27/31

ForwardEvaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pairMemoryless assumption allows efficient recursive computation:

28/31

BackwardEvaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pairMemoryless assumption allows efficient recursive computation:

29/31

EM FormulationEdit operations selected according to a probability distribution

So, EM has to update δ based on occurrence counts of each operation (similar to coin-tossing example)Idea: accumulate expected counts from forward, backward variablesγ(z): expected count of edit operation z

30/31

EM Detailsγ(z): expected count of edit operation ze.g,

31/31

ReferencesA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38.C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997M. Collins, The EM Algorithm, 1997J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286.A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003Carlo Tomasi, Estimating Gaussian Mixture Densities with EM – A Tutorial, 2004Wikipedia

expectation- maximization algorithm and applicationseugenew/publications/em-talk.pdf · can prove...

Documents