expectation- maximization algorithm and applicationseugenew/publications/em-talk.pdf · can prove...

31
Expectation- Maximization Algorithm and Applications Eugene Weinstein Courant Institute of Mathematical Sciences Nov 14th, 2006

Upload: vannga

Post on 05-May-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

Expectation-Maximization Algorithm and Applications

Eugene WeinsteinCourant Institute of Mathematical SciencesNov 14th, 2006

Page 2: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

2/31

List of Concepts

Maximum-Likelihood Estimation (MLE)Expectation-Maximization (EM)Conditional ProbabilityMixture ModelingGaussian Mixture Models (GMMs)String edit-distanceForward-backward algorithms

Page 3: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

3/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Edit-Distance

Page 4: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

4/31

One-Slide MLE ReviewSay I give you a coin with

But I don’t tell you the value of θNow say I let you flip the coin n times

You get h heads and n-h tailsWhat is the natural estimate of θ ?

This isMore formally, the likelihood of θ is governed by a binomial distribution:

Can prove is the maximum-likelihood estimate of θDifferentiate with respect to θ, set equal to 0

Page 5: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

5/31

EM MotivationSo, to solve any ML-type problem, we analytically maximize the likelihood function?

Seems to work for 1D Bernoulli (coin toss)Also works for 1D Gaussian (find µ, σ 2 )

Not quiteDistribution may not be well-behaved, or have too many parametersSay your likelihood function is a mixture of 1000 1000-dimensional Gaussians (1M parameters)Direct maximization is not feasible

Solution: introduce hidden variables toSimplify the likelihood function (more common)Account for actual missing data

Page 6: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

6/31

Hidden and Observed VariablesObserved variables: directly measurable from the data, e.g.

The waveform values of a speech recordingIs it raining today?Did the smoke alarm go off?

Hidden variables: influence the data, but not trivial to measure

The phonemes that produce a given speech recordingP (rain today | rain yesterday)

Is the smoke alarm malfunctioning?

Page 7: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

7/31

Expectation-MaximizationModel dependent random variables:

Observed variable xUnobserved (hidden) variable y that generates x

Assume probability distributions:θ represents set of all parameters of distribution

Repeat until convergenceE-step: Compute expectation of

(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q

Page 8: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

8/31

Let X, Y be r.v.’s drawn from the distributions P(x) and P(y)

Conditional distribution given by:Then

For function h(Y ):

Given a particular value of X (X=x):

Conditional Expectation Review

Page 9: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

9/31

Maximum Likelihood ProblemWant to pick θ that maximizes the log-likelihood of the observed (x) and unobserved (y) variables given

Observed variable xPrevious parameters θ ′

Conditional expectation of given x and θ ′ is

Page 10: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

10/31

EM DerivationLemma (Special case of Jensen’s Inequality): Let p(x), q(x) be probability distributions. Then

Proof: rewrite as:

Interpretation: relative entropy non-negative

Page 11: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

11/31

EM DerivationEM Theorem:

Ifthen

Proof:

By some algebra and lemma,

So, if this quantity is positive, so is

Page 12: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

12/31

EM SummaryRepeat until convergence

E-step: Compute expectation of

(θ ′,θ : old, new distribution parameters)M-step: Find θ that maximizes Q

EM Theorem: Ifthen

InterpretationAs long as we can improve the expectation of the log-likelihood, EM improves our model of observed variable xActually, it’s not necessary to maximize the expectation, just need to make sure that it increases – this is called “Generalized EM”

Page 13: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

13/31

EM CommentsIn practice, the x is series of data points

To calculate expectation, can assume i.i.d and sum over all points:

Problems with EM?Local maximaNeed to bootstrap training process (pick a θ )

When is EM most useful?When model distributions easy to maximize (e.g., Gaussian mixture models)

EM is a meta-algorithm, needs to be adapted to particular application

Page 14: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

14/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Distance

Page 15: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

15/31

EM Applications: Mixture ModelsGaussian/normal distribution

Parameters: mean µ and variance σ 2

In the multi-dimensional case, assume isotropic Gaussian: same variance in all dimensionsWe can model arbitrary distributions with density mixtures

Page 16: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

16/31

Density MixturesCombine m elementary densities to model a complex data distribution

kth Gaussian parametrized by

Page 17: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

17/31

Density MixturesCombine m elementary densities to model a complex data distribution

Page 18: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

18/31

Density MixturesCombine m elementary densities to model a complex data distribution

Log-likelihood function of the data x given :

Log of sum – hard to optimize analytically!Instead, introduce hidden variable y

: x generated by Gaussian k

EM formulation: maximize

Page 19: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

19/31

Gaussian Mixture Model EMGoal: maximize n (observed) data points: n (hidden) labels:

: xi generated by Gaussian kSeveral pages of math later, we get:E step: compute likelihood of

M step: update αk, µk, σk for each Gaussian k =1..m

Page 20: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

20/31

GMM-EM DiscussionSummary: EM naturally applicable to training probabilistic modelsEM is a generic formulation, need to do some hairy math to get to implementationProblems with GMM-EM?

Local maximaNeed to bootstrap training process (pick a θ )

GMM-EM applicable to enormous number of pattern recognition tasks: speech, vision, etc.Hours of fun with GMM-EM

Page 21: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

21/31

Overview

Expectation-MaximizationMixture Model TrainingLearning String Distance

Page 22: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

22/31

String Edit-DistanceNotation:Operate on two strings:Edit-distance: transform one string into another using

Substitution: kitten bitten, costInsertion: cop crop, costDeletion: learn earn, cost

Can compute efficiently recursively

Page 23: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

23/31

Stochastic String Edit-DistanceInstead of setting costs, model edit operation sequence as a random processEdit operations selected according to a probability distribution

For edit operation sequenceView string edit-distance as

memoryless (Markov):stochastic: random process according to δ (⋅) is governed by a true probability distributiontransducer:

Page 24: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

24/31

Edit-Distance TransducerArc label a:b/0 means input a, output b and weight 0Assume

Page 25: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

25/31

Two DistancesDefine yield of an edit sequence ν (zn#)as the set of all strings hx,yi such that zn# turns x into yViterbi edit-distance: negative log-likelihood of most likely edit sequence

Stochastic edit-distance: negative log-likelihood of all edit sequences from x to y

Page 26: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

26/31

Evaluating LikelihoodViterbi:Stochastic:

Both require calculation of over all possible edit sequences

possibilities (three edit operations)However, memoryless assumption allows us to compute likelihood efficientlyUse the forward-backward method!

Page 27: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

27/31

ForwardEvaluation of forward probabilities : likelihood of picking an edit sequence that generates the prefix pairMemoryless assumption allows efficient recursive computation:

Page 28: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

28/31

BackwardEvaluation of backward probabilities : likelihood of picking an edit sequence that generates the suffix pairMemoryless assumption allows efficient recursive computation:

Page 29: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

29/31

EM FormulationEdit operations selected according to a probability distribution

So, EM has to update δ based on occurrence counts of each operation (similar to coin-tossing example)Idea: accumulate expected counts from forward, backward variablesγ(z): expected count of edit operation z

Page 30: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

30/31

EM Detailsγ(z): expected count of edit operation ze.g,

Page 31: Expectation- Maximization Algorithm and Applicationseugenew/publications/em-talk.pdf · Can prove is the maximum-likelihood estimate of ... old, new distribution parameters) ... (similar

31/31

ReferencesA. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society B, 39(1), 1977 pp. 1-38.C. F. J. Wu, On the Convergence Properties of the EM Algorithm, The Annals of Statistics, 11(1), Mar 1983, pp. 95-103. F. Jelinek, Statistical Methods for Speech Recognition, 1997M. Collins, The EM Algorithm, 1997J. A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, Technical Report, University of Berkeley, TR-97-021, 1998E. S. Ristad and P. N. Yianilos, Learning string edit distance, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(2), 1998, pp. 522-532.L.R. Rabiner. A tutorial on HMM and selected applications in speech recognition, In Proc. IEEE, 77(2), 1989, pp. 257-286.A. D'Souza, Using EM To Estimate A Probablity [sic] Density With A Mixture Of Gaussians M. Mohri. Edit-Distance of Weighted Automata, in Proc. Implementation and Application of Automata, (CIAA) 2002, pp. 1-23J. Glass, Lecture Notes, MIT class 6.345: Automatic Speech Recognition, 2003Carlo Tomasi, Estimating Gaussian Mixture Densities with EM – A Tutorial, 2004Wikipedia