![Page 1: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/1.jpg)
Lecture 3
Gaussian Mixture Models and Introduction to HMM’s
Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen
IBM T.J. Watson Research CenterYorktown Heights, New York, USA
{picheny,bhuvana,stanchen}@us.ibm.com
24 September 2012
![Page 2: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/2.jpg)
Administrivia
Feedback (2+ votes):Too fast (pace/content/talking): many.More details/explanation of formulae: 5.More examples + explanation: 4.Talk more about labs: 2.Earlier break or more breaks: 2(Provide extra readings for people w/o DSP.)
Will try to address most of these today.Muddiest topic: DTW (4), LPC (3), DSP (2), deltas (1).Lab 1 due Wednesday, October 3rd at 6pm.
Should have received username and password.Courseworks discussion has been started.
2 / 113
![Page 3: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/3.jpg)
Where Are We?
Can extract features over time (LPC, MFCC, PLP) that . . .Characterize info in speech signal in compact form.Every ∼10 ms, process window of samples . . .To get ∼40 features.
DTW computes distance between feature vectors . . .While accounting for nonlinear time alignment.
Learned basic concepts (e.g., distances, shortest paths) . . .That will reappear throughout course.
3 / 113
![Page 4: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/4.jpg)
DTW Revisited
w∗ = arg minw∈vocab
distance(A′test, A′
w)
Training: collect audio Aw for each word w in vocab.Generate features ⇒ A′
w (template for w).Test time: given audio Atest, convert to A′
test.For each w , compute distance(A′
test, A′w) using DTW.
Return w with smallest distance.
4 / 113
![Page 5: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/5.jpg)
What are Pros and Cons of DTW?
5 / 113
![Page 6: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/6.jpg)
Pros
Easy to implement.Lots of freedom — can model arbitrary time warpings.
6 / 113
![Page 7: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/7.jpg)
Cons: It’s Ad Hoc
Distance measures completely heuristic.Why Euclidean?Weight all dimensions of feature vector equally?
Warping paths heuristic.Too much freedom not good for robustness?Allowable local paths hand-derived.
No guarantees of optimality or convergence.
7 / 113
![Page 8: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/8.jpg)
Cons (cont’d)
Doesn’t scale well?Run DTW for each template in training data.What if large vocabulary? Lots of templates per word?
Generalization.Doesn’t support mix and match between templates.
8 / 113
![Page 9: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/9.jpg)
Can We Do Better?
Key insight 1: Learn as much as possible from data.e.g., distance measure; graph weights; graphstructure?
Key insight 2: Use probabilistic modeling.Use well-described theories and models from . . .Probability, statistics, and computer science . . .Rather than arbitrary heuristics with ill-definedproperties.
9 / 113
![Page 10: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/10.jpg)
Next Two Main Topics
Gaussian Mixture models (today) — A probabilistic modelof . . .
Feature vectors associated with a speech sound.Principled distance between test frame . . .And set of template frames.
Hidden Markov models (next week) — A probabilistic modelof . . .
Time evolution of feature vectors for a speech sound.Principled generalization of DTW.
10 / 113
![Page 11: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/11.jpg)
Part I
Gaussian Distributions
11 / 113
![Page 12: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/12.jpg)
The Scenario
Given alignment between training feats X , test feats Y .Warping functions τx(t), τy(t), t = 1, . . . , T .i.e., time τx(t) in X aligns with time τy(t) in Y .
Total distance is sum of distance between aligned vectors.
distanceτx ,τy (X , Y ) =T∑
t=1
framedist(xτx (t), yτy (t))
12 / 113
![Page 13: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/13.jpg)
The Scenario
Computing frame distance for pair of frames is easy.
framedist(xτx (t), yτy (t))
Imagine 2d feature vectors instead of 40d for visualization.
13 / 113
![Page 14: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/14.jpg)
Problem Formulation
What if instead of one training sample, have many?
framedist((x1τx1 (t), x2
τx2 (t), x3τx3 (t), . . .); yτy (t))
14 / 113
![Page 15: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/15.jpg)
Ideas
Average training samples; compute Euclidean distance.Find best match over all training samples.Make probabilistic model of training samples.
15 / 113
![Page 16: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/16.jpg)
Probabilistic Modeling
Old paradigm:
w∗ = arg minw∈vocab
distance(A′test, A′
w)
New paradigm:
w∗ = arg minw∈vocab
− log P(A′test|w)
P(A′|w) is (relative) frequency with which w . . .Is realized as feature vector A′.
16 / 113
![Page 17: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/17.jpg)
Why Probabilistic Modeling?
If can estimate P(A′|w) perfectly . . .Can perform classification optimally!
e.g., two-class classification (w/ classes equally frequent)Choose yes iff P(A′
test|yes) > P(A′test|no).
This is best you can do!
17 / 113
![Page 18: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/18.jpg)
Why Probabilistic Modeling?
In limit of infinite data, . . .Can estimate probabilities perfectly (consistency ).
In real world situations (e.g., sparse data) . . .No guarantees.Still, better to follow principles imperfectly . . .Than to not have principles at all.
18 / 113
![Page 19: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/19.jpg)
Where Are We?
1 Gaussians in One Dimension
2 Gaussians in Multiple Dimensions
3 Estimating Gaussians From Data
19 / 113
![Page 20: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/20.jpg)
Problem Formulation, Two Dimensions
Estimate P(x1, x2), the “frequency” . . .That training sample occurs at location (x1, x2).
20 / 113
![Page 21: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/21.jpg)
Let’s Start With One Dimension
Estimate P(x), the “frequency” . . .That training sample occurs at location x .
21 / 113
![Page 22: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/22.jpg)
The Gaussian or Normal Distribution
Pµ,σ2(x) = N (µ, σ2) =1√2πσ
e−(x−µ)2
2σ2
Parametric distribution with two parameters:µ = mean (the center of the data).σ2 = variance (how wide data is spread).
22 / 113
![Page 23: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/23.jpg)
Visualization
Density function:
µ− 4σ µ− 2σ µ µ + 2σ µ + 4σ
Sample from distribution:
µ− 4σ µ− 2σ µ µ + 2σ µ + 4σ
23 / 113
![Page 24: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/24.jpg)
Properties of Gaussian Distributions
Is valid distribution.∫ ∞
−∞
1√2πσ
e−(x−µ)2
2σ2 dx = 1
Central Limit Theorem: Sums of large numbers ofidentically distributed random variables tend to Gaussian.
Lots of different types of data look “bell-shaped”.Sums and differences of Gaussian random variables . . .
Are Gaussian.If X is distributed as N (µ, σ2) . . .
aX + b is distributed as N (aµ + b, (aσ)2).Negative log looks like weighted Euclidean distance!
ln√
2πσ +(x − µ)2
2σ2
24 / 113
![Page 25: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/25.jpg)
Where Are We?
1 Gaussians in One Dimension
2 Gaussians in Multiple Dimensions
3 Estimating Gaussians From Data
25 / 113
![Page 26: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/26.jpg)
Gaussians in Two Dimensions
N (µ1, µ2, σ21, σ
22) =
12πσ1σ2
√1− r 2
e− 1
2(1−r2)
„(x1−µ1)2
σ21
− 2rx1x2σ1σ2
+(x2−µ2)2
σ22
«
If r = 0, simplifies to
1√2πσ1
e− (x1−µ1)2
2σ21
1√2πσ2
e− (x2−µ2)2
2σ22 = N (µ1, σ
21)N (µ2, σ
22)
i.e., like generating each dimension independently.
26 / 113
![Page 27: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/27.jpg)
Example: r = 0, σ1 = σ2
x1, x2 uncorrelated.Knowing x1 tells you nothing about x2.
27 / 113
![Page 28: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/28.jpg)
Example: r = 0, σ1 6= σ2
x1, x2 can be uncorrelated and have unequal variance.
28 / 113
![Page 29: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/29.jpg)
Example: r > 0, σ1 6= σ2
x1, x2 correlated.Knowing x1 tells you something about x2.
29 / 113
![Page 30: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/30.jpg)
Generalizing to More Dimensions
If we write following matrix:
Σ =
∣∣∣∣ σ21 rσ1σ2
rσ1σ2 σ22
∣∣∣∣then another way to write two-dimensional Gaussian is:
N (µ,Σ) =1
(2π)d/2|Σ|1/2 e−12 (x−µ)T Σ−1(x−µ)
where x = (x1, x2), µ = (µ1, µ2).More generally, µ and Σ can have arbitrary numbers ofcomponents.
Multivariate Gaussians.
30 / 113
![Page 31: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/31.jpg)
Diagonal and Full Covariance Gaussians
Let’s say have 40d feature vector.How many parameters in covariance matrix Σ?The more parameters, . . .The more data you need to estimate them.
In ASR, usually assume Σ is diagonal ⇒ d params.This is why like having uncorrelated features!
(Research direction: is there something in between?)
31 / 113
![Page 32: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/32.jpg)
Computing Gaussian Log Likelihoods
Why log likelihoods?Full covariance:
log P(x) = −d2
ln(2π)− 12
ln |Σ| − 12
(x− µ)TΣ−1(x− µ)
Diagonal covariance:
log P(x) = −d2
ln(2π)−d∑
i=1
ln σi −12
d∑i=1
(xi − µi)2/σ2
i
Again, note similarity to weighted Euclidean distance.Terms on left independent of x; precompute.A few multiplies/adds per dimension.
32 / 113
![Page 33: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/33.jpg)
Where Are We?
1 Gaussians in One Dimension
2 Gaussians in Multiple Dimensions
3 Estimating Gaussians From Data
33 / 113
![Page 34: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/34.jpg)
Estimating Gaussians
Give training data, how to choose parameters µ, Σ?Find parameters so that resulting distribution . . .
“Matches” data as well as possible.Sample data: height, weight of baseball players.
140
180
220
260
300
66 70 74 78 8234 / 113
![Page 35: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/35.jpg)
Maximum-Likelihood Estimation (Univariate)
One criterion: data “matches” distribution well . . .If distribution assigns high likelihood to data.
Likelihood of string of observations x1, x2, . . . , xN is . . .Product of individual likelihoods.
L(xN1 |µ, σ) =
N∏i=1
1√2πσ
e−(xi−µ)2
2σ2
Maximum likelihood estimation: choose µ, σ . . .That maximizes likelihood of training data.
(µ, σ)MLE = arg maxµ,σ
L(xN1 |µ, σ)
35 / 113
![Page 36: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/36.jpg)
Why Maximum-Likelihood Estimation?
Assume we have “correct” model form.Then, in presence of infinite training samples . . .
ML estimates approach “true” parameter values.For most models, MLE is asymptotically consistent,unbiased, and efficient.
ML estimation is easy for many types of models.Count and normalize!
36 / 113
![Page 37: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/37.jpg)
What is ML Estimate for Gaussians?
Much easier to work with log likelihood L = lnL:
L(xN1 |µ, σ) = −N
2ln 2πσ2 − 1
2
N∑i=1
(xi − µ)2
σ2
Take partial derivatives w.r.t. µ, σ:
∂L(xN1 |µ, σ)
∂µ=
N∑i=1
(xi − µ)
σ2
∂L(xN1 |µ, σ)
∂σ2 = − N2σ2 +
N∑i=1
(xi − µ)2
σ4
Set equal to zero; solve for µ, σ2.
µ =1N
N∑i=1
xi σ2 =1N
N∑i=1
(xi − µ)2
37 / 113
![Page 38: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/38.jpg)
What is ML Estimate for Gaussians?
Multivariate case.
µ =1N
N∑i=1
xi
Σ =1N
N∑i=1
(xi − µ)T (xi − µ)
What if diagonal covariance?Estimate params for each dimension independently.
38 / 113
![Page 39: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/39.jpg)
Example: ML Estimation
Heights (in.) and weights (lb.) of 1033 pro baseball players.Noise added to hide discretization effects.
∼stanchen/e6870/data/mlb_data.dat
height weight74.34 181.2973.92 213.7972.01 209.5272.28 209.0272.98 188.4269.41 176.0268.78 210.28
. . . . . .
. . . . . .
39 / 113
![Page 40: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/40.jpg)
Example: ML Estimation
140
180
220
260
300
66 70 74 78 82
40 / 113
![Page 41: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/41.jpg)
Example: Diagonal Covariance
µ1 =1
1033(74.34 + 73.92 + 72.01 + · · · ) = 73.71
µ2 =1
1033(181.29 + 213.79 + 209.52 + · · · ) = 201.69
σ21 =
11033
[(74.34− 73.71)2 + (73.92− 73.71)2 + · · · )
]= 5.43
σ22 =
11033
[(181.29− 201.69)2 + (213.79− 201.69)2 + · · · )
]= 440.62
41 / 113
![Page 42: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/42.jpg)
Example: Diagonal Covariance
140
180
220
260
300
66 70 74 78 82
42 / 113
![Page 43: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/43.jpg)
Example: Full Covariance
Mean; diagonal elements of covariance matrix the same.
Σ12 = Σ21
=1
1033[(74.34− 73.71)× (181.29− 201.69)+
(73.92− 73.71)× (213.79− 201.69) + · · · )]
= 25.43
µ = [ 73.71 201.69 ]
Σ =
∣∣∣∣ 5.43 25.4325.43 440.62
∣∣∣∣43 / 113
![Page 44: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/44.jpg)
Example: Full Covariance
140
180
220
260
300
66 70 74 78 82
44 / 113
![Page 45: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/45.jpg)
Recap: Gaussians
Lots of data “looks” Gaussian.Central limit theorem.
ML estimation of Gaussians is easy.Count and normalize.
In ASR, mostly use diagonal covariance Gaussians.Full covariance matrices have too many parameters.
45 / 113
![Page 46: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/46.jpg)
Part II
Gaussian Mixture Models
46 / 113
![Page 47: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/47.jpg)
Problems with Gaussian Assumption
47 / 113
![Page 48: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/48.jpg)
Problems with Gaussian Assumption
Sample from MLE Gaussian trained on data on last slide.Not all data is Gaussian!
48 / 113
![Page 49: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/49.jpg)
Problems with Gaussian Assumption
What can we do? What about two Gaussians?
P(x) = p1 ×N (µ1,Σ1) + p2 ×N (µ2,Σ2)
where p1 + p2 = 1.49 / 113
![Page 50: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/50.jpg)
Gaussian Mixture Models (GMM’s)
More generally, can use arbitrary number of Gaussians:
P(x) =∑
j
pj1
(2π)d/2|Σj |1/2 e−12 (x−µj )
T Σ−1j (x−µj )
where∑
j pj = 1 and all pj ≥ 0.Also called mixture of Gaussians.Can approximate any distribution of interest pretty well . . .
If just use enough component Gaussians.
50 / 113
![Page 51: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/51.jpg)
Example: Some Real Acoustic Data
51 / 113
![Page 52: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/52.jpg)
Example: 10-component GMM (Sample)
52 / 113
![Page 53: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/53.jpg)
Example: 10-component GMM (µ’s, σ’s)
53 / 113
![Page 54: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/54.jpg)
ML Estimation For GMM’s
Given training data, how to estimate parameters . . .i.e., the µj , Σj , and mixture weights pj . . .To maximize likelihood of data?
No closed-form solution.Can’t just count and normalize.
Instead, must use an optimization technique . . .To find good local optimum in likelihood.Gradient searchNewton’s method
Tool of choice: The Expectation-Maximization Algorithm.
54 / 113
![Page 55: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/55.jpg)
Where Are We?
1 The Expectation-Maximization Algorithm
2 Applying the EM Algorithm to GMM’s
55 / 113
![Page 56: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/56.jpg)
Wake Up!
This is another key thing to remember from course.Used to train GMM’s, HMM’s, and lots of other things.Key paper in 1977 by Dempster, Laird, and Rubin [2].
56 / 113
![Page 57: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/57.jpg)
What Does The EM Algorithm Do?
Finds ML parameter estimates for models . . .With hidden variables.
Iterative hill-climbing method.Adjusts parameter estimates in each iteration . . .Such that likelihood of data . . .Increases (weakly) with each iteration.
Actually, finds local optimum for parameters in likelihood.
57 / 113
![Page 58: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/58.jpg)
What is a Hidden Variable?
A random variable that isn’t observed.Example: in GMMs, output prob depends on . . .
The mixture component that generated the observationBut you can’t observe it
So, to compute prob of observed x , need to sum over . . .All possible values of hidden variable h:
P(x) =∑
h
P(h, x) =∑
h
P(h)P(x |h)
58 / 113
![Page 59: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/59.jpg)
Mixtures and Hidden Variables
Consider probability that is mixture of probs, e.g., a GMM:
P(x) =∑
j
pj N (µj ,Σj)
Can be viewed as hidden model.h ⇔ Which component generated sample.P(h) = pj ; P(x |h) = N (µj ,Σj).
P(x) =∑
h
P(h)P(x |h)
59 / 113
![Page 60: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/60.jpg)
The Basic Idea
If nail down “hidden” value for each xi , . . .Model is no longer hidden!e.g., data partitioned among GMM components.
So for each data point xi , assign single hidden value hi .Take hi = arg maxh P(h)P(xi |h).e.g., identify GMM component generating each point.
Easy to train parameters in non-hidden models.Update parameters in P(h), P(x |h).e.g., count and normalize to get MLE for µj , Σj , pj .
Repeat!
60 / 113
![Page 61: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/61.jpg)
The Basic Idea
Hard decision:For each xi , assign single hi = arg maxh P(h, xi) . . .With count 1.
Soft decision:For each xi , compute for every h . . .the Posterior prob P̃(h|xi) = P(h,xi )P
h P(h,xi ).
Also called the “fractional count”e.g., partition event across every GMM component.
Rest of algorithm unchanged.
61 / 113
![Page 62: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/62.jpg)
The Basic Idea
Initialize parameter values somehow.For each iteration . . .Expectation step: compute posterior (count) of h for each xi .
P̃(h|xi) =P(h, xi)∑h P(h, xi)
Maximization step: update parameters.Instead of data xi with hidden h, pretend . . .Non-hidden data where . . .(Fractional) count of each (h, xi) is P̃(h|xi).
62 / 113
![Page 63: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/63.jpg)
Example: Training a 2-component GMM
Two-component univariate GMM; 10 data points.The data: x1, . . . , x10
8.4, 7.6, 4.2, 2.6, 5.1, 4.0, 7.8, 3.0, 4.8, 5.8
Initial parameter values:
p1 µ1 σ21 p2 µ2 σ2
20.5 4 1 0.5 7 1
Training data; densities of initial Gaussians.
63 / 113
![Page 64: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/64.jpg)
The E Step
xi p1 · N1 p2 · N2 P(xi) P̃(1|xi) P̃(2|xi)8.4 0.0000 0.0749 0.0749 0.000 1.0007.6 0.0003 0.1666 0.1669 0.002 0.9984.2 0.1955 0.0040 0.1995 0.980 0.0202.6 0.0749 0.0000 0.0749 1.000 0.0005.1 0.1089 0.0328 0.1417 0.769 0.2314.0 0.1995 0.0022 0.2017 0.989 0.0117.8 0.0001 0.1448 0.1450 0.001 0.9993.0 0.1210 0.0001 0.1211 0.999 0.0014.8 0.1448 0.0177 0.1626 0.891 0.1095.8 0.0395 0.0971 0.1366 0.289 0.711
P̃(h|xi) =P(h, xi)∑h P(h, xi)
=ph · Nh
P(xi)h ∈ {1, 2}
64 / 113
![Page 65: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/65.jpg)
The M Step
View: have non-hidden corpus for each component GMM.For hth component, have P̃(h|xi) counts for event xi .
Estimating µ: fractional events.
µ =1N
N∑i=1
xi ⇒ µh =1∑
i P̃(h|xi)
N∑i=1
P̃(h|xi)xi
µ1 =1
0.000 + 0.002 + 0.980 + · · ·×
(0.000× 8.4 + 0.002× 7.6 + 0.980× 4.2 + · · · )
= 3.98
Similarly, can estimate σ2h with fractional events.
65 / 113
![Page 66: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/66.jpg)
The M Step (cont’d)
What about the mixture weights ph?
To find MLE, count and normalize!
p1 =0.000 + 0.002 + 0.980 + · · ·
10= 0.59
66 / 113
![Page 67: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/67.jpg)
The End Result
iter p1 µ1 σ21 p2 µ2 σ2
20 0.50 4.00 1.00 0.50 7.00 1.001 0.59 3.98 0.92 0.41 7.29 1.292 0.62 4.03 0.97 0.38 7.41 1.123 0.64 4.08 1.00 0.36 7.54 0.8810 0.70 4.22 1.13 0.30 7.93 0.12
67 / 113
![Page 68: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/68.jpg)
First Few Iterations of EM
iter 0
iter 1
iter 2
68 / 113
![Page 69: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/69.jpg)
Later Iterations of EM
iter 2
iter 3
iter 10
69 / 113
![Page 70: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/70.jpg)
Why the EM Algorithm Works [3]
x = (x1, x2, . . .) = whole training set; h = hidden.θ = parameters of model.Objective function for MLE: (log) likelihood.
L(θ) = log P(x|θ) = log∑
h
P(h, x|θ)
Alternate objective functionWill show maximizing this equivalent to above
F (P̃, θ) = L(θ)− D(P̃ ‖ Pθ)
Pθ(h|x) = posterior over hidden.P̃(h) = distribution over hidden to be optimized . . .D(· ‖ ·) = Kullback-Leibler divergence.
70 / 113
![Page 71: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/71.jpg)
Why the EM Algorithm Works
F (P̃, θ) = L(θ)− D(P̃ ‖ Pθ)
Outline of proof:Show that both E step and M step improve F (P̃, θ).Will follow that likelihood L(θ) improves as well.
71 / 113
![Page 72: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/72.jpg)
The E Step
F (P̃, θ) = L(θ)− D(P̃ ‖ Pθ)
Properties of KL divergence.Nonnegative; and zero iff P̃ = Pθ.
What is best choice for P̃(h)?Compute the current posterior Pθ(h|x).Set P̃(h) equal to this posterior.
Since L(θ) is not function of P̃ . . .F (P̃, θ) can only improve in E step.
72 / 113
![Page 73: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/73.jpg)
The M Step
Lemma:
F (P̃, θ) = EP̃ [log P(h, x|θ)] + H(P̃)
Proof:
F (P̃, θ) = L(θ)− D(P̃ ‖ Pθ)
= log P(x|θ)−∑
h
P̃(h) logP̃(h)
P(h|x, θ)
= log P(x|θ)−∑
h
P̃(h) logP̃(h)P(x|θ)
P(h, x|θ)
=∑
h
P̃(h) log P(h, x|θ)−∑
h
P̃(h) log P̃(h)
= EP̃ [log P(h, x|θ)] + H(P̃)
73 / 113
![Page 74: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/74.jpg)
The M Step (cont’d)
F (P̃, θ) = EP̃ [log P(h, x|θ)] + H(P̃)
EP̃ [· · · ] = log likelihood of non-hidden corpus . . .Where each h gets P̃(h) counts.
H(P̃) = entropy of distribution P̃(h).
What do we do in M step?Pick θ to maximize term on leftNote this is just MLE of non-hidden corpus . . .Since we chose an estimate for h from the E step.
Since H(P̃) is not function of θ . . .F (P̃, θ) can only improve in M step.
74 / 113
![Page 75: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/75.jpg)
Why the EM Algorithm Works
Observation: F (P̃, θ) = L(θ) after E step (set P̃ = Pθ).
F (P̃, θ) = L(θ)− D(P̃ ‖ Pθ)
If F (P̃, θ) improves with each iteration . . .And F (P̃, θ) = L(θ) after each E step . . .L(θ) improves after each iteration.
There you go!
75 / 113
![Page 76: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/76.jpg)
Discussion
EM algorithm is elegant and general way to . . .Train parameters in hidden models . . .To optimize likelihood.
Only finds local optimum.Seeding is of paramount importance.
Generalized EM algorithm.F (P̃, θ) just needs to improve some in each step.i.e., P̃(h) in E step need not be exact posterior.i.e., θ in M step need not be ML estimate.e.g., can optimize Viterbi likelihood.
76 / 113
![Page 77: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/77.jpg)
Where Are We?
1 The Expectation-Maximization Algorithm
2 Applying the EM Algorithm to GMM’s
77 / 113
![Page 78: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/78.jpg)
Another Example Data Set
78 / 113
![Page 79: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/79.jpg)
Question: How Many Gaussians?
Method 1 (most common): Guess!Method 2: Bayesian Information Criterion (BIC)[1].
Penalize likelihood by number of parameters.
BIC(Ck) =k∑
j=1
{−12
nj log |Σj |} − Nk(d +12
d(d + 1))
k = Gaussian components.d = dimension of feature vector.nj = data points for Gaussian j ; N = total data points.
79 / 113
![Page 80: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/80.jpg)
The Bayesian Information Criterion
View GMM as way of coding data for transmission.Cost of transmitting model ⇔ number of params.Cost of transmitting data ⇔ log likelihood of data.
Choose number of Gaussians to minimize cost.
80 / 113
![Page 81: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/81.jpg)
Question: How To Initialize Parameters?
Set mixture weights pj to 1/k (for k Gaussians).Pick N data points at random and . . .
Use them to seed initial values of µj .Set all σ’s to arbitrary value . . .
Or to global variance of data.Extension: generate multiple starting points.
Pick one with highest likelihood.
81 / 113
![Page 82: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/82.jpg)
Another Way: Splitting
Start with single Gaussian, MLE.Repeat until hit desired number of Gaussians:
Double number of Gaussians by perturbing means . . .Of existing Gaussians by ±ε.Run several iterations of EM.
82 / 113
![Page 83: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/83.jpg)
Question: How Long To Train?
i.e., how many iterations of EM?Guess.Look at performance on training data.
Stop when change in log likelihood per event . . .Is below fixed threshold.
Look at performance on held-out data.Stop when performance no longer improves.
83 / 113
![Page 84: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/84.jpg)
The Data Set
84 / 113
![Page 85: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/85.jpg)
Sample From Best 1-Component GMM
85 / 113
![Page 86: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/86.jpg)
The Data Set, Again
86 / 113
![Page 87: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/87.jpg)
20-Component GMM Trained on Data
87 / 113
![Page 88: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/88.jpg)
20-Component GMM µ’s, σ’s
88 / 113
![Page 89: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/89.jpg)
Acoustic Feature Data Set
89 / 113
![Page 90: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/90.jpg)
5-Component GMM; Starting Point A
90 / 113
![Page 91: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/91.jpg)
5-Component GMM; Starting Point B
91 / 113
![Page 92: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/92.jpg)
5-Component GMM; Starting Point C
92 / 113
![Page 93: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/93.jpg)
Solutions With Infinite Likelihood
Consider log likelihood; two-component 1d Gaussian.
N∑i=1
ln
(p1
1√2πσ1
e− (xi−µ1)2
2σ21 + p2
1√2πσ2
e− (xi−µ2)2
2σ22
)
If µ1 = x1, above reduces to
ln
(1
2√
2πσ1+
12√
2πσ2e
12
(x1−µ2)2
σ22
)+
N∑i=2
. . .
which goes to ∞ as σ1 → 0.Only consider finite local maxima of likelihood function.
Variance flooring.Throw away Gaussians with “count” below threshold.
93 / 113
![Page 94: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/94.jpg)
Recap
GMM’s are effective for modeling arbitrary distributions.State-of-the-art in ASR for decades.
The EM algorithm is primary tool for training GMM’s.Very sensitive to starting point.Initializing GMM’s is an art.
94 / 113
![Page 95: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/95.jpg)
References
S. Chen and P.S. Gopalakrishnan, “Clustering via theBayesian Information Criterion with Applications in SpeechRecognition”, ICASSP, vol. 2, pp. 645–648, 1998.
A.P. Dempster, N.M. Laird, D.B. Rubin, “Maximum Likelihoodfrom Incomplete Data via the EM Algorithm”, Journal of theRoyal Stat. Society. Series B, vol. 39, no. 1, 1977.
R. Neal, G. Hinton, “A view of the EM algorithm that justifiesincremental, sparse, and other variants”, Learning inGraphical Models, MIT Press, pp. 355–368, 1999.
95 / 113
![Page 96: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/96.jpg)
Where Are We: The Big Picture
Given test sample, find nearest training sample.
w∗ = arg minw∈vocab
distance(A′test, A′
w)
Total distance between training and test sample . . .Is sum of distances between aligned frames.
distanceτx ,τy (X , Y ) =T∑
t=1
framedist(xτx (t), yτy (t))
Goal: move from ad hoc distances to probabilities.
96 / 113
![Page 97: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/97.jpg)
Gaussian Mixture Models
Assume many training templates for each word.Calc distance between set of training frames . . .And test frame.
framedist((x1, x2, . . . , xD); y)
Idea: use x1, x2, . . . , xD to train GMM: P(x).
framedist((x1, x2, . . . , xD); y) ⇒ − log P(y) !
97 / 113
![Page 98: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/98.jpg)
What’s Next: Hidden Markov Models
Replace DTW with probabilistic counterpart.Together, GMM’s and HMM’s comprise . . .
Unified probabilistic framework.Old paradigm:
w∗ = arg minw∈vocab
distance(A′test, A′
w)
New paradigm:
w∗ = arg maxw∈vocab
P(A′test|w)
98 / 113
![Page 99: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/99.jpg)
Part III
Introduction to Hidden Markov Models
99 / 113
![Page 100: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/100.jpg)
Introduction to Hidden Markov Models
The issue of weights in DTW.Interpretation of DTW grid as Directed Graph.Adding Transition and Output Probabilities to the Graphgives us an HMM!The three main HMM operations.
100 / 113
![Page 101: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/101.jpg)
Another Issue with Dynamic Time Warping
Weights are completely heuristic!Maybe we can learn weights from data?Take many utterances . . .
101 / 113
![Page 102: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/102.jpg)
Learning Weights From Data
For each node in DP path, count number of times move up↑ right → and diagonally ↗.Normalize number of times each direction taken by totalnumber of times node was actually visited.Take some constant times reciprocal as weight.Example: particular node visited 100 times.
Move ↗ 50 times; → 25 times; ↑ 50 times.Set weights to 2, 4, and 4, respectively (or 1, 2, and 2).
Point: weight distribution should reflect . . .Which directions are taken more frequently at a node.
Weight estimation not addressed in DTW . . .But central part of Hidden Markov models.
102 / 113
![Page 103: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/103.jpg)
DTW and Directed Graphs
Take following Dynamic Time Warping setup:
Let’s look at representation of this as directed graph:
103 / 113
![Page 104: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/104.jpg)
DTW and Directed Graphs
Another common DTW structure:
As a directed graph:
Can represent even more complex DTW structures . . .Resultant directed graphs can get quite bizarre.
104 / 113
![Page 105: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/105.jpg)
Path Probabilities
Let’s assign probabilities to transitions in directed graph:
aij is transition probability going from state i to state j ,where
∑j aij = 1.
Can compute probability P of individual path just usingtransition probabilities aij .
105 / 113
![Page 106: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/106.jpg)
Path Probabilities
It is common to reorient typical DTW pictures:
Above only describes path probabilities associated withtransitions.Also need to include likelihoods associated withobservations.
106 / 113
![Page 107: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/107.jpg)
Path Probabilities
As in GMM discussion, let us define likelihood of producingobservation xi from state j as
bj(xi) =∑
m
cjm1
(2π)d/2|Σjm|1/2 e−12 (xi−µjm)T Σ−1
jm (xi−µjm)
where cjm are mixture weights associated with state j .This state likelihood is also called the output probabilityassociated with state.
107 / 113
![Page 108: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/108.jpg)
Path Probabilities
In this case, likelihood of entire path can be written as:
108 / 113
![Page 109: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/109.jpg)
Hidden Markov Models
The output and transition probabilities define a HiddenMarkov Model or HMM.
Since probabilities of moving from state to state onlydepend on current and previous state, model is Markov.Since only see observations and have to infer statesafter the fact, model is hidden.
One may consider HMM to be generative model of speech.Starting at upper left corner of trellis, generateobservations according to permissible transitions andoutput probabilities.
Not only can compute likelihood of single path . . .Can compute overall likelihood of observation string . . .As sum over all paths in trellis.
109 / 113
![Page 110: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/110.jpg)
HMM: The Three Main Tasks
Compute likelihood of generating string of observationsfrom HMM (Forward algorithm).Compute best path from HMM (Viterbi algorithm).Learn parameters (output and transition probabilities) ofHMM from data (Baum-Welch a.k.a. Forward-Backwardalgorithm).
110 / 113
![Page 111: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/111.jpg)
Part IV
Epilogue
111 / 113
![Page 112: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/112.jpg)
Sample Project List
112 / 113
![Page 113: Lecture 3 - Gaussian Mixture Models and Introduction to HMM'sstanchen/fall12/e6870/slides/lecture3.pdf · Gaussian Mixture Models and Introduction to HMM’s Michael Picheny, Bhuvana](https://reader036.vdocument.in/reader036/viewer/2022062603/5f072eb67e708231d41bb77e/html5/thumbnails/113.jpg)
Course Feedback
1 Was this lecture mostly clear or unclear? What was themuddiest topic?
2 Other feedback (pace, content, atmosphere)?3 What is the chance you will do a non-reading project? If
nonzero, what type of project appeals most to you rightnow? (Doesn’t have to be on list.)
113 / 113