speech processing overview & supporting concepts
TRANSCRIPT
![Page 1: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/1.jpg)
Speech ProcessingOverview & Supporting Concepts
Professor Marie RochReadings are listed in the schedule
![Page 2: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/2.jpg)
What is speech processing?
• Specialized branch of machine learning• Interdisciplinary
– signal processing– statistics– linear algebra– linguistics– optimization– perception & production
2
-1.5-1
-0.50
0.51
1.5
-1.5
-1
-0.5
0
0.5
10
2
4
6
8
10
12
f2f1
P(f 1
,f 2)
![Page 3: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/3.jpg)
Basic ideas
• Acquire an audio signal• Extract features• Classify features to labels depending on goals, e.g.
– speech recognition: text to speech– speaker recognition: identify a person– command and control: recognize directives or keywords
3
Rosie the robot, The Jetsons ©
Hannah B
arbera
![Page 4: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/4.jpg)
Probability
• A measure of uncertainty.• Sources of uncertainty:
– Process is stochastic (nondeterministic).– We cannot observe all aspects of process.– Incomplete modeling of process.
4
![Page 5: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/5.jpg)
Probability
Two ways to think about:• frequentist – Related to the proportion of events occurring• Bayesian – Related to degree of belief
Both types of probability are treated using the same rules.
5
![Page 6: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/6.jpg)
Random variables
• Variables that can be bound to values non-deterministically.
• Common to use notation to distinguish between*
– a random variable X (capitalized)– and an observed value x (lower case, e.g. x=5)
6*Goodfellow et al. use bold in place of capitalization.
![Page 7: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/7.jpg)
Discrete random variables
• Values from a discrete set of labels:
• Probability density (mass) function (pdf/pmf) is the probability of something occurring, P(X=x)*:
•
7
{ } { }red, orange, yellow, blue, indigo, violet , 0,1,2X Y∈ ∈
( ) 0.8, 1( 2)3
PX re YP d= == =
domain( )domain( ),0 ( ) 1 and ( ) 1
x Xx X P X x P X x
∈
∈ ≤ = ≤ =∀ =∑* Goodfellow et al. use PMF for discrete distributions. P(X=x) is commonly abbreviated to P(x)
![Page 8: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/8.jpg)
Distribution
• Set of probabilities associated with the values of a random variable.
• Uniform distribution – Special distribution where all probabilities are equal.
8
![Page 9: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/9.jpg)
Example
• Let X be a die roll– P(X=3) denotes the probability of rolling a 3– Fair die P(X=3) = 1/6
• If we don’t know if the die is fair, how would we approximate P(X=x)?
• Estimates of probability are denoted with a hat:9
ˆ( )P X x=
![Page 10: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/10.jpg)
Continuous random variables
• Values from a continuous domain• Satisfies the following:
• Worth noting:– no max on P(X=x)– What is?
10
domain( ),0 ( )
( ) ( )
( ) 1
b
a
x
X P X x
P a X b P X x dx
P X
x
x dx
∈ ≤ =
≤ ≤ = =
= =
∀
∫∫
( ) ( )a
aP a X a P X x dx≤ ≤ = =∫
![Page 11: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/11.jpg)
Continuous uniform distribution
Notation: ~ means “has a distribution of.”U(a,b) is a uniform distribution between values and b
11
or implies 1
~ ( , )0
( )
X U a bx a
P X xa
x b
x bb a
>
≤ ≤−
<= =
![Page 12: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/12.jpg)
Expectation
When u(X) = X, we call this the mean, or average value:
12
discrete: ( ) P( )
continuous:
[ ( )]
[ ( )] ( ) P( )x
x
E u X
E u
u x x
xX dxu x
=
= ∫
∑
[ ]E Xµ =
![Page 13: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/13.jpg)
Sample mean
• What if we don’t know P(X=x)?• If we roll a die many times, we can estimate
P(X=x) by counting the number of times each x occurs and dividing by the number of rolls.
• To think about: How does this relate to our traditional idea of an average?
13
µ̂
![Page 14: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/14.jpg)
Variance
• Answers the question: What is the expected squared deviation from the mean?
• Can be defined with expectation operator
• Sample variance (unbiased)
14
2 2) ] ( )[ )( (x
E X x P X xµ µ= − =− ∑
2 21ˆ[( ) ] ( )1 x
E x xN
µ µ− = −− ∑
![Page 15: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/15.jpg)
Normal distribution
• Normal or Gaussian distributions are the so-called “bell curve” distributions
• Parameters: mean & variance:
15
2
2( )
2 22
,2
1( | )x
eP xµ
σµ σπσ
− −
=
2~ )( ,X n µ σ
![Page 16: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/16.jpg)
Normal distribution
• Mean controls the center, variance the breadth of the distribution.
• A normal distribution can be fit with sample mean & variance
16
, the standard~ (0 no al,1) rmX n
~ (2,1)X n
~ (0,0.04)X n
![Page 17: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/17.jpg)
Application: Speech/noise segmentation
How can we determine if someone is talking?1. Record: Acquire pressure signal 2. Frame: Break into pieces3. Compute features: root mean square intensity4. Train: Fit speech/noise distribution models5. Classify: See which distribution matches new samples
17
![Page 18: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/18.jpg)
Acquisition: pressure sensors
Basic ideas for microphones:• Deformable membrane
– pushed by compression– pulled by rarefaction
• Deformation is converted to voltage• Samples: voltage measured N times/second and discretized
– called sample rate (Fs) and – measured in Hertz (Hz)
18
![Page 19: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/19.jpg)
Pressure and Intensity• Pressure
– Amount of force per area– Typically measured in Pascals
(Newtons/meter squared)if the microphone is calibratedotherwise in counts
• Intensity– Product of sound pressure and particle velocity per area–
19
Intensity Pressure2α
2
NPam
=
![Page 20: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/20.jpg)
Framing
• Most audio signals vary over time• Analyze small segments that are roughly static
20
chimpanzee pant hoot – courtesy Cat Hobaiter
and so on…
![Page 21: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/21.jpg)
21
decibel (dB) scale
• Human auditory sensitivity to pressure:20 µPa – 200 Pa (107 range!)
• The decibel scale allows us to compare the intensity between two sounds in a compressed rang:– Intensity
– ~ with pressure
010log10
II
=
010
2
010 log20log10
PP
PP
The bel is named afterAlexander Graham Bell10 decibels 1 bel
image credit: biography.com
𝐼𝐼0,𝑃𝑃0 are reference intensity/pressure
![Page 22: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/22.jpg)
22
What value of P0?
Rabiner/Juang 1993
calibrated terrestrial systems:𝑃𝑃0= 20𝜇𝜇Pa (threshold of hearing)denoted dB re 20 𝜇𝜇Pa
uncalibrated systems:𝑃𝑃0 = 1 (for convenience)denoted dB rel.
![Page 23: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/23.jpg)
Computing intensity
• Computed for each frame• Root-mean-square (RMS) pressure
– squared– averaged– square root
• Pressure converted to intensity in dB:
23
2
frame
1| frame |RMS i
ip x
∈
= ∑
20 log10 (𝑝𝑝𝑅𝑅𝑅𝑅𝑅𝑅) = 10 log101
|frame|�
𝑖𝑖∈frame𝑥𝑥𝑖𝑖2 dB
![Page 24: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/24.jpg)
Our first speech problem
• Given a recording, separate speech from constant background noise
240 1 2 3 4 5 6
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
Time (s)
Am
plitu
de (c
ount
s)
![Page 25: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/25.jpg)
Relative Intensity
25
0 1 2 3 4 5 6-75
-70
-65
-60
-55
-50
-45
-40
-35
-30
Time (s)
dB re
l.
![Page 26: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/26.jpg)
Intensity histogram
26
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Cou
nts
![Page 27: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/27.jpg)
So how can we characterize speech and silence?
27-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
![Page 28: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/28.jpg)
Silence estimator
• Beginning of recording vs.
• Entire recording
28
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Counts
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Counts
![Page 29: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/29.jpg)
Normal fit of silence
29
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Cou
nts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.12
2( )
2
2
1( | )2
sil
sil
x
sil
P X x sil eµσ
πσ
− −
= =
P(X|sil)
![Page 30: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/30.jpg)
Similar for speech 1.3 – 2.8 s
30
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
P(X|speech)
![Page 31: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/31.jpg)
Intensity distribution with pdfs
31
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Where should we draw the boundary?
P(X|sil) &
P(X|speech)
![Page 32: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/32.jpg)
Conditional probability
The probability of something occurring can change when you know something…
32
( "nice to meet you")
( "nice to meet you" | meeting new person)
P XvsP X
=
=
![Page 33: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/33.jpg)
Conditional probability
The probability of something occurring can change when you know something…
33
( "nice to meet you")
( "nice to meet you" | meeting new person)
P XvsP X
=
=
Formally, P(X|Y) defined as: 𝑃𝑃(𝑋𝑋|𝑌𝑌) ≜𝑃𝑃(𝑋𝑋,𝑌𝑌)𝑃𝑃(𝑌𝑌)
![Page 34: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/34.jpg)
A problem…
Our intensity pdfs showed
but we would like to know
which is known as theposterior distribution
34-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.
Co
un
ts
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
( | ) and ( | )P X speech P X noise
( | ) and ( | )P speech X P noise X
![Page 35: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/35.jpg)
Bayes’ decision rule
The optimal classification rule is the one that maximizes the posterior probability:
Regrettably, we do not know ,but we do know
35
{ , }
( | )arg maxspeech noise
P Xω
ω∈
note: arg returns the argument (ω) rather than the value
( | )P Xω( | )P X ω
Reverend Thomas Bayes1702-1761
![Page 36: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/36.jpg)
Bayes’ rule (of conditional probability)
36
𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐴𝐴,𝐵𝐵)𝑃𝑃(𝐵𝐵)
≜ conditional probability
note that swapping names A and B
P(B|A) =𝑃𝑃(𝐵𝐵,𝐴𝐴)𝑃𝑃(𝐴𝐴)
→ 𝑃𝑃(𝐵𝐵,𝐴𝐴) = 𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
and 𝑃𝑃(𝐴𝐴,𝐵𝐵) = 𝑃𝑃(𝐵𝐵,𝐴𝐴)
∴ 𝑃𝑃(𝐴𝐴|𝐵𝐵) =𝑃𝑃(𝐵𝐵|𝐴𝐴)𝑃𝑃(𝐴𝐴)
𝑃𝑃(𝐵𝐵)
also known as Bayes’ Theorem/Bayes’ Law)
Bayes’ Rule≠Bayes’ Decision Rule
𝑃𝑃(𝐴𝐴,𝐵𝐵) : probability that bothA and B occur
![Page 37: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/37.jpg)
Bayes’ decision rule
Bayes’ rule to the rescue!
37
arg max ( ( | ) ( )| )( )
PP X PXP Xω
ω ωω =
posterior probability
class-conditionalprobability
priorprobability
observationprobability
![Page 38: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/38.jpg)
almost there…
We know P(X|ω). We don’t need to know P(X) as
38
( | ) ( )| )( )
( P X PX
P XPω ωω =
( )
( | ) ( ) ( | ) ( ),( ) ( )
max ( | ) (
max
), ( | ) ( )
P X speech P speech P X noise P noiseP X P X
P X speech P speech P X noise P noise
=
![Page 39: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/39.jpg)
Prior probability
• Probability of an observation without any prior information.• When we don’t know this, we frequently assume a uniform
(equally likely) distribution:
39
1 1max ( | ) , ( | )2 2
P X speech P X noise =
![Page 40: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/40.jpg)
Bayes decision rule revisited
Assuming a uniform prior, we can now make decisions about speech or noise:
How can we solve the boundary threshold?
40
2
2( )
2
2{ , }
1decision( ) arg max2
x
speech noisex e
ω
ω
µσ
ωωπσ
− −
∈=
![Page 41: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/41.jpg)
Threshold for our minimal speech detector
Solve for x (2 roots):
41
22
22
( )( )22
2 22 21 1
speechnoise
speechnoise
noise speech
xx
e eµµσσ
πσ πσ
− −− −
=
![Page 42: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/42.jpg)
Result
42
0 1 2 3 4 5 6-75
-70
-65
-60
-55
-50
-45
-40
-35
-30
Time (s)
dB re
l.
speechnoise
Caveat: Example is for pedagogical purposes, we can do better than this…
![Page 43: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/43.jpg)
A better way
• This required us to have labels for the speech and noise
• Remember that the intensity histogram had two modes:
• Can we learn these automatically?
43
-75 -70 -65 -60 -55 -50 -45 -40 -35 -300
5
10
15
20
25
30
35
40
dB rel.C
ount
s
![Page 44: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/44.jpg)
A better way
• Unsupervised model with unlabeled data• Mixture models
– Learn to fit complicated distributions with simpler parametric ones
– We do not require labels
44
![Page 45: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/45.jpg)
45
Multidimensional Gaussians
Maximum likelihood estimators: sample mean & variance
Note: We do not need d>1 for this problem, but let’s start thinking about higher dimensions
( )
11 ( ) ( )2
1/22
1( )2
Ti i ix x
id
i
f x eµ µ
π
−− − Σ −=
Σ
dimensions, T transpose, determinant, , variance-covariance m
#[ a ri] t x i i
dE Xµ
→ ⋅ →→
→ Σ →
![Page 46: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/46.jpg)
Gaussian Intuitions: Size of Σ
• µ = [0 0] µ = [0 0] µ = [0 0] • Σ = I Σ = 0.6I Σ = 2I• As Σ becomes larger, Gaussian becomes more spread
out; as Σ becomes smaller, Gaussian more compressedText and figures from Andrew Ng’s lecture notes for CS229 (via Jurafsky & Martin)
46Speech and Language Processing Jurafsky and Martin
![Page 47: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/47.jpg)
47
GMM: Parameters
• Parameters for each of K mixtures k=1...K:– μk mean– ∑k variance-covariance matrix– ck mixture weight (importance) where
– Use Φk to denote (ck, μk,∑k) and Φ to denote the entire set of parameters
11
=∑ =
K
k kc
![Page 48: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/48.jpg)
48
Sample 16 mixture modelBased upon R2 cepstral speech data
-1.5-1
-0.50
0.51
1.5
-1.5
-1
-0.5
0
0.5
10
2
4
6
8
10
12
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
-1
-0.5
0
0.5
Surface plot of pdfs Equal likelihood contour linesMixtures have different weights, means, & covariance matrices
![Page 49: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/49.jpg)
49
GMM: Evaluation probability
• Probability of an observation:
P( �⃑�𝑥|Φ) = �𝑘𝑘=1
𝐾𝐾
𝑐𝑐𝑘𝑘𝑁𝑁𝑘𝑘(�⃗�𝑥|𝜇𝜇𝑘𝑘 ,∑𝑘𝑘)
= �𝑘𝑘=1
𝐾𝐾
𝑐𝑐𝑘𝑘1
(2𝜋𝜋) �𝑑𝑑 2 Σ𝑘𝑘 �1 2𝑒𝑒−12 (�⃑�𝑥−𝜇𝜇𝑘𝑘)𝑡𝑡Σ𝑘𝑘
−1(�⃑�𝑥−𝜇𝜇𝑘𝑘)
![Page 50: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/50.jpg)
Learning GMM parameters(overview only)
• Given training data, and a set of parameters, we would like to maximize likelihood of the data.
• If the data are independent, select parameters to maximize:
𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ = �𝑖𝑖=1
𝑇𝑇
𝑃𝑃(𝑥𝑥𝑖𝑖|Φ)
= Π𝑖𝑖=1𝑇𝑇 �𝑘𝑘=1
𝑐𝑐𝑘𝑘1
2𝜋𝜋𝑑𝑑2 Σ𝑘𝑘
12𝑒𝑒−
12 𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘 𝑇𝑇Σ𝑘𝑘
−1(𝑥𝑥𝑖𝑖−𝜇𝜇𝑘𝑘)
50
![Page 51: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/51.jpg)
Learning GMM parameters
• If we knew which mixture generated each observation, we could use the derivative to select model parameters that maximize the likelihood:
𝑃𝑃 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑇𝑇|Φ• We do not know this…
51
![Page 52: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/52.jpg)
EM algorithm
• Suppose we have an estimate, model Φ• Use E[·] operator to determine how likely a mixture is to
generate an observation.• Now we can maximize• This is called the expectation-maximization algorithm.
52
![Page 53: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/53.jpg)
EM algorithm
Basic ideas• Use expectation and an extant model to estimate what you do
not know• Use maximization to come up with a new model• Repeat
Convergence guaranteed under minimal assumptions…but no guarantee on rate
![Page 54: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/54.jpg)
Old Faithful Data
54
Wikipedia, EM algorithm
slide courtesy Kaitlin Palmer
![Page 55: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/55.jpg)
Mixture models
• We do not have the time for the details of implementation• For those interested, see:
– Moon, T. K. (1996). "The expectation-maximization algorithm," IEEE Signal Proc. Mag. 13(6). 47-60.
or– Huang, X., Acero, A., and Hon, H. W. (2001). Spoken Language Processing (Prentice Hall PTR,
Upper Saddle River, NJ)
• For our purposes, we will use an implementation from scikit learn.
55
![Page 56: Speech Processing Overview & Supporting Concepts](https://reader031.vdocument.in/reader031/viewer/2022012101/61db3fc9fd631f0d7a4e76cc/html5/thumbnails/56.jpg)
GMM estimation
• Our training data are the RMS intensity values• We use two mixtures• When predicting, we look at the likelihood of each mixture
and select the higher one. • How do we know which mixture represents noise and which
mixture represents speech?
56