speech and language processing
DESCRIPTION
Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (II). Outline for ASR. ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system Language Model Lexicon/Pronunciation Model (HMM) Feature Extraction Acoustic Model Decoder Training - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/1.jpg)
Speech and Language Processing
Chapter 9 of SLPAutomatic Speech Recognition (II)
![Page 2: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/2.jpg)
Outline for ASR
ASR Architecture The Noisy Channel Model
Five easy pieces of an ASR system1)Language Model2)Lexicon/Pronunciation Model (HMM)3)Feature Extraction4)Acoustic Model5)Decoder
Training Evaluation
04/20/23 2Speech and Language Processing Jurafsky and Martin
![Page 3: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/3.jpg)
Acoustic Modeling (= Phone detection)
Given a 39-dimensional vector corresponding to the observation of one frame oi
And given a phone q we want to detect
Compute p(oi|q) Most popular method:
GMM (Gaussian mixture models) Other methods
Neural nets, CRFs, SVM, etc04/20/23 3Speech and Language Processing Jurafsky and Martin
![Page 4: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/4.jpg)
Problem: how to apply HMM model to continuous observations?
We have assumed that the output alphabet V has a finite number of symbols
But spectral feature vectors are real-valued!
How to deal with real-valued features? Decoding: Given ot, how to compute P(ot|
q) Learning: How to modify EM to deal with
real-valued features04/20/23 4Speech and Language Processing Jurafsky and Martin
![Page 5: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/5.jpg)
Vector Quantization
Create a training set of feature vectors
Cluster them into a small number of classes
Represent each class by a discrete symbol
For each class vk, we can compute the probability that it is generated by a given HMM state using Baum-Welch as above
04/20/23 5Speech and Language Processing Jurafsky and Martin
![Page 6: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/6.jpg)
VQ
We’ll define a Codebook, which lists for each symbol A prototype vector, or codeword
If we had 256 classes (‘8-bit VQ’), A codebook with 256 prototype vectors Given an incoming feature vector, we
compare it to each of the 256 prototype vectors
We pick whichever one is closest (by some ‘distance metric’)
And replace the input vector by the index of this prototype vector
04/20/23 6Speech and Language Processing Jurafsky and Martin
![Page 7: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/7.jpg)
VQ
04/20/23 7Speech and Language Processing Jurafsky and Martin
![Page 8: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/8.jpg)
VQ requirements
A distance metric or distortion metric Specifies how similar two vectors are Used:
to build clusters To find prototype vector for cluster And to compare incoming vector to prototypes
A clustering algorithm K-means, etc.
04/20/23 8Speech and Language Processing Jurafsky and Martin
![Page 9: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/9.jpg)
Distance metrics
Simplest: (square of)
Euclidean distance
Also called ‘sum-squared error’
€
d2(x,y) = (x i − y i)2
i=1
D
∑
04/20/23 9Speech and Language Processing Jurafsky and Martin
![Page 10: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/10.jpg)
Distance metrics
More sophisticated: (square of) Mahalanobis distance Assume that each dimension of feature
vector has variance 2
Equation above assumes diagonal covariance matrix; more on this later€
d2(x,y) =(x i − y i)
2
σ i2
i=1
D
∑
04/20/23 10Speech and Language Processing Jurafsky and Martin
![Page 11: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/11.jpg)
Training a VQ system (generating codebook): K-means clustering
1. Initializationchoose M vectors from L training vectors
(typically M=2B) as initial code words… random or max.
distance.
2. Search:for each training vector, find the closest code word,
assign this training vector to that cell
3. Centroid Update:for each cell, compute centroid of that cell. Thenew code word is the centroid.
4. Repeat (2)-(3) until average distance falls below threshold (or no change)
Slide from John-Paul Hosum, OHSU/OGI04/20/23 11Speech and Language Processing Jurafsky and Martin
![Page 12: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/12.jpg)
Vector Quantization
• Example
Given data points, split into 4 codebook vectors with initialvalues at (2,2), (4,6), (6,5), and (8,8)
1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90
12
34
56
78
90
12
34
56
78
90
Slide thanks to John-Paul Hosum, OHSU/OGI
04/20/23 12Speech and Language Processing Jurafsky and Martin
![Page 13: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/13.jpg)
Vector Quantization
• Example
compute centroids of each codebook, re-compute nearestneighbor, re-compute centroids...
1 2 3 4 5 6 7 8 90 1 2 3 4 5 6 7 8 90
12
34
56
78
90
12
34
56
78
90
Slide from John-Paul Hosum, OHSU/OGI
04/20/23 13Speech and Language Processing Jurafsky and Martin
![Page 14: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/14.jpg)
Vector Quantization
• Example
Once there’s no more change, the feature space will bepartitioned into 4 regions. Any input feature can be classifiedas belonging to one of the 4 regions. The entire codebookcan be specified by the 4 centroid points.
1 2 3 4 5 6 7 8 90
12
34
56
78
90
Slide from John-Paul Hosum, OHSU/OGI
04/20/23 14Speech and Language Processing Jurafsky and Martin
![Page 15: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/15.jpg)
Summary: VQ
To compute p(ot|qj) Compute distance between feature vector ot
and each codeword (prototype vector) in a preclustered codebook where distance is either
Euclidean Mahalanobis
Choose the vector that is the closest to ot and take its codeword vk
And then look up the likelihood of vk given HMM state j in the B matrix
Bj(ot)=bj(vk) s.t. vk is codeword of closest vector to ot
Using Baum-Welch as above04/20/23 15Speech and Language Processing Jurafsky and Martin
![Page 16: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/16.jpg)
Computing bj(vk)
feature value 1 for state j
feat
ure
valu
e 2
for
stat
e j
• bj(vk) = number of vectors with codebook index k in state j number of vectors in state j
= =14 156 4
Slide from John-Paul Hosum, OHSU/OGI04/20/23 16Speech and Language Processing Jurafsky and Martin
![Page 17: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/17.jpg)
Summary: VQ
Training: Do VQ and then use Baum-Welch to
assign probabilities to each symbol Decoding:
Do VQ and then use the symbol probabilities in decoding
04/20/23 17Speech and Language Processing Jurafsky and Martin
![Page 18: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/18.jpg)
Directly Modeling Continuous Observations
Gaussians Univariate Gaussians
Baum-Welch for univariate Gaussians
Multivariate Gaussians Baum-Welch for multivariate Gausians
Gaussian Mixture Models (GMMs) Baum-Welch for GMMs
04/20/23 18Speech and Language Processing Jurafsky and Martin
![Page 19: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/19.jpg)
Better than VQ
VQ is insufficient for real ASR Instead: Assume the possible values of the observation
feature vector ot are normally distributed.
Represent the observation likelihood function bj(ot) as a Gaussian with mean j and variance j
2
€
f (x | μ,σ ) =1
σ 2πexp(−
(x − μ)2
2σ 2)
04/20/23 19Speech and Language Processing Jurafsky and Martin
![Page 20: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/20.jpg)
Gaussians are parameters by mean and variance
04/20/23 20Speech and Language Processing Jurafsky and Martin
![Page 21: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/21.jpg)
For a discrete random variable X Mean is the expected value of X
Weighted sum over the values of X
Variance is the squared average deviation from mean
Reminder: means and variances
04/20/23 21Speech and Language Processing Jurafsky and Martin
![Page 22: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/22.jpg)
Gaussian as Probability Density Function
04/20/23 22Speech and Language Processing Jurafsky and Martin
![Page 23: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/23.jpg)
Gaussian PDFs
A Gaussian is a probability density function; probability is area under curve.
To make it a probability, we constrain area under curve = 1.
BUT… We will be using “point estimates”; value of
Gaussian at point. Technically these are not probabilities,
since a pdf gives a probability over a internvl, needs to be multiplied by dx
As we will see later, this is ok since same value is omitted from all Gaussians, so argmax is still correct.
04/20/23 23Speech and Language Processing Jurafsky and Martin
![Page 24: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/24.jpg)
Gaussians for Acoustic Modeling
P(o|q):
P(o|q)
o
P(o|q) is highest here at mean
P(o|q is low here, very far from mean)
A Gaussian is parameterized by a mean and a variance:
Different means
04/20/23 24Speech and Language Processing Jurafsky and Martin
![Page 25: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/25.jpg)
Using a (univariate Gaussian) as an acoustic likelihood estimator Let’s suppose our observation was a
single real-valued feature (instead of 39D vector)
Then if we had learned a Gaussian over the distribution of values of this feature
We could compute the likelihood of any given observation ot as follows:
04/20/23 25Speech and Language Processing Jurafsky and Martin
![Page 26: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/26.jpg)
Training a Univariate Gaussian
A (single) Gaussian is characterized by a mean and a variance
Imagine that we had some training data in which each state was labeled
We could just compute the mean and variance from the data:
€
i =1
Tot
t=1
T
∑ s.t. ot is state i
€
i2 =
1
T(ot
t=1
T
∑ − μ i)2 s.t. ot is state i
04/20/23 26Speech and Language Processing Jurafsky and Martin
![Page 27: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/27.jpg)
Training Univariate Gaussians
But we don’t know which observation was produced by which state!
What we want: to assign each observation vector ot to every possible state i, prorated by the probability the the HMM was in state i at time t.
The probability of being in state i at time t is t(i)!!
€
2i =
ξ t (i)(ot − μ i)2
t=1
T
∑
ξ t (i)t=1
T
∑
€
i =
ξ t (i)ot
t=1
T
∑
ξ t (i)t=1
T
∑
04/20/23 27Speech and Language Processing Jurafsky and Martin
![Page 28: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/28.jpg)
Multivariate Gaussians
Instead of a single mean and variance :
Vector of means and covariance matrix €
f (x | μ,σ ) =1
σ 2πexp(−
(x − μ)2
2σ 2)
€
f (x | μ,Σ) =1
(2π )n / 2 | Σ |1/ 2exp −
1
2(x − μ)T Σ−1(x − μ)
⎛
⎝ ⎜
⎞
⎠ ⎟
04/20/23 28Speech and Language Processing Jurafsky and Martin
![Page 29: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/29.jpg)
Multivariate Gaussians
Defining and
So the i-jth element of is:€
=E(x)
€
=E (x − μ)(x − μ)T[ ]
€
ij2 = E (x i − μ i)(x j − μ j )[ ]
04/20/23 29Speech and Language Processing Jurafsky and Martin
![Page 30: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/30.jpg)
Gaussian Intuitions: Size of
= [0 0] = [0 0] = [0 0] = I = 0.6I = 2I As becomes larger, Gaussian becomes
more spread out; as becomes smaller, Gaussian more compressed
Text and figures from Andrew Ng’s lecture notes for CS22904/20/23 30Speech and Language Processing Jurafsky and Martin
![Page 31: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/31.jpg)
From Chen, Picheny et al lecture slides04/20/23 31Speech and Language Processing Jurafsky and Martin
![Page 32: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/32.jpg)
[1 0] [.6 0][0 1] [ 0 2]
Different variances in different dimensions
04/20/23 32Speech and Language Processing Jurafsky and Martin
![Page 33: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/33.jpg)
Gaussian Intuitions: Off-diagonal
As we increase the off-diagonal entries, more correlation between value of x and value of y
Text and figures from Andrew Ng’s lecture notes for CS22904/20/23 33Speech and Language Processing Jurafsky and Martin
![Page 34: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/34.jpg)
Gaussian Intuitions: off-diagonal
As we increase the off-diagonal entries, more correlation between value of x and value of y
Text and figures from Andrew Ng’s lecture notes for CS22904/20/23 34Speech and Language Processing Jurafsky and Martin
![Page 35: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/35.jpg)
Gaussian Intuitions: off-diagonal and diagonal
Decreasing non-diagonal entries (#1-2) Increasing variance of one dimension in diagonal (#3)
Text and figures from Andrew Ng’s lecture notes for CS22904/20/23 35Speech and Language Processing Jurafsky and Martin
![Page 36: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/36.jpg)
In two dimensions
From Chen, Picheny et al lecture slides04/20/23 36Speech and Language Processing Jurafsky and Martin
![Page 37: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/37.jpg)
But: assume diagonal covariance
I.e., assume that the features in the feature vector are uncorrelated
This isn’t true for FFT features, but is true for MFCC features, as we will see.
Computation and storage much cheaper if diagonal covariance.
I.e. only diagonal entries are non-zero Diagonal contains the variance of each
dimension ii2
So this means we consider the variance of each acoustic feature (dimension) separately
04/20/23 37Speech and Language Processing Jurafsky and Martin
![Page 38: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/38.jpg)
Diagonal covariance
Diagonal contains the variance of each dimension ii
2
So this means we consider the variance of each acoustic feature (dimension) separately
€
f (x | μ,σ ) =1
σ j 2πexp −
1
2
xd − μd
σ d
⎛
⎝ ⎜
⎞
⎠ ⎟
2 ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
d =1
D
∏
€
f (x | μ,σ ) =1
2πD
2 σ d2
d =1
D
∏exp(−
1
2
(xd − μd )2
σ d2
d =1
D
∑ )
04/20/23 38Speech and Language Processing Jurafsky and Martin
![Page 39: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/39.jpg)
Baum-Welch reestimation equations for multivariate Gaussians
Natural extension of univariate case, where now i is mean vector for state i:
€
€
i =
ξ t (i)(ot − μ i)t=1
T
∑ (ot − μ i)T
ξ t (i)t=1
T
∑€
i =
ξ t (i)ot
t=1
T
∑
ξ t (i)t=1
T
∑
04/20/23 39Speech and Language Processing Jurafsky and Martin
![Page 40: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/40.jpg)
But we’re not there yet
Single Gaussian may do a bad job of modeling distribution in any dimension:
Solution: Mixtures of GaussiansFigure from Chen, Picheney et al slides04/20/23 40Speech and Language Processing Jurafsky and Martin
![Page 41: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/41.jpg)
Mixtures of Gaussians
M mixtures of Gaussians:
For diagonal covariance:
€
b j (ot ) =c jk
2πD
2 σ jkd2
d =1
D
∏exp(−
1
2
(x jkd − μ jkd )2
σ jkd2
d =1
D
∑ )k=1
M
∑€
f (x | μ jk,Σ jk ) = c jkN(x,μ jk,Σ jk )k=1
M
∑
€
b j (ot ) = c jkN(ot ,μ jk,Σ jk )k=1
M
∑
04/20/23 41Speech and Language Processing Jurafsky and Martin
![Page 42: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/42.jpg)
GMMs
Summary: each state has a likelihood function parameterized by: M Mixture weights M Mean Vectors of dimensionality D Either
M Covariance Matrices of DxD
Or more likely M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D
04/20/23 42Speech and Language Processing Jurafsky and Martin
![Page 43: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/43.jpg)
Where we are
Given: A wave file Goal: output a string of words What we know: the acoustic model
How to turn the wavefile into a sequence of acoustic feature vectors, one every 10 ms
If we had a complete phonetic labeling of the training set, we know how to train a gaussian “phone detector” for each phone.
We also know how to represent each word as a sequence of phones
What we knew from Chapter 4: the language model Next:
Seeing all this back in the context of HMMs Search: how to combine the language model and the acoustic
model to produce a sequence of words
04/20/23 43Speech and Language Processing Jurafsky and Martin
![Page 44: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/44.jpg)
Decoding
In principle:
In practice:
04/20/23 44Speech and Language Processing Jurafsky and Martin
![Page 45: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/45.jpg)
Why is ASR decoding hard?
04/20/23 45Speech and Language Processing Jurafsky and Martin
![Page 46: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/46.jpg)
HMMs for speech
04/20/23 46Speech and Language Processing Jurafsky and Martin
![Page 47: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/47.jpg)
HMM for digit recognition task
04/20/23 47Speech and Language Processing Jurafsky and Martin
![Page 48: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/48.jpg)
The Evaluation (forward) problem for speech
The observation sequence O is a series of MFCC vectors
The hidden states W are the phones and words
For a given phone/word string W, our job is to evaluate P(O|W)
Intuition: how likely is the input to have been generated by just that word string W
04/20/23 48Speech and Language Processing Jurafsky and Martin
![Page 49: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/49.jpg)
Evaluation for speech: Summing over all different paths!
f ay ay ay ay v v v v f f ay ay ay ay v v v f f f f ay ay ay ay v f f ay ay ay ay ay ay v f f ay ay ay ay ay ay ay ay v f f ay v v v v v v v
04/20/23 49Speech and Language Processing Jurafsky and Martin
![Page 50: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/50.jpg)
The forward lattice for “five”
04/20/23 50Speech and Language Processing Jurafsky and Martin
![Page 51: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/51.jpg)
The forward trellis for “five”
04/20/23 51Speech and Language Processing Jurafsky and Martin
![Page 52: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/52.jpg)
Viterbi trellis for “five”
04/20/23 52Speech and Language Processing Jurafsky and Martin
![Page 53: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/53.jpg)
Viterbi trellis for “five”
04/20/23 53Speech and Language Processing Jurafsky and Martin
![Page 54: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/54.jpg)
Search space with bigrams
04/20/23 54Speech and Language Processing Jurafsky and Martin
![Page 55: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/55.jpg)
Viterbi trellis
04/20/23 55Speech and Language Processing Jurafsky and Martin
![Page 56: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/56.jpg)
Viterbi backtrace
04/20/23 56Speech and Language Processing Jurafsky and Martin
![Page 57: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/57.jpg)
Evaluation
How to evaluate the word string output by a speech recognizer?
04/20/23 57Speech and Language Processing Jurafsky and Martin
![Page 58: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/58.jpg)
Word Error Rate
Word Error Rate = 100 (Insertions+Substitutions + Deletions) ------------------------------ Total Word in Correct TranscriptAligment example:REF: portable **** PHONE UPSTAIRS last night soHYP: portable FORM OF STORES last night soEval I S S WER = 100 (1+2+0)/6 = 50%
04/20/23 58Speech and Language Processing Jurafsky and Martin
![Page 59: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/59.jpg)
NIST sctk-1.3 scoring softare:Computing WER with sclite
http://www.nist.gov/speech/tools/ Sclite aligns a hypothesized text (HYP) (from the
recognizer) with a correct or reference text (REF) (human transcribed)
id: (2347-b-013)
Scores: (#C #S #D #I) 9 3 1 2
REF: was an engineer SO I i was always with **** **** MEN UM and they
HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they
Eval: D S I I S S
04/20/23 59Speech and Language Processing Jurafsky and Martin
![Page 60: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/60.jpg)
Sclite output for error analysis
CONFUSION PAIRS Total (972) With >= 1 occurances (972)
1: 6 -> (%hesitation) ==> on 2: 6 -> the ==> that 3: 5 -> but ==> that 4: 4 -> a ==> the 5: 4 -> four ==> for 6: 4 -> in ==> and 7: 4 -> there ==> that 8: 3 -> (%hesitation) ==> and 9: 3 -> (%hesitation) ==> the 10: 3 -> (a-) ==> i 11: 3 -> and ==> i 12: 3 -> and ==> in 13: 3 -> are ==> there 14: 3 -> as ==> is 15: 3 -> have ==> that 16: 3 -> is ==> this
04/20/23 60Speech and Language Processing Jurafsky and Martin
![Page 61: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/61.jpg)
Sclite output for error analysis
17: 3 -> it ==> that 18: 3 -> mouse ==> most 19: 3 -> was ==> is 20: 3 -> was ==> this 21: 3 -> you ==> we 22: 2 -> (%hesitation) ==> it 23: 2 -> (%hesitation) ==> that 24: 2 -> (%hesitation) ==> to 25: 2 -> (%hesitation) ==> yeah 26: 2 -> a ==> all 27: 2 -> a ==> know 28: 2 -> a ==> you 29: 2 -> along ==> well 30: 2 -> and ==> it 31: 2 -> and ==> we 32: 2 -> and ==> you 33: 2 -> are ==> i 34: 2 -> are ==> were
04/20/23 61Speech and Language Processing Jurafsky and Martin
![Page 62: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/62.jpg)
Better metrics than WER?
WER has been useful But should we be more concerned with
meaning (“semantic error rate”)? Good idea, but hard to agree on Has been applied in dialogue systems, where
desired semantic output is more clear
04/20/23 62Speech and Language Processing Jurafsky and Martin
![Page 63: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/63.jpg)
Training
04/20/23 Speech and Language Processing Jurafsky and Martin 63
![Page 64: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/64.jpg)
Reminder: Forward-Backward Algorithm
1) Intialize =(A,B)2) Compute , , 3) Estimate new ’=(A,B)4) Replace with ’5) If not converged go to 2
04/20/23 64Speech and Language Processing Jurafsky and Martin
![Page 65: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/65.jpg)
The Learning Problem: It’s not just Baum-Welch
Network structure of HMM is always created by hand no algorithm for double-induction of optimal
structure and probabilities has been able to beat simple hand-built structures.
Always Bakis network = links go forward in time
Subcase of Bakis net: beads-on-string net:
Baum-Welch only guaranteed to return local max, rather than global optimum
04/20/23 65Speech and Language Processing Jurafsky and Martin
![Page 66: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/66.jpg)
Complete Embedded Training
Setting all the parameters in an ASR system
Given: training set: wavefiles & word
transcripts for each sentence Hand-built HMM lexicon
Uses: Baum-Welch algorithm
04/20/23 66Speech and Language Processing Jurafsky and Martin
![Page 67: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/67.jpg)
Baum-Welch for Mixture Models
By analogy with earlier, let’s define the probability of being in state j at time t with the kth mixture component accounting for ot:
Now,
€
tm ( j) =
α t−1( j)i=1
N
∑ aijc jmb jm (ot )β j (t)
α F (T)
€
jm =
ξ tm ( j)ot
t=1
T
∑
ξ tk ( j)k=1
M
∑t=1
T
∑
€
c jm =
ξ tm ( j)t=1
T
∑
ξ tk ( j)k=1
M
∑t=1
T
∑
€
jm =
ξ tm ( j)(ot − μ j )t=1
T
∑ (ot − μ j )T
ξ tm ( j)k=1
M
∑t=1
T
∑
04/20/23 67Speech and Language Processing Jurafsky and Martin
![Page 68: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/68.jpg)
How to train mixtures?
• Choose M (often 16; or can tune M dependent on amount of training observations)
• Then can do various splitting or clustering algorithms
• One simple method for “splitting”:1) Compute global mean and global variance2) Split into two Gaussians, with means
(sometimes is 0.2)3) Run Forward-Backward to retrain 4) Go to 2 until we have 16 mixtures
04/20/23 68Speech and Language Processing Jurafsky and Martin
![Page 69: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/69.jpg)
Embedded Training
Components of a speech recognizer: Feature extraction: not statistical Language model: word transition
probabilities, trained on some other corpus
Acoustic model: Pronunciation lexicon: the HMM structure for
each word, built by hand Observation likelihoods bj(ot) Transition probabilities aij
04/20/23 69Speech and Language Processing Jurafsky and Martin
![Page 70: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/70.jpg)
Embedded training of acoustic model
If we had hand-segmented and hand-labeled training data
With word and phone boundaries We could just compute the
B: means and variances of all our triphone gaussians
A: transition probabilities And we’d be done! But we don’t have word and phone
boundaries, nor phone labeling04/20/23 70Speech and Language Processing Jurafsky and Martin
![Page 71: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/71.jpg)
Embedded training
Instead: We’ll train each phone HMM embedded
in an entire sentence We’ll do word/phone segmentation and
alignment automatically as part of training process
04/20/23 71Speech and Language Processing Jurafsky and Martin
![Page 72: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/72.jpg)
Embedded Training
04/20/23 72Speech and Language Processing Jurafsky and Martin
![Page 73: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/73.jpg)
Initialization: “Flat start”
Transition probabilities: set to zero any that you want to be
“structurally zero” The probability computation includes
previous value of aij, so if it’s zero it will never change
Set the rest to identical values Likelihoods:
initialize and of each state to global mean and variance of all training data
04/20/23 73Speech and Language Processing Jurafsky and Martin
![Page 74: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/74.jpg)
Embedded Training
Now we have estimates for A and B So we just run the EM algorithm During each iteration, we compute
forward and backward probabilities Use them to re-estimate A and B Run EM til converge
04/20/23 74Speech and Language Processing Jurafsky and Martin
![Page 75: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/75.jpg)
Viterbi training
Baum-Welch training says: We need to know what state we were in, to
accumulate counts of a given output symbol ot
We’ll compute I(t), the probability of being in state i at time t, by using forward-backward to sum over all possible paths that might have been in state i and output ot.
Viterbi training says: Instead of summing over all possible paths, just
take the single most likely path Use the Viterbi algorithm to compute this
“Viterbi” path Via “forced alignment”
04/20/23 75Speech and Language Processing Jurafsky and Martin
![Page 76: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/76.jpg)
Forced Alignment
Computing the “Viterbi path” over the training data is called “forced alignment”
Because we know which word string to assign to each observation sequence.
We just don’t know the state sequence. So we use aij to constrain the path to go
through the correct words And otherwise do normal Viterbi Result: state sequence!
04/20/23 76Speech and Language Processing Jurafsky and Martin
![Page 77: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/77.jpg)
Viterbi training equations
Viterbi Baum-Welch
€
ˆ b j (vk ) =n j (s.t.ot = vk )
n j€
ˆ a ij =nij
ni
For all pairs of emitting states, 1 <= i, j <= N
Where nij is number of frames with transition from i to j in best pathAnd nj is number of frames where state j is occupied
04/20/23 77Speech and Language Processing Jurafsky and Martin
![Page 78: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/78.jpg)
Viterbi Training
Much faster than Baum-Welch But doesn’t work quite as well But the tradeoff is often worth it.
04/20/23 78Speech and Language Processing Jurafsky and Martin
![Page 79: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/79.jpg)
Viterbi training (II)
Equations for non-mixture Gaussians
Viterbi training for mixture Gaussians is more complex, generally just assign each observation to 1 mixture
€
i2 =
1
N i
(ott =1
T
∑ − μ i )2 s.t. qt = i
€
i =1
N i
ot s.t. qt = it=1
T
∑
04/20/23 79Speech and Language Processing Jurafsky and Martin
![Page 80: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/80.jpg)
Log domain In practice, do all computation in log domain Avoids underflow
Instead of multiplying lots of very small probabilities, we add numbers that are not so small.
Single multivariate Gaussian (diagonal ) compute:
In log space:
04/20/23 80Speech and Language Processing Jurafsky and Martin
![Page 81: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/81.jpg)
Repeating:
With some rearrangement of terms
Where:
Note that this looks like a weighted Mahalanobis distance!!! Also may justify why we these aren’t really probabilities (point
estimates); these are really just distances.
Log domain
04/20/23 81Speech and Language Processing Jurafsky and Martin
![Page 82: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/82.jpg)
Summary: Acoustic Modeling for LVCSR.
Increasingly sophisticated models For each state:
Gaussians Multivariate Gaussians Mixtures of Multivariate Gaussians
Where a state is progressively: CI Phone CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone
Forward-Backward Training Viterbi training
04/20/23 82Speech and Language Processing Jurafsky and Martin
![Page 83: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/83.jpg)
Summary: ASR Architecture
Five easy pieces: ASR Noisy Channel architecture1) Feature Extraction:
39 “MFCC” features
2) Acoustic Model: Gaussians for computing p(o|q)
3) Lexicon/Pronunciation Model• HMM: what phones can follow each other
4) Language Model• N-grams for computing p(wi|wi-1)
5) Decoder• Viterbi algorithm: dynamic programming for
combining all these to get word sequence from speech!
04/20/23 83Speech and Language Processing Jurafsky and Martin
![Page 84: Speech and Language Processing](https://reader035.vdocument.in/reader035/viewer/2022062517/56813dfd550346895da7d6db/html5/thumbnails/84.jpg)
Summary
ASR Architecture The Noisy Channel Model
Five easy pieces of an ASR system1)Language Model2)Lexicon/Pronunciation Model (HMM)3)Feature Extraction4)Acoustic Model5)Decoder
Training Evaluation
04/20/23 84Speech and Language Processing Jurafsky and Martin