speech recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · see (juang and rabiner, 1995)...
TRANSCRIPT
![Page 1: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/1.jpg)
Speech RecognitionLecture 1: Introduction
Mehryar MohriCourant Institute of Mathematical Sciences
![Page 2: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/2.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Logistics
Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing.
Workload: 3 homework assignments, 1 project (your choice).
Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically.
2
![Page 3: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/3.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Objectives
Computer science view of automatic speech recognition (ASR) (no signal processing).
Essential algorithms for large-vocabulary speech recognition.
But, emphasis on general algorithms:
• automata and transducer algorithms.
• statistical learning algorithms.
3
![Page 4: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/4.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Topics
introduction, formulation, components, features.
weighted transducer software library.
weighted automata algorithms.
statistical language modeling software library.
ngram models.
maximum entropy models.
pronunciation models, decision trees, context-dependent models.
4
![Page 5: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/5.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Topics
search algorithms, transducer optimizations, Viterbi decoder.
search algorithms, N-best algorithms, lattice generation, rescoring.
structured prediction algorithms.
adaptation.
active learning.
semi-supervised learning.
5
![Page 6: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/6.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
6
![Page 7: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/7.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Problem
Definition: find accurate written transcription of spoken utterances.
• transcriptions may be in words, phonemes, syllables, or other units.
Accuracy: typically measured in terms of the edit-distance between reference transcription and sequence output by the model.
7
![Page 8: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/8.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Other Related Problems
Speaker verification.
Speaker identification.
Spoken-dialog systems.
Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!
Speech synthesis.
8
![Page 9: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/9.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Spectogram
9
![Page 10: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/10.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Is Difficult
Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms.
• source variation: speaking rate, volume, accent, dialect, pitch, coarticulation.
• channel variation: microphone (type, position), noise (background, distortion).
Key problem: robustness to such variations.
10
![Page 11: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/11.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
ASR Characteristics
Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M).
Speaker-dependent or speaker-independent.
Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.
Isolated (pause between units) or continuous.
Read or spontaneous, e.g., dictation, news broadcast, conversational speech.
11
![Page 12: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/12.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Example - Broadcast News
12
![Page 13: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/13.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1922: Radio Rex, toy, single-word recognizer (rex).
1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).
1952: isolated digit recognition, single speaker (Bell Labs).
1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).
1950s: speaker-independent 10-vowel recognizer (MIT).
13
See (Juang and Rabiner, 1995)
![Page 14: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/14.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1960s: Linear Predictive Coding (LPC), Atal and Itakura.
1969: John Pierce’s negative comments about ASR (Bell Labs).
1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.
14
![Page 15: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/15.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.
mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.
1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.
15
![Page 16: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/16.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system.
2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.
16
![Page 17: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/17.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
17
! ! !
! ! "#$#%$&##'!#()*+)''!,-! &"!
./!012.0.34!45651451768!9:5;.<4!=64:<!.>!;1<<:>!-60?.@!9.<:84A!,88!./!5;14!769:!6=.35!=:7634:!./!
412>1/176>5! 0:4:607;! 7.>501=351.>4! /0.9! 676<:916B! C01@65:! 1><3450D! 6><! 5;:! 2.@:0>9:>5A!,4! 5;:!
5:7;>.8.2D!7.>51>3:4!5.!96530:B! 15! 14!78:60! 5;65!96>D!>:E!6CC817651.>4!E188!:9:02:!6><!=:7.9:!
C605!./!.30!E6D!./!81/:!F!5;:0:=D!56?1>2!/388!6<@6>562:!./!967;1>:4!5;65!60:!C6051688D!6=8:!5.!91917!
;396>!4C::7;!76C6=18151:4A!!
G;:!7;688:>2:!./!<:412>1>2!6!967;1>:!5;65!5038D!/3>751.>4!81?:!6>!1>5:8812:>5!;396>!14!45188!6!
96H.0!.>:!2.1>2!/.0E60<A!I30!677.9C814;9:>54B!5.!<65:B!60:!.>8D!5;:!=:21>>1>2!6><!15!E188!56?:!
96>D!D:604!=:/.0:!6!967;1>:!76>!C644!5;:!G301>2!5:45B!>69:8D!67;1:@1>2!C:0/.096>7:!5;65!01@684!
5;65!./!6!;396>A!
!
!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!!
'#!N:604A'
*&+&%&,-&.'
"A! OA!P3<8:DB!!"#$%&'&(#)B!Q:88!R6=4!K:7.0<B!S.8A!"TB!CCA!"&&U"&+B!"(V(A!
&A! OA!P3<8:DB!KA!KA!K1:4WB!6><!JA!,A!X65?1>4B!*$+,-."#./'$+0#12#)B! YA!Z06>?81>! [>451535:B!S.8A!&&TB!CCA!TV(UT+'B!"(V(A!
VA! YA!\A!X18C.>!6><!PA!QA!K.:B!*!3!$!#4#0"&-#$5#.6&)2$*004/'1./&-7$&8$+0##'"$9#'&:-/./&-B!M0.7;$]IJG&V&!X.0?4;.CB!K.9:B![568DB!^.@A!"((&A!
!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-)
!!!!!!!!!!!!!!!!!"#$%! !!"#$&!!!!!!!!!!"#&%! !!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%
!"#$
7%'#.&$/)8'6/%!
)*+,-./0123!121+45*56!7*8-/
29.81+*:1,*926!;4218*<!
=.9>.188*2>!
7%'#.&$/)8'6/%9):'(($,&$/);"3"&%9)
:'(&"(0'0%)
*+$$,-
?1,,-.2!.-<9>2*,*926!@?A!
121+45*56!A+B5,-.*2>!
1+>9.*,C856!@-D-+!0B*+E*2>6
:'(&"(0'0%)*+$$,-9)*+$$,-)
F,9<C15,*<!+12>B1>-!B2E-.5,12E*2>6!)*2*,-/5,1,-!81<C*2-56!!
F,1,*5,*<1+!+-1.2*2>6!
*1.##)<',.=0#.64>)?,'0%&",)@-'($&",%A
=.%$/!
!$/"01)<',.=0#.64>)2$1+#.&$A
=.%$/!
B.63$)<',.=0#.649)*4(&.C>)
*$1.(&",%>)!
:'(($,&$/)8'6/%9)
:'(&"(0'0%)
*+$$,-
B.63$)<',.=0#.64>)*&.&"%&",.#A
=.%$/!
G*EE-2!H1.39D!!!89E-+56!!!!!!!!!F,9<C15,*<!@12>B1>-!89E-+*2>6
*+'D$()/".#'39)!0#&"+#$)
1'/.#"&"$%
<$64)B.63$)<',.=0#.649)*$1.(&",%>)!0#&"1'/.#)
;".#'3>)22*!
A92<1,-21,*D-!542,C-5*56!H1<C*2-!+-1.2*2>6!!H*I-E/*2*,*1,*D-!E*1+9>6
(Juang and Rabiner, 1995)
![Page 18: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/18.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
18
![Page 19: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/19.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
19
![Page 20: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/20.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
D
x1, . . . , xm ! X.
X
D
p P
20
![Page 21: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/21.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Likelihood
Likelihood: probability of observing sample under distribution , which, given the independence assumption is
Principle: select distribution maximizing sample probability
Pr[x1, . . . , xm] =m!
i=1
p(xi).
p ! P
p! = argmaxp!P
m!
i=1
p(xi),
p! = argmaxp!P
m!
i=1
log p(xi).or
21
![Page 22: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/22.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Relative Entropy Formulation
Empirical distribution: distribution that assigns to each point the frequency of its occurrence in the sample.
Lemma: has maximum likelihood iff
Proof:
p
p! = argminp!P
D(p ! p).
p!
D(p ! p) =!
zi observed
p(zi) log p(zi) "!
zi observed
p(zi) log p(zi)
= "H(p) ""
zi
count(zi)N0
log p(zi)
= "H(p) " 1N0
log#
zip(zi)count(zi)
= "H(p) " 1N0
l(p).
l(p!)
22
![Page 23: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/23.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Example: Bernoulli Trials
Problem: find most likely Bernoulli distribution, given sequence of coin flips
Bernoulli distribution:
Likelihood:
Solution: is differentiable and concave;
H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.
p(H) = !, p(T ) = 1 ! !.
l(p) = log !N(H)(1 ! !)N(T )
= N(H) log ! + N(T ) log(1 ! !).
l
dl(p)
d!=
N(H)
!!
N(T )
!= 0 " ! =
N(H)
N(H) + N(T ).
23
![Page 24: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/24.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Example: Gaussian Distribution
Problem: find most likely Gaussian distribution, given sequence of real-valued observations
Normal distribution:
Likelihood:
Solution: is differentiable and concave;
3.18, 2.35, .95, 1.175, . . .
p(x) =1
!
2!"2exp
!
"
(x " µ)2
2"2
"
.
l(p) = !
1
2m log(2!"2) !
1
2
m!
i=1
(xi ! µ)2
2"2.
l
!p(x)
!µ= 0 ! µ =
1
m
m!
i=1
xi
!p(x)
!"2= 0 ! "2 =
1
m
m!
i=1
x2
i " µ2.
24
![Page 25: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/25.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Properties
Problems:
• the underlying distribution may not be among those searched.
• overfitting: number of examples too small wrt number of parameters.
25
![Page 26: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/26.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum A Posteriori (MAP)
Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,
Note: for a uniform prior, MAP coincides with maximum likelihood.
h ! H
h! = argmaxh!H
Pr[h | S]
= argmaxh!H
Pr[S|h] Pr[h]Pr[S]
= argmaxh!H
Pr[S | h] Pr[h].
Pr[h]
26
![Page 27: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/27.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
27
![Page 28: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/28.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
General Ideas
Probabilistic formulation: given a spoken utterance, find the most likely transcription.
Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.
28
phoneme seq.
CD phone seq.
word seq.
observ. seq.
w1 w2 w3 w4
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16
![Page 29: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/29.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Statistical Formulation
Observation sequence produced by signal processing system:
Sequence of words over alphabet :
Formulation (maximum a posteriori decoding):
29
o = o1 . . . om.
! w = w1 . . . wk.
w = argmaxw!!!
Pr[w | o]
= argmaxw!!!
Pr[o | w] Pr[w]
Pr[o]
= argmaxw!!!
Pr[o | w]! "# $
Pr[w]! "# $
.
language modelacoustic & pronounciation model
(Bahl, Jelinek, and Mercer, 1983)
![Page 30: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/30.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Components
Acoustic and pronunciation model:
• : observation seq. distribution seq.
• : distribution seq. CD phone seq.
• : CD phone seq. phoneme seq.
• : phoneme seq. word seq.
Language model: , distribution over word seq.
30
Pr(o | w) =!
d,c,p
Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
Pr(o | d)
Pr(d | c)
!
!
Pr(c | p) !
Pr(p | w) !
Pr(w)
acou
stic
mod
el
![Page 31: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/31.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Notes
Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.
31
![Page 32: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/32.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
32
![Page 33: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/33.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Observations
Discretization
• time: local spectral analysis of the speech waveform at regular intervals,
Parameter vectors
• magnitude.
• Note: other perceptual information, e.g., visual information is ignored.
33
t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).
o = o1 . . . om, oi ! RN , N = 39 (typically).
![Page 34: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/34.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Model
Three-state hidden Markov models (HMMs)
Distributions:
• Full covariance multivariate Gaussians:
• Diagonal covariance Gaussian mixture.
• Semi-continuous, tied mixtures.34
0
d0:!
1d0:!
d1:!
2d1:!
d2:!
3d2:aeb,d
Pr[!] =1
(2")N/2|#|1/2e!
1
2(!!µ)T "!1(!!µ)
.
(Rabiner and Juang, 1993)
![Page 35: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/35.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Context-Dependent Model
Idea:
• phoneme pronounciation depends on environment (allophones, co-articulation).
• model phone in context better accuracy.
Context-dependent rules:
• Context-dependent units:
• Allophonic rules:
• Complex contexts: regular expressions.
35
ae/b d ! aeb,d.
!
t/V ! V ! dx.
(Lee, 1990; Young et al., 1994)
![Page 36: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/36.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Pronunciation Dictionary
Phonemic transcription
• Example: word data in American English.
Representation
36
data D ey dx ax .32
data D ey t ax .08
data D ae dx ax .48
data D ae t ax .12
0 1d:!/1.0
2ey:!/0.4
ae:!/0.63
dx:!/0.8
t:!/0.24/1
ax:data/1.0
![Page 37: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/37.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Language Model
Definition: probabilistic model for sequences of words
• By the chain rule,
Modeling simplifications:
• Clustering of histories:
• Example: nth order Markov assumption,
37
w = w1 . . . wk.
(w1, . . . , wi!1) !" c(w1, . . . , wi!1).
!i,Pr[wi | w1 . . . wi−1] = Pr[wi | hi], |hi| " n # 1.
Pr[w] =k!
i=1
Pr[wi | w1 . . . wi!1].
![Page 38: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/38.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Recognition Cascade
Combination of components
Viterbi approximation
38
Pron. ModelHMM Lang. ModelCD Model
word seq.phoneme seq.CD phone seq. word seq.observ. seq.
w = argmaxw
!
d,c,p
Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]
! argmaxw
maxd,c,p
Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].
![Page 39: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/39.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Problems
Learning: how to create accurate models for each component?
Search: how to efficiently combine models and determine best transcription?
Representation: compact data structure for the computational representation of the models.
39
![Page 40: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/40.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
40
![Page 41: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/41.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Feature Selection
Short-time spectral analysis:
Idea: find a smooth approximation eliminating large variations over short frequency intervals.
41
log
!
!
!
!
"
g(!)x(t + !)e!i2!f" d!
!
!
!
!
Short-time (25 msec. Hamming window) spectrum of /ae/.
power (db)
freq. (Hz)
![Page 42: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/42.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Cepstral Coefficients
Let denote the Fourier transform of the signal.
Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion
Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.
42
x(!)
log |x(!)| =!!
n="!
cne"in!
.
![Page 43: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/43.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Mel Frequency Cepstral Coefficients
Refinement: non-linear scale, approximation of human perception of distance between frequencies. E.g., mel frequency scale:
MFCCs:
• signal first transformed using the Mel frequency band.
• extraction of cepstral coefficients.
43
fmel = 2595 log10(1 + f/700).
(Stevens and Volkman, 1940)
![Page 44: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/44.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Other Refinements
Speaker/Channel adaptation:
• mean cepstral subtraction.
• vocal tract normalization.
• linear transformations.
44
![Page 45: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/45.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References
• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.
• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.
• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.
• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.
• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.
45
![Page 46: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear](https://reader033.vdocument.in/reader033/viewer/2022051815/603bd8e73b49c15510293302/html5/thumbnails/46.jpg)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References
• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of Psychology, 53:329, 1940.
• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.
46