computer speech recognition: mimicking the human system li deng microsoft research, redmond feb. 2,...
TRANSCRIPT
![Page 1: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/1.jpg)
Computer Speech Recognition: Mimicking the Human System
Li Deng
Microsoft Research, RedmondMicrosoft Research, RedmondFeb. 2, 2005
at IPAM Workshop on Math of Ear and Sound Processing (UCLA)
Collaborators: Dong Yu (MSR), Xiang Li (CMU), A. Acero (MSR)
![Page 2: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/2.jpg)
Speech Recognition--- Introduction • Converting naturally uttered speech into text and meaning• Human-machine dialogues (scenario demos)• Conventional technology --- statistical modeling and estimation
(HMM)• Limitations
– noisy acoustic environments– rigid speaking style– constrained task– unrealistic demand of training data– huge model sizes, etc.– far below human speech recognition performance
• Trend: Incorporate key aspects of human speech processing mechanisms
![Page 3: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/3.jpg)
Production & Perception: Closed-Loop Chain
message
motor/articulators
Internal model
decodedmessage
ear/a
uditory
rece
ption
SPEAKER LISTENER
Speech Acoustics in
closed-loop chain
![Page 4: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/4.jpg)
Encoder: Two-Stage Production Mechanisms
message
motor/articulators
Speech Acoustics
Phonology (higher level):•Symbolic encoding of linguistic message•Discrete representation by phonological features•Loosely-coupled multiple feature tiers•Overcome beads-on-a-string phone model•Theories of distinctive features, feature geometry & articulatory phonology• Account for partial/full sound deletion/modification in casual speech
SPEAKER
PhoneticsPhonetics (lower level):(lower level):•Convert discrete linguistic features toConvert discrete linguistic features to continuous acousticscontinuous acoustics•Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics•Mapping from articulatory variables toMapping from articulatory variables to VT area function to acoustics VT area function to acoustics •Account for co-articulation and reduction Account for co-articulation and reduction (target undershoot), etc. (target undershoot), etc.
![Page 5: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/5.jpg)
Encoder: Phonological Modeling
message
motor/articulators
Speech Acoustics
Computational phonology:• Represent pronunciation variations as constrained factorial Markov chain • Constraint: from articulatory phonology• Language-universal representation
SPEAKER
ten themes
/ t ε n ө i: m z /
TongueTip
TongueBody
High / FrontMid / Front
![Page 6: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/6.jpg)
Encoder: Phonetic Modeling
message
motor/articulators
Speech Acoustics
SPEAKER
Computational phonetics:Computational phonetics:• Segmental factorial HMM for sequential target Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domainin articulatory or vocal tract resonance domain• Switching trajectory model for target-directedSwitching trajectory model for target-directed articulatory dynamicsarticulatory dynamics• Switching nonlinear state-space model forSwitching nonlinear state-space model for dynamics in speech acousticsdynamics in speech acoustics• Illustration:Illustration:
![Page 7: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/7.jpg)
Phonetic Encoder: Computation
message
motor/articulators
Speech Acoustics
SPEAKER
articulation
targets
distortion-free acoustics
distorted acoustics
distortion factors & feedback to articulation
![Page 8: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/8.jpg)
Phonetic Reduction Illustration
yo-yo (formal) yo-yo (casual)
2 21 22 (1 )n s n s n s s nz z z T w
![Page 9: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/9.jpg)
Decoder I: Auditory Reception
message
motor/articulators
Internal model
decodedmessage
ear/a
uditory
rece
ption
LISTENER• Convert speech acoustic waves into efficient & robust auditory representation• This processing is largely independent of phonological units• Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex• Principal roles: 1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding• Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.
![Page 10: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/10.jpg)
Decoder II: Cognitive Perception
message
motor/articulators
Internal model
decodedmessage
ear/a
uditory
rece
ption
LISTENER• Cognitive process: recovery of linguistic message• Relies on 1) “Internal” model: structural knowledge of the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks• Child speech acquisition process is one that gradually establishes the “internal” model• Strategy: analysis by synthesis• i.e., Probabilistic inference on (deeply) hidden linguistic units using the internal model• No motor theory: the above strategy requires no articulatory recovery from speech acoustics
![Page 11: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/11.jpg)
• On-line modification of speaker’s articulatory behavior (speaking effort, rate, clarity, etc.) based on listener’s “decoding” performance (i.e. discrimination)
• Especially important for conversational speech recognition and understanding
• On-line adaptation of “encoder” parameters• Novel criteria:
– maximize discrimination while minimizing articulation effort
• In this closed-loop model, the “effort” quantified as “curvature” of temporal sequence of articulatory vector zt.
• No such concept of “effort” in conventional HMM systems
Speaker-Listener Interaction
![Page 12: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/12.jpg)
Stage-I illustration (effects of speaking rate)Stage-I illustration (effects of speaking rate)
![Page 13: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/13.jpg)
Sound Confusion for Casual Speech (model vs. data)
speaking rate speaking rate• Two sounds merge when they become “sloppy”• Human perception does “extrapolation”; so does our model
• 5000 hand-labeled speech tokens• Source: J. Acoustical Society of America, 2000
model prediction hand measurements
![Page 14: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/14.jpg)
Model Stage-I:
• Impulse response of FIR filter (non-causal):
• Output of filter:
![Page 15: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/15.jpg)
Model Stage-II:
• Analytical prediction of cepstra:
Assuming P-th order all-pole model
• Residual random vector for statistical bias modeling (finite pole order, no zeros):
residual
![Page 16: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/16.jpg)
Illustration: Output of Stage-II (green)Illustration: Output of Stage-II (green)
data
Model
![Page 17: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/17.jpg)
Speech Recognizer Architecture
• Stages I and II of the hidden trajectory model in combination speech recognizer
• No context-dependent parameters FIR bi-directional filter provides context dependence, as well as reduction
• Training procedure
• Recognition procedure
![Page 18: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/18.jpg)
Procedure --- Training
• training residual parameters and ss
featureextraction
targetfilteringw/ FIR
Table lookup
training waveform
phoneticxcriptw/ time
LPCC
targetsequence
LPCCresidual
VTR trackspredicted
+
-
2ss
nonlinear mapping
LPCCpredicted
monophoneHMM trainer
ss 2ss
-
![Page 19: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/19.jpg)
Procedure --- N-best Evaluation
ss
featureextraction
FIR
triphoneHMM
system
test data LPCC
+
-
2ss
nonlinear mapping
table lookup
Hyp 1
Hyp 2
Hyp N
N-best list (N=1000); each hypothesis has phonetic xcript & time
………
GaussianScorer
table lookup
table lookup
FIR
FIR
nonlinear mapping
nonlinear mapping
-
-
+
+
GaussianScorer
GaussianScorer
H*=
arg Max { P
(H1), P
(H2),…
P(H
1000)}
………
ssT
………
………
………
………
parameter free
(k) (k)
![Page 20: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/20.jpg)
Results (recognition accuracy %)
30
40
50
60
70
80
90
100
1 101 10001
Acc%
. . . HMM
1001 11N in N-best
![Page 21: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/21.jpg)
• Human speech production/perception viewed as synergistic elements in a closed-looped communication chain
• They function as encoding & decoding of linguistic messages, respectively.
• In human, speech “encoder” (production system) consists of phonological (symbolic) and phonetic (numeric) levels.
• Current HMM approach approximates these two levels in a crude way: – phone-based phonological model (“beads-on-a-string”)– multiple Gaussians as phonetic model for acoustics directly– very weak hidden structure
Summary & Conclusion
![Page 22: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/22.jpg)
• “Linguistic message recovery” (decoding) formulated as:– auditory reception for efficient & robust speech representation & for
providing temporal landmarks for phonological features– cognition perception using “encoder” knowledge or “internal model” to
perform probabilistic analysis by synthesis or pattern matching
• Dynamic Bayes network developed as a computational tool for constructing encoder and decoder
• Speaker-listener interaction (in addition to poor acoustic environment) cause substantial changes of articulation behavior and acoustic patterns
• Scientific background and computational framework for our recent MSR speech recognition research
Summary & Conclusion (cont’d)
![Page 23: Computer Speech Recognition: Mimicking the Human System Li Deng Microsoft Research, Redmond Feb. 2, 2005 at IPAM Workshop on Math of Ear and Sound Processing](https://reader036.vdocument.in/reader036/viewer/2022062408/56649f055503460f94c19a23/html5/thumbnails/23.jpg)
End &
Backup Slides