How to handlepronunciation variation in ASR:By storing episodes in memory?
Helmer StrikCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands
Radboud University Nijmegen
Radboud University Nijmegen
Overview
Contents : Variation, invariance problem ASR : Automatic Speech Recognition HSR : Human Speech Recognition ESR : Episodic Speech Recognition
Radboud University Nijmegen
Invariance problem (1)
One of the main issues in speech recognition is the large amount of variability present in speech.SRIV2006: ITRW on Speech Recognition and Intrinsic Variation
Invariance problem:Variation in stimuli, invariant perceptAlso visual, tactile, etc.Studied in many fields, no consensus
2 paradigms InvariantEpisodic
Radboud University Nijmegen
Invariance problem (1)
Example 1: Speech
Dutch word: “natuurlijk” (naturally, ‘of course’) [natyrlk] [natylk]… [tyk]
Multiword expressions (MWEs): lot of reductionmany variants
Radboud University Nijmegen
Invariance problem (2)
Example 2: Writing (vision)
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
natuurlijk natuurlijk
Familiar ‘styles’ (fonts, handwriting)are recognized better
Radboud University Nijmegen
ASR - Paradigm
Invariant, symbolic approach : utterance sequence of words sequence of phonemes sequence of states parametric description : pdf’s / ANN
Radboud University Nijmegen
ASR - Paradigm
Same paradigm (HMMs), since 70’s Assumptions : incorrect, questionable Insufficient performance
ASR vs. HSR : error rates 8-80x higher Slow progress (ceiling effect?) Simply using more and more data is not sufficient
(Moore, 2001)
A new paradigm is needed!However, only few attempts
Radboud University Nijmegen
HSR - Indexical information
Speech - 2 types of information :
1. Verbal info. : what, contents2. Indexical info. : how, form
e.g. environmental and speaker-specific aspects(pitch, loudness, speech rate, voice quality)
Radboud University Nijmegen
HSR - Indexical information
Traditional ASR model: Verbal information is used Indexical information
Noise, disturbances Preprocessing:
o Strip offo Normalization (VTLN, MLLR, etc.)
And in HSR?
Radboud University Nijmegen
HSR - Indexical information
HSR : Strip off indexical information?
No!
Familiar voices and accents :recognize and mimic
Indexical informationis perceived and encoded
Radboud University Nijmegen
HSR - Indexical information
Verbal & indexical information :processed independently?
No!
Familiar ‘voices’ are recognized better
Facilitation, also with ‘similar’ speech
Radboud University Nijmegen
HSR - Indexical and detailed information
Experimental results:indexical information andfine phonetic detail (Hawkins et al.)influence perception
Difficult to explain / integrate in the traditional, invariant model
New models: episodic models,for auditive and visual perception
Radboud University Nijmegen
ESR - Basic idea
A new paradigm for ASR is needed:An episodic model !!??
Training : Store trajectories - (representatives of) episodes
Recognition : Calculate distance between X and sequences of stored
trajectories (DTW) Take the one with minimum distance : the recognized
word
Radboud University Nijmegen
ESR – Invariant vs. episodic
phone-based HMM ESR-------------------------------------------------------------
Unit:[ Phone Syllable, word, … ]
Representation:States - pdf’s or ANN Trajectories
Compare:Trajectory (X) & states Trajectory (X) & Trajectories
Parsimonious representation Extensive representationComplex mapping Simple mapping‘Variation is noise’ Variation contains info.Normalization Use variation
Radboud University Nijmegen
Phone ‘aj’ from ‘nine’.
X = begin
3 parts: aj(, aj|, aj)
Representationpdf’s (Gaussians)
Much detail, dynamic information is lost
Trajectories: details
Radboud University Nijmegen
Unit: phone(me)
Switchboard (Greenberg et al.):deletion: 25% of the phonessubstitution: 30% of the phones together 55%!!
Difficult for a model based on ‘sequences of phones’.Syllables: less than 1% deleted
Phonetic transcriptions and their evaluation :Large differences between humansWhat is the ‘golden reference’?Speech – a sequence of symbols?
Radboud University Nijmegen
Unit: Multiword expressions (MWEs)
MWEs (see poster) :A lot of reduction;
many phonemes deleted, or substitutedMany variants (= sequences of phonemes)
more than 90 for 2 MWEs studiedDifficult to handle in ASR systems with current methods
for pronunciation variation modeling.Reduction, e.g. for a MWE: 4 words with 7 syllables
reduced to ‘1 entity’ with 2 syllables
What should be stored?Units of various lenghts?
Radboud University Nijmegen
An episodic approach for ASR
Advantages:More information during search:
dynamic, indexical, fine phonetic detailContinuity constraints can be used
(reduces the trajectory folding problem)Model is simpler
Disadvantage:More information during search: complexity
Brain: a lot of storage and ‘CPU’ Computers: more and more powerful
Radboud University Nijmegen
An episodic approach for ASR
Strik (2003) ITC-irst, Trento, Italy; ICPhS, Barcelona De Wachter et al. (2003) Interspeech-2003 Axelrod & Maison (2004) ICASSP-2004 Maier & Moore (2005) Interspeech-2005 Aradilla, Vepa, Bourlard (2005) Interspeech-2005 Matton, De Wachter, et al. (2005) SPECOM-2005
Promising results The computing power and memory that are needed to
investigate the episodic approach to speech recognition are (becoming) available
Radboud University Nijmegen
The HSR-ASR gap
HSR & ASR – 2 different communitiesDifferent people, departments, journals, terminology, goals, methodologies
Goals, evaluationHSR: simulate experimental findingsASR: reduce WER
Radboud University Nijmegen
The HSR-ASR gap
Marr (1982) – 3 levels of modeling:1. Computational2. Algorithmic3. Implementational
HSR - (larger) differences at higher levels
ASR – implementations, end-to-end models using real speech signals as input Thousands of exp.: WER has been gradually reduced However, essentially the same model New model: performance (WER), funding, etc.
Radboud University Nijmegen
The HSR-ASR gap - bridge
Use same evaluation metric for HSR & ASR systems: reaction times (Cutler & Robinson, 1992)
Use knowledge or components from the other field (Scharenborg et al., 2003).
Use models that are suitable for HSR & ASR researchEvaluation from HSR & ASR point of view
S2S – Sound to Sense (Sarah Hawkins)Marie Curie Research Training Network (MC-RTN)Recently approved by the EU
Radboud University Nijmegen
Episodic speech recognition
Radboud University Nijmegen
Radboud University Nijmegen
ESRASA model
T1
T2
TN
B1
B2
BE
C12
C11
C22
CE2
CE
W
F1
F2
FN
attention weights
... ... ...
EA1
EA2
EAE
WA1
WA2
WAW
episodes association weights
words
feature vector
episode activation
B1
B2
BW
word activation
Radboud University Nijmegen
ESRASA model
ESRASAEpisodic Speech Recognition And Structure Acquisition
The ESRASA model is inspired by several previous models, especially model described in Johnson (1997)WRAPSA (Jusczyk, 1993), and CGM (Nosofsky, 1986)
The ESRASA model is a feedforward neural network with two sets of weights: atTention weights Tn and assoCiation weights Cew. Besides these two sets of weights, words, episodes (for speech units), and their base activation levels (Bw and Be, respectively) will be stored in memory.
Radboud University Nijmegen
ESRRecognition
L items in lexicon
S items in subset
1 item, the winner
Preselection
Competition
L items in lexicon
S items in subset
1 item, the winner
Preselection
Competition
X
Radboud University Nijmegen
ESRPreselection
Why preselection? Reduce CPU & memory Increase performance Also used in DTW-based pattern recognition
applications Used in many HSR models
Radboud University Nijmegen
ESRCompetition
Recognize unknown word X : Calculate distance between X and sequences of
stored episodes (DTW) Take the one with minimum distance : the recognized
word
Use continuity constraints (as in TTS)
Radboud University Nijmegen
ESRDTW: Dynamic Time Warping
Radboud University Nijmegen
ESR – ResearchPreselection ?
Best method?Compare: kNN – k nearest neighbor Lower bound distance : Ddtw Dlb d Build an index for the lexicon
Is preselection needed?Compare: with & without preselection
Radboud University Nijmegen
ESR – ResearchUnits for preselection ?
Compare : Syllable Word Begin (window of fixed length)
Radboud University Nijmegen
ESR - ResearchUnits for competition ?Compare : Syllables Words In combination with multisyllables?
Multisyllables (reduction, resyllabification) Ik weet het niet -> kweeni Op een gegeven moment -> pgeefment Zeven-en -> ze-fnen
Radboud University Nijmegen
ESR - ResearchExemplars ?
How to select exemplars : DTW distances + hierarchical clustering VQ : LVQ & K-means
Trade-off normalization & (size) lexiconCompare normalization techniques : TDNR, MVN, HN VTLN
Radboud University Nijmegen
ESR - ResearchFeatures ?
Compare : Spectral features : MFCC, PLP, LPC Articulatory features (ANN) Combine spectral & articulatory feat.
Different features for preselection & competition?
Radboud University Nijmegen
ESR - Research Distance metrics ?
Compare (frame-based metrics) : Euclidean Mahalanobis Itakura (for LPC) Perceptually-based?
Distance metric for trajectories?
Radboud University Nijmegen
HMM-based ASR Information sources
HMM-based ASR, roughly 3 ways :1. Class-specific HMMs2. Multistream3. 2-pass decoding
Disadvantages :1. Many classes2. Synchronization & recombination3. Pass 1 : no / less knowledge
Radboud University Nijmegen
ESR - ResearchInformation sources
ESR : compare 2 trajectoriesAll details are available during search, e.g. context &
dynamic informationCompare shape + timing of feat. contours
F0 rise: early or final, half or complete
Tags can be added to the lexicon+ continuity constraints
Radboud University Nijmegen
HSR - Foreign English Examples
Conversation about Italy.
dropped / robbed
I was robbed in Milan.By parachute?
[ FEE 1 ]
Radboud University Nijmegen
HSR - Indexical information
HSR : Strip off indexical information?No!
Familiar voices and accents :recognize and mimic [ FEE 2 ]
Indexical informationis perceived and encoded
Radboud University Nijmegen
HSR - Indexical information
Verbal & indexical information :processed independently? No!
Familiar ‘voices’ are recognized better[ FEE 3 ]
Facilitation, also with ‘similar’ speech[ FEE 4 ]
Radboud University Nijmegen
ASR - Pronunciation variation
SRIV2006:ITRW on Speech Recognition and Intrinsic Variation
Pronunciation variation modeling for ASR : Improvements, but generally small Current ASR paradigm : suitable?
Phonetic transcriptions and their evaluation : Large differences between humans What is the ‘golden reference’? Speech – a sequence of symbols?