hierarchical approach for spotting keywords from an acoustic stream supervisor:professor raimo...
TRANSCRIPT
Hierarchical Approach for SpottingKeywords from an Acoustic Stream
Supervisor: Professor Raimo KantolaInstructor: Professor Hynek Hermansky, IDIAP Research Institute
8.11.2005 Hierarchical Approach for Spotting Keywords 2
Introduction to the thesis
Existing keyword spotting approaches are usually based on speech recognition techniques
Growing apart from the original problem can lead to drawbacks, like lack of generality
Another approach is presented and studied, where only the target sounds of the keyword are looked for
To study and formulate this approach was my work at IDIAP Research Institute, 3/2005 - 8/2005 Objective ot the thesis: to see how far can we go without using
hidden Markov models and dynamic programming techniques
8.11.2005 Hierarchical Approach for Spotting Keywords 3
Outline
Introduction to keyword spotting 4 - 7
Motivation for this work 8
Steps of hierarchical processing 9 - 14
Experiments 15 - 20
Conclusions 21
8.11.2005 Hierarchical Approach for Spotting Keywords 4
Keyword Spotting
Keyword Spotting (KWS) aims at finding only certain words while rejecting the rest (hypothesis – test)
Finding only certain, rare and high-information-valued words is feasible approach in for example voice command driven applications or multimedia indexing
Picture from [Jun96]
8.11.2005 Hierarchical Approach for Spotting Keywords 5
Performance measures for keyword spotting
The possible events in keyword spotting are hit, false alarm and miss
The performance is evaluated by presenting the detection rate as function of the false alarm rate
This yields the receiver operating charasteristics (ROC) curve
Average detection rate in 0-10 false alarms per hour is called figure of merit (FOM) [Roh89]
False Alarms / HourK
eyw
ord
s d
ete
cte
d /
%
8.11.2005 Hierarchical Approach for Spotting Keywords 6
LVCSR / HMM based approaches
Typical large vocabulary continuous speech recognition (LVCSR) / hidden Markov model (HMM) based KWS approaches model both keywords and non-keywords (background or garbage)
Keywords are searched by using dynamic programming techniques
Keyword spotting network from [Roh89].
Y
Xx1 xN
y1
yM
Optimal alignement between X and Y
An example of dynamic programming.
8.11.2005 Hierarchical Approach for Spotting Keywords 7
LVCSR / HMM based approaches vs. hypothesis test approach
LABEL: um ... okay, uh ... please open the, uh ... window
Spot1: ---------------------------1111-------------------Spot2: --------------------------------------------111111
Recog: garbage garbage garbage - OPEN – garbage – WINDOW
word 1
Yes No
time
8.11.2005 Hierarchical Approach for Spotting Keywords 8
Motivation for this work
Typical LVCSR / HMM based approaches require garbage model for Viterbi dynamic programming
The better the garbage model, the better the keyword spotting performance [Ros90]... ... and the closer the system is to LVCSR
Use of LVCSR techniques can introduce task dependency, lack of generality computational load, complexity need for training data off-line operating mode complexity to add keywords
How far can we go by looking only at the keysounds?
8.11.2005 Hierarchical Approach for Spotting Keywords 9
Hierarchical approach forspotting keywords
Key sounds (words) are spotted by looking for the target sounds (phonemes) that form the key sound.
STEP 1: Estimate equally sampled phoneme posteriors
STEP 2: Derive phoneme-spaced posterior estimates
STEP 3: Search right sequences of high-confidence phonemes
ALARM
8.11.2005 Hierarchical Approach for Spotting Keywords 10
Step 1: From acoustic streamto phoneme posteriors
TRAP-NN system: Feature extraction from 2-D filtering of critical band spectrogram, using
1010 ms long temporal patterns (TRAPs)
Features are fed to a trained neural net (NN) vector classifier that returns estimates of phoneme posterior probabilities every 10 ms
TRAP-NN was succesfully used in [Szö05] for phoneme based keyword spotting
8.11.2005 Hierarchical Approach for Spotting Keywords 11
Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors
Phonemes are found by filtering the posteriogram with a bank of matched filters
Matched filters are obtained by averaging 0.5 s long segments of phoneme trajectories
The purpose of filtering is to have one peak per phoneme
8.11.2005 Hierarchical Approach for Spotting Keywords 12
Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (2)
The local maxima (peaks) of the filtered posteriogram are extracted and taken as estimates of underlying phonemes being present
The places of the peaks correspond to the center frames of the underlying phonemes: )|,( nobservatiojframeiphnP
8.11.2005 Hierarchical Approach for Spotting Keywords 13
Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (3)
Matched filter bank, estimated from 30,000 phonemes of the training data (english numbers)
Filter lengths are 41 samples (210 ms processing delay)
8.11.2005 Hierarchical Approach for Spotting Keywords 14
Step 3: From phoneme estimates to words
Method 1: A posterior threshold is applied for phoneme estimates
An alarm is set for a correct stream of phonemes
Minimum and maximum intervals between phonemes are defined from the training data
Only the primary lexical form of each word is searched
Threshold
8.11.2005 Hierarchical Approach for Spotting Keywords 15
Experiments
Two telephone corpora were used [Col94, Col95]:
The MLP was trained to estimate the posterior probabilities of 28 English phonemes + silence (numbers from zero to ninety-nine)
A separate keyword spotter was implemented for all digits from zero to nine, with only the primary lexical forms
Results were compared to time-aligned phonemic labeling, and all legal pronunciations were treated as true alarms
8.11.2005 Hierarchical Approach for Spotting Keywords 16
Results – Experiment 1(phoneme estimates only)
Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95), experiment 1 (only phoneme estimates)
Two main reasons for differencies in performance:
1. Some phonemes more prone
to classification errors
2. The probability that a keyword is mixed with another word is not constant
8.11.2005 Hierarchical Approach for Spotting Keywords 17
Introduction of phoneme transition probability
)...()()()( 2211 phnPphnphnPphnPkeywordConfidence
Introduction of a confidence measure that tells, are there extraneous phonemes between two phonemes
phoneme transition probability: Phoneme transition probability is estimated using:
Strategy 1: the height of the crossing point of posterior trajectories of the corresponding phonemes
Strategy 2: the height of the crossing point of filtered posterior trajectories Strategy 3: one minus the minimum of the sum of the posteriors of the
corresponding phonemes, between the phoneme estimates New method for Step 3 (with transition probabilities):
The posterior threshold of applied to the product of phoneme and transition estimates:
)( 21 phnphnP
8.11.2005 Hierarchical Approach for Spotting Keywords 18
Results – Experiment 2 (Phoneme and transition estimates)
Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95) , experiment 2 (with phoneme transition probability estimates)
The average increase in FOM compared to first experimet is 5.6%
Only small differencies between different strategies of deriving the phoneme transition estimates.
8.11.2005 Hierarchical Approach for Spotting Keywords 19
ROC curve – ’zero’
8.11.2005 Hierarchical Approach for Spotting Keywords 20
ROC curve - ’eight’
8.11.2005 Hierarchical Approach for Spotting Keywords 21
Conclusions A theoretical framework for keysound spotting was introduced and used to spot digits.
Besides keyword spotting, the proposed processing can be applied in: Phoneme detection (experimented in the thesis) Event spotting in general
This approach has no garbage model and no dynamic programming techniques or HMMs are used
Benefits from looking only at the target sounds: Independence from vocabulary Some independece from language Less need for training the models Simple and fast
Relies on reliable phoneme estimates Quite robust for the choice of matched filter and phoneme sequence search technique
High variance in results between different words Short phonemes yield weaker estimates
Room to improve the performance Treat closure forms of plosive phonemes Look for all the possible pronunciation forms Use the non-keyword phoneme estimates to extract complementary information Introduce prior lexical knowledge
8.11.2005 Hierarchical Approach for Spotting Keywords 22
Questions?
[Jun96] Junqua, J.C., Haton J.-P.: Robustness in Automatic Speech Recognition,Fundamentals and Applications. Dordrecht, The Netherlands, Kluwer Academic
Publishers, 1996.
[Roh89] Rohlicek., J., Russel, W., Roukos, S., Gish, H.: Continuous Hidden Markov Modeling For Speaker-Independent Word-Spotting. In ICASSP 89, pp. 627-630, 1989.
[Ros90] Rose, R., Paul, D.: A Hidden Markov Model Based Keyword Recognition System. In Proceedings of ICASSP 90, pp. 129-132, Albuquerque, New Mexico, United States, 1990.
[Szö05] Szöke, I., Schwarz P., Matejka P., Burget L., Fapso M., Karafiát M., Cernocký J.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In MLMI 05, Edinburgh, United Kingdom, July 2005.
[Col94] Cole, R. et al.: Telephone Speech Corpus Development at CSLU. In Proceedings of ISCLP '94, pp. 1815-1818, Yokohama, Japan, 1994.
[Col94] Cole, R. et al.: New Telephone Speech Corpora at CSLU. In Proceedings of Eurospeech '95, pp. 821-824, Madrid, Spain, 1995.
Lehtonen, M., Fousek, P., Hermansky, H.: A Hierarchical Approach for Spotting Keywords. In 2nd Workshop on Multimodal Interaction and Related Machine
Learning Algorithms – MLMI 05, Edinburgh, United Kingdom, July 2005.
8.11.2005 Hierarchical Approach for Spotting Keywords 23
Appendix: Application to phoneme detection
The phoneme estimates of Step 2 were used in phoneme detection
The phoneme stream was estimated by counting all the phoneme estimates over a threshold, with different threshold values
Results were estimated in terms of substitutios (S), insertionts (I) and deletions (D)
For example (N = Number of phonemes in labeling):
Labeled: s eh v ah n f ay v
Recognized: sil n eh v n f ay v
Operation: I S D
%5.628
1118%100
N
DISNAccuracy
8.11.2005 Hierarchical Approach for Spotting Keywords 24
Appendix: Application to phoneme detection (cont)
Results from phoneme detection:
Threshold Accuracy
0.01 -93.21 %
0.05 28.57 %
0.10 54.35 %
0.15 64.37 %
0.20 69.44 %
0.25 71.24 %
0.30 70.50 %
0.35 68.12 %
0.40 64.57 %
Taking into account also the transition probabilities yielded 73.15 % accuracy.
State-of-the-art phoneme recognition accuracy for unrestricted speech 67% - 77%.
0102030405060708090100
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Threshold
Accuracy
8.11.2005 Hierarchical Approach for Spotting Keywords 25
Appendix: System diagram
8.11.2005 Hierarchical Approach for Spotting Keywords 26
Appendix: Conclusions (table)
What affects/determines the performance
Places for improvement
Step 1(from acoustic stream to phoneme posteriors)
•Phoneme’s proness to classification errors•Phoneme’s duration (longer phonemes yield stronger posteriors)
•To treat the closure form phonemes
Step 2(from frame-based posteriors to phoneme-spaced posteriors)
•How the matched filter models the duration of the phoneme
•To adapt the filter lengths more precisely to the phoneme durations (e.g. through speech rate)
Step 3(from phoneme estimates to words)
•How well the keyword’s phonemes differentiate the keyword from the background•How the single phoneme estimates are combined to word estimate•The length of the keyword
•To extract complementary information from the non-keyword phonemes to avoid false alarms
8.11.2005 Hierarchical Approach for Spotting Keywords 27
Appendix: false alarms from similar phoneme streams
The approach (method 1 in step 3) doesn’t take care that the detected phoneme stream is the complete underlying stream
Problem: False alarmsExample Label: .. s eh v ah n w ah n ..Example Label: .. t r uw th ..
Solution: Make sure there are no extra phonemes between two keyword phonemes, by looking only at the target sounds
ninetwo
Extraneous phoneme? Extraneous phoneme?
8.11.2005 Hierarchical Approach for Spotting Keywords 28
Appendix: Phoneme intervals
Histograms of distances (in 10 ms frames) between phonemes of word one (w –ah, ah – n and w – n).
8.11.2005 Hierarchical Approach for Spotting Keywords 29
Appendix: Average and variance filters
8.11.2005 Hierarchical Approach for Spotting Keywords 30
Appendix: Hard case - weak posteriors and classification error