hierarchical approach for spotting keywords from an acoustic stream supervisor:professor raimo...

Hierarchical Approach for SpottingKeywords from an Acoustic Stream

Supervisor: Professor Raimo KantolaInstructor: Professor Hynek Hermansky, IDIAP Research Institute

8.11.2005 Hierarchical Approach for Spotting Keywords 2

Introduction to the thesis

Existing keyword spotting approaches are usually based on speech recognition techniques

Growing apart from the original problem can lead to drawbacks, like lack of generality

Another approach is presented and studied, where only the target sounds of the keyword are looked for

To study and formulate this approach was my work at IDIAP Research Institute, 3/2005 - 8/2005 Objective ot the thesis: to see how far can we go without using

hidden Markov models and dynamic programming techniques


Outline

Introduction to keyword spotting 4 - 7

Motivation for this work 8

Steps of hierarchical processing 9 - 14

Experiments 15 - 20

Conclusions 21


Keyword Spotting

Keyword Spotting (KWS) aims at finding only certain words while rejecting the rest (hypothesis – test)

Finding only certain, rare and high-information-valued words is feasible approach in for example voice command driven applications or multimedia indexing

Picture from [Jun96]


Performance measures for keyword spotting

The possible events in keyword spotting are hit, false alarm and miss

The performance is evaluated by presenting the detection rate as function of the false alarm rate

This yields the receiver operating charasteristics (ROC) curve

Average detection rate in 0-10 false alarms per hour is called figure of merit (FOM) [Roh89]

False Alarms / HourK

eyw

ord

s d

ete

cte

d /

%


LVCSR / HMM based approaches

Typical large vocabulary continuous speech recognition (LVCSR) / hidden Markov model (HMM) based KWS approaches model both keywords and non-keywords (background or garbage)

Keywords are searched by using dynamic programming techniques

Keyword spotting network from [Roh89].

Y

Xx1 xN

y1

yM

Optimal alignement between X and Y

An example of dynamic programming.


LVCSR / HMM based approaches vs. hypothesis test approach

LABEL: um ... okay, uh ... please open the, uh ... window

Spot1: ---------------------------1111-------------------Spot2: --------------------------------------------111111

Recog: garbage garbage garbage - OPEN – garbage – WINDOW

word 1

Yes No

time


Motivation for this work

Typical LVCSR / HMM based approaches require garbage model for Viterbi dynamic programming

The better the garbage model, the better the keyword spotting performance [Ros90]... ... and the closer the system is to LVCSR

Use of LVCSR techniques can introduce task dependency, lack of generality computational load, complexity need for training data off-line operating mode complexity to add keywords

How far can we go by looking only at the keysounds?


Hierarchical approach forspotting keywords

Key sounds (words) are spotted by looking for the target sounds (phonemes) that form the key sound.

STEP 1: Estimate equally sampled phoneme posteriors

STEP 2: Derive phoneme-spaced posterior estimates

STEP 3: Search right sequences of high-confidence phonemes

ALARM


Step 1: From acoustic streamto phoneme posteriors

TRAP-NN system: Feature extraction from 2-D filtering of critical band spectrogram, using

1010 ms long temporal patterns (TRAPs)

Features are fed to a trained neural net (NN) vector classifier that returns estimates of phoneme posterior probabilities every 10 ms

TRAP-NN was succesfully used in [Szö05] for phoneme based keyword spotting


Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors

Phonemes are found by filtering the posteriogram with a bank of matched filters

Matched filters are obtained by averaging 0.5 s long segments of phoneme trajectories

The purpose of filtering is to have one peak per phoneme


Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (2)

The local maxima (peaks) of the filtered posteriogram are extracted and taken as estimates of underlying phonemes being present

The places of the peaks correspond to the center frames of the underlying phonemes: )|,( nobservatiojframeiphnP


Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (3)

Matched filter bank, estimated from 30,000 phonemes of the training data (english numbers)

Filter lengths are 41 samples (210 ms processing delay)


Step 3: From phoneme estimates to words

Method 1: A posterior threshold is applied for phoneme estimates

An alarm is set for a correct stream of phonemes

Minimum and maximum intervals between phonemes are defined from the training data

Only the primary lexical form of each word is searched

Threshold


Experiments

Two telephone corpora were used [Col94, Col95]:

The MLP was trained to estimate the posterior probabilities of 28 English phonemes + silence (numbers from zero to ninety-nine)

A separate keyword spotter was implemented for all digits from zero to nine, with only the primary lexical forms

Results were compared to time-aligned phonemic labeling, and all legal pronunciations were treated as true alarms


Results – Experiment 1(phoneme estimates only)

Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95), experiment 1 (only phoneme estimates)

Two main reasons for differencies in performance:

1. Some phonemes more prone

to classification errors

2. The probability that a keyword is mixed with another word is not constant


Introduction of phoneme transition probability

)...()()()( 2211 phnPphnphnPphnPkeywordConfidence

Introduction of a confidence measure that tells, are there extraneous phonemes between two phonemes

phoneme transition probability: Phoneme transition probability is estimated using:

Strategy 1: the height of the crossing point of posterior trajectories of the corresponding phonemes

Strategy 2: the height of the crossing point of filtered posterior trajectories Strategy 3: one minus the minimum of the sum of the posteriors of the

corresponding phonemes, between the phoneme estimates New method for Step 3 (with transition probabilities):

The posterior threshold of applied to the product of phoneme and transition estimates:

)( 21 phnphnP


Results – Experiment 2 (Phoneme and transition estimates)

Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95) , experiment 2 (with phoneme transition probability estimates)

The average increase in FOM compared to first experimet is 5.6%

Only small differencies between different strategies of deriving the phoneme transition estimates.


ROC curve – ’zero’


ROC curve - ’eight’


Conclusions A theoretical framework for keysound spotting was introduced and used to spot digits.

Besides keyword spotting, the proposed processing can be applied in: Phoneme detection (experimented in the thesis) Event spotting in general

This approach has no garbage model and no dynamic programming techniques or HMMs are used

Benefits from looking only at the target sounds: Independence from vocabulary Some independece from language Less need for training the models Simple and fast

Relies on reliable phoneme estimates Quite robust for the choice of matched filter and phoneme sequence search technique

High variance in results between different words Short phonemes yield weaker estimates

Room to improve the performance Treat closure forms of plosive phonemes Look for all the possible pronunciation forms Use the non-keyword phoneme estimates to extract complementary information Introduce prior lexical knowledge


Questions?

[Jun96] Junqua, J.C., Haton J.-P.: Robustness in Automatic Speech Recognition,Fundamentals and Applications. Dordrecht, The Netherlands, Kluwer Academic

Publishers, 1996.

[Roh89] Rohlicek., J., Russel, W., Roukos, S., Gish, H.: Continuous Hidden Markov Modeling For Speaker-Independent Word-Spotting. In ICASSP 89, pp. 627-630, 1989.

[Ros90] Rose, R., Paul, D.: A Hidden Markov Model Based Keyword Recognition System. In Proceedings of ICASSP 90, pp. 129-132, Albuquerque, New Mexico, United States, 1990.

[Szö05] Szöke, I., Schwarz P., Matejka P., Burget L., Fapso M., Karafiát M., Cernocký J.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In MLMI 05, Edinburgh, United Kingdom, July 2005.

[Col94] Cole, R. et al.: Telephone Speech Corpus Development at CSLU. In Proceedings of ISCLP '94, pp. 1815-1818, Yokohama, Japan, 1994.

[Col94] Cole, R. et al.: New Telephone Speech Corpora at CSLU. In Proceedings of Eurospeech '95, pp. 821-824, Madrid, Spain, 1995.

Lehtonen, M., Fousek, P., Hermansky, H.: A Hierarchical Approach for Spotting Keywords. In 2nd Workshop on Multimodal Interaction and Related Machine

Learning Algorithms – MLMI 05, Edinburgh, United Kingdom, July 2005.


Appendix: Application to phoneme detection

The phoneme estimates of Step 2 were used in phoneme detection

The phoneme stream was estimated by counting all the phoneme estimates over a threshold, with different threshold values

Results were estimated in terms of substitutios (S), insertionts (I) and deletions (D)

For example (N = Number of phonemes in labeling):

Labeled: s eh v ah n f ay v

Recognized: sil n eh v n f ay v

Operation: I S D

%5.628

1118%100

N

DISNAccuracy


Appendix: Application to phoneme detection (cont)

Results from phoneme detection:

Threshold Accuracy

0.01 -93.21 %

0.05 28.57 %

0.10 54.35 %

0.15 64.37 %

0.20 69.44 %

0.25 71.24 %

0.30 70.50 %

0.35 68.12 %

0.40 64.57 %

Taking into account also the transition probabilities yielded 73.15 % accuracy.

State-of-the-art phoneme recognition accuracy for unrestricted speech 67% - 77%.

0102030405060708090100

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Threshold

Accuracy


Appendix: System diagram


Appendix: Conclusions (table)

What affects/determines the performance

Places for improvement

Step 1(from acoustic stream to phoneme posteriors)

•Phoneme’s proness to classification errors•Phoneme’s duration (longer phonemes yield stronger posteriors)

•To treat the closure form phonemes

Step 2(from frame-based posteriors to phoneme-spaced posteriors)

•How the matched filter models the duration of the phoneme

•To adapt the filter lengths more precisely to the phoneme durations (e.g. through speech rate)

Step 3(from phoneme estimates to words)

•How well the keyword’s phonemes differentiate the keyword from the background•How the single phoneme estimates are combined to word estimate•The length of the keyword

•To extract complementary information from the non-keyword phonemes to avoid false alarms


Appendix: false alarms from similar phoneme streams

The approach (method 1 in step 3) doesn’t take care that the detected phoneme stream is the complete underlying stream

Problem: False alarmsExample Label: .. s eh v ah n w ah n ..Example Label: .. t r uw th ..

Solution: Make sure there are no extra phonemes between two keyword phonemes, by looking only at the target sounds

ninetwo

Extraneous phoneme? Extraneous phoneme?


Appendix: Phoneme intervals

Histograms of distances (in 10 ms frames) between phonemes of word one (w –ah, ah – n and w – n).


Appendix: Average and variance filters


Appendix: Hard case - weak posteriors and classification error

hierarchical approach for spotting keywords from an acoustic stream supervisor:professor raimo...

Documents

spotting keywordslvcsr

spotting keywordsintroduction

spotting keywordsmotivation

spotting keywordsstep

feasible approach

jun96hierarchical approach

garbage model

based kws approaches