acoustic keyword spotting in speech with … · speech and audio research laboratory of the saivt...
TRANSCRIPT
Speech and Audio Research Laboratory of the SAIVT program
Centre for Built Environment and Engineering Research
ACOUSTIC KEYWORD SPOTTING
IN SPEECH WITH APPLICATIONS
TO DATA MINING
A. J. Kishan Thambiratnam
BE(Electronics)/BInfTech
SUBMITTED AS A REQUIREMENT OF
THE DEGREE OF
DOCTOR OF PHILOSOPHY
AT
QUEENSLAND UNIVERSITY OF TECHNOLOGY
BRISBANE, QUEENSLAND
9 MARCH 2005
Keywords
Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Veri-
fication, Confidence Scoring, Speech Recognition, Utterance Verification
i
ii
Abstract
Keyword Spotting is the task of detecting keywords of interest within continu-
ous speech. The applications of this technology range from call centre dialogue
systems to covert speech surveillance devices. Keyword spotting is particularly
well suited to data mining tasks such as real-time keyword monitoring and unre-
stricted vocabulary audio document indexing. However, to date, many keyword
spotting approaches have suffered from poor detection rates, high false alarm
rates, or slow execution times, thus reducing their commercial viability.
This work investigates the application of keyword spotting to data mining
tasks. The thesis makes a number of major contributions to the field of keyword
spotting.
The first major contribution is the development of a novel keyword verification
method named Cohort Word Verification. This method combines high level lin-
guistic information with cohort-based verification techniques to obtain dramatic
improvements in verification performance, in particular for the problematic short
duration target word class.
The second major contribution is the development of a novel audio document
indexing technique named Dynamic Match Lattice Spotting. This technique aug-
ments lattice-based audio indexing principles with dynamic sequence matching
techniques to provide robustness to erroneous lattice realisations. The resulting
algorithm obtains significant improvement in detection rate over lattice-based
iii
audio document indexing while still maintaining extremely fast search speeds.
The third major contribution is the study of multiple verifier fusion for the task
of keyword verification. The reported experiments demonstrate that substantial
improvements in verification performance can be obtained through the fusion
of multiple keyword verifiers. The research focuses on combinations of speech
background model based verifiers and cohort word verifiers.
The final major contribution is a comprehensive study of the effects of limited
training data for keyword spotting. This study is performed with consideration
as to how these effects impact the immediate development and deployment of
speech technologies for non-English languages.
iv
Contents
Keywords i
Abstract iii
List of Tables xiii
List of Figures xvi
List of Abbreviations xxi
Authorship xxiii
Acknowledgments xxv
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Major Contributions of this Research . . . . . . . . . . . . . . . . 6
1.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 A Review of Keyword Spotting 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
v
2.2 The keyword spotting problem . . . . . . . . . . . . . . . . . . . . 10
2.3 Applications of keyword spotting . . . . . . . . . . . . . . . . . . 11
2.3.1 Keyword monitoring applications . . . . . . . . . . . . . . 11
2.3.2 Audio document indexing . . . . . . . . . . . . . . . . . . 13
2.3.3 Command controlled devices . . . . . . . . . . . . . . . . . 13
2.3.4 Dialogue systems . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 The development of keyword spotting . . . . . . . . . . . . . . . . 15
2.4.1 Sliding window approaches . . . . . . . . . . . . . . . . . . 15
2.4.2 Non-keyword model approaches . . . . . . . . . . . . . . . 16
2.4.3 Hidden Markov Model approaches . . . . . . . . . . . . . . 17
2.4.4 Further developments . . . . . . . . . . . . . . . . . . . . . 17
2.5 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 The reference and result sets . . . . . . . . . . . . . . . . . 19
2.5.2 The hit operator . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.3 Miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 False alarm rate . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.5 False acceptance rate . . . . . . . . . . . . . . . . . . . . . 21
2.5.6 Execution time . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.7 Figure of Merit . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.8 Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.9 Receiver Operating Characteristic Curves . . . . . . . . . . 24
2.5.10 Detection Error Trade-off Plots . . . . . . . . . . . . . . . 25
2.6 Unconstrained vocabulary spotting . . . . . . . . . . . . . . . . . 26
2.6.1 HMM-based approach . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Neural Network Approaches . . . . . . . . . . . . . . . . . 28
2.7 Approaches to non-keyword modeling . . . . . . . . . . . . . . . . 31
2.7.1 Speech background model . . . . . . . . . . . . . . . . . . 31
2.7.2 Phone models . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
2.7.3 Uniform distribution . . . . . . . . . . . . . . . . . . . . . 34
2.7.4 Online garbage model . . . . . . . . . . . . . . . . . . . . 34
2.8 Constrained vocabulary spotting . . . . . . . . . . . . . . . . . . . 36
2.8.1 Language model approaches . . . . . . . . . . . . . . . . . 36
2.8.2 Event spotting . . . . . . . . . . . . . . . . . . . . . . . . 39
2.9 Keyword verification . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.9.1 A formal definition . . . . . . . . . . . . . . . . . . . . . . 42
2.9.2 Combining keyword spotting and verification . . . . . . . . 42
2.9.3 The problem of short duration keywords . . . . . . . . . . 43
2.9.4 Likelihood ratio based approaches . . . . . . . . . . . . . . 43
2.9.5 Alternate Information Sources . . . . . . . . . . . . . . . . 46
2.10 Audio Document Indexing . . . . . . . . . . . . . . . . . . . . . . 47
2.10.1 Limitations of the Speech-to-Text
Transcription approach . . . . . . . . . . . . . . . . . . . . 48
2.10.2 Reverse dictionary lookup searches . . . . . . . . . . . . . 49
2.10.3 Indexed reverse dictionary lookup searches . . . . . . . . . 51
2.10.4 Lattice based searches . . . . . . . . . . . . . . . . . . . . 53
3 HMM-based spotting and verification 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 The confusability circle framework . . . . . . . . . . . . . . . . . . 58
3.3 Analysis of non-keyword models . . . . . . . . . . . . . . . . . . . 60
3.3.1 All-speech models . . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 SBM methods . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Phone-set methods . . . . . . . . . . . . . . . . . . . . . . 62
3.3.4 Target-word-excluding methods . . . . . . . . . . . . . . . 62
3.4 Evaluation of keyword spotting techniques . . . . . . . . . . . . . 63
3.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . 64
vii
3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 Tuning the phone set non-keyword model . . . . . . . . . . . . . . 68
3.6 Output score thresholding for SBM
spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 Performance across keyword length . . . . . . . . . . . . . . . . . 72
3.7.1 Evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . 73
3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.8 HMM-based keyword verification . . . . . . . . . . . . . . . . . . 74
3.8.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8.2 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 77
3.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Discriminative background model KV . . . . . . . . . . . . . . . . 79
3.9.1 System architecture . . . . . . . . . . . . . . . . . . . . . . 79
3.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 82
4 Cohort word keyword verification 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Foundational concepts . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.1 Cohort-based scoring . . . . . . . . . . . . . . . . . . . . . 87
4.2.2 The use of language information . . . . . . . . . . . . . . . 88
4.3 Overview of the cohort word technique . . . . . . . . . . . . . . . 90
4.4 Cohort word set construction . . . . . . . . . . . . . . . . . . . . 92
4.4.1 The choice of dmin and dmax . . . . . . . . . . . . . . . . . 92
4.4.2 Cohort word set downsampling . . . . . . . . . . . . . . . 94
4.4.3 Distance function . . . . . . . . . . . . . . . . . . . . . . . 94
4.5 Classification approach . . . . . . . . . . . . . . . . . . . . . . . . 96
4.5.1 2-class classification approach . . . . . . . . . . . . . . . . 96
viii
4.5.2 Hybrid N-class approach . . . . . . . . . . . . . . . . . . . 98
4.6 Summary of the cohort word algorithm . . . . . . . . . . . . . . . 100
4.7 Comparison of classifier approaches . . . . . . . . . . . . . . . . . 101
4.7.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.7.2 Recogniser parameters . . . . . . . . . . . . . . . . . . . . 103
4.7.3 Cohort word selection . . . . . . . . . . . . . . . . . . . . 103
4.7.4 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 104
4.7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.8 Performance across target keyword length . . . . . . . . . . . . . 106
4.8.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.8.2 Recogniser parameters . . . . . . . . . . . . . . . . . . . . 107
4.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.8.4 Analysis of poor 8-phone performance . . . . . . . . . . . . 110
4.8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.9 Effects of selection parameters . . . . . . . . . . . . . . . . . . . . 113
4.9.1 Cohort word set downsampling . . . . . . . . . . . . . . . 114
4.9.2 Cohort word selection range . . . . . . . . . . . . . . . . . 116
4.9.3 MED cost parameters . . . . . . . . . . . . . . . . . . . . 119
4.9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.10 Fused cohort word systems . . . . . . . . . . . . . . . . . . . . . . 122
4.10.1 Training dataset . . . . . . . . . . . . . . . . . . . . . . . 123
4.10.2 Neural network architecture . . . . . . . . . . . . . . . . . 123
4.10.3 Experimental procedure . . . . . . . . . . . . . . . . . . . 123
4.10.4 Baseline unfused results . . . . . . . . . . . . . . . . . . . 124
4.10.5 Fused SBM-CW experiments . . . . . . . . . . . . . . . . . 125
4.10.6 Fused CW-CW experiments . . . . . . . . . . . . . . . . . 128
4.10.7 Comparison of fused and unfused systems . . . . . . . . . 129
4.11 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . 133
ix
5 Dynamic Match Lattice Spotting 137
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3 Dynamic Match Lattice Spotting method . . . . . . . . . . . . . . 140
5.3.1 Basic method . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.2 Optimised Dynamic Match Lattice Search . . . . . . . . . 145
5.4 Evaluation of DMLS performance . . . . . . . . . . . . . . . . . . 146
5.4.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.2 Recogniser parameters . . . . . . . . . . . . . . . . . . . . 147
5.4.3 Lattice building . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.4 Query-time processing . . . . . . . . . . . . . . . . . . . . 148
5.4.5 Baseline systems . . . . . . . . . . . . . . . . . . . . . . . 149
5.4.6 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 150
5.4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.5 Analysis of dynamic match rules . . . . . . . . . . . . . . . . . . . 152
5.5.1 System configurations . . . . . . . . . . . . . . . . . . . . 153
5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.6 Analysis of DMLS algorithm parameters . . . . . . . . . . . . . . 156
5.6.1 Number of lattice generation tokens . . . . . . . . . . . . . 157
5.6.2 Pruning beamwidth . . . . . . . . . . . . . . . . . . . . . . 158
5.6.3 Number of lattice traversal tokens . . . . . . . . . . . . . . 159
5.6.4 MED cost threshold . . . . . . . . . . . . . . . . . . . . . 160
5.6.5 Tuned systems . . . . . . . . . . . . . . . . . . . . . . . . 162
5.6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.7 Conversational telephone speech
experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.7.2 Recogniser parameters . . . . . . . . . . . . . . . . . . . . 165
x
5.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.8 Non-destructive optimisations . . . . . . . . . . . . . . . . . . . . 168
5.8.1 Prefix sequence optimisation . . . . . . . . . . . . . . . . . 169
5.8.2 Early stopping optimisation . . . . . . . . . . . . . . . . . 171
5.8.3 Combining optimisations . . . . . . . . . . . . . . . . . . . 173
5.9 Optimised system timings . . . . . . . . . . . . . . . . . . . . . . 174
5.9.1 Experimental procedure . . . . . . . . . . . . . . . . . . . 175
5.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6 Non-English Spotting 181
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.2 The issue of limited resources . . . . . . . . . . . . . . . . . . . . 182
6.3 The role of keyword spotting . . . . . . . . . . . . . . . . . . . . . 184
6.4 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.4.1 Database design . . . . . . . . . . . . . . . . . . . . . . . . 185
6.4.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . 186
6.4.3 Evaluation set design . . . . . . . . . . . . . . . . . . . . . 188
6.4.4 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 188
6.5 English and Spanish stage 1 evaluations . . . . . . . . . . . . . . 189
6.6 English and Spanish post keyword
verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.7 Indonesian spotting and verification . . . . . . . . . . . . . . . . . 197
6.8 Extrapolating Indonesian performance . . . . . . . . . . . . . . . 198
6.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 200
7 Summary, Conclusions and Future Work 203
7.1 HMM-based Spotting and Verification . . . . . . . . . . . . . . . 203
7.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 203
xi
7.1.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.2 Cohort Word Verification . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.3 Dynamic Match Lattice Spotting . . . . . . . . . . . . . . . . . . 206
7.3.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4 Non-English Spotting . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.5 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Bibliography 210
A The Levenstein Distance 217
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
A.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
xii
List of Tables
3.1 Keyword spotting performance of baseline systems on Switchboard
1 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Effect of target word insertion penalty on PM-KS performance . . 69
3.3 Equal error rates of unnormalised and duration normalised output
score thresholding applied to SBM-KS . . . . . . . . . . . . . . . 71
3.4 Details of phone-length dependent evaluation sets . . . . . . . . . 73
3.5 SBM-KS performance on Switchboard 1 data for different phone-
length target words . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Statistics for keyword verification evaluation sets . . . . . . . . . . 77
3.7 Equal error rates for SBM-based keyword verification . . . . . . . 78
3.8 Equal error rates for SBM and MLP-SBM keyword verification . . 82
4.1 Evaluated cohort word selection parameters . . . . . . . . . . . . 103
4.2 Performance of selected cohort word KV systems on TIMIT eval-
uation set. Cohort word systems are qualified with the appropri-
ate cohort word selection parameters using a tag in the format
dmin, dmax, ψd, ψi. . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 Performance of SBM-KV and selected cohort word systems on the
SWB1 evaluation sets. Cohort word selection parameters are spec-
ified with each system in the format dmin, dmax, ψd, ψi. . . . . . 108
xiii
4.4 Mean and standard deviation of the number cohort words used
in the 3 best performing cohort word KV methods for the SWB1
evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 Performance of baseline SBM-KV and best cohort word systems
on the SWB1 evaluation sets . . . . . . . . . . . . . . . . . . . . . 124
4.6 Performance of the best fused SBM-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.7 Performance of the best fused cohort-cohort systems on the SWB1
evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.8 Correlation analysis of fused EER and individual unfused EER . . 130
4.9 Summary of best performing systems . . . . . . . . . . . . . . . . 135
5.1 Phone substitution costs for DMLS . . . . . . . . . . . . . . . . . 149
5.2 Baseline keyword spotting results evaluated on TIMIT . . . . . . 151
5.3 TIMIT performance when isolating various DP rules . . . . . . . 154
5.4 Effect of adjusting number of lattice generation tokens . . . . . . 157
5.5 Effect of adjusting pruning beamwidth . . . . . . . . . . . . . . . 158
5.6 Effect of adjusting number of traversal tokens . . . . . . . . . . . 160
5.7 Effect of adjusting MED cost threshold Smax . . . . . . . . . . . . 161
5.8 Optimised DMLS configurations evaluated on TIMIT . . . . . . . 163
5.9 Keyword spotting results on SWB1 . . . . . . . . . . . . . . . . . 166
5.10 Relative speeds of optimised DMLS systems . . . . . . . . . . . . 176
5.11 Performance of a fully optimised DMLS system on Switchboard data177
5.12 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . 179
6.1 Summary of training data sets . . . . . . . . . . . . . . . . . . . . 186
6.2 Codes used to refer to model architectures . . . . . . . . . . . . . 187
6.3 Summary of evaluation data sets. . . . . . . . . . . . . . . . . . . 188
6.4 Stage 1 spotting rates for various model sets and database sizes . 191
xiv
6.5 Equal error rates after keyword verification for various model sets
and training database sizes . . . . . . . . . . . . . . . . . . . . . . 194
6.6 Stage 1 spotting and stage 2 post verification results for S1I ex-
periments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
xv
xvi
List of Figures
2.1 An example of a Receiver Operating Characteristic curve . . . . . 24
2.2 An example of a Detection Error Trade-off plot . . . . . . . . . . 25
2.3 Recognition grammar for HMM-based keyword spotting . . . . . . 27
2.4 Sample recognition grammar for small non-keyword vocabulary
keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 System architecture for HMM keyword spotting using a Speech
Background Model as the non-keyword model . . . . . . . . . . . 32
2.6 System architecture for HMM keyword spotting using a composite
non-keyword model constructed from phone models . . . . . . . . 33
2.7 Constructing a recognition network for constrained vocabulary key-
word spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.8 An optimised constrained vocabulary keyword spotting recognition
network (language model probabilities omitted) . . . . . . . . . . 39
2.9 An event spotting network for detecting occurrences of times [16] 40
2.10 Likelihood ratio based keyword occurrence verification with mul-
tiple verifier fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.11 Applying reverse dictionary searches to the detection of the word
ACQUIRE in a phone stream . . . . . . . . . . . . . . . . . . . . 50
2.12 Example of indexed reverse dictionary searching for the detection
of the word ACQUIRE . . . . . . . . . . . . . . . . . . . . . . . . 52
xvii
2.13 Using lattice based searching to locate instances of the word AC-
QUIRE within a phone lattice . . . . . . . . . . . . . . . . . . . . 54
3.1 Confusability circle for the target word STOCK . . . . . . . . . . 59
3.2 Example of the shared subevent confusable acoustic region for the
keyword STOCK . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Incorporating target word insertion penalty into HMM-based key-
word spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 DET plots for unnormalised and duration normalised output score
thresholding applied to SBM-KS . . . . . . . . . . . . . . . . . . . 72
3.5 DET plots for duration normalised output score thresholding ap-
plied to SBM-KS for keyword length dependent evaluation sets . . 75
3.6 DET plots for different target keyword lengths for SBM-KV on
Switchboard 1 evaluation sets . . . . . . . . . . . . . . . . . . . . 78
3.7 System architecture for MLP background model based KV . . . . 80
3.8 DET plots for SBM and MLP-SBM systems for 4-phone words . . 81
3.9 DET plots for SBM and MLP-SBM systems for 6-phone words . . 81
3.10 DET plots for SBM and MLP-SBM systems for 8-phone words . . 81
4.1 Controlling the degree of CAR region modeling dmin and dmax tuning 93
4.2 A N-class classifier approach to cohort word verification for the
keyword w and cohort word set R(w) . . . . . . . . . . . . . . . . 99
4.3 DET plot for best cohort word and SBM-KV systems on SWB1
4-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109
4.4 DET plot for best cohort word and SBM-KV systems on SWB1
6-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109
4.5 Equal error rate versus mean number of cohort words . . . . . . . 112
4.6 Trends in equal error rate with changes in cohort word set down-
sampling size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
xviii
4.7 Trends in equal error rate with changes in cohort word selection
range for 4-phone length cohort word KV . . . . . . . . . . . . . . 117
4.8 Trends in equal error rate with changes in cohort word selection
range for 6-phone length cohort word KV . . . . . . . . . . . . . . 118
4.9 Trends in equal error rate with changes in cohort word selection
range for 8-phone length cohort word KV . . . . . . . . . . . . . . 118
4.10 Trends in equal error rate with changes in MED cost parameters . 120
4.11 Correlation between unfused system performances and fused sys-
tem performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131
4.13 Boxplot of log(EERs) for all evaluated architectures and phone-
lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.1 Segment of phone lattice for an instance of the word STOCK . . 142
5.2 Effect of lattice traversal token parameter . . . . . . . . . . . . . 159
5.3 Trends in miss rate and FA/kw rate performance for various types
of tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.4 Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS
systems evaluated on Switchboard . . . . . . . . . . . . . . . . . . 168
5.5 The relationship between cost matrices for subsequences . . . . . 169
5.6 Demonstration of the MED prefix optimisation algorithm . . . . . 170
6.1 Effect of training dataset size on speech recognition [24] . . . . . . 183
6.2 Trends in miss rate across training database size . . . . . . . . . . 190
6.3 Trends in FA/kw rate across training database size . . . . . . . . 190
6.4 DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,
4=T16S2S, 5=T16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
6.5 DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,
4=M16S2S, 5=M16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
xix
6.6 DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,
4=M32S2S, 5=M32S1S . . . . . . . . . . . . . . . . . . . . . . . . 193
6.7 Trends in EER across training dataset size . . . . . . . . . . . . . 195
6.8 DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S 196
6.9 DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I 197
6.10 Extrapolations of Indonesian keyword spotting performance using
larger sized databases . . . . . . . . . . . . . . . . . . . . . . . . . 199
A.1 Example of cost matrix calculated using Levenstein algorithm for
transforming deranged to hanged. Cost of substitutions, deletions
and insertions all fixed at 1, cost of match fixed at 0. . . . . . . . 220
xx
List of Abbreviations
ADI Audio Document Indexing
CAR Confusable Acoustic Region
CLS Conventional Lattice-based Spotting
CMS Cepstral Mean Subtraction
CW Cohort Word
DAR Disparate Acoustic Region
DET Detection Error Trade-off
DMLS Dynamic Match Lattice Spotting
EER Equal Error Rate
FA False Alarm
GMM Gaussian Mixture Model
HMM Hidden Markov Model
IRDL Indexed Reverse Dictionary Lookup
KS Keyword Spotting
KV Keyword Verification
LVCSR Large Vocabulary Continuous Speech Recognition
MED Minimum Edit Distance
MLP Multi-Layer Perceptron
PLP Perceptual Linear Prediction
RDL Reverse Dictionary Lookup
xxi
ROC Receiver Operating Characteristic
SBM Speech Background Model
SBM-KS Speech Background Model based Keyword Spotting
SBM-KV Speech Background Model based Keyword Verification
STT Speech-to-Text Transcription
SWB1 Switchboard-1
TAR Target Acoustic Region
WSJ1 Wall Street Journal 1
xxii
Authorship
The work contained in this thesis has not been previously submitted for a degree
or diploma at any other higher educational institution. To the best of my knowl-
edge and belief, the thesis contains no material previously published or written
by another person except where due reference is made.
Signed:
Date:
xxiii
xxiv
Acknowledgments
Foremost I would like to acknowledge my Lord and Saviour Jesus Christ. It is by
His grace that I was given the opportunity and necessary abilities to partake in
this research.
I would also like to thank my beautiful wife, Melenie, who has been a constant
source of support and inspiration. Your words of encouragement have seen me
through the more difficult and frustrating times of this work.
To my supervisor, Professor Sridha Sridharan, I would like to offer my heartfelt
gratitude for your unrelenting support in bringing this research to completion.
Your positive words and guidance have been a true blessing.
I would also like to offer a special thanks for the friendship of the members
of the QUT Speech Research Labs. In particular, I would like to thank Terry
Martin, Robbie Vogt, Michael Mason and Brendan Baker for their constructive
criticism as well as their constant joviality.
Finally, I would like to thank my loving two families for believing in and
supporting me during this long venture, and my wonderful dogs for always giving
me a reason to smile.
Kit Thambiratnam
Queensland University of Technology
February 2005
xxv
xxvi
Chapter 1
Introduction
1.1 Overview
Keyword Spotting (KS) is the automated task of detecting keywords of interest
within continuous speech. This technology has been used in a variety of appli-
cations, ranging from telephone call centre systems to covert surveillance appli-
cations. Keyword spotting is closely related to the task of speech transcription,
but offers many advantages for certain applications.
Primarily, keyword spotting is well suited to data-mining tasks that process
large amounts of speech. This is because keyword spotting requires significantly
less processing power than transcription, and can therefore run at considerably
faster speeds. Real-time stream monitoring is one such example where this is
required. These applications monitor audio in real-time and flag occurrences of
segments of interest, such as news stories related to a specific topic. Clearly,
the majority of the stream does not require attention, and therefore a keyword
spotting solution that simply detects occurrences of topical keywords will be more
efficient than a fully-fledged large vocabulary transcription engine.
Keyword spotting is also an excellent technology for audio search applications,
1
2 Chapter 1. Introduction
such as audio document indexing. In particular, recent developments in KS in-
cluding lattice-based searching and reverse dictionary lookup methods have made
possible the development of unrestricted vocabulary audio document database
search engines that can search hours of data in seconds.
However, many keyword spotting technologies are encumbered by poor detec-
tion performance or slow search speeds. There is a trade-off between accuracy
and speed that needs to be managed, and unfortunately to date, many practical
keyword spotting applications are forced to sacrifice detection performance to
realise the execution speeds required for commercial deployment. One has only
to use speech-recognition-enabled telephony services such as telephone banking
to conclude that these systems are far from perfect.
Nevertheless, keyword spotting is a powerful and relevant technology. Used
appropriately, a keyword spotting solution brings with it reduced computational
requirements, increased scalability and potentially higher accuracies than a large
vocabulary transcription system.
1.1.1 Aims and Objectives
This work specifically examines the application of keyword spotting technolo-
gies to two data mining tasks: real-time keyword monitoring and large audio
document database indexing. With the ever-increasing amounts of audio and
multimedia being generated daily, the ability to extract information from audio
streams at high speeds while maintaining good detection rates is paramount.
A desirable feature of data mining applications is the support for unrestricted
vocabulary keyword queries. However, a significant portion of past keyword spot-
ting research has dealt primarily with restricted vocabulary methods. Although
these approaches offer advantages in terms of detection and false alarm perfor-
mance, they limit the flexibility of queries. As such, this work concerns itself
1.1. Overview 3
solely with the study of unrestricted vocabulary keyword spotting techniques.
Data throughput is also another major consideration when dealing with large
amounts of data. Although the cost of computing is constantly becoming cheaper,
it is nevertheless beneficial to run at high speeds. This is particularly true for
audio indexing applications, where literally hundreds of hours may need to be
interactively searched by a user. Unfortunately many published KS works neglect
to consider execution time during experimentation. This research will therefore
give considerable attention to the issue of processing speed.
The primary objectives of this thesis are as follows:
1. To review and investigate current state-of-the-art keyword spotting tech-
niques that are relevant to the tasks of real-time keyword monitoring and
audio document indexing
2. To assess and evaluate the performance of these techniques with regards
to crucial performance metrics relevant to the target applications, and as
such, identify potential issues that need to be addressed
3. To investigate and develop novel techniques that can be used to improve the
performance of keyword spotting techniques for data mining applications
4. To investigate the application of keyword spotting technologies for non-
English data mining
1.1.2 Research Scope
Keyword spotting encompasses a plethora of speech recognition research topics
that unfortunately cannot be fully addressed in a single work. As such, the scope
of this research was limited to issues that were directly related to the application
of keyword spotting to real-time keyword monitoring and audio document index-
ing. Additionally, the following restrictions and constraints were applied to this
4 Chapter 1. Introduction
research:
1. Primarily this work concerns itself with the application of HMM-based
speech recognition techniques to the keyword spotting task. Alternate sta-
tistical modeling approaches, such as neural network techniques, have been
proposed and demonstrated to be suitable for keyword spotting. However,
it is believed that the HMM-based approach provides a greater degree of
flexibility particularly with regards to unrestricted vocabulary tasks, and
as such is the modeling architecture of choice for this research.
2. Experiments reported within this work are limited to single keyword detec-
tion. Although most practical applications of keyword spotting use multi-
word detection during a single pass, it is believed that research constrained
to single keyword detection offers a number of advantages. Primarily, it
allows ease of comparison between results in this thesis and other published
works. Additionally, the variability in performance due to different mix-
tures of words within a multi-word keyword set can be avoided, thereby
ensuring greater consistency between experiments. Finally, it is believed
that trends in single keyword spotting across methods will easily translate
to multi-word keyword spotting tasks, and as such, does not limit the value
of this research.
1.2 Thesis Organisation
An overview of the organisation of this thesis is given below:
Chapter 2 - A Review of Keyword Spotting presents a thorough review of
keyword spotting and associated technologies. A formal definition of the
keyword spotting problem is given, as well a discussion of its primary appli-
cations. This is followed by an overview of the key performance metrics that
1.2. Thesis Organisation 5
are relevant to evaluating and understanding keyword spotting methodol-
ogy. A detailed review of KS literature is then presented covering the topics
of unrestricted and restricted spotting techniques, non-keyword modeling
architectures, keyword verification and confidence scoring methods, and au-
dio indexing approaches.
Chapter 3 - HMM-based Spotting and Verification discusses and evalu-
ates existing HMM-based keyword spotting and verification techniques.
Such methods have a strong following within the keyword spotting com-
munity. However, to date, there has been little published work that com-
pares the performances of the various approaches. What little that has
been published has primarily focused on measuring performance for sim-
plistic domains such as read microphone speech. A number of HMM-based
techniques are evaluated in this chapter and the strengths and weaknesses
of these methods are discussed.
Chapter 4 - Cohort Word Verification proposes a novel keyword verifica-
tion approach that combines high level linguistic information with cohort-
based verification techniques to yield improved performance. A number of
experiments are reported on to measure the performance of this method for
the conversational telephone speech and read microphone speech domains.
The results demonstrate that significant gains can be obtained particularly
for the difficult task of short-word keyword verification. In addition, exper-
iments are performed using a fused architecture that combines cohort word
verification with traditional background model based verification. Further
gains in performance are obtained using this approach.
Chapter 5 - Dynamic Match Lattice Spotting proposes a novel audio in-
dexing technique that is presented and evaluated in this chapter. Although
existing unrestricted audio indexing methods are capable of very fast search
6 Chapter 1. Introduction
speeds, they are encumbered by very poor miss rate performance. It is ar-
gued here that this poor miss rate is a result of inherent phone recogniser
errors that are not accommodated for by these techniques. As such, a new
method of lattice-based searching is proposed that incorporates dynamic
sequence matching methods to provide robustness against erroneous lattice
realisations. The results demonstrate that dramatic gains in performance
can be obtained while still maintaining extremely fast search speeds.
Chapter 6 - Non-English Spotting studies the application of keyword spot-
ting technologies to non-English languages. In particular, it examines the
effects of limited training data on keyword spotting performance. The lack
of availability of non-English training data has greatly hindered the de-
velopment of other speech technologies such as large vocabulary speech
transcribers. However, keyword spotting is a significantly more constrained
task, and therefore may be less affected by reduced amounts of training
data. If so, this may allow the immediate development of speech tech-
nologies for non-English languages without the need for the costly task of
creating large training databases.
Chapter 7 - Summary, Conclusions and Future Work presents the sum-
mary and conclusions of this work as well as a discussion of future research
directions.
1.3 Major Contributions of this Research
This work has generated a number of novel contributions to the field of keyword
spotting. These are:
1. The development of the novel Cohort Word Verification technique. This
1.4. List of Publications 7
method combines high level linguistic knowledge with cohort-based veri-
fication techniques to yield significant improvements particularly for the
problematic area of short-word keyword verification.
2. The use of multiple keyword verifier fusion, in particular applied to the
combination of cohort word verification with existing HMM-based tech-
niques. It is demonstrated that such fusion techniques allow the strengths
of individual verifiers to be combined to yield considerable improvements
in verification performance.
3. The development of the novel Dynamic Match Lattice Spotting approach.
This technique augments existing lattice-based audio indexing techniques
with dynamic sequence matching to improve robustness to erroneous lattice
realisation. The resulting algorithm is capable of searching hours of speech
using seconds of processor time while maintaining good miss and false alarm
rates.
4. A detailed study of the effects of limited training data for keyword spotting,
as well as how this impacts the immediate development and deployment of
speech technologies for non-English languages.
1.4 List of Publications
The research presented in this thesis has resulted in the publication of a number
of fully referenced peer reviewed works.
1. K. Thambiratnam and S. Sridharan. “Isolated word verification using Co-
hort Word-level Verification”, in Proceedings of the European Conference
on Speech Communication and Technology (EUROSPEECH), (Geneva,
Switzerland), 2003
8 Chapter 1. Introduction
2. K. Thambiratnam and S. Sridharan. “A study on the effects of limited
training data for English, Spanish and Indonesian keyword spotting”, in
Proceedings of the 10th Australian International Conference on Speech Sci-
ence and Technology (SST), (Sydney, Australia), 2004
3. T. Martin, K. Thambiratnam and S. Sridharan. “Target Structured Cross
Language Model Refinement”, in Proceedings of the 10th Australian In-
ternational Conference on Speech Science and Technology (SST), (Sydney,
Australia), 2004
4. K. Thambiratnam and S. Sridharan, “Fusion of cohort-word and speech
background model based confidence scores for improved keyword confidence
scoring and verification”, in Proceedings of the IEEE 3rd International Con-
ference on Sciences of Electronic, Technologies of Information and Telecom-
munications, (Susa, Tunisia), 2005
5. K. Thambiratnam and S. Sridharan, “Dynamic match phone-lattice searches
for very fast and accurate unrestricted vocabulary keyword spotting”, in Pro-
ceedings of the 2005 IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), (Philadelphia, USA), 2005
Chapter 2
A Review of Keyword Spotting
2.1 Introduction
This chapter presents a comprehensive review of keyword spotting technologies
to date. Section 2.2 gives a formal definition of the keyword spotting problem
and is followed by a discussion of the various applications of keyword spotting in
section 2.3. A brief synopsis of the development of keyword spotting research is
provided in section 2.4 as well as a detailed description of how keyword spotting
performance is measured in section 2.5.
Subsequent sections discuss the current methods of keyword spotting with
respect to their key applications. Section 2.6 discusses a number of algorithms
for unconstrained vocabulary keyword spotting. This is followed by a description
of the various approaches to non-keyword modeling in section 2.7. Approaches to
constrained vocabulary keyword spotting are then presented in section 2.8 as well
as methods for keyword occurrence verification in section 2.9. Finally, methods
of applying KS to the task of audio document indexing are discussed in section
2.10.
9
10 Chapter 2. A Review of Keyword Spotting
2.2 The keyword spotting problem
Keyword spotting can be viewed as a special case of Speech-to-Text Transcription
(STT), in which the transcription vocabulary is restricted to keywords of interest
plus a non-keyword symbol that is used to represent all other words in the target
application domain.
Let O be an observation sequence, V be the vocabulary of the target appli-
cation domain, Q be the set of keywords of interest and Ω be the non-keyword
symbol. If STT is represented as the transformation W = Transcribe(O, V ),
where W = w1, w2, . . . is the resulting hypothesised word sequence, then the
keyword spotting task can be defined as
KS(O, V,Q) = f(Transcribe(O, V ), Q) (2.1)
where f(W,Q) is a transformation applied to the output of STT and is given by
f(W,Q) =
W |W | = 1, w1 ∈ Q
Ω |W | = 1, w1 6∈ Q
w1, f(Tail(W ), Q) |W | > 1, w1 ∈ Q
Ω, f(Tail(W ), Q) |W | > 1, w1 6∈ Q, w2 ∈ Q
f(Tail(W ), Q) otherwise
and Tail(xiNi=1) = xi
Ni=2
f(W,Q) essentially replaces all sequences of non-keywords in the word se-
quence output by the transcriber by a single non-keyword symbol Ω.
Although valid, this formulation of keyword spotting is inefficient as it requires
full transcription using a vocabulary of size |V |. Typically keyword spotting is
2.3. Applications of keyword spotting 11
only interested in occurrences of a much smaller set of words defined by Q. Given
this simplification, a more practical and efficient formulation of keyword spotting
is
KS(O, V,Q) = Transcribe(O, g(Q)) (2.2)
where g(Q) = Q ∪ Ω
This alternate approach requires transcription using a much smaller vocab-
ulary of size |Q| + 1. Clearly, this is a considerably less computationally inten-
sive task than transcription using the formulation in equation 2.1. However, it
introduces the additional burden of an acoustic model representation of the non-
keyword symbol Ω. Definition of the non-keyword symbol is one of the active
areas of keyword spotting research and is discussed further in section 2.7.
2.3 Applications of keyword spotting
Keyword spotting lends itself to a plethora of speech-enabled applications. Key-
word spotting is particularly well suited to applications where large amounts of
speech need to be processed. This is because it offers a significant speed benefit
over a large vocabulary STT approach. Four major applications of this technol-
ogy are keyword monitoring, audio document indexing, command control devices
and dialogue systems.
2.3.1 Keyword monitoring applications
Keyword monitoring applications are required to continuously monitor a real-
time stream of audio and to flag any occurrences of a keyword in the query
set. Specific keyword monitoring applications include telephone tapping, listening
device monitoring and broadcast monitoring.
12 Chapter 2. A Review of Keyword Spotting
Telephone tapping and listening device monitoring are used extensively by
security organisations to detect criminal or malicious activity. Keyword spotting
provides a fast and automatic solution to this task and potentially a higher de-
tection accuracy then human monitoring, particularly when a very large number
of audio streams needs to be monitored. However, these applications create a
considerable challenge for keyword spotting because of the noisy nature of the
speech being monitored. Telephone conversations may be plagued with signifi-
cant background noise, multiple languages and even multiple speakers, providing
challenges for acoustic modeling. Listening device audio may suffer from very low
signal-to-noise ratios, a difficulty for any speech processing application.
Broadcast monitoring is actively performed by commercial broadcast mon-
itoring companies to locate segments that may be of interest to a client. For
example, a senator may be interested in all news stories in which he or she is
mentioned in - broadcast monitoring organisations provide such a service at a
fee. A significant challenge of broadcast monitoring is the amount of audio that
needs to be processed daily. Broadcast monitoring clients may be interested in
stories from a comprehensive set of broadcast sources, including free-to-air tele-
vision, cable-television, commercial radio and community radio. It is easy to see
that the vast numbers of these combined with the fact that many of these sources
broadcast continually 24 hours a day, 7 days a week, makes broadcast monitoring
a very data intensive problem.
Keyword spotting provides an excellent solution to all these keyword mon-
itoring tasks. Faster-than-real-time keyword spotting technologies are likely to
process audio faster than a human processor. Additionally the accuracy of an
automatic system is also likely to exceed that of a human processor since com-
puters do not suffer from fatigue and mental distractions that plague a human
processor. Keyword spotting is particularly well suited to the broadcast moni-
toring task since audio quality in this domain is usually of much higher quality
2.3. Applications of keyword spotting 13
than telephone and listening device audio.
2.3.2 Audio document indexing
Audio document indexing is the task of rapidly searching an audio document
database for keywords and topics of interest. This functionality is analogous
to traditional text document indexing systems such as the Google [11] Internet
search engine, but operates on audio documents instead. The need for efficient
and fast audio document indexing is paramount in a world where audio and
multimedia documents play a greater role in everyday life.
STT systems are one solution to the audio document indexing problem. Audio
is first transcribed to text that can then be rapidly searched during query time.
However, many applications of audio document indexing, such as news database
searching, require support for proper noun queries such as names, places and
foreign words — terms that in many cases are not a part of the transcription
system’s vocabulary. As such, alternates to the STT-based approach that do not
constrain the query vocabulary are required.
Thankfully, a keyword spotting solution does provide support for unrestricted
vocabulary queries. The trade-off though is a reduction in query speeds, since
most KS approaches are nowhere near as fast as text-based searching methods.
Nevertheless, the support for unrestricted vocabulary queries is important, and
as such, a keyword spotting system can be used to augment an STT-based system
to provide very fast queries for in-vocabulary words while still supporting out-of-
vocabulary queries.
2.3.3 Command controlled devices
Command controlled devices monitor the ambient audio and react when they
detect specific command words. Examples of command controlled devices are
14 Chapter 2. A Review of Keyword Spotting
speech-enabled mobile phones, voice-controlled VCRs and command-controlled
factory machinery.
Although generic keyword monitoring technologies can be used for command
controlled devices, they typically place too high a processing or memory require-
ment to be feasible, especially in the case of DSP-based or embedded applications.
Additionally, the query terms of command controlled devices tend to be fixed,
allowing more application-specific information to be incorporated into the key-
word detection process. This includes query word linguistic context information
and environmental noise conditions.
Hence command controlled device KS lends itself to the development of cus-
tom solutions. Though many of these solutions may be based on existing key-
word spotting approaches, significant enhancements and modifications are made
to provide maximum performance for the intended application.
2.3.4 Dialogue systems
Automated dialogue systems are becoming more common in the commercial en-
vironment as a viable alternative to human-operated call centres. A dialogue
system mimics a human call-centre operator by playing voice prompts to a caller
and then attempting to detect keywords that indicate the response of caller.
Since the volume of calls processed by a call-centre can be very large, large vo-
cabulary STT approaches have proven infeasible due to their high computational
requirements. Instead restricted grammar speech recognisers or keyword spotting
technologies are used to interpret the response of callers.
Keyword spotting approaches offer a benefit over restricted grammar speech
recogniser approaches because they allow greater flexibility in the response of the
speaker. This is because KS accommodates out-of-vocabulary words by means
of non-keyword modeling. However, a cleverly constructed restricted grammar
2.4. The development of keyword spotting 15
speech recogniser can better understand the intention of a caller using contextual
information, and therefore may prove more appropriate for certain applications.
2.4 The development of keyword spotting
In a similar fashion to general speech recognition theory, keyword spotting has un-
dergone a number of generations of development. Early approaches were limited
by low computing resources and hence KS research was limited to simpler tasks
such as isolated keyword detection. As speech recognition technology matured,
more advanced tasks were explored, such as the detection of keywords embedded
in noise or continuous speech.
2.4.1 Sliding window approaches
Initial methods focused on using sliding window approaches such as the dynamic
time warping approaches proposed by Sakoe and Chiba [29] and Bridle [6], or
the sliding window based neural network method prescribed by Zeppenfeld [40].
Such techniques yielded acceptable results in isolated keyword spotting tasks,
but suffered from considerable drops in performance when spotting keywords
embedded in continuous speech.
A major reason for this drop in performance was because sliding window ap-
proaches did not model non-keywords either implicitly or explicitly. Spotting
of keywords in continuous speech is essentially a 2-class discrimination task, at-
tempting to classify regions as either a keyword or a non-keyword instance. Since
the traditional sliding window approaches did not model non-keywords, they es-
sentially were attempting discrimination with only knowledge of the target class.
This was analogous to making measurements without a point of reference - all ob-
servations were purely relative and therefore provided little confidence for making
absolute decisions.
16 Chapter 2. A Review of Keyword Spotting
2.4.2 Non-keyword model approaches
To address the lack of knowledge of the non-target class, the concept of non-
keyword models (also known as filler models) was introduced into keyword spot-
ting. Non-keyword models attempted to model all speech that did not form a
part of the target keyword speech. For example, in a closed vocabulary system, a
non-keyword model would attempt to model all words in the vocabulary except
for the target keywords. Using a non-keyword model provided more confidence
when accepting or rejecting putative instances of target keywords compared to
the sliding window approaches because a comparison was being made between
the target keyword model and the non-keyword model.
One of the initial approaches used to incorporate non-keyword models was pro-
posed by Higgins and Wohlford [13]. Here a DTW-based continuous speech recog-
niser was modified to use filler non-keyword models to represent non-keyword
speech. The modified speech recogniser was then used to transcribe continuous
speech into regions of keywords and non-keywords. Finally, a likelihood ratio
was used to normalise keyword likelihoods by the corresponding likelihood of the
non-keyword model over the same observation sequence. Non-keyword models in
this particular approach were modeled by using pieces and subsequences of the
target keyword.
The introduction of non-keyword models into keyword spotting saw the fu-
sion of continuous speech recognition research with keyword spotting techniques.
Whereas previously KS approaches had exclusively used sliding window tech-
niques, the use of non-keyword models required a paradigm shift into the speech
recognition context. Specifically, keyword spotting could be simply viewed as a
special case of continuous speech recognition, where all non-keyword speech was
labeled with a single non-keyword tag. Operating within the speech recognition
framework allowed the latest developments in continuous speech recognition such
2.4. The development of keyword spotting 17
as advances in modeling techniques to be transferred to the KS domain. Hence,
keyword spotting research began to more closely follow the trends of speech recog-
nition research.
2.4.3 Hidden Markov Model approaches
The advent of Hidden Markov Model (HMM) based speech recognition lead to
the introduction of HMM-based keyword spotting techniques. As for DTW-based
keyword spotting, HMM-based keyword spotting could be viewed as a special case
of HMM-based speech recognition, where all non-target words were represented
by a non-keyword model.
One common approach was to use a word loop consisting of all target keywords
in parallel with the non-keyword. Target keywords were typically modeled using
either word models or sub-word models, while non-keyword speech was modeled
using a plethora of architectures, including a high-order Gaussian Mixture Model
as prescribed by Wilpon et al. [35] or a monophone model set as suggested by Rose
and Paul [28]. This lead to the development of better performing KS systems,
paving the way to more complex keyword spotting applications.
2.4.4 Further developments
Advances in high-level linguistic modeling through recognition grammars and lan-
guage modeling were also incorporated into keyword spotting. These advances
were motivated by the need to reduce false alarm rates of KS systems through
the use of contextual information, specifically to reduce or constrain the emission
of false putative keyword occurrences. Kenji et al. [18] and Gou et al. [12] both
described techniques of incorporating finite state grammars into the spotting pro-
cess. The reported experiments demonstrated significant gains in performance for
simple recognition grammar applications compared to non-grammar-constrained
18 Chapter 2. A Review of Keyword Spotting
approaches. However, such systems were less flexible since they were configured
specifically for a target grammar, limiting the ability to be easily ported to a
different recognition grammar environment.
Other experiments reported by Rohlicek et al. [27] and Jeanrenaud et al.
[16] described how bigram language modeling concepts could be incorporated
into KS to improve performance. Results demonstrated that such methods were
particularly well suited for event spotting (spotting a set of keywords belonging to
a specific class such as dates). However, such approaches resulted in significantly
greater computational burden due to increased recognition network complexity.
Finally, the increased importance of audio and multimedia introduced the
requirement for fast unrestricted vocabulary keyword spotting in large audio
databases. Prior to this, the majority of keyword spotting research had focused
on real-time monitoring style applications or telephone dialogue systems. Al-
though these existing technologies were fast, they did not provide the very high
speed and scalability required for large audio database data mining.
This saw the introduction of two-stage algorithms such as that proposed by
Young and Brown [39]. Such algorithms first transcribed speech to a low-level
textual representation (eg. phone labels) that could then be searched very quickly
at query time for a target keyword of interest. This resulted in query speeds sev-
eral orders of magnitude faster than previously obtained using existing keyword
spotting methods.
2.5 Performance Measures
A broad range of metrics are used to measure the performance of keyword spotting
algorithms. Understanding these measures is essential both for the purpose of
discussing algorithms as well as comprehending the significance of results. This
section defines the metrics and terms used in keyword spotting literature and
2.5. Performance Measures 19
within this work.
2.5.1 The reference and result sets
The output of a keyword spotting system is a set of tuples, Ψ, representing
putative keyword occurrences. Each tuple consists of an utterance identifier, a
keyword tag, a start time and an end time.
When evaluating performance, this result set is scored against a reference
set of tuples, Γ, containing the true occurrences of the keywords to be detected.
Formally, these results are defined as follows:
Γ = γ1, γ2, . . . , γN (2.3)
Ψ = ψ1, ψ2, . . . , ψM (2.4)
where γi = (uri , wri , s
ri , e
ri )
ψj = (upj , wpj , s
pj , e
pj)
uri , upi = ith reference/putative occurrence’s utterance identifier
wri , w
pi = ith reference/putative occurrence’s keyword tag
sri , spi = ith reference/putative occurrence’s keyword start time
eri , epi = ith reference/putative occurrence’s keyword end time
2.5.2 The hit operator
The hit operation is used to determine whether a reference occurrence was suc-
cessfully detected. In this work, the following definition of a hit is used
Hit Operator. Given a reference occurrence γ and a putative result set, Ψ,
then the reference occurrence γ is declared as hit if the mid-point of the reference
occurrence falls within the time boundaries of one of the putative occurrences in
Ψ, and the respective keyword tags and utterance identifiers are equal.
20 Chapter 2. A Review of Keyword Spotting
A similar hit definition is used in other keyword spotting literature and soft-
ware, including the HTK software package.
Formally, the hit operation is defined as follows:
γi ª ψj =
0 uri 6= upj
0 wri 6= wp
j
0 spj >(sr
i+eri )
2
0 epj <(sr
i+eri )
2
1 otherwise
(2.5)
2.5.3 Miss rate
In keyword spotting literature, both miss rate and its converse, hit rate, are used
predominantly as the measure of reference occurrence detection error rate. The
miss rate is defined as
Miss rate. Given a reference occurrence set Γ and a putative result set Ψ, the
miss rate is defined as the total number of elements of Γ that were not hit by at
least one element in Ψ.
Formally, the miss rate is defined in terms of the hit operator as follows:
MissRate(Γ,Ψ) =|Γ| − |HitSet(Γ,Ψ)|
|Γ|(2.6)
where HitSet(Γ,Ψ) = γ ∈ Γ|∑
j
(γ ª ψj) > 0
Hit rate is the converse of miss rate and is defined as
HitRate(Γ,Ψ) = 1−MissRate(Γ,Ψ) (2.7)
2.5. Performance Measures 21
2.5.4 False alarm rate
False alarm rate is a measure of the number of incorrect results output by a
keyword spotter. The false alarm is defined as
False Alarm. A member of the result set Ψ that does not hit any of the members
of the reference set Γ.
Two different definitions for false alarm rate are used in literature. The more
common of the two and the definition used within this work is
False Alarm Rate. The number of false alarms in the keyword spotting result
set normalised by the total duration of evaluation speech searched and the number
of unique keywords searched for.
False alarm rate can be expressed formally as
FARate(Γ,Ψ,W, T ) =|FASet(Γ,Ψ)|
|W | ∗ T(2.8)
where FASet(Γ,Ψ) = ψ ∈ Ψ|∑
j
(γj ª ψ) = 0
W = List of keywords being queried for
T = Duration of speech searched in hours
This definition of false alarm rate is used when measuring the overall keyword
spotting performance of a system. The alternate definition of false alarm, referred
to as false acceptance rate in this work, is typically more useful in measuring the
performance of the keyword confidence scoring and verification stages of a KS
system.
2.5.5 False acceptance rate
The false acceptance rate is an alternate measure to the false alarm rate that re-
flects the impurity of a result set. Specifically, the false acceptance rate measures
22 Chapter 2. A Review of Keyword Spotting
what percentage of the final result set is comprised of false alarms. The measure
is defined as follows:
False Acceptance Rate. The number of false alarms in the keyword spotting
result set normalised by the total size of the result set.
Formally, the false acceptance rate is defined as:
FalseAcceptRate(Γ,Ψ) =|FASet(Γ,Ψ)|
|Ψ|(2.9)
2.5.6 Execution time
In applications that require high query speeds, it is necessary to measure the
execution time of algorithms. A convenient measure of execution time for keyword
spotting is
Execution time. Given a query word set, W , execution time is defined as the
total number of minutes of CPU time used for execution normalised by the size
of the query word set and the number of hours of speech searched. CPU time is
measured using a 3.0GHz Pentium 4 processor.
Formally, execution time is expressed as:
ExecutionT ime =tcpu
|W | ∗ T(2.10)
where tcpu = Total CPU minutes for execution
W = Query word set
T = Total hours of speech searched
2.5.7 Figure of Merit
A frequently quoted metric in literature is the Figure Of Merit (FOM). This
metric is a measure of the average hit rate performance across a cross-section of
2.5. Performance Measures 23
system operating points. FOM is defined as
Figure Of Merit. The average hit rate taken over false alarm rates from 0
FA/kw-hr to 10 FA/kw-hr.
The FOM metric has a number of issues that restrict its usefulness. Most
importantly, it is not a good comparative measure for systems where low miss
rate (or high hit rate) is more important than low false alarm rates. This is
typically the case in security surveillance applications where it is important to
miss as few events as possible. For such applications, a better alternative would
be to take the average false alarm rate across a cross-section of low miss rates.
Additionally, the FOM cannot be used to measure the performance of systems
that do not provide an output score. This is because for such systems it is not
possible to easily tune the system to obtain hit rates at various false alarm rates,
and hence it is difficult to calculate the FOM.
2.5.8 Equal Error Rate
The Equal Error Rate (EER) is a single metric that attempts to provide a figure
of comparison between tunable systems. For keyword spotting, EER is defined
as
Equal Error Rate. The miss rate at the point on the operating characteristic
curve at which the miss rate equals the false acceptance rate.
EERs should be used carefully when comparing systems as they only sum-
marise performance at a single operating point. They do not take into effect
important performance considerations such as the gradient of the operating char-
acteristic curve. However they are a good comparative measure when the gradient
and form of the operating characteristic curves being compared are similar.
24 Chapter 2. A Review of Keyword Spotting
2.5.9 Receiver Operating Characteristic Curves
A Receiver Operating Characteristic (ROC) curve is a plot of hit rate versus
false acceptance/alarm rate. ROCs provide a view of how keyword spotting
performance varies as the system is tuned. Figure 2.1 gives an example of an
ROC plot.
Figure 2.1: An example of a Receiver Operating Characteristic curve
ROCs are excellent for examining performance at low false acceptance/alarm
rates. However, they provide very poor resolution at low miss rates, making it
difficult to understand how performance varies at low miss rates, as demonstrated
in figure 2.1.
2.5. Performance Measures 25
2.5.10 Detection Error Trade-off Plots
Detection Error Trade-off (DET) plots are a plot of miss rate versus false ac-
ceptance rate on a logarithmic scale. Viewing this data on a logarithmic scale
typically results in a linear operating characteristic curve that is easier to interpret
visually than an ROC curve. Additionally DET plots provide good resolution at
both low miss rates as well as low false acceptance rates. An example of a DET
plot is shown in figure 2.2.
Figure 2.2: An example of a Detection Error Trade-off plot
DET plots may suffer from step effects within the low miss rate regions in some
experiments. Such an effect indicates that the output scores of putative hits are
distinctly segregated from the putative false alarms within the step region. Hence
there are sudden jumps along the curve (ie. a sequence of putative hits followed
by a sequence of putative misses within the result set ordered by output score).
26 Chapter 2. A Review of Keyword Spotting
Step effects may also be observed when a small reference set Γ is used. How-
ever, in such small reference set size cases, the step effects would be observed
across all operating points since the lower bound of changes in miss rate would
be 1/|Γ|.
2.6 Unconstrained vocabulary spotting
Unconstrained vocabulary keyword spotting algorithms perform keyword spotting
without placing significant constraints on the query vocabulary. Such systems of-
fer significant flexibility, in particular for applications with a dynamic vocabulary.
News monitoring is an example of such an application, where there is a need to
be able to continually update the query word set to include the latest terms of
interest, such as names, abbreviations and foreign keywords.
An unfortunate byproduct of this flexibility is that it is not possible to in-
corporate prior knowledge about query keywords into the KS search process.
This may result in potentially lower performance than a constrained vocabulary
keyword spotting system.
2.6.1 HMM-based approach
Subword acoustic unit HMM-based speech recognition techniques provide a con-
venient and popular framework for unconstrained vocabulary keyword spotting.
This is because the recognition vocabulary of such systems can be easily extended
by updating the recogniser lexicon, assuming appropriate subword unit acoustic
models (eg. monophone or triphone models) are being used.
A HMM-based keyword spotting system is constructed in a similar fashion
to a word-loop speech recognition system. First a recognition word loop is built
with nodes for each of the words in the query keyword set. Additionally, a non-
keyword node is included to represent non-keyword speech, resulting in a word
2.6. Unconstrained vocabulary spotting 27
Keyword W1
Keyword W2
Keyword WN
Non-Keyword
Figure 2.3: Recognition grammar for HMM-based keyword spotting
network of the form shown in figure 2.3. Then the word network is expanded into
an acoustic model network using a lexicon. The lexicon must contain mappings
for all words in the recognition network as well as the non-keyword word.
Viterbi decoding is then performed using this network to transcribe a given
observation sequence into a word sequence. The result is a time-marked transcrip-
tion of the observation sequence in terms of keyword symbols and the non-keyword
symbol.
Modeling of keyword models for keyword spotting is straightforward as well-
established techniques from speech recognition literature can be employed. These
include word-based, phone-based and syllable-based models. However, for uncon-
strained vocabulary applications, phone-based or syllable-based models are more
suitable as they allow a much greater flexibility in the choice of keywords that
can be modeled.
In contrast, modeling of the non-keyword symbol is more difficult. A good
non-keyword model must adequately represent almost the entire vocabulary of an
application domain while still rejecting any instances of keyword speech. Ideally,
28 Chapter 2. A Review of Keyword Spotting
given a query word set, W , and an observation sequence O corresponding to the
utterance of a single word ω, the following conditions should hold in a maximum-
likelihood sense:
argmaxip(O|λwi
)− p(O|λnkw)
≥ 0 if ω = wi
< 0 otherwise
(2.11)
where λwiis the acoustic model for word wi and λnkw is the non-keyword model.
For domains where the vocabulary is anticipated to be small and well defined,
a possible composite non-keyword model could be constructed from a parallel
combination of all non-keyword words. Figure 2.4 demonstrates a trivial exam-
ple of such a non-keyword model using a small task vocabulary where the male
names Joe,William and Henry are target keywords and the female names Sherita,
Rachel, Jessica, Tiani and Chelsea are non-keywords.
Such a solution becomes more intractable as the size of the vocabulary for non-
keyword speech increases. This is because the given architecture is essentially a
speech recognition solution, attempting to transcribe all speech. Instead, a non-
keyword model representation is required that does not increase in complexity
with the size of the non-keyword speech vocabulary. A number of solutions to
this problem have been proposed in literature and are discussed further in section
2.7.
2.6.2 Neural Network Approaches
Neural network based speech recognition is a popular alternative to the more
common HMM-based approach. In particular, neural networks are a good can-
didate for highly discriminative tasks, such as keyword spotting, since they are
more discriminative than the mixture of Gaussian HMM models typically used
in KS.
2.6. Unconstrained vocabulary spotting 29
Henry
William
Sherita
Rachael
Jessica
Tiani
Chelsea
Joe
Target Keyword Models
Non−Keyword Models
Figure 2.4: Sample recognition grammar for small non-keyword vocabulary key-word spotting
However pure neural network methods have been rarely proposed for keyword
spotting, particularly in recent literature. Those that have proposed this archi-
tecture have mainly used time-modeling architectures such as Recurrent Neural
Networks or Time-Delay Neural Networks (eg. Trentin et al. [33]). One such ex-
ample is the Recurrent Neural Network approach proposed by Jianlai et al. [17].
Unfortunately the reported experiments demonstrated that although this method
was capable of reliably detecting simple signal shapes, such as sinusoids embedded
in continuous signals, performance was less than pleasing for digit spotting.
A more common and far more successful approach to incorporating neural
networks has been through the use of hybrid neural network HMM systems.
30 Chapter 2. A Review of Keyword Spotting
Such systems include those prescribed by Bernadis and Bourlard [4], Lippmann
and Singer [21] and Ou et al. [25]. These algorithms attempted to augment
the convenient and well-performing HMM architecture with the discriminative
benefits of neural networks.
Lippmann and Singer [21] performed keyword spotting by using HMM acous-
tic models with Radial Basis Function neural network state distributions. The
motivation here was to use the Radial Basis Function state distribution in place
of the less discriminative mixture of Gaussian architecture normally used with
HMMs. Results demonstrated some appreciable gains in performance.
An alternate hybrid system was proposed by Alvarez-Cercadillo et al. [1].
Here a Recurrent Neural Network decoder was used in conjunction with HMM-
based observation scoring. The method achieved low miss and false alarm rates
demonstrating the suitability of the method for the KS task. However, it must be
noted that very low word error rates (2% and lower) were also reported for speech
recognition using this approach, implying fairly simplistic (eg. single speaker or
small vocabulary) evaluation data sets. The reported results should be considered
in the light of this.
Two stage hybrid systems have also been proposed, such as the methods de-
scribed by Bernadis and Bourlard [4] and Ou et al. [25]. In such approaches, an
initial mixture of Gaussians HMM system was used to generate a set of puta-
tive occurrences. A secondary neural network classifier was then used to further
classify the putative occurrences based on the HMM output likelihoods. Com-
parisons of these methods with standard HMM approaches is difficult because
they implicitly incorporate a secondary confidence scoring stage. It is more than
likely that a KS system with a confidence scoring stage will outperform a single-
stage HMM system since the latter does not incorporate any explicit confidence
scoring. This must be taken into account when considering results reported for
two-stage hybrid systems.
2.7. Approaches to non-keyword modeling 31
2.7 Approaches to non-keyword modeling
The choice of non-keyword model architecture is a significant consideration for
any keyword spotting and keyword confidence scoring system. These models play
a crucial role in determining the final miss rate and false alarm rates of these
systems since they represent the important non-target class in keyword spotting.
This section discusses a cross-section of non-keyword model approaches proposed
in literature.
2.7.1 Speech background model
Wilpon et al. [35] and Rose and Paul [28] both espoused the use of a single high-
order Gaussian Mixture Model (GMM) for non-keyword modeling. This type
of model is commonly referred to as a universal background model (UBM) or
speech background model (SBM) in literature. Reported experiments demon-
strated that the SBM-based approach was capable of attaining miss rates below
6% for telephone speech keyword spotting.
Figure 2.5 shows the typical architecture for an SBM-based approach. The
motivation for this approach was to capture the acoustic characteristics of all
speech in the target domain within a single non-keyword model. However, as
stated before, an ideal non-keyword model should uphold the condition given
in equation 2.11. In the SBM approach, this condition is not guaranteed to be
maintained since the SBM is trained on speech that may include instances of the
target keywords. As such, for an observation sequence O corresponding to an
instance of word w, the output likelihood of the SBM, p(O|λnkw), may exceed the
output likelihood of the target word model, p(O|λw), thus breaking the condition
in equation 2.11. This is more likely for words with a high frequency count in the
SBM training data.
However, since the SBM models greater acoustic variance than the target
32 Chapter 2. A Review of Keyword Spotting
Keyword W1
Keyword W2
Keyword WN
Non-Keyword Model
Speech Background
Figure 2.5: System architecture for HMM keyword spotting using a Speech Back-ground Model as the non-keyword model
keyword models, it is likely that the SBM distribution will be flatter than the
target keyword model distributions in the acoustic regions corresponding to target
keyword speech. This would result in average frame likelihoods output by the
SBM being lower than the average frame likelihoods output by the target keyword
models in the acoustic regions of target keyword speech. Hence, although there
would be some overlap with target keyword speech, it is not unreasonable to
expect that the SBM would be sufficiently generalised such as not to dominate
the target keyword models.
Additionally, a significant benefit of the SBM approach is target vocabulary
independence. Since an SBM is trained over all speech, there are no implicit
assumptions made regarding target vocabulary. In terms of flexibility then, a
SBM-based approach system is well suited to the unconstrained vocabulary key-
word spotting task.
2.7. Approaches to non-keyword modeling 33
2.7.2 Phone models
In the phone model approach, a composite non-keyword model is constructed
from the parallel combination of phone models. Each phone model is assigned an
optional weight, α, allowing prior weighting of phones within the non-keyword
model. The phone models used may be monophones, triphones, or in fact any
appropriate subword model. Figure 2.6 demonstrates the recognition network
required for the phone model approach.
Keyword W1
Keyword W2
Keyword WN
aa
ae
z
zh
Non-Keyword Model
αae
αz
αaa
αzh
Figure 2.6: System architecture for HMM keyword spotting using a compositenon-keyword model constructed from phone models
The phone model approach is similar to the SBM approach, in that all speech
is modeled within the non-keyword model. Hence, there may be similar issues
regarding upholding the ideal non-keyword model condition in equation 2.11 (see
34 Chapter 2. A Review of Keyword Spotting
section 2.7.1). In fact, this issue may be further compounded if the same set of
models are being used in the non-keyword model and the target keyword models,
for example, if both used the same triphone model set. If so, then an observation
sequence may be scored using the same sequence of models, resulting in equal
target keyword model and non-keyword model likelihoods. This will have a direct
impact on detection performance.
Experiments using the phone model approach were reported by Bourlard et al.
[5] and Bazzi and Glass [2]. Both reported good performance, though careful
choice of the phone model weightings was required. Weightings in these cases
were allocated using phone language model bigram/unigram statistics.
2.7.3 Uniform distribution
Silaghi and Bourlard [31] proposed a method of keyword spotting that used
a uniform distribution with a constant value of ε, to approximate the non-
keyword model. Using a novel iterative dynamic programming method, Silaghi
and Bourlard [31] described how this uniform distribution could be estimated
from a set of training utterances. Once derived, the uniform distribution could
then be used as the non-keyword model during HMM-based keyword spotting.
A significant benefit of this approach is a dramatic reduction in computational
processing, as there are no calculations required to score an observation against
the non-keyword model. However, the method is limited by the requirement that ε
must be calculated on a per-target-word basis. This poses significant implications
for unconstrained vocabulary keyword spotting as it requires training examples
of all target words of interest.
2.7.4 Online garbage model
Bourlard et al. [5] proposed an implicitly derived non-keyword model for keyword
2.7. Approaches to non-keyword modeling 35
spotting. This online garbage model approach estimated a filler model likelihood
at each frame by using the average of the K-best frame likelihoods of phone models
not belonging to the target keyword. The method is classified as an implicit non-
keyword modeling approach because the architecture of the non-keyword model is
synthesised dynamically based on the phonetic decomposition of the target word.
This differs from the phone model approach discussed in section 2.7.2 where the
non-keyword model was built from the parallel combination of all phone models.
Formally, the online garbage likelihood for an observation o, when detecting
keyword w with phone pronunciation Q = (q1, q2, ..., qN ) and using the top K
scoring phone models from the phone model set S is given by:
p(o|λnkw, w) =
∑
i p(o|λj′i)
|J ′|
where J ′ = sortdesc(J)K1
J = p(o|λr1), p(o|λr2), . . . , p(o|λrM)
R = s ∈ S|s 6∈ Q
sortdesc(X) = X ordered in descending order
This approach is more computationally expensive than other approaches, as
each frame must be scored against a number of phone models. However, it has the
advantage of being more flexible, as no prior training is required. Additionally,
gains may be observed over the phone model approach in section 2.7.2 since only
phones that are not a part of the target word are used in discrimination. As such,
there is a reduction in modeling overlap between the target word models and the
non-keyword model.
36 Chapter 2. A Review of Keyword Spotting
2.8 Constrained vocabulary spotting
Constrained vocabulary keyword spotting is used for fixed or well-defined query
set tasks, such as dialogue and command control applications. Using a con-
strained vocabulary allows greater levels of syntactic and lexical information to
be incorporated into the recognition process through the use of recognition gram-
mars and task-specific trigger words. Incorporating such prior knowledge has
been demonstrated in literature to yield significant improvements.
2.8.1 Language model approaches
Language Models (LMs), in particular N-gram models, have proven to be a pop-
ular means of incorporating contextual information into constrained vocabulary
keyword spotting. The simplest means of incorporating language models is to use
the LM to obtain all sequences of words that contain one of the target keywords.
Decoding can then be performed to detect these significantly longer phrasal events
rather than single word events. Since longer events are markedly easier to detect
(due to an increased number of associated frames), significant improvements in
miss rate and false alarm rate can be obtained.
A simple and naive means of constructing a LM-based keyword spotting net-
work is as follows. Let W be defined as the set of target keywords, W , and Θ be
defined as the task-specific N -gram language model that models word sequences
of length N that occur in the target application domain. The language model Θ
is represented as a collection of 2-element tuples as specified by
Θ = (θ1, P (θ1)), (θ2, P (θ2)), . . . , (θM , P (θ(M))) (2.12)
θi = θi(1), . . . , θi(N) (2.13)
2.8. Constrained vocabulary spotting 37
Then the subset of all sequences in Θ that contain an element in W at posi-
tion k, called the constrained vocabulary subsequence set, is given by Ω(W, k,Θ)
defined as:
Ω(W, k,Θ) = (ω1, P (ω1)), (ω,P (ω1)), . . . (2.14)
= F (W, k,Θ)
where F (W, k,Θ) = (Si, P (Si)) ∈ Θ|G(W, k, Si) = 1
G(W, k,X) =
1 if X(k) ∈ W
0 otherwise
A recognition network can then be constructed from the constrained vocabulary
subsequence set using approaches similar to unconstrained vocabulary keyword
spotting. A single Viterbi decoding pass will then result in a transcription se-
quence consisting of sequences in F (W, k,Θ) and the non-keyword symbol. Figure
2.7 describes the process of building the constrained vocabulary keyword spotting
recognition network for a simple small vocabulary domain.
A significant issue with this approach is that even for low values of N , the
number of elements in the constrained vocabulary subsequence set can be large,
particularly for large target vocabulary domains. This results in significantly
slower decoding, reducing the speed benefits offered by keyword spotting over
large vocabulary speech recognition. Network optimisation techniques such as
the determinisation and minimisation techniques of Mohri et al. [23] may be ap-
plied to reduce the size of networks, resulting in significant reductions in network
size as demonstrated in figure 2.8. Nevertheless, the network complexity is still
considerably greater (particularly for a large number of contexts) than that of
unconstrained vocabulary keyword spotting, though this may be an acceptable
trade-off in exchange for improved miss and false alarm rates.
38 Chapter 2. A Review of Keyword Spotting
Smith
Bloggs
Doe
F (W,k,Θ)Language Model, Θ
θ1
θ2
θ3
θM
= P (θ1)
= P (θ2)
= P (θ3)
= P (θM )
Said
Said
Said
Said
Said
Said
SmithMister
John Smith
Mister Bloggs
John Bloggs
Miss Doe
Jane Doe
Non-Keyword
Target set, W
ω1
ω2
ω3
ω4
ω5
ω6
CVKS Subsequence Set, Ω
= P (ω1)
= P (ω2)
= P (ω3)
= P (ω4)
= P (ω5)
= P (ω6)
SaidSmithMister
John Smith Said
SaidBloggsMister
John Bloggs Said
SaidDoeMiss
Jane Doe Said
CVKS unoptimised network
P (ω3)
P (ω2)
P (ω1)
P (ω6)
P (ω4)
P (ω5)
Figure 2.7: Constructing a recognition network for constrained vocabulary key-word spotting
2.8. Constrained vocabulary spotting 39
Smith
Bloggs
Doe
Miss
John
Jane
Mister
Said
Non−Keyword
Figure 2.8: An optimised constrained vocabulary keyword spotting recognitionnetwork (language model probabilities omitted)
2.8.2 Event spotting
Event spotting is the task of detecting events of interest such as occurrences of
times, dates, or questions. If an event can be sufficiently described by a set of
word sequences, then constrained vocabulary keyword spotting methods can be
used to detect these events embedded in continuous speech.
Event spotting experiments for the detection of times in continuous speech
were reported by Jeanrenaud et al. [16] and Jeanrenaud et al. [15]. A recognition
network (see figure 2.9) was first constructed to represent various means of utter-
ing a time, such as Midnight, 3 o’clock in the afternoon and 12 thirty PM. Decod-
ing was then performed using phone-based HMMs to transcribe a given utterance
in terms of time and non-time events. The results demonstrated that incorpora-
tion of contextual information and the increased phrase length significantly aided
event detection performance. However recognition speed was markedly reduced
because of the additional complexity of the recognition network.
40Chapter
2.AReview
ofKeyword
Spottin
g
1
2 o’clock
thirty
12
A.M.
P.M.
at night
in the
midnight
noon
Non−Keyword
morning
evening
The ’Time’ Event
Figure2.9:
Aneven
tspottin
gnetwork
fordetectin
goccurren
cesoftimes[16]
2.9. Keyword verification 41
An obvious restriction of the event spotting method described above is that
any deviation from one of the prescribed methods of uttering a time may result
in a missed detection. For example, the network shown in figure 2.9 does not
represent utterances of times such as Half Past One or Quarter to Six.
To address this issue, Yining et al. [38] proposed augmenting the event spot-
ting network with filler context models. For example, for the network in figure
2.9, the parallel group of numbers 1 to 12 could be augmented with a parallel
filler model to capture any alternate forms at that point in the network, such
as the slang term Twelvish. Another filler model could be included in parallel
with morning and evening to handle phrases such as One thirty in the afternoon.
Similarly filler models could be introduced at other points to handle variations
in time phrase utterance.
Finite Stage Grammar approaches have also been proposed for event spotting
by Gou et al. [12] and Kenji et al. [18]. In these methods, nodes in an event
spotting recognition network were clustered using knowledge based classing to
reduce the complexity of a network. For example, the nodes in figure 2.9 could
be clustered using class tags such as number = 1, 2, ..., 12 and timeofday =
morning, evenining,A.M., P.M. Models were then constructed for each class
and decoding was performed using this simplified grammar. This approach gave
considerable improvements in recognition speed since the network complexity was
greatly reduced.
2.9 Keyword verification
Keyword Verification (KV) is a vital stage of many keyword spotting systems.
The purpose of these algorithms is to reduce the typically high false alarm rate
of a prior keyword spotting stage while preserving as many true keyword oc-
currences as possible. The majority of methods proposed in literature perform
42 Chapter 2. A Review of Keyword Spotting
keyword verification by deriving a confidence score for each putative occurrence.
Thresholding is then used to accept or reject candidates.
2.9.1 A formal definition
Verification is usually performed on an isolated sequence of observations corre-
sponding to a putative occurrence output by a prior keyword spotting stage. The
verification task is a 2-class discrimination task seeking to determine if a given ob-
servation sequence, O, corresponds to a true occurrence of the keywordW . Thus,
a keyword verification classifier must maximise the probability of the following
condition being maintained:
KV (W,O) =
1 if O is a true occurrence
0 otherwise
(2.15)
The classifier needs to be robust against a number of errors. For example,
there may be a degree of error with regards to the word boundary time alignments
output by the keyword spotter. This may have ramifications for target word scores
if used within the verification algorithm. A keyword verification algorithm should
also remain robust against variations in target word length, to accommodate
cross-speaker and cross-utterance variations. Thus, it is often prudent to include
a degree of duration normalisation in any keyword verification algorithm.
2.9.2 Combining keyword spotting and verification
Many keyword spotting algorithms already contain a degree of implicit key-
word verification. For example, HMM-based techniques normalise target keyword
model scores using filler model likelihoods, thus implicitly performing KV. How-
ever, it is important not to rely too heavily on the implicit verification ability of
2.9. Keyword verification 43
a keyword spotting stage, for example, through aggressive system tuning. This is
because any true occurrences that are missed by the keyword spotting stage are
occurrences that cannot be recovered by a subsequent post processing stage.
Instead, keyword spotting stages should be tuned to obtain lower miss rates
at the expense of higher false alarm rates. A subsequent KV stage can then be
used to cull extraneous false alarms. In this way, each subsystem does what it
is best at — detection with low miss rate for keywords spotting and verification
with low false alarm rate for keyword verification.
2.9.3 The problem of short duration keywords
Short duration keywords pose a significant difficulty for many keyword verification
methods. This is because it is more difficult to obtain a robust likelihood estimate
as the number of observations decreases, since there is less information upon which
to base any classification decisions.
Consider the example of a simple KV algorithm that uses a mean frame like-
lihood confidence statistic. In this case, the error in the mean frame likelihood
estimate will be a function of the number of frames used in the estimation. For
instance, a confidence statistic with a Gaussian distribution will have an error
proportionate to σNbased on normal distribution error analysis theory.
Hence, verification of short words tends to be more erroneous than that of
long words. This is a significant issue as the false alarm rates of keyword spotters
are also typically poorer for shorter words.
2.9.4 Likelihood ratio based approaches
The likelihood ratio test is one of the more common forms of confidence scores
used in keyword verification. Specifically, the likelihood ratio is used in keyword
44 Chapter 2. A Review of Keyword Spotting
verification as a simple 2-class discrimination statistic for determining if an ob-
servation sequence, O, should be classified as an occurrence of the target word w.
A typical formulation uses the non-keyword model likelihood as the normalising
term, as given by
LR(W,O) =p(O|w)
p(O|!w)(2.16)
logLR(W, 0) = log p(O|λw)− log p(O|λnkw) (2.17)
This formulation is convenient since it makes use of the non-keyword model, λnkw,
that has been well defined in keyword spotting literature.
Figure 2.10 shows a typical configuration for a verification system that uses a
likelihood ratio based confidence score. Additionally it shows how multiple veri-
fiers can be combined using fusion techniques to obtain a more robust verification
system.
As for keyword spotting, the main difference between the majority of likeli-
hood ratio based methods is the choice of non-keyword model architecture. A
high-order GMM non-keyword model was proposed by Wilpon et al. [35]. In
a similar fashion to SBM-based keyword spotting (see section 2.7.1), this non-
keyword model would provide discrimination against the speech of the average
word. Wilpon et al. [35] first scored putative occurrences against the target word
model and the non-keyword model. These scores were then combined using a
likelihood ratio to obtain the confidence statistic:
LR(W, 0) =p(O|λw)− p(O|λGMM)
|O|(2.18)
Leung and Fung [20] proposed an alternate non-keyword model that was con-
structed from the parallel combination of all states and models that were not
a part of the target keyword model. The non-keyword model likelihood for a
2.9. Keyword verification 45
Keyword
Spotter
Putative N
Putative 1
Putative 2
Scorer
LR
Scorer
Scorer
LR
λnkw
Verifier 1
Fusion
Scorer
λnkw
Verifier 2
λw
λw
Result
Figure 2.10: Likelihood ratio based keyword occurrence verification with multipleverifier fusion
46 Chapter 2. A Review of Keyword Spotting
putative occurrence was then calculated by determining the maximum likelihood
obtained when decoding the occurrence set using the non-keyword network:
logLR(W, 0) = log p(O|λw)− argmaxilog p(O|qi) (2.19)
where qi = ith path through the non-keyword network
Sukkar and Lee [32] and Xin and Wang [37] proposed a more constrained non-
keyword network constructed from the parallel combination of all cohort phones.
A cohort phone was defined as a phone that was highly confusable with one
of the phones of the target keyword. Confusion information could be derived
from either knowledge based rules or from confusion matrix statistics of a phone
recogniser. These approaches alleviated the ad hoc nature used in constructing
the non-keyword network and additionally reduced the size of the non-keyword
model network.
A similar cohort based approach was proposed by Liu and Zhu [22], how-
ever the cohort models were derived dynamically. A non-keyword network was
first constructed from all subword models. Then, a N-Best recognition pass was
used to determine the N-Best subword models for the given putative occurrence
sequence. Finally, the non-keyword model likelihood was estimated using the
average of the likelihoods of the N-Best subword models.
2.9.5 Alternate Information Sources
Another means of keyword verification is to use an alternate information source
to that used in the initial keyword spotting stage. This means that the keyword
is verified using new information, rather than simply using a transformation of
the same information.
One source of alternate information is to use an alternate feature set. The
information captured within a different feature set may contain complimentary
2.10. Audio Document Indexing 47
information which may aid in discrimination. Wu et al. [36] suggested the use
of prosodic information to augment phonetic information in confidence scoring.
This system first used phonetic information in a typical filler model based keyword
spotter. The verification stage then used phonetic features and prosodic features
to derive a log likelihood ratio.
An alternate means of incorporating alternate information is to use a different
classifier. For example, a Gaussian-based classifier uses Gaussian decision bound-
aries as a basis for decision making. In contrast, certain classes of neural-network
classifiers use a non-linear decision boundary. Both classifiers may provide com-
plimentary information that may be useful in keyword verification. Ou et al. [25]
combined a neural network in series with a HMM system to utilise the orthogo-
nal discriminative abilities of the two recognition systems. The speech signal was
passed into a typical filler-model based keyword spotting system. The posterior
probabilities of the HMM system were then passed to a neural network for further
discriminative analysis.
2.10 Audio Document Indexing
Audio Document Indexing (ADI) algorithms are required to provide rapid search-
ing of large collections of audio documents for keywords of interest. Typical al-
gorithms use a two-pass approach, where the audio is first prepared during an
initial pass for rapid searching during subsequent query passes. In this way, the
majority of query-word independent processing can be performed during audio
preparation, leaving as little processing as possible to be performed during query
time.
48 Chapter 2. A Review of Keyword Spotting
2.10.1 Limitations of the Speech-to-Text
Transcription approach
The most intuitive, and the approach offering the greatest query speed, uses a
large vocabulary speech transcription system to fully transcribe audio documents
to text for subsequent rapid textual searching. This requires an initial large
overhead to perform STT, but as a result allows very fast text-only searching
during queries.
Unfortunately, a significant restriction of this approach is that queries are
restricted to the vocabulary of the STT engine. Many ADI applications require
support for unrestricted vocabulary queries, such as name, place, slang, and for-
eign language terms. Even transcription systems with very large vocabularies are
unlikely to provide sufficient coverage of the required query vocabulary for such
systems.
Additionally, any errors made by the transcription system during transcription
are completely unrecoverable at query time, even with knowledge of the actual
query term. Hence, the performance of these systems will always be limited by
the word error rate of the associated STT system.
In contrast, unconstrained vocabulary audio document indexing techniques
are specifically designed to support unrestricted vocabulary queries. Unfortu-
nately, these unconstrained methods also have much slower query speeds than
the STT approach.
Unconstrained vocabulary indexing methods use an initial pass of the data
to derive a compact intermediary representation of the speech. This intermedi-
ary representation must be sufficiently terse to allow rapid query-time searching,
while still preserving sufficient information to provide accurate unconstrained vo-
cabulary querying. At search time, the intermediary representation can then be
searched in a bottom-up fashion to locate putative keyword locations.
2.10. Audio Document Indexing 49
2.10.2 Reverse dictionary lookup searches
Early audio document indexing approaches used Reverse Dictionary Lookup
(RDL) searching to detect keywords. RDL searches attempt to infer the location
of high-level events (ie. dictionary items) from a stream of low-level events (ie.
dictionary decompositions). For audio document indexing, this inference is made
from a stream of low-level acoustic events such as phones or syllables.
Speech utterances are first transcribed using a subevent-level recogniser, typ-
ically constrained by a language model, to generate a set of subevent level tran-
scriptions. This is only performed once for each speech utterance to prepare the
utterance for subsequent querying.
At search time, the query word is decomposed using a dictionary or letter-
to-sound rules to obtain a target subevent sequence. The previously generated
subevent transcriptions are then searched to locate instances of this target se-
quence.
Commonly, phones have been used as the subevent representation, such as
in the experiments reported by Chigier [8]. In these experiments, the speech
preparation stage used a phone recogniser to generate phone transcriptions. Then,
at query time, the query word was decomposed to a phone sequence and the phone
transcriptions were searched for instances of the target phone sequence.
A significant shortcoming of RDL searches is that there is little consideration
given to subevent-level recogniser errors. For example, phone-level transcribers
typically have high error rates (in excess of 40% in many cases) and as such any
generated phone transcriptions are likely to contain a large number of errors.
Figure 2.11 demonstrates how RDL performance is affected by various subevent-
level recogniser errors.
One means of improving robustness is to generate and search multiple phone
transcription hypothesis for each speech utterance. This increases the chance of
50 Chapter 2. A Review of Keyword Spotting
diy ahsh ax k w ay r d
Detected successfully
diy ahsh ax w ay r d
Missed because of deletion error
diy ahsh ax w ay r dp
Missed because of substitution error
diy dsh ahax k w ay er r
Missed because of insertion error
Figure 2.11: Applying reverse dictionary searches to the detection of the wordACQUIRE in a phone stream
the correct transcription being generated for a speech utterance but also increases
the chance of false alarms. Zue et al. [41] used a phone dendrogram to maintain
multiple hypothesis information while Sethy and Narayanan [30] used N-best
phone transcriptions. Both methods yielded improved performance over single-
level phone transcriptions.
An alternative is to use a more robust subevent representation. Sethy and
Narayanan [30] proposed the use of the syllable subevent instead of the phone
subevent. The syllable is a considerably longer unit than the phone, and hence
more easily detected and classified. The robustness of the syllable was demon-
strated by Sethy and Narayanan [30] in experiments where a 17% reduction in
2.10. Audio Document Indexing 51
transcription error was obtained using a syllable transcriber instead of a phone
recogniser. The improvement in transcription error rate resulted in considerable
improvements in overall ADI system performance.
2.10.3 Indexed reverse dictionary lookup searches
A short-coming of reverse dictionary lookup ADI methods is that the amount
of data that needs to be searched during query time is very large. In fact, an
RDL search has an O(N) level of complexity that may prove problematic for very
large database searches. The Indexed Reverse Dictionary Lookup (IRDL) search,
proposed by Dharanipragada and Roukos [10], addresses this issue by using an
index to constrain searching to only plausible regions of speech. This reduced set
of regions can then searched using a more thorough algorithm to generate a final
set of putative occurrences.
Figure 2.12 demonstrates the IRDL method proposed by Dharanipragada and
Roukos [10]. During the preparation stage, speech is transcribed to a subevent
level using the same approach used in RDL. A subevent index is then built that
contains the times at which each subevent occurred. Dharanipragada and Roukos
[10] used trigram subevents for this index.
At query time, the query term is first decomposed into a subevent-level rep-
resentation. The subevent index is then consulted to obtain the location of all
subevents that the query word is comprised off. Following this, correlations be-
tween subevent locations are examined to determine locations in speech that most
closely match the query term subevent sequence. Dharanipragada and Roukos
[10] performed this correlation by quantising the subevent timescale into 1 sec-
ond units and then finding times at which a majority of the subevents occurred
in close proximity. Finally, the resultant candidate regions are searched using a
more thorough keyword spotting method. A HMM-based approach was used by
52 Chapter 2. A Review of Keyword Spotting
k w ay r
Sax = ta1x, ta2x, t
a3x, . . .
Sk = tk1 , tk2 , t
k3 , . . .
Sw = tw1 , tw2 , t
w3 , . . .
Say = ta1y, ta2y, t
a3y, . . .
Sr = tr1, tr2, t
r3, . . .
ACQUIRE
Query term
Phone decomposition
ax
Subevent combination
Preparation Stage
Putative Results
Query Stage
S1 = t11, t12, t
13, . . .
S2 = t21, t22, t
23, . . .
Subevent index
eg. Lattice, Dendrogram
Subevent Transcripts
Figure 2.12: Example of indexed reverse dictionary searching for the detection ofthe word ACQUIRE
2.10. Audio Document Indexing 53
Dharanipragada and Roukos [10] for this stage.
The IRDL approaches offers considerable benefits in terms of scalability by
using the subevent index to constrain query time searching. Since the size of the
candidate region set is considerably smaller than the size of the entire database,
processing requirements will be considerably lower than the standard reverse
dictionary lookup approach.
However, the coarseness of the subevent correlation process may result in
a very large number of candidate regions that then need to be searched using
slower acoustic search methods. Hence any gain in speed obtained in reducing
the candidate search space from subevent indexing may be lost in the requirement
for subsequent slower acoustic searching.
2.10.4 Lattice based searches
Lattice based searching, proposed by Young and Brown [39], reduces the effect
of phone recogniser error in RDL by searching a phone lattice representation
of speech at query time. Phone lattices encode a significantly greater number
of recognition paths than individual N-best transcriptions, therefore greatly in-
creasing the possible search space for query time processing.
In lattice based searching, an initial pass of the speech is first made using a
phone recogniser to generate phone lattices. At query time, these lattices are then
searched for any instances of the target word phone sequence. Any discovered
sequences are extracted from the lattices together with corresponding node times
and scores to generate a set of putative occurrences. Figure 2.13 demonstrates
the lattice based search process for detecting the word ACQUIRE with phone
transcription (/ax/, /k/, /w/, /ay/, /r/)’.
One advantage of this method is that a degree of phone recogniser error ro-
bustness is implicitly provided by the multiple paths encoded within the lattice.
54Chapter
2.AReview
ofKeyword
Spottin
g
ax
ax
aa
aa
ow
k
p
d
d
t
wh
wh
w
eh
ay
uw
ay
ey
ey
ay
r
r
er
er
uw
k
k
d
p
d
t
d er
ax
kw
ay
r
Time
k
p
ax
k
p
d
Figure2.13:
Usinglattice
based
searchingtolocate
instan
cesoftheword
AC-
QUIREwithinaphonelattice
2.10. Audio Document Indexing 55
Phone recognisers are highly susceptible to error and therefore such robustness is
likely to be beneficial.
However, a major problem when using phone lattices is generating a suffi-
ciently compact representation for storage. A phone lattice can potentially rep-
resent thousands of possible utterance transcriptions that need to be preserved
for query time searching. This results in significant disk storage requirements
which results in query time searching requiring a large amount of disk access - a
potentially slow operation.
Hence Young and Brown [39] proposed storing an approximation of lattices.
The lattice time scale was quantised and nodes were then placed into bins based on
their quantised time value. Then, it was assumed that a node in bin T was always
connected to a node in bin T + δ where δ was the quantisation unit. In this way,
lattice storage no longer required storage of node interconnection information,
resulting in a significant saving in storage for typically highly interconnected
phone lattices.
Unfortunately, this optimisation also introduces node interconnections that
may not have previously existed, potentially leading to false lattice paths and
therefore increased false alarm rates.
An extension to this lattice based search was proposed by James and Young
[14]. This work proposed the introduction of dynamic programming techniques
into the lattice search to provide additional robustness against phone recogniser
error. Experiments demonstrated some gains in detection performance. Interest-
ingly though, this technique was removed from subsequent publications by the
secondary authors suggesting that the additional processing required for this dy-
namic programming search did not necessarily justify the gains in performance.
56 Chapter 2. A Review of Keyword Spotting
Chapter 3
HMM-based spotting and
verification
3.1 Introduction
A range of keyword spotting and verification algorithms have been proposed in
literature. Of these, HMM-based methods have stood out as offering exceptional
performance and have the benefit of being within a well-studied and mature
speech recognition framework.
This chapter analyses a number of these methods and reports on compara-
tive evaluations performed on conversational telephone speech. The motivation
for this work was to provide a platform for further research by investigating the
current approaches and identifying the key issues that needed to be addressed in
subsequent work. The research was further necessitated by the lack of compara-
tive keyword spotting evaluations available in recent literature.
The majority of HMM-based methods differ primarily in their approach to
non-keyword modeling. As such, initial sections of this chapter examine a num-
ber of non-keyword modeling techniques and discuss individual strengths and
57
58 Chapter 3. HMM-based spotting and verification
weaknesses with regards to keyword spotting, verification and discrimination in
general. A conceptual framework named the confusability circle is presented and
used here to aid discussion.
Subsequent sections report on HMM-based keyword spotting experiments us-
ing these non-keyword model architectures. The reported experiments include a
comparative evaluation of various non-keyword models, the effects of target key-
word length on performance and a discussion on the tunability of HMM-based
systems.
The final sections of this chapter discuss and evaluate HMM-based keyword
verification. The presented work investigates the performance of SBM-based
verification as well as the role of discriminative classifiers.
3.2 The confusability circle framework
The confusability circle is a simple tool that aids the visualisation of aspects
related to the analysis of non-keyword models. This concept is used extensively
within this work to provide a well-defined framework for such discussion.
The framework is based on three key properties that define a good non-
keyword model:
1. A high degree of representation for words that are confusable with the target
word. Within this work, the word W1 is said to be confusable with word
W2 if there is a reasonably high probability that occurrences of word W1
will be output as putative occurrences of W2 by a keyword spotter.
2. A high degree of representation for words that are very disparate from the
target word. The word W1 is said to be disparate with regards to the word
W2 if there is a low probability of occurrences of word W1 being output as
putative occurrences of word W2.
3.2. The confusability circle framework 59
3. A low degree of representation for the target word
The properties above can be viewed as modeling three types of regions in
acoustic feature space. The confusability circle in figure 3.1 shows these regions
within a simplified two-dimensional feature space. The centre region, labeled the
Target Acoustic Region (TAR), corresponds to the acoustic feature space within
which observations for the target word fall. The surrounding region, named
the Confusable Acoustic Region (CAR), represents the region where observations
from highly confusable words occur. Finally the outer region, called the Disparate
Acoustic Region (DAR), represents the rest of acoustic feature space, within
which observations for words that are disparate from the target word fall.
Disparate Acoustic Region
STOCK
STACK
CLOCK
STROKES
FLAKEY
DRAGON
Target Acoustic Region
Confusable Acoustic Region
LOGISTICAL
Figure 3.1: Confusability circle for the target word STOCK
Within this framework, a good non-keyword model should have a high degree
of representation of speech within the DAR and CAR regions and a low level
of representation within the TAR region. As continually highlighted in keyword
60 Chapter 3. HMM-based spotting and verification
verification literature, this is a difficult problem because of the large non-keyword
acoustic space that needs to be modeled and the comparatively small target
keyword region of acoustic space that has to be excluded.
The confusability circle paradigm has a number of limitations. Primarily, it is
unlikely that there will be clear distinctions between the individual confusability
regions. Instead, there is more likely to be some overlap between regions resulting
in fuzzy boundaries. Additionally, the definitions of confusability are fairly loose,
and are not defined formally.
Nevertheless, the confusability circle framework provides a convenient means
of representing and discussing the various aspects related to non-keyword mod-
eling. It does not provide a means of proving any conjectures arising from these
discussions but merely simplifies the process of visualising related issues. It is for
this reason that it is used in subsequent discussion in this work.
3.3 Analysis of non-keyword models
3.3.1 All-speech models
A distinct class of non-keyword models is those that model all speech, such as the
SBM method used by Wilpon et al. [35] and the phone set approach reported by
Bourlard et al. [5]. Within the confusability circle framework, it can be seen that
these methods perform unbiased modeling of all three regions: the TAR, CAR
and DAR. No attempts are made to include implicit discrimination between the
various confusability regions directly into the non-keyword model.
Despite this, all-speech non-keyword modeling techniques have been demon-
strated to be effective in literature, for example in the work of Bourlard et al. [5].
Clearly this cannot be because of any specific DAR and CAR modeling. Instead,
such methods must rely on robust CAR modeling in the target keyword models
3.3. Analysis of non-keyword models 61
to obtain good discriminatability. If such a robust target keyword model can be
constructed, then output likelihoods of the target word model would hopefully
exceed the output likelihoods of the non-keyword model within the CAR region,
leading to proper target word discrimination. All-speech non-keyword models are
therefore dependent on the robustness of the target keyword model rather than
the robustness of the non-keyword model.
3.3.2 SBM methods
SBM based non-keyword models are a specific type of all-speech model that have
been demonstrated to be effective for keyword discrimination by Wilpon et al. [35]
and Rose and Paul [28]. The key benefits of such an approach is the simplicity of
the model and the reduced network complexity compared to a phone model set
approach.
As stated before, the performance of all-speech models will be dictated to a
degree by the quality of the target word model. In particular, good performances
can only be expected if a robust target word model is used. In a HMM-based
framework, it is easy to build such a model, either through a word-based model or
more flexibly through the concatenation of phone models. Both these approaches
will result in a robust target word model that is sufficiently disparate from the
non-keyword SBM model.
As such, one would expect good detection rates using an SBM non-keyword
model. However, the highly generalised nature of the SBM means that there
is little CAR modeling in the non-keyword model, and therefore, a significant
number of false alarms is likely to be observed.
62 Chapter 3. HMM-based spotting and verification
3.3.3 Phone-set methods
The phone-set approach is another all-speech architecture that performs more
explicit modeling of speech than an SBM by using individual phone models. In
the confusability circle, this can be visualised as modeling individual pockets of
the entire confusability circle.
One limitation of a phone-set non-keyword model is that there will be overlap
between the target word model and the non-keyword model if both are con-
structed using models from the same phone model set. Given this, observations
associated with common phones will score equally well in the target word model
and the non-keyword model. As a result, one would expect an increase in miss
rate compared to an SBM non-keyword model.
Obviously, this overlap is reduced if using a word-based target model. How-
ever, the use of word-based models significantly reduces the flexibility of key-
word spotting and verification algorithms, and is particularly unsuitable for un-
restricted vocabulary tasks.
3.3.4 Target-word-excluding methods
Another distinct class of non-keyword models is those constructed from a combi-
nation of all subevent models (eg. phones, states) excluding those that occur in
the target keyword. The methods proposed by Leung and Fung [20], Sukkar and
Lee [32], and Xin and Wang [37] are examples of such models. These approaches
specifically attempt to exclude speech in the TAR region, hence constraining
non-keyword modeling to only the DAR and CAR regions.
Unfortunately, excluding target word subevents from the non-keyword model
does not necessarily guarantee this to be the case. This is because a number of
confusable words in the CAR region will have subevents that are shared with the
target word. This region of overlap between the subevents from confusable words
3.4. Evaluation of keyword spotting techniques 63
and those from the target word is shown as the Shared Subevent CAR (SSCAR)
region in figure 3.2.
The subevents in the SSCAR are a subset of the subevents in the target word
model but are also excluded from the non-keyword model. As a result, there will
be a number of highly confusable words that are better modeled by the target
word model than the non-keyword model. This unfortunately is likely to lead to
an increased number of false alarms over the standard phone model approach.
However, a possible benefit is a decrease in miss rate since there is now a greater
separation between the non-keyword model and the target word model.
STOCK
Target Acoustic Region
Confusable Acoustic Region
Disparate Acoustic Region
STACK
SOCK
FLOCK
SPOCK
Shared Subevent
Confusable Acoustic Region
Figure 3.2: Example of the shared subevent confusable acoustic region for thekeyword STOCK
3.4 Evaluation of keyword spotting techniques
The choice of non-keyword architecture is the primary difference between many
HMM-based keyword spotting techniques. In the previous section, a number
64 Chapter 3. HMM-based spotting and verification
of these were compared and contrasted, and hypotheses were made regarding
expected miss rate and false alarm rate performances. This section reports on
experiments that quantitatively evaluate HMM-based keyword spotting using a
selection of these non-keyword models. Three specific algorithms are examined
here:
1. SBM-KS - Speech background model approach (section 2.7.1)
2. PM-KS - Phone model approach (section 2.7.2)
3. XPM-KS - Target-word-exclusive phone model approach based on the
non-keyword model described by Leung and Fung [20] (section 2.9.4).
3.4.1 Experiment setup
Recogniser parameters
HMMmodels were trained using data taken from a 165 hour subset of the Switch-
board 1 conversational telephone speech corpus. Speech was parameterised using
Perceptual Linear Prediction (12 statics and C0 + deltas + accelerations) fea-
ture extraction with Cepstral Mean Subtraction applied to provide speaker and
channel compensation.
Target word models were constructed by training 16-mixture cross-word tri-
phone Gaussian state distribution HMM models on the Switchboard training
data. A 256-mixture GMM model was trained on this same data to be used as
the non-keyword model for the SBM-KS experiments. Additionally, 32-mixture
monophone models were constructed for use as non-keyword models for the PM-
KS and XPM-KS methods.
3.4. Evaluation of keyword spotting techniques 65
Evaluation set
Evaluation speech was taken from a 2 hour subset (not overlapping with the
training data) of the Switchboard data. The query words were constrained to
only 6-phone length words to remove any variations in experiment results due
to keyword length. As such, 360 6-phone length unique words were randomly
chosen from the evaluation speech and labeled as query words. These query
words appeared a total of 808 times in the evaluation set.
Evaluation procedure
Experiments were performed as follows.
1. For each query word, single-word keyword spotting was performed for every
utterance in the evaluation set.
2. Total execution time was measured using a single 3GHz Pentium 4 proces-
sor.
3. Miss and false alarm rates were calculated using reference forced-aligned
word-level transcription timings. No thresholding or normalisation was ap-
plied to output scores prior to calculating performance metrics.
Execution time is reported in terms of CPU minutes per queried word per hour
(CPU/kw-hr).
3.4.2 Results
Results of the experiments are shown in table 3.1. In terms of detection per-
formance, SBM-KS clearly outperforms the other two systems, yielding a very
low miss rate of 1.9%. Execution time is also significantly faster than the phone-
model based methods, being approximately 10 times faster than the XPM-KS and
66 Chapter 3. HMM-based spotting and verification
PM-KS systems. Unfortunately, the high FA/kw rate of SBM-KS is of concern.
Method Miss rate FA/kw CPU/kw-hr
SBM-KS 1.9 419.7 1.82
PM-KS 32.6 2.1 18.0
XPM-KS 19.8 10.3 16.1
Table 3.1: Keyword spotting performance of baseline systems on Switchboard 1data
Typically a high FA/kw rate is considered crippling for a keyword spotting
system. This is because a system that has such a high number of extraneous
incorrect results is essentially unusable from a practical perspective. However,
if a keyword spotting system is used in conjunction with an accurate keyword
verification stage that can successfully remove a large proportion of the false
alarms, then the significance of the high false alarm rate is greatly reduced. In
essence, the major penalty of a high false alarm rate is then simply an increase in
the amount of processing required in the keyword verification stage. False alarm
errors are therefore recoverable errors within the keyword spotting context.
In contrast, a poor miss rate is an error that is completely unrecoverable.
Once a true occurrence has been missed by a keyword spotter, no amount of
subsequent processing can recover the missed occurrence. As such, it is more
favourable for a keyword spotting system to have a low miss rate rather than a
low false alarm rate, given an accurate subsequent keyword verification stage.
The poor miss rate performances of PM-KS and XPM-KS is a result of overly
greedy non-keyword models, and is in line with assertions made in sections 3.3.3
and 3.3.4. A good non-keyword model should generate much lower likelihoods
than the target word model in the TAR. However, an overly-greedy non-keyword
3.4. Evaluation of keyword spotting techniques 67
model generates likelihoods that are close to or exceed the likelihoods of the
target word model in the TAR, leading to unfortunate miss errors. The presented
results validate the assertion that the PM-KS and XPM-KS non-keyword models
are more greedy that the SBM-KS non-keyword model.
Reduction of overlap between the non-keyword model and the target-word
model is one method of reducing the greediness of a non-keyword model. This
in fact is the motivation behind the XPM-KS method. XPM-KS removes all
phone models that exist in the target word from the list of phone models used in
it’s non-keyword model, thus generating a less-overlapping non-keyword model
on a per word basis. As demonstrated by the above results, and as postulated
in section 3.3.4, the method is effective in improving miss error rate over a non-
target-word-excluding PM-KS system. Specifically, an absolute decrease of 12.8%
in miss rate is observed.
However, an unfortunate side-effect of this is that speech corresponding to
words similar to the target word are now consumed by the target word model,
leading to increased false alarm rates. As explained in section 3.3.4, this leads
to an increase in false alarm rate as demonstrated by the 8.2 FA/kw absolute
increase observed for XPM-KS over PM-KS.
The significantly slower execution times for PM-KS and XPM-KS is a direct
result of the increased complexity of the non-keyword model. Using a composite
non-keyword model comprised of multiple phone models results in an increase
in the number of nodes in the recognition lattice, and hence an increase in the
amount of decoding processing. For example, a 44-entry phone set requires 44
nodes in the recognition network as well as 44 × 44 = 1936 extra links between
phone nodes. In contrast, the SBM-KS approach only has a single node within
the non-keyword model, resulting in very little impact on the complexity of the
decoding network.
68 Chapter 3. HMM-based spotting and verification
Overall then, the SBM-KS appears to be the most appealing choice of al-
gorithm for keyword spotting, assuming the availability of a subsequent well-
performing keyword verification stage. Not only is a very low miss rate obtained
using this method, but execution speed is significantly faster, making the method
suitable for real-time speech processing tasks.
3.5 Tuning the phone set non-keyword model
As noted above, one reason for the poor detection performance of the PM-KS
and XPM-KS methods was an overly greedy non-keyword model. One means
of addressing this issue is to introduce a target word insertion penalty. This is
analogous to the concept of word insertion penalty used in speech recognition -
a means to control the insertion rate of tokens.
The word insertion penalty is incorporated into the recognition process by
inserting a link transition probability into the recognition network. This is de-
noted by the α link probabilities in figure 3.3. Using a positive word insertion
penalty on a target keyword node will favour the emission of target words over
non-keywords during the decoding process. On the contrary, using a negative
penalty simulates a more greedy non-keyword model, resulting in an increase in
miss rate but also a decrease in false alarm rate.
Experiments were performed to determine if adjusting target word inser-
tion penalty would yield improvements in the performance of phone-model-non-
keyword based keyword spotting methods. The PM-KS experiments reported
in section 3.4 were repeated using two different values of target word insertion
penalty: 50 and 100. It was assumed that trends observed for PM-KS would be
similar to trends observed in XPM-KS and as such experiments were not repeated
for the XPM-KS method. Results of these evaluations are shown in table 3.2.
The original PM-KS method evaluated in section 3.4 is listed in this table as
3.5. Tuning the phone set non-keyword model 69
Keyword W2
Keyword WN
Keyword W1
α2
αN
α1
Non-Keyword Model
Figure 3.3: Incorporating target word insertion penalty into HMM-based keywordspotting
having a word insertion penalty of 0.
Method Word insertion penalty Miss rate FA/kw
PM-KS 0 32.6 2.1
PM-KS 50 21.0 8.8
PM-KS 100 15.8 17.8
Table 3.2: Effect of target word insertion penalty on PM-KS performance
The results demonstrate that target word insertion penalty is indeed an ef-
fective means of obtaining a more palatable level of performance. Specifically, an
absolute decrease of 16.8% in miss rate is obtained using a penalty of 100, though
at the expense of a 15.7 increase in FA/kw. This suggests phone-model based
systems may be tuned to obtain low miss rates in a similar fashion to SBM-KS,
70 Chapter 3. HMM-based spotting and verification
in particular if much higher word insertion penalties are used. Unfortunately, the
growth in FA/kw rate indicates that a similar problem as faced in SBM-KS will
occur - a very low miss rate but at the expense of a very high FA/kw rate.
Additionally, this type of tuning does not provide any benefit for execution
speed. As a result, even though improved detection performance may be achiev-
able, the very high execution time of PM-KS and XPM-KS still make them
unattractive for real-time keyword spotting tasks.
3.6 Output score thresholding for SBM
spotting
Output score thresholding is a simple means of adjusting the operating point
of a keyword spotting system. Given a set of putative occurrences and their
corresponding likelihoods, such thresholding can be used to reduce false alarm
rate in exchange for an increase in miss rate performance.
As demonstrated in section 3.4, SBM-KS was capable of delivering lower miss
rates and faster execution speeds than PM-KS and XPM-KS. However, this was
at the expense of a considerably higher false alarm rate. Experiments were there-
fore performed to determine whether output score thresholding could be used to
reduce false alarm rates.
It must be noted that the aim of these experiments was not to yield false alarm
rates sufficiently low for the final output of a keyword spotting system. This was
because in practice a subsequent keyword verification stage would be used to
further cull false alarms. Instead, it was hoped that output score thresholding
could be applied here to remove any highly improbable putative occurrences
without significantly affecting miss rate.
The output score thresholding techniques that were evaluated were:
3.6. Output score thresholding for SBM
spotting 71
1. UNT - This method applied direct thresholding on the unnormalised pu-
tative occurrence likelihoods output by SBM-KS
2. DNT - In this approach, duration normalisation was applied to all putative
occurrence likelihoods prior to applying thresholding. This was done to
compensate for variations in the length of realisations of keywords. The
duration normalised likelihood was calculated by:
DNT (p, ts, te) =p
te − ts(3.1)
where p = Output likelihood of putative occurrence
ts = Start time of putative occurrence
te = End time of putative occurrence
The SBM-KS putative occurrence set from experiments reported in section
3.4 was initially post filtered using each of the above methods. Performance was
then measured at various thresholds to obtain the DET plots shown in figure 3.4.
Equal error rates for these systems are given in table 3.3.
Thresholding Method Equal error rate
UNT 58.1
DNT 41.1
Table 3.3: Equal error rates of unnormalised and duration normalised outputscore thresholding applied to SBM-KS
The results clearly indicate that duration normalisation is more appropriate
than unnormalised score thresholding for the task of crude false alarm culling.
The equal error rate obtained using DNT is almost 20% lower than that obtained
for UNT.
72 Chapter 3. HMM-based spotting and verification
0.05 0.1 0.2 0.5 1 2 5 10 20 40 60 80 0.05
0.1
0.2
0.5
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
UNTDNT
Figure 3.4: DET plots for unnormalised and duration normalised output scorethresholding applied to SBM-KS
The slope of both DET plots is almost at a 45o angle, indicating that the
trade-off between miss rate and false acceptance rate is approximately one-for-
one. That is, a reduction of 10% in false acceptance rate would be matched by
an approximately equal increase of 10% in miss rate. This is unfortunate as such
a sensitivity means that attempting to reduce false alarm rates by any significant
amount would result in significant losses in miss rate.
The experiments then indicate that output score thresholding provides a poor
method of reducing the false alarm of SBM-KS, even as an initial pre-processing
step to remove highly improbable putative occurrences.
3.7 Performance across keyword length
The keyword spotting experiments reported so far in this chapter have been re-
stricted to only the detection of 6-phone-length keywords. This was done to
reduce any variations in performance caused by target keyword length. This
3.7. Performance across keyword length 73
section quantitatively measures the performance of SBM-KS for three target key-
word lengths: 4-phone, 6-phone and 8-phone, corresponding to short, medium
and long keywords respectively.
3.7.1 Evaluation sets
Three target-word-length dependent evaluation sets were used for this set of ex-
periments. The 6-phone set was taken directly from previous experiments, while
4-phone and 8-phone sets were constructed anew. These new sets were built in a
similar fashion to the 6-phone set, except for the number of query words selected.
This was because there were limits on the available number of unique words in
the evaluation speech.
In particular, it was not possible to find 360 unique 8-phone words, and as
such, only 184 words were selected. In contrast, the number of words selected for
the 4-phone set was increased from 360 to 400. Table 3.4 shows the number of
query words selected for each phone length as well as the corresponding number
of actual occurrences of these words within the evaluation speech.
Phone-length Number of query words Number of occurrences in data
4 400 1957
6 360 808
8 184 364
Table 3.4: Details of phone-length dependent evaluation sets
3.7.2 Results
SBM-KS performance for the phone-length dependent evaluation sets are shown
in table 3.5. As expected, the numbers show that miss rate decreases as target
74 Chapter 3. HMM-based spotting and verification
word phone-length increases. For very short 4-phone keywords, a miss rate of
5.7% was obtained, which is pleasing considering the very short duration of 4-
phone length events.
Phone-Length Miss rate FA/kw
4 5.7 315.4
6 1.9 419.7
8 1.6 302.9
Table 3.5: SBM-KS performance on Switchboard 1 data for different phone-lengthtarget words
An unexpected result is the trend in FA/kw rates. False alarm rates for
medium length words actually exceeded that observed for both 4-phone and 8-
phone words. This is a perplexing result, but is most likely due to the chance
occurrence that the operating point for SBM-KS for this particular 6-phone set
happened to be at a lower miss-rate and higher false alarm rate.
Duration normalised output thresholding was applied to further study the
trends in performance across keyword length. As shown in figure 3.5, DNT does
not provide any significant capability for culling false alarms without dramatically
affecting miss rate. This is true even for long duration 8-phone keywords which
theoretically should be easier to cull due to their increased duration and therefore
increased discriminatability.
3.8 HMM-based keyword verification
To date, many of the HMM-based verification systems proposed in literature
have differed primarily in their choice of non-keyword model. The proposed non-
keyword models have also been used in keyword spotting and hence there is a
significant overlap between the two areas of research.
3.8. HMM-based keyword verification 75
0.05 0.1 0.2 0.5 1 2 5 10 20 40 60 80 0.05
0.1
0.2
0.5
1
2
5
10
20
40
60
80
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
4−phone6−phone8−phone
Figure 3.5: DET plots for duration normalised output score thresholding appliedto SBM-KS for keyword length dependent evaluation sets
HMM-based keyword verification is very similar to HMM-based keyword spot-
ting, with the major exception that recognition is performed on isolated word
instances. As such, many findings from keyword spotting research will also apply
to keyword verification.
In previous sections, a number of experiments comparing SBM-based and
phone set based keyword spotting were presented. In these experiments, it was
established that the SBM-based method was considerably faster in terms of ex-
ecution speed. This is equally applicable in SBM-based keyword verification
providing a clear benefit for real-time processing.
Additionally, it was found that the SBM-based system achieved significantly
lower miss rates (though a phone set based system could potentially be tuned to
obtain a lower miss rate as well). A low miss rate is paramount for a keyword
verification system to avoid compounding the miss rate of the previous keyword
spotting stage.
76 Chapter 3. HMM-based spotting and verification
For these reasons, only the SBM-based configuration is evaluated in this sec-
tion. Its speed and low miss rate performance make it an excellent candidate for
real-time keyword verification tasks.
3.8.1 Evaluation set
A keyword verification evaluation set can be constructed in two ways. The sim-
plest is to randomly select words from a word-level transcription and relabel a
selection of these to simulate false alarms. An alternate approach is to use a
keyword spotter to first generate a putative occurrence set. The evaluation set
can then be constructed by selecting hits and false alarms from this result set.
The latter approach better reflects the type of putative occurrences that a KV
system would typically operate on. Occurrences in this set would be acoustically
similar to the target word since they were obviously confused by the keyword
spotter. As a result, they would also be more difficult to verify. Therefore, this is
the approach used for generating the evaluation set for these set of experiments.
Target word length dependent evaluation sets were constructed for 3 keyword
lengths - 4-phone, 6-phone and 8-phone - using the following procedure:
1. SBM-KS was performed using the appropriate target word length dependent
evaluation set from section 3.7
2. A reference transcription was then used to mark each putative occurrence
as a hit or false alarm
3. Finally, a verification evaluation set was constructed by randomly select-
ing a required number of hits and false alarms from the set of putative
occurrences.
Table 3.6 summarises the number of hits and false alarms in each evaluation
set.
3.8. HMM-based keyword verification 77
Phone-length # Hits # False alarms
4 1882 617171
6 799 339115
8 362 110266
Table 3.6: Statistics for keyword verification evaluation sets
3.8.2 Evaluation procedure
SBM-based keyword verification was performed by calculating the log-likelihood
ratio confidence score for each item in the evaluation set using equation 2.18
in section 2.9.4. False alarm and miss rate performance were then measured at
various thresholds to obtain DET plots and equal error rates. Experiments for
each target word length were performed completely independently.
Note that when calculating miss rate, the miss rate from the previous keyword
spotting stage was added to the keyword verification miss rate. In this way, true
overall system miss rates are reported. This method of calculating miss rate is
used for all keyword verification experiments in this work unless otherwise noted.
3.8.3 Results
The equal error rates for SBM-KV on the individual keyword length evaluations
sets are given in table 3.7. DET plots for performance across operating points
are shown in figure 3.6.
The results indicate that target keyword length has a significant effect on
keyword verification. Performance is markedly better for longer length keywords,
with the 8-phone tests yielding the lowest equal error rate of 7.9%. Medium
length KV performance is poorer, with an equal error rate of 12.1%. However,
a very poor 19.9% EER was obtained for 4-phone length words, correponding to
78 Chapter 3. HMM-based spotting and verification
Phone-length Equal error rate
4 19.9
6 12.1
8 7.9
Table 3.7: Equal error rates for SBM-based keyword verification
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
4−phone6−phone8−phone
Figure 3.6: DET plots for different target keyword lengths for SBM-KV onSwitchboard 1 evaluation sets
an absolute increase of 7.8% going from 6-phone to 4-phone words.
These numbers highlight a significant issue for SBM-based keyword verifica-
tion and in fact a problem in general for KV. That is, the inability to estimate
robust confidence scores for short words as a result of a reduced number of obser-
vation vectors available for scoring. Since the SBM-KV confidence score is derived
from the ratio of target word score to non-keyword score, a reduced amount of
data results in statistically less robust estimates of the individual component
scores, and therefore a less robust confidence score ratio.
3.9. Discriminative background model KV 79
3.9 Discriminative background model KV
Background model based KV has typically been implemented using a LLR con-
fidence score formulation. However, the log-likelihood ratio only provides a very
crude decision boundary and therefore is suboptimal for discriminative tasks. In-
stead, a discriminative framework such as a neural network or support vector
machine is likely to provide a more robust decision boundary.
Such concepts have been previously applied to keyword verification in the
works of Ou et al. [25] and Bourlard et al. [5]. Discriminative frameworks have
also been shown to be effective for other classification tasks, such as speaker
verification as demonstrated by Bengio [3].
Given this, the previously evaluated speech background model approach was
modified to incorporate a decision boundary governed by a Multi Layer Per-
ceptron (MLP) neural network. Experiments were performed to compare the
performance of such an approach with the traditional SBM-KV approach.
3.9.1 System architecture
The MLP Speech Background Model (MLP-SBM) method was implemented as
shown in figure 3.7. Each putative occurrence was scored against the target
word model and background model to obtain segment based likelihoods. These
likelihoods, together with the start and end times of the putative occurrence
and duration normalised versions of the likelihoods were then fed into the neural
network as inputs.
The MLP itself was constructed using a single hidden layer consisting of 25
sigmoid neurons. Standard squared error gradient descent training methods were
used to train the network. Training data for the neural network was obtained
by using half the evaluation set from section 3.8. The remaining half of the
evaluation set was used for actual experimentation.
80 Chapter 3. HMM-based spotting and verification
Non-Keyword Model
Speech Background
Putative
Occurrence
MLP
Putative
Occurrence Timings
Word model, W
Figure 3.7: System architecture for MLP background model based KV
3.9.2 Results
Figures 3.8, 3.9 and 3.10 show DET plot comparisons of the standard SBM
method and the MLP-SBM method. Table 3.8 shows the equal error rates for
these systems. Note that the miss rate of the previous keyword spotting stage was
not included when calculating miss rates as the focus of these experiments was to
compare the performances of the individual KV methods rather than overall sys-
tem performance. As such, the DET plots and equal error rates for SBM-KV do
not coincide with those reported in section 3.8. Additionally, the longer-length
evaluations appear to suffer from data sparsity issues as indicated by the step
effects that are prominent in figure 3.10. This is unfortunate but could not be
avoided since the evaluation set size had to be reduced to provide training data
for the MLP.
The results show that there is a marked improvement when using MLP-SBM
for short word verification. An absolute equal error rate gain of 3.7% (from
18.0% down to 14.3%) is observed. This is a notable improvement considering
the difficulty of short-word KV. Additionally, the DET plots show that this gain
is consistent across operating points.
3.9. Discriminative background model KV 81
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
MLPLLR
Figure 3.8: DET plots for SBM and MLP-SBM systems for 4-phone words
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
MLPLLR
Figure 3.9: DET plots for SBM and MLP-SBM systems for 6-phone words
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
MLPLLR
Figure 3.10: DET plots for SBM and MLP-SBM systems for 8-phone words
82 Chapter 3. HMM-based spotting and verification
Phone-length SBM MLP-SBM
4 18.0 14.3
6 11.2 9.9
8 7.1 7.2
Table 3.8: Equal error rates for SBM and MLP-SBM keyword verification
Unfortunately, the benefits appear to reduce as keyword length increases, with
only a 1.3% gain for 6-phone words and 0.1% gain for 8-phone words. Considering
the additional complexity and effort required for an MLP-SBM system, these
minimal gains may not warrant the use of MLP-SBM for longer words.
Overall, the experiments demonstrate that a decision boundary generated by
a discriminative framework, such as an MLP, is beneficial. This is particularly
true for the task of short-word KV. A possible future research task would be to
examine whether additional inputs could be used to further improve performance,
such as scores from multiple non-keyword models or likelihood scores generated
using different feature sets. Compared to the fairly restrictive log-likelihood ratio,
the flexibility of the MLP architecture allows a plethora of possible structures to
be examined, providing greater potential for improved KV performance.
3.10 Summary and Conclusions
A number of keyword spotting and verification architectures were discussed and
evaluated in this chapter. It was found that individual systems excelled in dif-
ferent aspects of performance. Overall though, when the suite of experiments
presented in this chapter are considered holistically, it was established that the
conventional background model based keyword spotting and verification systems
3.10. Summary and Conclusions 83
provided the best compromise in performance for a real-time unrestricted vocab-
ulary task.
Specifically, experiments demonstrated that the SBM non-keyword model ob-
tained good miss rate and execution speed performance for keyword spotting.
However false alarm rate for this system was very high indicating that in practi-
cal applications, a subsequent keyword verification stage would be required.
In contrast the phone set non-keyword model method achieved a significantly
lower false alarm rate, though at the expense of severe degradation in miss rate.
Additionally, this method was approximately 10 times slower than the background
model method, deeming it less appropriate for real-time and speed critical appli-
cations.
A target-word-excluding phone model approach was also evaluated for key-
word spotting. This approach was found to have better miss rate performance
than the basic phone model approach, but still suffered from very slow execution
speed.
Experiments were performed to evaluate the use of target word insertion
penalties as a means of improving the miss rate performance of phone model
based keyword spotting. It was found that tuning word insertion penalty was an
effective means of improving miss rate, though at the expense of increasing false
alarm rate. However, this still did not address the problematic execution speed
of phone model based methods.
The effects of output score thresholding for SBM keyword spotting was also
studied as a means of reducing high false alarm rates. Unfortunately it was
found that any significant reduction in false alarm rate using this approach was
matched by dramatic increases in miss rate, making output score thresholding an
inappropriate tuning method.
Analysis of how keyword spotting performance of the SBM-KS method varied
for different target word lengths demonstrated that there were indeed significant
84 Chapter 3. HMM-based spotting and verification
variations. In general, miss rate was poorer, and the associated DET curves were
further from the origin for shorter keywords.
Keyword verification experiments were performed using an SBM based ap-
proach. The results showed that this method was effective for long-length key-
word verification tasks but performed very poorly for short words. Specifically
an equal error rate of 19.9% was obtained for 4-phone keywords.
To address the issue of poor short-word performance, a discriminative frame-
work KV approach was evaluated. Here a neural network was used to estimate
the keyword verification confidence score using component scores from SBM key-
word verification. It was found that this approach provided significant benefits
for short word verification, yielding an impressive absolute gain of 3.7%. Sadly,
the magnitude of this gain was not matched at longer keyword lengths, suggesting
that the additional complexity of using a neural network may not be warranted
for longer word keyword verification.
Chapter 4
Cohort word keyword verification
4.1 Introduction
This chapter presents a novel KV technique called Cohort Word Keyword Veri-
fication. The proposed method attempts to address the issue of poor short-word
performance by incorporating linguistic information into the non-keyword model
construction process. This is done by synthesising the non-keyword model from
a combination of similarly pronounced cohort words in the target language. The
verification confidence measure is then derived using a combination of scores from
the target word model and cohort words in the non-keyword model.
Chapter 3 demonstrated the poor performance of HMM-based spotting and
verification for short target words. Detection of such words lead to significantly
higher miss and false alarm rates than those observed for longer words. The
resulting performance would significantly restrict the practical application of such
algorithms.
The key motivation for the cohort word technique was to obtain a more robust
non-keyword model representation that was suitable for short-word KV. Many of
85
86 Chapter 4. Cohort word keyword verification
the traditional keyword verification techniques discussed in section 2.9 used non-
keyword models that were either independent of the target keyword or used fairly
simplistic means of deriving a word-dependent non-keyword model. For example
the SBM approach used a GMM that was completely independent of the target
keyword while target-word-excluding methods, such as the method proposed by
Leung and Fung [20], used a simple out-of-target-word phone selection process to
construct the non-keyword model.
Examination of the putative occurrence sets resulting from short word key-
word spotting revealed that a large portion of the false putative occurrences
corresponded to instances of words that had very similar pronunciations to the
target word. For example, the result set when spotting the word STOCK had
instances of words such as LOCK, STOP , and STICK. Since the durations of
these putative occurrences were short, only a very small number of the observa-
tions in each occurrence scored poorly against the target word model. As such,
in many cases, these false putative occurrences would score highly against the
target word model resulting in the potential for false acceptance.
This suggests that constructing a non-keyword model for a given target word
from knowledge of similarly pronounced words may provide more robust keyword
verification. One means of doing this is to use linguistic pronunciation information
to determine the set of confusable words and to then use cohort-based scoring
techniques to perform keyword verification.
Initial sections of this chapter present a thorough discussion of the cohort word
technique. First, discussions of cohort-based scoring and the role of linguistic
information in keyword verification are presented. This is followed by a detailed
description of the cohort word approach, including a formalised definition of the
algorithm.
Subsequent sections report on various experiments that were performed to
4.2. Foundational concepts 87
evaluate the performance of the cohort word technique. In addition to deter-
mining miss and false alarm rates of cohort word KV for various target keyword
lengths, a number of experiments to validate various design choices and assump-
tions are reported on. The effects of the various parameters of the cohort word
method are also discussed here.
The final sections of this chapter examine fused KV architectures involving
the cohort word method. Specifically, two key systems are examined: the fusion
of multiple cohort word verifiers and the fusion of a cohort word system with a
SBM-based verifier.
4.2 Foundational concepts
The key aspects of cohort verification are the incorporation of word-level linguistic
information and cohort-based scoring into the keyword verification process. These
concepts are discussed further in this section.
4.2.1 Cohort-based scoring
Section 3.3 presented a detailed analysis of two common classes of non-keyword
models: all-speech and target-word-excluding models. A third class is the co-
hort non-keyword model. Such models are constructed using prior information
concerning the cohorts of a target class
In keyword verification, a cohort non-keyword model is built using subevents
that are deemed confusable with the subevents of the target word. This style
of non-keyword model has been proposed previously in literature, such as in the
works of Sukkar and Lee [32], and Xin and Wang [37].
Within the confusability circle framework, cohort methods can be seen as
attempting to directly model the Confusable Acoustic Region, and as such should
88 Chapter 4. Cohort word keyword verification
provide better discrimination than non-cohort methods when operating in this
region.
Unfortunately, the behaviour of such systems is unpredictable when operating
in the DAR region. This is because both the target keyword model and the non-
keyword model do not explicitly model this region. For keyword verification,
it is imperative that the non-keyword model generates higher probabilities than
the target word model in both the DAR and CAR regions. Since this cannot
be guaranteed for the DAR region, performance may be poor when verifying
occurrences from this region.
However, within the context of a keyword spotting system, the significance of
this issue is greatly reduced. This is because the majority of false alarms in the
output of a keyword spotter should fall within the CAR region, given that this
is the region that is considered confusable. The putative occurrence set will then
only contain a limited number of entries in the DAR. Any errors arising from
poor DAR region verification performance will thus be minimal.
4.2.2 The use of language information
A limitation of many previously proposed KV techniques is that non-keyword
models are constructed without giving an consideration to higher level linguistic
information. Linguistic information has been demonstrated in speech processing
literature to be of considerable value. For example, language models are perva-
sively used in speech recognisers to constrain word sequences, and to improve
recognition performance through word n-gram statistics. Linguistic information
has also been used to improve the performance of speaker verification systems,
for example by Reynolds and et al. [26].
The use of language information is particularly relevant for approaches that
use subevent models internally, such as the phone model approaches described by
4.2. Foundational concepts 89
Leung and Fung [20], and Sukkar and Lee [32]. These methods use a non-keyword
model likelihood based on the subevent sequence that maximises the likelihood
of an observation sequence.
When no auxiliary linguistic syntactic constraints are applied, the best scoring
subevent sequence may be a high scoring nonsensical sequence that does not
occur in the language. For example, when scoring a true occurrence of the word
DONOR = (/d/, /ow/, /n/, /er/), the best scoring phone sequence in the non-
keyword model may, for example, be G = (/p/, /ow/, /n/, /er/) — a sequence
that does not correspond to an English word. G may in this case actually score
higher thanDONOR because of effects such as modeling inadequacies, or channel
mismatch effects. If this happened, then this true occurrence may be falsely
rejected based on a comparison made with a nonsensical phone sequence.
The use of phone-level language model statistics may reduce this effect by
applying probabilistic constraints to the phone sequences that are considered
for scoring in the non-keyword model. However, this approach still has some
shortcomings. For example, if the trigram /p/ − /ow/ − /n/ occurs frequently
in English, then G may still score highly, resulting in an unrealistically high
non-keyword model score.
Word-level information is another source of language information that un-
fortunately to date has not been explored in KV literature. One application of
word-level information is in the derivation of a set of potentially confusable words
for a target word. Intuitively one would expect that words that sound similar to
a target word are likely candidates for confusion by a keyword spotter. As such,
there is the potential that false alarms in the output putative occurrence set are in
fact true occurrences of a similar sounding word. Therefore, if knowledge regard-
ing the target vocabulary could be used to derive a list of potentially confusable
words for a given target word, then this list could be used to aid discrimination
in keyword verification.
90 Chapter 4. Cohort word keyword verification
One major advantage of word-level information over phone-level information
is the coupling between target word and non-keyword model construction. In
word-level methods, the KV process is very tightly coupled to the target word.
For example, the confusion word set is specific for each target word. In con-
trast, phone-level information is more loosely coupled to the target word since
decisions regarding non-keyword model construction are made at the phone-level.
For example, when using the cohort phone approach proposed by Sukkar and Lee
[32], the choice of models included in the non-keyword model is derived from a
knowledge based mapping of cohort phones. Although the list of phones consid-
ered is constrained by the phone decomposition of the target word, the level of
dependence on the target word is still considerably lower than word-level based
methods.
4.3 Overview of the cohort word technique
The cohort word keyword verification method incorporates the distinct advan-
tages of word-level language information and cohort non-keyword models. Word-
level language information is used to obtain a set of potentially confusable words
for a given target word. This set of confusable words are combined in parallel to
create the cohort word non-keyword model. Cohort verification techniques are
then used to classify a putative occurrence as belonging to either the target class
or the non-keyword class.
It is anticipated that using this approach will improve the robustness of the
non-keyword model since it is more directly representative of potential false pu-
tative occurrences, and as such the CAR region. For example, when verifying the
word DONOR, comparisons will be made with true events that occur in the ap-
plication’s vocabulary, such as SONAR and LONER. As a result, non-keyword
model likelihoods will not be derived from nonsensical events, hence avoiding the
4.3. Overview of the cohort word technique 91
issue of unrealistically high non-keyword likelihoods.
In essence, the cohort word method can be described as a classifier that asks
the question Is this occurrence best modeled by the target word or one of the words
that are easily confusable with the target word? This is in fact a sensible question
to ask for keyword verification, since in practice this is exactly the question that
is trying to be answered.
The behaviour of the cohort word method is anticipated to be as follows for
the given scenarios:
1. The putative occurrence is an instance of the target word. In this case, both
the target keyword model and the individual confusable word non-keyword
models will output high likelihoods. However, on average the likelihood
output by the target word model should be greater than the likelihood
output by any of the individual confusable word models.
2. The putative occurrence is an instance of a confusable word. Given that
a keyword spotter is trying to detect occurrences of a specific keyword, it
is not unreasonable to expect that a significant portion of putative occur-
rences will actually correspond to instances of a confusable word. In these
cases, a cohort word non-keyword model will provide robust discrimina-
tion since the confusable words are explicitly modeled by the non-keyword
model. The performance of the cohort word approach may outperform that
of cohort subevent methods since non-keyword models cannot be based on
nonsensical subevent sequences.
Unfortunately, as with cohort-subevent approaches, performance for putative
occurrences in the DAR of the confusability circle may be unpredictable. For
such putative occurrences, low likelihoods would be observed for both the target
word model and the non-keyword model. Ratio based confidence scores, such as
92 Chapter 4. Cohort word keyword verification
the log likelihood ratio, may not provide sufficient discrimination in such cases.
Instead, a confidence score that accounts for low absolute model likelihoods may
be more appropriate.
A number of issues need to be considered when formulating the cohort word
method. Of most importance are the choice of cohort word selection process
and the non-keyword modeling architecture. These are discussed in the following
sections.
4.4 Cohort word set construction
Obtaining the cohort word set of a target word is in itself a considerable problem.
In particular, for unconstrained vocabulary keyword verification, an automatic
procedure for deriving this set is required. One means of performing this auto-
matic selection is to use a word-confusability measure in conjunction with a large
word list that provides adequate coverage of the target application vocabulary.
Let D(w1, w2) be a distance that measures the confusability between two
words, w1 and w2. Additionally, let V be a list of words representing the target
application vocabulary. Then the cohort word set, R(w), of word w can be
expressed as:
R(w) = v ∈ V |dmin ≤ D(w, v) ≤ dmax (4.1)
where dmin and dmax are thresholds used to limit the degree of confusability of
cohort words.
4.4.1 The choice of dmin and dmax
Both dmin and dmax play important roles in controlling the performance of cohort
word keyword verification. Within the confusability circle, these can be seen as
controlling the extent of CAR modeling within the non-keyword model, as shown
4.4. Cohort word set construction 93
in figure 4.1.
Target Acoustic Region
Confusable Acoustic Region
Disparate Acoustic Region
STACK
SOCK
FLOCK
SPOCK
dmin
STOCK
dmax
Figure 4.1: Controlling the degree of CAR region modeling dmin and dmax tuning
Tuning of the dmax parameter changes the number of words included in the
cohort word set. Specifically, using a very large value is likely to improve perfor-
mance since a greater number of cohort words are included in the discrimination
process. However, a large dmax will also reduce execution speed because of the
increased complexity of the non-keyword model.
Careful attention must also be given to the choice of dmin. Using too small
a value will result in extremely confusable words being included in the cohort
word set. This may result in reduced false alarm rates since highly confusable
words are included in the non-keyword model. However, it is more likely that
this will introduce significant overlaps between the target word model and the
non-keyword model, resulting in an increase in miss rate.
94 Chapter 4. Cohort word keyword verification
4.4.2 Cohort word set downsampling
Further control of the cohort word set can be obtained by downsampling. This
results in less coverage of the CAR region, potentially leading to suboptimal
performance. However, for large vocabularies, this may be necessary to maintain
practical execution speed. In this work, random sampling is used to perform this
downsampling. That is, the reduced cohort word set is given by:
RJ(w) = shuffle(R(w))Ji=1 (4.2)
where the shuffle function randomly shuffles a set.
4.4.3 Distance function
The choice of distance function will also play a significant role in controlling the
quality of the cohort word set. Distance can be calculated either acoustically or
linguistically.
An acoustic approach is more in line with the motivation of cohort word KV -
that is, to determine and model the acoustic region of confusability. The distance
measures used in data clustering algorithms, such as likelihood-based metrics and
the Kullback-Leibler divergence, are candidates for such a distance function.
Although both these acoustic approaches are suitable for determining cohort
words, they are computationally and data intensive. In particular, for large target
application vocabularies, determining the acoustic distance between the target
word and every word in the vocabulary list V would be almost intractable.
In contrast, a pronunciation-based linguistic distance function is more practi-
cal. Such a measure would reflect the distance between two words with regards
to their difference in pronunciation. Intuitively, one would expect words with
similar pronunciations to be confusable, and therefore likely candidates for the
4.4. Cohort word set construction 95
CAR region. This is particularly true for phone-based recognition system where
confusable words would share a significant number of the same phone models.
A candidate for this pronunciation-based linguistic distance function is the
MED or Levenstein distance (see Appendix A). The MED distance determines
the minimum cost of transforming a source sequence to a target sequence using
a combination of match, substitution, insertion and deletion operations, where
each operation has an associated cost.
Within the context of cohort word selection, the MED distance can be used to
measure the distance between the phonetic pronunciations of two words. There-
fore, given the mapping function, Φ(w), that maps a word w to it’s phonetic
pronunciation sequence, the MED distance based cohort word distance function
is given by:
DL(w1, w2) =MED(Φ(w1),Φ(w2)) (4.3)
where MED(A,B) is the MED distance between sequence a and sequence b.
A powerful feature of the Minimum Edit Distance is the support for vari-
able cost functions. This allows certain types of transformations to be favoured
over others. For example, a low deletion cost will result in a reduced cost of
transformation to shorter sequences.
In cohort word selection, tuning of the MED costs can be used to favour certain
types of cohort words over others. For example, using a high insertion penalty
will result in more cohort words with less phones in their pronunciation than the
target word. Intuitively it is not possible to predict what combination of costs
would be best suited for cohort word selection. However, the ability to tune MED
cost parameters provides an additional avenue for performance optimisation.
Thus, a more generalised MED-based cohort word distance function is then
96 Chapter 4. Cohort word keyword verification
given by:
DM(w1, w2) =MED(Φ(w1),Φ(w2), ψd, ψi, ψs, ψm) (4.4)
where MED(A,B, . . .) = MED distance to transform A to B
ψd = MED deletion cost function
ψi = MED insertion cost function
ψs = MED substitution cost function
ψm = MED match cost function
4.5 Classification approach
The cohort non-keyword model is based on the knowledge of the cohort words of
a given target word. How this knowledge is incorporated into the construction
process of the non-keyword model, and more importantly, how individual putative
occurrences are scored is an important consideration.
Two approaches to this issue are discussed here: a 2-class approach and a
hybrid N-class approach.
4.5.1 2-class classification approach
From a functional view, cohort word keyword verification is a 2-class classification
task, attempting to discriminate between putative occurrences corresponding to
a target word and putative occurrences representing false alarms. A benefit of
using a 2-class approach is that classifiers used in other verification methods,
such as LLR discrimination, can then be easily incorporated. However, a distinct
disadvantage is that a means of fusing the individual cohort word scores must be
determined in order to generate a single non-keyword model likelihood.
A possible formulation of a cohort word LLR is as follows. Given a target
4.5. Classification approach 97
keyword w, let R(w) = r1, r2, . . . be defined as the cohort word set of w. Note
that R(w) does not include the target word w. The corresponding models of the
cohort word set can then be used as the non-keyword model term in equation
2.17. The generalised cohort word confidence score for an observation sequence,
X, is hence given by:
C(X,w,R(w)) = log p(X|λw)− logF (p(X|λr1), p(X|λr2), . . .) (4.5)
The fusion function, F (. . .) is used here to denote the fusion of the individual
cohort word model scores.
The choice of fusion function will have an effect on overall performance, and
therefore must be carefully considered. One candidate for this fusion function is
to use a probabilistic ’OR’. This is an intuitive approach since the non-keyword
model likelihood will then be based on the likelihood that the putative occurrence
represents any one of the cohort words. The confidence score formulation using
this technique is then:
Ce(X,w,R(w)) = log p(X|λw)− log
∑
i p(X|λri)
|R(w)|(4.6)
However, this approach requires likelihood calculations for all words in the
cohort word set. This can be computationally very expensive, particularly if a
large cohort word set is used to maximise coverage of the CAR region.
To reduce computational requirements, the following approximation may be
used. Let S(w,K) = s1, s2, ..., sK be the subset of the K top scoring words
from the cohort word set R(w). Then the simplified cohort word confidence score
can be defined as:
Ce′(X,w,R(w), K) = log p(X|λw)− logF (p(X|λs1), . . . , p(X|λsK
)) (4.7)
98 Chapter 4. Cohort word keyword verification
This simplified confidence score only requires the likelihoods of the K best
scoring cohort words. Computational requirements can be significantly reduced
for many scoring algorithms if only the top K scoring models are required. For
example, if HMM cohort word models are being used, Viterbi N-Best recognition
can be used to determine the K best scoring models - a significantly cheaper
operation than obtaining the individual scores of every cohort word.
However a trade-off is that the overall non-keyword model likelihood is now
estimated from a much smaller set of cohort words. Estimation theory states
that the error in estimation tends to increase as the number of samples de-
creases. As a result, reduced KV performance may be observed when using this
Ce′(X,w,R(w), K)′ approximation.
4.5.2 Hybrid N-class approach
Cohort word KV may also be implemented as an N-class classification task. Here,
the algorithm attempts to classify a putative occurrence as belonging to one of
a set of word classes, w ∪ R(w), where w is the target word and R(w) is the
cohort word set of w. This is a direct application of the cohort word motivation:
attempting to ask the question Is this occurrence best modeled by the target word
or one of the words that are easily confusable with the target word.
However, a distinct consequence of using a N-class approach is that there is
no special consideration given to the target word class. Standard N-class clas-
sifier training algorithms do not typically provide the facility to directly train
the classifier to favour optimal decision making for a single class (ie. the target
word class for KV). In fact, optimising a classifier to favour a specific class is very
much against the fundamental concepts of many N-class classifier training meth-
ods. As such, any classifier training for N-class cohort word keyword verification
would have to be indirectly optimised to favour correct target word classification
4.5. Classification approach 99
decisions.
An alternative is to only consider the cohort word models in the classification
task - that is, to exclude the target word from the N-class classification. Using this
approach, a putative occurrence can first be classified in terms of the best scoring
cohort word. A subsequent 2-class classification may then be used to discriminate
between the target keyword model and the best-scoring cohort word. Figure 4.2
demonstrates this technique.
Putative
Occurrence
λr2
λr1
λrN
N-Class Classifier
2-Class Classifier
λrk
λw
Result
Best class, λrk
Figure 4.2: A N-class classifier approach to cohort word verification for the key-word w and cohort word set R(w)
This hybrid approach combines the benefits of N-class classification and 2-class
classification. N-class classification is used to select the most appropriate cohort
word model from which to estimate the non-keyword model likelihood. This
circumvents the need for a fusion function as required for the 2-class classifier
approach. In addition the use of the subsequent 2-class classification stage allows
direct tuning of the final decision stage to be optimised for target word class
100 Chapter 4. Cohort word keyword verification
discrimination.
In this work, the maximum likelihood N-class classifier and the LLR 2-class
classifier are used to implement the hybrid approach. Using these classifiers, the
hybrid N-class cohort word confidence score can be expressed as:
Ch(X,w,R(w)) = log p(X|λw)− log argmaxkp(X|λrk
) (4.8)
It can be seen that this formulation is in fact a special case of the 2-class
classification Ce′(X,w,R(w), K) score, with K = 1.
4.6 Summary of the cohort word algorithm
A summary of the cohort word keyword verification algorithm for verifying pu-
tative occurrences of the target word w is given below:
1. Define algorithm parameters
(a) Let V be the list of words in the target application vocabulary and
Φ(w) be a function that maps the word w to its phonetic pronunciation
sequence.
(b) Let J be defined as the maximum cohort word set size as required for
cohort word set downsampling (see equation 4.2).
(c) Let [dmin, dmax] be defined as the cohort word selection range as re-
quired for cohort word selection in equation 4.1.
(d) Let DM(x, y) be the MED distance function used for cohort word dis-
tance calculations
(e) Let ψd, ψi, ψs, ψm be defined as the MED deletion, insertion, substi-
tution and match cost functions respectively
4.7. Comparison of classifier approaches 101
(f) Let Ca(X,w,R(w)) be defined as the cohort word confidence score,
where this is one of the previously discussed confidence score formula-
tions:
i. 2-class classification score: Ce(X,w,R(w))
ii. Approximated 2-class classification score: Ce′(X,w,R(w), K)
iii. Hybrid N-classification score: Ch(X,w,R(w)).
2. Determine the cohort word set, RJ(w)
(a) Obtain the cohort word set:
R(w) = v ∈ V |dmin ≤ DM(w, v) ≤ dmax
(b) Generate the downsampled cohort word set:
RJ(w) = shuffle(R(w))Ji=1
3. For each putative occurrence, X, perform cohort word verification
(a) Calculate the cohort word confidence score Ca(X,w,R(w))
(b) Perform thresholding to accept or reject the putative occurrence
For convenience, the parameter set Θ = ψd, ψi, ψs, ψm, dmin, dmax, J is col-
lectively referred to as the cohort word selection parameters within the rest of
this work.
4.7 Comparison of classifier approaches
An important consideration for any keyword verification algorithm is the struc-
ture of the confidence score metric. The confidence score metric is the crux of
102 Chapter 4. Cohort word keyword verification
these algorithms and therefore it is paramount that significant care and attention
is given to it’s formulation.
In section 4.5, two cohort word confidence metrics were presented: the 2-class
classification approach and the hybrid N-class approach. Discussions was pre-
sented regarding the benefits and flaws of each method, however no quantitative
performances were provided to validate that assertions. As such, this section
reports on experiments that were performed to compare these two methods.
4.7.1 Evaluation set
The evaluation set was generated from the putative occurrence set of a SBM-
based keyword spotter using the following process:
1. 100 6-phone length keywords were randomly selected from the TIMIT test
set
2. Keyword spotting was performed on the TIMIT test set for each of the 100
words to generate a set of putative occurrences.
3. Each putative occurrence was classified as a true or false occurrence using
the reference transcriptions of the TIMIT test set.
4. 111 true occurrences were randomly selected from the set of putative oc-
currences and included in the evaluation set as true occurrences
5. 555 false alarm occurrences were randomly selected from the set of putative
occurrences and included in the evaluation set as false alarms
Restrictions were applied to the number of true and false occurrences in the
evaluation set to allow practical computational time. This was particularly rele-
vant for the 2-class classification approach.
4.7. Comparison of classifier approaches 103
4.7.2 Recogniser parameters
Perceptual Linear Prediction (12 statics and C0 + deltas + accelerations) feature
extraction was used to parameterise each utterance. In addition, Cepstral Mean
Subtraction was applied to reduce the effects of channel/speaker mismatch.
Cross-word triphone HMM models with 16-mixture Gaussian state distribu-
tions were used to model keywords and cohort words. As there was insufficient
data in the TIMIT training set to train robust triphone models, the HMMs were
trained using the Long and Short Training subsets (140 hours) of the Wall Street
Journal 1 clean microphone speech database.
4.7.3 Cohort word selection
Table 4.1 shows the values of the cohort word parameters that were used. Every
combination of these parameters was evaluated for each cohort word method so
that experiments would not be penalised by a poor choice of selection parameters.
This resulted in a total of 40 evaluation systems.
Parameter Range
V The CALLHOME PronLex [19] word listconsisting of approximately 90000 words.
Φ(w) Phonetic pronunciations taken from thePronlex lexicon.
J 200
dmin 1-4
dmax dmin-4
ψd, ψi, ψs, ψm 1-2,1-2,1,0
Table 4.1: Evaluated cohort word selection parameters
The limits on the evaluated values of dmin, dmax and the MED cost parameters
104 Chapter 4. Cohort word keyword verification
were applied to restrict the scope of the experiments. Using these parameters
resulted in 40 different possible cohort word systems that had to be evaluated.
For the 2-class experiments, all cohort words were included in the summation
of cohort word likelihoods. That is, the confidence score formulation used was
Ce(X,w,R(w)).
4.7.4 Evaluation procedure
Experiments were performed using the SBM-KV, equal-weighted sum cohort word
and hybrid N-class cohort word methods. For each method, the following proce-
dure was used:
1. For each putative occurrence:
(a) The target word score was calculated using a target-word model con-
structed from a concatenation of triphone models
(b) The non-keyword model score was calculated using the appropriate
non-keyword model
(c) The confidence score was calculated
2. Thresholding on confidence score was applied to obtain false acceptance
performance at the 3% and 10% miss rate performance points.
4.7.5 Results
The results of the KV experiments are shown in table 4.2. Once again every
combination of selection parameters was evaluated resulting in 40 systems per
method. However, only the best performing system at each of the two miss rates
are reported.
Clearly the CW-NClass architecture outperforms the CW-2Class system at
both miss rate operating points. The optimal cohort word selection parameters
4.7. Comparison of classifier approaches 105
Method FA @ 3% Miss rate FA @ 10% Miss rate
CW-2Class 67.8% 1,4,1,2 44.1% 1,4,1,2
CW-NClass 31.5% 2,4,1,2 17.8% 3,3,1,2
Table 4.2: Performance of selected cohort word KV systems on TIMIT evaluationset. Cohort word systems are qualified with the appropriate cohort word selectionparameters using a tag in the format dmin, dmax, ψd, ψi.
varied at each operating point, but in nearly all of the 40 evaluated systems, the
CW-NClass method significantly outperformed the CW-2Class method.
Additionally, the CW-NClass systems were in the order of 30 times faster
than the CW-2Class systems. This was because the CW-2Class systems required
calculation of likelihoods for every cohort word model (200 cohort words in this
case). In contrast, the CW-NClass systems only required finding the best scoring
cohort word using a single Viterbi recognition pass - a considerably faster task.
The execution speed of the CW-2Class approach could be improved by re-
ducing the size of the number of cohort word models considered in the fusion
function. That is, a smaller value of K could be used. However, this would result
in less coverage of the CAR region within the non-keyword model, and therefore
poorer performance.
Further analysis of the the CW-2Class method revealed that there was a
large variance in the distribution of cohort word likelihoods for a given putative
occurrence. Additionally, the mean of these likelihoods was often very much lower
than the best cohort word likelihood. Since the fusion function used essentially
calculated the mean cohort word likelihood, this meant that the resulting non-
keyword model likelihood was often quite low. This explains the high false alarm
rates of the CW-2Class approach.
In summary, the reported experiments demonstrate the advantages of the N-
class hybrid cohort word approach over the 2-class cohort word method. The
106 Chapter 4. Cohort word keyword verification
N-class approach outperformed the 2-class approach in terms of detection perfor-
mance and was also approximately 30 times faster. It can be concluded then that
the N-class hybrid confidence score formulation is more appropriate for cohort
word keyword verification.
4.8 Performance across target keyword length
In a similar fashion to keyword spotting, KV performance is very much affected
by the length of the target keyword. Specifically, the longer the average duration
of a target word, the easier it is to robustly verify it’s putative occurrences.
This section examines the effects of target keyword length on cohort word
verification performance. In particular, three target word lengths are studied:
4-phone, 6-phone and 8-phone, corresponding to short, medium and long words
respectively.
Experiments were performed using the specific classes of target word length.
Only the hybrid N-class approach was evaluated since experiments in section 4.7
demonstrated it’s superiority over the 2-class formulation particularly in terms of
speed. Additionally, a more difficult evaluation set taken from the Switchboard-1
conversational telephone speech corpus was used. This was done to provide in-
sight into the performance of the cohort word method under less ideal conditions.
The cohort word selection parameters and evaluation procedure used were the
same as those described in sections 4.7.3 and 4.7.4.
4.8.1 Evaluation set
Evaluation speech was taken from a subset of the SWB1 database. A list of
words for each phone length class were obtained by using phone pronunciations
from the PronLex lexicon. Evaluation sets for each phone length class were then
4.8. Performance across target keyword length 107
restricted to containing only putative occurrences for words of the same phone
length.
The putative occurrence sets for 4-phone and 6-phone words consisted of 375
true putative occurrences and 1875 false putative occurrences. Only 175 true
and 875 false putative occurrences were used for the 8-phone evaluation set as
8-phone words were less frequent in the test set.
True occurrences were obtained by randomly selecting words of the required
phone length from high quality forced-aligned transcriptions of the SWB1 test
data. False putative occurrences were obtained by first performing keyword spot-
ting using a SBM-based keyword spotting system for each of the words in the
true putative occurrence set, and then randomly selecting the necessary number
of false occurrences from the false alarm outputs. This restricted the false oc-
currences to being acoustically similar occurrences of words that were confusable
with the target word.
4.8.2 Recogniser parameters
Speech was parameterised using Perceptual Linear Prediction (12 statics and C0
+ deltas + accelerations) feature extraction. Cepstral Mean Subtraction was
applied to provide speaker and channel compensation.
Cross-word triphone HMM models with 16-mixture Gaussian state distribu-
tions were used to model target words and cohort words. These HMMs were
trained using a 165 hour training subset of the SWB1 data that was independent
of the evaluation data set.
Additionally, a 256-mixture GMM was trained on the same SWB1 training
dataset for use with SBM-KV baseline experiments.
108 Chapter 4. Cohort word keyword verification
4.8.3 Results
Equal error results for the SBM-KV baseline, the 3 best cohort word methods,
the median cohort word method and the worst cohort word method are shown
below.
4 phone-len 6 phone-len 8 phone-len
Method EER Method EER Method EER
SBM-KV 19.7 SBM-KV 11.7 SBM-KV 7.5
CW 3, 3, 2, 1 14.1 CW 1, 4, 2, 1 9.6 CW 1, 4, 1, 1 9.7
CW 3, 3, 2, 2 14.1 CW 3, 4, 2, 1 9.8 CW 3, 4, 2, 1 10.3
CW 2, 3, 1, 2 14.2 CW 4, 4, 1, 1 10.1 CW 2, 4, 1, 1 10.3
CW 1, 4, 2, 1 16.3 CW 1, 3, 1, 1 12.3 CW 3, 3, 2, 1 19.3
CW 1, 1, 2, 2 25.9 CW 1, 1, 2, 2 55.7 CW 1, 1, 2, 2 66.3
Table 4.3: Performance of SBM-KV and selected cohort word systems on theSWB1 evaluation sets. Cohort word selection parameters are specified with eachsystem in the format dmin, dmax, ψd, ψi.
In terms of EER, the cohort word methods clearly outperformed SBM-KV for
the 4-phone and 6-phone keyword lengths. However, the contrary was observed
for 8-phone words. EER gains were particularly pleasing for the 4-phone set,
where the 3 best cohort word methods yielded an approximate 30% relative gain
over SBM keyword verification.
Cohort word performance gains are consistent across operating points for the
4-phone and 6-phone tests. Figures 4.3 and 4.4 show the DET plots for the best
cohort word system and the SBM-KV system for 4-phone and 6-phone words
respectively. For short length 4-phone KV, the cohort word method maintained
a considerable margin of gain over SBM-KV at all operating points. Cohort
word KV also outperformed SBM-KV for medium length 6-phone words at false
acceptance rates above 6%.
4.8. Performance across target keyword length 109
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
Cohortword (3,3,2,1)SBM−KV
Figure 4.3: DET plot for best cohort word and SBM-KV systems on SWB14-phone length evaluation set
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
Cohortword (1,4,2,1)SBM−KV
Figure 4.4: DET plot for best cohort word and SBM-KV systems on SWB16-phone length evaluation set
110 Chapter 4. Cohort word keyword verification
4.8.4 Analysis of poor 8-phone performance
Trends in performance figures across keyword length indicate that the benefits
of cohort word KV over SBM-KV decreased with increased keyword length. A
likely explanation is a decrease in the quality of the cohort word non-keyword
model.
The cohort word method relies on a non-keyword model that adequately mod-
els the CAR region. In order to maintain practical execution speeds, the exper-
iments used a random sampling of at most 200 cohort words to model the CAR
region.
However, examination of the average number of cohort words used for each
phone length demonstrated that in fact 200 cohort words were not used in every
case. Table 4.4 shows the mean and standard deviation of the number of cohort
words used for the 3 best performing systems for each phone length. For the
majority of the 4-phone and 6-phone experiments, the number of cohort words
was 200 as expected. In contrast, none of the 3 best performing 8-phone systems
had a mean cohort word set size of 200. This implies that for the 8-phone length
experiments, the coverage of the CAR region for non-keyword modeling was less
than expected, and as such, the quality of the cohort word confidence score was
compromised.
A reduced number of cohort words was obtained for the 8-phone length exper-
iments because there were simply insufficient words of the required length that
were potential cohort candidates. In fact, it must be noted that the 3 best per-
forming 8-phone length cohort word systems all had high dmax values allowing a
greater number of words to be considered as cohort words. This suggests that
perhaps higher values of dmax than that evaluated may result in better 8-phone
performance.
Further analysis of experimental results reveal that there is a high degree
4.8. Performance across target keyword length 111
Phone Length Method Mean # Std #
cohort words cohorts words
4 CW 3, 3, 2, 1 200.0 0.0
4 CW 3, 3, 2, 2 199.9 0.2
4 CW 2, 3, 1, 2 200.0 0.0
6 CW 1, 4, 2, 1 200.0 0.0
6 CW 3, 4, 2, 1 200.0 0.0
6 CW 4, 4, 1, 1 199.8 2.6
8 CW 1, 4, 1, 1 180.9 39.9
8 CW 3, 4, 2, 1 107.8 53.4
8 CW 2, 4, 1, 1 158.2 51.1
Table 4.4: Mean and standard deviation of the number cohort words used in the3 best performing cohort word KV methods for the SWB1 evaluation set
of correlation between equal error rate and the mean number of cohort words.
Figure 4.5 shows a plot of equal error rate versus the mean number of cohort
words for all the cohort word systems that were evaluated. The reflected trends
show that increased equal error rates were observed for systems that had a lower
mean number of cohort words. Additionally it can be seen that a good majority
of 4-phone and 6-phone systems had mean cohort word set sizes of 200 while the
8-phone systems had a significant number of lower valued cohort word set sizes.
4.8.5 Conclusions
In conclusion, the experiments demonstrate that the cohort word method is ad-
vantageous for short to medium length KV. This is a pleasing result since short
word KV is a particularly difficult problem prone to high miss rates and false
alarm rates.
However, there are issues that need to be addressed for long word KV. In
112 Chapter 4. Cohort word keyword verification
050
100
150
200
250
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Mea
n co
hort
−w
ord
coun
t
Equal error rate
4−ph
ones
6−ph
ones
8−ph
ones
Figure 4.5: Equal error rate versus mean number of cohort words
4.9. Effects of selection parameters 113
particular, the experiments suggest that for long words, there were insufficient
words in the dictionary used to be able to construct a cohort word non-keyword
model that provided sufficient coverage of the CAR region.
A simple solution to this may be to increase the number of long words in
the dictionary. However, considering that a very large 90000 word dictionary was
used in these experiments, it is likely that there simply aren’t enough linguistically
similar words in the English language that are suitable for use in 8-phone cohort
word keyword verification.
An alternative solution may be to increase dmax. This would increase the
number of candidate cohort words, thus resulting in larger cohort word set sizes
and more robust confidence scores.
4.9 Effects of selection parameters
The experiments in previous sections demonstrated that cohort word verification
performance is very sensitive to the choice of cohort word selection parameters.
In particular, for 6-phone and 8-phone target words, a poor choice in cohort
word selection parameters can result in extraordinarily poor rates of performance
compared to the best achievable rates. For example in section 4.8 the best 6-
phone cohort word system had an EER of 9.6% while the worst had an EER of
55.7%.
This section examines the contributions of various selection parameters to
overall cohort word performance. In particular, attention is given to the effects
of cohort word set downsampling, selection range tuning and MED cost function
tuning.
Analysis is performed on the experimental outputs of the evaluations reported
in section 4.8. Any additional experiments conducted used the same evaluation
sets, recogniser parameters and experimental procedures documented in sections
114 Chapter 4. Cohort word keyword verification
4.8.1, 4.8.2 and 4.7.4 respectively.
4.9.1 Cohort word set downsampling
Cohort word set downsampling is used to reduce the size of a cohort word set
to maintain fast execution speeds. Without downsampling, the size of the co-
hort word set can reach extremely large sizes resulting in increased non-keyword
complexity and thus slower processing times.
Downsampling is particularly important when a large cohort word selection
range is used, since this dramatically increases the number of candidate cohort
words. For example, the word GALLERY has a cohort word set size of 3044
when using the selection parameters dmin = 1, dmax = 4, φd = 2 and φi = 1
in conjunction with the 90000 word PronLex dictionary. In contrast, using a
reduced selection range of dmin = 1 and dmax = 2 results in a set size of only 49
words.
The method of downsampling will also have some effect on performance. In
this work, random sampling is used to perform downsampling. There is likely to
be a significant amount of redundancy in terms of CAR region coverage within
the cohort word set. As such, in addition to reducing the size of the cohort word
set, random sampling will hopefully also remove some of this redundancy without
significantly compromising the quality of the non-keyword model.
In order to gain a better understanding of the effects of downsampling, the
experiments from section 4.8 were repeated using three alternate cohort word
set downsampling sizes: 50, 100 and 300. Figure 4.6 shows the distribution of
EER across these various cohort word set downsampling sizes. The effect of
downsampling clearly decreases with increased keyword length. For 6-phone KV,
there appears to be only a small effect on both the mean EER and the best
obtained EER. There are no noticable effects on 8-phone KV, though this is to
4.9. Effects of selection parameters 115
50 100 200 300
0.14
0.16
0.18
0.2
0.22
0.24
0.26
EE
R
PL = 4
CW downsample set size
50 100 200 300
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
EE
R
PL = 6
CW downsample set size
50 100 200 300
0.1
0.2
0.3
0.4
0.5
0.6
EE
R
PL = 8
CW downsample set size
Figure 4.6: Trends in equal error rate with changes in cohort word set downsam-pling size
116 Chapter 4. Cohort word keyword verification
be expected since as discussed previously many of the configurations had very
small cohort word sets anyway. This is a positive result indicating that smaller
cohort word set downsampling sizes can be used without significantly impacting
KV performance. This is particularly beneficial for the overall execution speed
of cohort word KV.
In contrast, 4-phone cohort word KV appears to be very sensitive to the size
of the cohort word set. Specifically an absolute gain of 0.8% can be obtained from
increasing the set size to 300 while losses of 0.8% and 2.1% result from reducing
the downsampled set size to 100 and 50 respectively. This indicates that short-
word KV requires considerably more information within the non-keyword model
to perform robust verification. This additional information may be required to
compensate for the reduced number of observations available for scoring.
4.9.2 Cohort word selection range
Tuning of the cohort word selection range, [dmin, dmax], affects the degree of lin-
guistic similarity between the cohort word set and the target word. This in turn
directly affects the portion of the CAR modeled in the cohort word non-keyword
model.
A range of selection ranges were evaluated in the experiments reported in
section 4.8. Figure 4.7 shows how equal error rate varied with changes in selection
range for the 4-phone length experiments.
A key observation is that the choice of dmax had a significant impact on EER
performance. In all systems, the change in EER when tuning dmax was more
dramatic than changes observed when tuning dmin. Figure 4.7 shows that for
4-phone KV, a value of dmax = 3 yielded close to optimum performance in most
cases. Further tuning of dmin could be used to obtain the absolute local minimum.
4.9. Effects of selection parameters 117
1
2
3
4
11.5
22.5
33.5
4
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
dmin
EER across cohort−word selection ranges for deletion penalty = 1, insertion penalty = 1
dmax
EE
R
1
1.5
2
2.5
3
3.5
4
11.5
22.5
33.5
4
0.16
0.18
0.2
0.22
0.24
0.26
dmin
EER across cohort−word selection ranges for deletion penalty = 1, insertion penalty = 2
dmax
EE
R
1
2
3
4
11.522.533.54
0.15
0.16
0.17
0.18
0.19
0.2
0.21
0.22
0.23
0.24
dmin
dmax
EER across cohort−word selection ranges for deletion penalty = 2, insertion penalty = 1
EE
R
1
1.5
2
2.5
3
3.5
4
11.5
22.5
33.5
4
0.16
0.18
0.2
0.22
0.24
0.26
dmin
EER across cohort−word selection ranges for deletion penalty = 2, insertion penalty = 2
dmax
EE
R
Figure 4.7: Trends in equal error rate with changes in cohort word selection rangefor 4-phone length cohort word KV
118 Chapter 4. Cohort word keyword verification
1
2
3
4
11.522.533.54
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
dmin
EER across cohort−word selection ranges for deletion penalty = 2, insertion penalty = 1
dmax
EE
R
Figure 4.8: Trends in equal error rate with changes in cohort word selection rangefor 6-phone length cohort word KV
1
2
3
4
11.522.533.54
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
dmin
EER across cohort−word selection ranges for deletion penalty = 2, insertion penalty = 1
dmax
EE
R
Figure 4.9: Trends in equal error rate with changes in cohort word selection rangefor 8-phone length cohort word KV
4.9. Effects of selection parameters 119
Figures 4.8 and 4.9 show EER trends for the 6-phone and 8-phone experi-
ments. As for the 4-phone systems, similar trends in EER were observed for all
combinations of φi and φd and as such only the plots for φd = 2, φi = 1 are given
here.
Once again, the results demonstrate that system performance is more sensitive
to choice in dmax, while dmin only provides some fine tuning capability. However,
of note is that values of dmax = 4 yielded the best EER performance. Since the
maximum value of dmax evaluated was 4, this suggests that further improvements
in performance may be obtained by using even higher values of dmax. This seems
more promising for the 8-phone experiments since the trend curve does not seem
to have flattened at dmax = 4. In contrast, it is unlikely significant improvements
in EER will be observed for 6-phone KV since the trend curve is fairly flat at
dmax = 4.
4.9.3 MED cost parameters
Adjusting the MED insertion and deletion cost parameters affects the type of
words included in the cohort word set. For example, using a higher insertion cost
than deletion cost will favour the inclusion of words shorter than the target word,
while penalising words longer than the target word.
A number of cohort word configurations were evaluated in section 4.8. Figure
4.10 shows box and whisker plots of the equal error rates grouped by MED cost
parameters for the 4-phone, 6-phone and 8-phone experiments.
For the 4-phone and 6-phone data, there does not appear to be a significant
difference in terms of mean equal error between the four MED cost parameter
sets. There are however some differences in mean EER for the 8-phone data.
These differences though must be considered in light of the fact that the 8-phone
120 Chapter 4. Cohort word keyword verification
0.14
0.16
0.18
0.2
0.22
0.24
0.26
Equ
al e
rror
rat
e
DelCost,InsCost11
12
21
22
EER performance grouped by MED cost parameters − 4phone keywords
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
Equ
al e
rror
rat
e
DelCost,InsCost11
12
21
22
EER performance grouped by MED cost parameters − 6phone keywords
0.1
0.2
0.3
0.4
0.5
0.6
Equ
al e
rror
rat
e
DelCost,InsCost11
12
21
22
EER performance grouped by MED cost parameters − 8phone keywords
Figure 4.10: Trends in equal error rate with changes in MED cost parameters
4.9. Effects of selection parameters 121
experiments suffered from poorer performance in many cases due to reduced co-
hort word set size, as demonstrated in section 4.8.
The range of EERs observed for a fixed MED cost parameter set indicate the
potential robustness of a cohort word system to careless choice of the other cohort
word selection parameters. In this dataset, it is clear that certain choices of MED
cost parameters result in less sensitivity towards other selection parameters. For
example, for 4-phone KV, the choice of φd = 1, φi = 2 clearly gives a much smaller
range in observed EERs. In contrast, this same combination of MED parameters
leads to significantly greater sensitivity for 6-phone KV.
Overall, there are no strong conclusions that can be drawn regarding the choice
of MED cost parameters. At best, the results suggest that the φd = 2, φi = 1
combination is likely to yield better performance, since this configuration resulted
in the lowest mean EER as well as one of the lowest equal error rates for all phone
lengths. However, this configuration is particularly sensitive to other cohort word
selection parameters for 4-phone and 6-phone KV.
4.9.4 Conclusions
The reported experiments quantify the benefits of tuning the various cohort word
selection parameters. Specifically, cohort word KV appears most sensitive to the
choice of dmax. Additionally for short-word KV, system performance is highly
dependent on the choice of cohort word set downsampling size. In contrast, the
remaining cohort word selection parameters: dmin, φd and φi, only have a minor
effect on overall system performance.
As such, when using a cohort word system, tuning of dmax and cohort word
set downsampling size will provide close to optimal performance. This finding
significantly reduces the complexity of constructing and deploying a cohort word
keyword verification system.
122 Chapter 4. Cohort word keyword verification
4.10 Fused cohort word systems
Multiple system fusion has been demonstrated to yield significant gains in a va-
riety of applications, such as biometric authentication. The primary benefit of
fusion is the ability to use complementary and orthogonal information effectively
to combine the strengths of individual systems as well as to negate their weak-
nesses.
A notable finding of previous sections of this chapter was that the gains of
the cohort word method over SBM-based KV reduced with increases in keyword
length. Specifically, while cohort word KV outperformed SBM-based KV for
short and medium length keywords, performance was in fact poorer for long
length keywords.
Verifier fusion may provide a means of combining the benefits of the cohort
word method and SBM method. There are a number of potential benefits of
using such a fused system, including:
1. Improved performance for other keyword lengths using the fusion of in-
dependent keyword-length-optimised systems. For example, given a cohort
word system optimised for 4-phone words, and a SBM-KV system optimised
for 8-phone words, a fused verifier may be able to improve the verification
performance for other phone-length words, such as 5-phone words.
2. The potential to fuse mutual information (if any exists) from individual
verifiers to improve robustness. For example, if all individual systems give
low confidence scores, then it is significantly more likely that the putative
occurrence should be rejected.
In these experiments, only the late fusion of systems will be examined. Fu-
sion is performed by combining the output scores of individual systems using a
Multi-Layer Perceptron neural network. Middle-fusion techniques such as the
4.10. Fused cohort word systems 123
inclusion of a SBM within the cohort word recognition network may also yield
improvements, but are not explored here.
The cohort word selection parameters, evaluation set and recogniser parame-
ters used are the same as those described in section 4.8.
4.10.1 Training dataset
The use of a MLP network for fusion required an appropriate training dataset
to train the network. As such, network training datasets were constructed for
each evaluated keyword length. These training sets were constructed in the same
fashion as the evaluation sets and it was ensured that there were no overlaps
between the training and evaluation sets. The resulting training sets consisted
of 375/1875 true/false putative occurrences for the 4-phone and 6-phone training
sets, and 175/875 true/false putative occurrences for the 8-phone training set.
4.10.2 Neural network architecture
A three layer MLP neural network architecture was used. The confidence scores
from the individual verifiers to be fused were used as the input values for the
neural network. A 25 node hidden layer was used in the intermediary layer and
2 nodes were used in the output layer, one for true occurrences and one for false
occurrences. The network was then trained using 4-fold cross validation and
squared error gradient descent training.
4.10.3 Experimental procedure
Two fusion architectures were examined. The first was a fused cohort-SBM sys-
tem. Each of the 40 cohort word systems evaluated in section 4.8 was fused with
a SBM-KV system. It was anticipated that this approach would allow the indi-
vidual verifiers to augment each other’s weakness at different keyword lengths.
124 Chapter 4. Cohort word keyword verification
The second approach was the fusion of two cohort word systems. The per-
formance of every combination of two cohort word systems was evaluated; that
is 40 × 39 = 1536 fused systems. These experiments sought to use the mutual
information of individual cohort word verifiers to improve overall performance.
The following experimental procedure was then used for each KV experiment:
1. The confidence score for each individual KV system was calculated for every
putative occurrence in the training and evaluation sets.
2. An MLP was trained using the individual confidence scores of the training
data set as inputs and the reference classifications as outputs.
3. The trained neural network was used to calculate the fused confidence score
for each putative occurrence in the evaluation sets.
4. Thresholding was applied on the fused confidence scores to obtain EER
performance
4.10.4 Baseline unfused results
Baseline SBM and cohort word KV performances from experiments in section 4.8
are reproduced in table 4.5. Only the best performing cohort word KV systems
are reported here.
4-phone 6-phone 8-phone
Method EER Method EER Method EER
SBM-KV 19.7 SBM-KV 11.7 SBM-KV 7.5
CW 3, 3, 2, 1 14.1 CW 1, 4, 2, 1 9.6 CW 1, 4, 1, 1 9.7
Table 4.5: Performance of baseline SBM-KV and best cohort word systems onthe SWB1 evaluation sets
4.10. Fused cohort word systems 125
4.10.5 Fused SBM-CW experiments
Table 4.6 shows the results of SBM-cohort fusion experiments. Only the top 3
fused systems are reported here.
Phone-length Method EER
4
SBM-CW 1, 4, 2, 2 13.9
SBM-CW 3, 3, 2, 1 13.9
SBM-CW 1, 3, 1, 1 14.1
Baseline SBM 19.7
Baseline CW 3, 3, 2, 1 14.1
6
SBM-CW 2, 4, 2, 1 8.2
SBM-CW 1, 4, 1, 2 8.3
SBM-CW 1, 3, 2, 1 8.3
Baseline SBM 11.7
Baseline CW 3, 3, 2, 1 9.6
8
SBM-CW 2, 4, 1, 2 5.7
SBM-CW 2, 4, 2, 1 5.7
SBM-CW 3, 4, 2, 1 5.7
Baseline SBM-KV 7.5
Baseline CW 3, 3, 2, 1 9.7
Table 4.6: Performance of the best fused SBM-cohort systems on the SWB1evaluation sets
The results demonstrate considerable gains over the baseline SBM system for
all configurations, particularly for the longer length keywords. Specifically gains
in EER of 5.8%, 3.5% and 1.8% absolute were observed using the best fused
systems for the 4-phone, 6-phone, and 8 phone-length experiments respectively.
Gains were also observed over the individual unfused cohort word systems, though
these were not as dramatic except for the 8-phone experiments.
The mean EER gains across all cohort word selection parameter configurations
126 Chapter 4. Cohort word keyword verification
were considerable in all cases: 2.1%, 7.3%, and 15.2% for 4-phone, 6-phone, and
8 phone-length experiments respectively. However, the majority of these mean
stats were dominated by the gains observed for the very poor performing unfused
cohort word systems.
Gains were better for longer keyword lengths. This is an interesting result
since the unfused cohort word systems actually performed more poorly at longer
keyword lengths than the unfused SBM system. This result suggests that the
extra information provided by the cohort word system is able to significantly
improve performance for longer keywords. However, the results suggest that
little information is provided by the SBM system for short word KV.
The plots in figure 4.11 show a comparison of EER between the fused and
unfused systems, where unfused cohort word and fused SBM-cohort systems with
the same cohort word selection parameters are plotted at the same point on the
horizontal axis. Clearly for the majority of cohort word configurations, the fused
system outperformed the unfused system.
The plots also demonstrate that there was significant correlation between the
performance of unfused and fused cohort word systems that used the same cohort
word selection parameters. In fact, numerical analysis found that there was a cor-
relation coefficient of 0.84 between unfused and fused system EER performance.
This indicates that a well performing unfused system would be a good candidate
for a fused architecture.
More so, the trend lines in figure 4.11 suggest that the best performing unfused
system will give close to the best achievable performance of a fused setup. Thus,
when constructing a fused SBM-CW system, it is sufficient to simply select the
best performing isolated cohort word system as the candidate for fusion.
4.10. Fused cohort word systems 127
0 5 10 15 20 25 30 35 40−2
−1.9
−1.8
−1.7
−1.6
−1.5
−1.4
−1.3
−1.2
CW configuration index
log(
Equ
al e
rror
rat
e)
EERs in fused and unfused systems for PL=4Correlation coefficient CW/SBM−CW = 0.8354
Mean CW/SBM−CW gain = 0.0213Std CW/SBM−CW gain = 0.0217
CWSBM−CWSBM
0 5 10 15 20 25 30 35 40−2.5
−2
−1.5
−1
−0.5
CW configuration index
log(
Equ
al e
rror
rat
e)
EERs in fused and unfused systems for PL=6Correlation coefficient CW/SBM−CW = 0.8479
Mean CW/SBM−CW gain = 0.0730Std CW/SBM−CW gain = 0.0899
CWSBM−CWSBM
0 5 10 15 20 25 30 35 40−3
−2.5
−2
−1.5
−1
−0.5
0
CW configuration index
log(
Equ
al e
rror
rat
e)
EERs in fused and unfused systems for PL=8Correlation coefficient CW/SBM−CW = 0.8417
Mean CW/SBM−CW gain = 0.1528Std CW/SBM−CW gain = 0.1195
CWSBM−CWSBM
Figure 4.11: Correlation between unfused system performances and fused systemperformances
128 Chapter 4. Cohort word keyword verification
4.10.6 Fused CW-CW experiments
Table 4.7 shows the results of the fusion of 2 cohort word systems. Only the top 3
fused systems are reported here. Cohort word selection parameters are reported in
the format dmin1, dmax1, ψd1, ψi1, dmin2, dmax2, ψd2, ψi2, where the individual
groups of cohort word selection parameters correspond to the individual cohort
word systems
Phone-length Method EER
4
CW-CW 1, 3, 2, 1, 3, 4, 2, 2 12.3
CW-CW 3, 3, 2, 2, 2, 4, 2, 1 12.3
CW-CW 4, 4, 2, 1, 3, 3, 2, 2 12.5
Baseline SBM 19.7
Baseline CW 3, 3, 2, 1 14.1
6
CW-CW 1, 4, 2, 1, 1, 4, 1, 1 8.3
CW-CW 4, 4, 1, 1, 1, 4, 2, 1 8.5
CW-CW 1, 4, 1, 2, 4, 4, 1, 1 8.5
Baseline SBM 11.7
Baseline CW 3, 3, 2, 1 9.6
8
CW-CW 1, 1, 1, 1, 1, 4, 1, 1 8.0
CW-CW 1, 1, 1, 2, 1, 4, 1, 1 8.1
CW-CW 3, 3, 2, 2, 1, 4, 1, 1 8.5
Baseline SBM-KV 7.5
Baseline CW 3, 3, 2, 1 9.7
Table 4.7: Performance of the best fused cohort-cohort systems on the SWB1evaluation sets
The trends in gains across phone-length for the CW-CW systems versus the
unfused systems were in stark contrast with the trends observed for the SBM-
CW gains. Specifically the magnitude of gains was greater for the short-word
evaluations. This is in line with previous results that found that unfused CW
4.10. Fused cohort word systems 129
was much better suited to short-word KV. Overall maximum EER gains of 1.8%,
1.3% and 1.7% over the best unfused cohort word systems were observed for the
4-phone, 6-phone and 8-phone experiments respectively. The gain for 4-phone
KV is particularly pleasing correponding to a relative gain of 13%.
Table 4.8 shows a correlation analysis between unfused and fused EERs. For
any given fusion pair, EER1 was the lower unfused equal error rate while EER2
was the other equal error rate. The results demonstrate two important properties
of CW-CW fusion. First, there is a high degree of correlation between the fused
EER and the EER1 statistic. For example, the correlation coefficient between
4-phone FusedEER and EER1 was 0.8961. This shows that the equal error rate
of the better performing system in a fusion pair has a significant effect on the
overall fused equal error rate.
Additionally, there is also a high degree of correlation between the product of
the individual unfused EERs and the fused EER. For example, for 6-phone EER,
the correlation coefficient is 0.8427. This indicates that combining two well-
performing systems will result in even more optimal fused system performance,
while combining two poor performing systems will result in poor fused system
performance.
Given the results of the correlation analysis, it is clear that selecting two
well performing unfused cohort word systems will result in good fused system
performance. This means that it is not necessary to try every combination of
cohort word system configurations to obtain close to optimum fusion performance.
4.10.7 Comparison of fused and unfused systems
Figure 4.12 shows a comparison of performance across all evaluated architectures
and phone-lengths. Figure 4.13 shows the same results using a log EER scale to
provide better resolution at low error rates. The figures clearly demonstrate the
130 Chapter 4. Cohort word keyword verification
4-phones
FusedEER EER1 EER2 EER1 * EER2
1.0000 0.8961 0.5218 0.7820
0.8961 1.0000 0.4481 0.7729
0.5218 0.4481 1.0000 0.9099
0.7820 0.7729 0.9099 1.0000
6-phones
FusedEER EER1 EER2 EER1 * EER2
1.0000 0.9887 0.4783 0.8427
0.9887 1.0000 0.4213 0.8158
0.4783 0.4213 1.0000 0.8332
0.8427 0.8158 0.8332 1.0000
8-phones
FusedEER EER1 EER2 EER1 * EER2
1.0000 0.9867 0.4708 0.8681
0.9867 1.0000 0.4498 0.8511
0.4708 0.4498 1.0000 0.8000
0.8681 0.8511 0.8000 1.0000
Table 4.8: Correlation analysis of fused EER and individual unfused EER
benefits of various architectures.
For 4-phone KV, the best performing system was CW-CW - the majority of
configurations significantly outperformed the baseline SBM system. Additionally,
both unfused cohort word KV and SBM-CW fused KV gave good improvements
over SBM-KV. However, overall the gains observed for CW-CW, particularly
when using optimal cohort word selection parameters make it clearly the system
of choice for short-word KV.
4.10. Fused cohort word systems 131
SBM CW CW−CW SBM−CW
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
EE
R
Comparison of KV performance across architectures. PL = 4.
SBM CW CW−CW SBM−CW
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
EE
R
Comparison of KV performance across architectures. PL = 6.
SBM CW CW−CW SBM−CW
0.1
0.2
0.3
0.4
0.5
0.6
EE
R
Comparison of KV performance across architectures. PL = 8.
Figure 4.12: Boxplot of EERs for all evaluated architectures and phone-lengths
132 Chapter 4. Cohort word keyword verification
SBM CW CW−CW SBM−CW
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
EE
R
Comparison of KV performance across architectures. PL = 4.
SBM CW CW−CW SBM−CW
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
EE
R
Comparison of KV performance across architectures. PL = 6.
SBM CW CW−CW SBM−CW
0.1
0.2
0.3
0.4
0.5
0.6
EE
R
Comparison of KV performance across architectures. PL = 8.
Figure 4.13: Boxplot of log(EERs) for all evaluated architectures and phone-lengths
4.11. Conclusions and Summary 133
Both the SBM-CW and CW-CW fused architectures provided good improve-
ments for medium length 6-phone KV. SBM-CW KV appeared particularly well
suited to this task, with almost all evaluated configurations outperforming SBM-
KV. In terms of execution speed, an SBM-CW system is also quicker than a
CW-CW system and as such, would be a better choice for real-time applications.
Finally, some benefits can be observed for long 8-phone KV using SBM-CW
KV over the baseline SBM system. All fused SBM-CW systems outperformed the
baseline. As such even careless choice of cohort word selection parameters would
yield a better system. However, the maximum achieved EER gain was 1.8%
absolute (24% relative gain). Considering that long word KV EERs are already
quite low compared to shorter word KV, the additional processing burden may
not justify the small absolute gains that may be achieved by a fused system. As
such, based on available processing power, the most appropriate architecture for
long word KV is either a pure SBM approach or the SBM-CW approach.
4.11 Conclusions and Summary
This chapter presented a novel KV technique named Cohort Word Verification.
The cohort word technique combines high level linguistic information with cohort
verification techniques to obtain a well performing non-keyword model.
Multiple formulations for the cohort word confidence score were presented
and evaluated. Small scale experiments on the TIMIT clean microphone speech
database demonstrated that the N-class hybrid approach provided better KV
performance than the 2-class approach. Additionally, the N-class approach was
significantly faster further supporting it as the formulation of choice.
KV experiments on the Switchboard-1 telephone speech corpus were also pre-
sented. These experiments sought to benchmark the performance of the cohort
134 Chapter 4. Cohort word keyword verification
word method compared to the baseline SBM approach. The experiments demon-
strated that considerable gains in EER performance could be obtained using co-
hort word KV over the baseline method for short and medium keyword lengths.
Specifically the experiments showed absolute EER gains of 5.6% and 2.1% for
4-phone and 6-phone target words respectively. The observed gains were particu-
larly pleasing for the 4-phone case since the baseline SBM system achieved a very
poor EER of 19.7%. Unfortunately, the cohort word method performed poorly
for long word 8-phone KV, resulting in an absolute EER loss of 2.2%.
Analysis was performed on the various cohort word selection parameters to
quantify the effects of its plethora of tuning parameters. It was found that the
key parameters of interest were dmax and the amount of cohort word set size
downsampling. Tuning of the other cohort word parameters only gave small
subsequent refinements in performance. These findings radically simplify the
process of constructing and deploying a cohort word keyword verifier.
Finally, fused SBM-cohort and cohort-cohort systems were examined. Exper-
iments were performed to quantify the gains of these architectures over unfused
SBM and cohort word systems. The results demonstrated that cohort-cohort fu-
sion was well suited for short-word KV, yielding an absolute EER gain of 9.4%
and 1.8% over the baseline SBM and cohort word systems. SBM-cohort KV was
found to be excellent for medium length KV, resulting in an absolute EER gain
of 3.5% and 1.4% over the baseline SBM and cohort word systems. Additionally
SBM-cohort KV was also found to yield improvements for long phone-length KV,
providing an absolute EER gain of 1.8% over the baseline SBM system.
In summary, cohort word verification has been demonstrated to be a well per-
forming means of keyword verification. In particular, fusion of this technique with
SBM-KV yields systems that perform well across a variety of keyword lengths.
The key results of this chapter are summarised in table 4.9
4.11. Conclusions and Summary 135
Phone-length Method EER
4
SBM-KV 19.7
CW 3, 3, 2, 1 14.1
SBM-CW 1, 4, 2, 2 13.9
CW-CW 1, 3, 2, 1, 3, 4, 2, 2 12.3
6
SBM-KV 11.7
CW 1, 4, 2, 1 9.6
SBM-CW 2, 4, 2, 1 8.2
CW-CW 1, 4, 2, 1, 1, 4, 1, 1 8.3
8
SBM-KV 7.5
CW 1, 4, 1, 1 9.7
SBM-CW 2, 4, 1, 2 5.7
CW-CW 1, 1, 1, 1, 1, 4, 1, 1 8.0
Table 4.9: Summary of best performing systems
136 Chapter 4. Cohort word keyword verification
Chapter 5
Dynamic Match Lattice Spotting
5.1 Introduction
The ever-increasing volume and importance of audio and multimedia data has
brought with it the need for rapid audio indexing technologies. Complex large
vocabulary speech-to-text transcription engines have provided an intermediary
solution by transcribing speech into text that can then be rapidly searched using
conventional text search engines. However such systems are severely restricted
by the vocabulary of the STT engine.
Many applications, such as surveillance and news-story indexing, require sup-
port for typically out-of-vocabulary keyword queries such as names, acronyms and
foreign words. In such cases, unrestricted vocabulary keyword spotting methods
such as HMM-based keyword spotting have provided a solution, though at the
expense of considerably slower query speeds. Faster approaches such as reverse
dictionary lookup search engines and phone lattice spotting techniques (see sec-
tion 2.10) offer significantly quicker searching but are encumbered by poor miss
rate performance.
This chapter proposes a very fast and accurate keyword spotting method
137
138 Chapter 5. Dynamic Match Lattice Spotting
named Dynamic Match Lattice Spotting (DMLS). DMLS builds upon lattice-
based spotting methods, but addresses the issue of inherent phone recogniser
errors that adversely affect miss rate performance. This is done by augmenting
the lattice search with dynamic programming sequence matching techniques to
provide robustness against erroneous phone lattice realisations. The resulting
system is capable of searching 1 hour of audio in 2 seconds while maintaining
good detection performance.
Initial sections of this chapter discuss the motivation for the DMLS method
and present a detailed description of all associated algorithms. Subsequent sec-
tions then report on experiments to compare the performance of DMLS to pre-
existing keyword spotting methods. These sections also provide a detailed anal-
ysis of the various parameters of DMLS. The final sections discuss methods of
optimising the execution speed of DMLS to improve real-time speed without af-
fecting detection performance.
5.2 Motivation
The experiments reported in chapter 3 demonstrated that although HMM-based
keyword spotting systems provided good detection performance, they left much
wanting in terms of execution speed. Specifically, it was found that the fastest
SBM-based spotting method required 110 seconds to search 1 hour of speech for
a single keyword using a 3GHz Pentium 4 processor. Although such speeds are
more than adequate for real-time monitoring tasks such as broadcast monitoring
and keyword control systems, significantly faster speeds are required for tasks
such as large database searching.
One solution is to use a two stage STT approach as described in section
2.10.1. Such an approach provides very fast searches since query time processing
is purely textual. However, keyword queries using this method are restricted by
5.2. Motivation 139
the vocabulary of the STT system. In very large vocabulary domains or domains
with dynamic vocabulary sets - such as the broadcast domain - this restriction
will be problematic. For example, the name of the latest elected president of Sri
Lanka is unlikely to be in the vocabulary of most STT systems, but may be of
interest to a user searching for any related news stories.
In contrast, lattice-based and bottom-up keyword spotting methods provide
unrestricted query-time vocabularies while maintaining very fast query speeds.
Instead of transcribing the speech into words during the speech preparation stage,
these methods use a low-level representation such as phone or syllable labels. This
low-level representation can then be searched at query time to infer putative
locations of a target word.
Unfortunately, the detection performances of such methods are significantly
poorer than HMM-based methods. For example, the indexed reverse dictionary
lookup method proposed by Dharanipragada and Roukos [10] achieved a miss
rate of approximately 35% @ 10 FA/kw-hr. Poor performances have also been
reported for lattice based keyword spotting by Young and Brown [39].
A major reason for the poor performance of lattice-based and bottom-up
methods is that query time searching is based on highly erroneous phone recog-
niser transcriptions. Phone-recogniser error rates are typically in the vicinity of
30%-50% in favourable conditions, and potentially even poorer for adverse con-
ditions. This high error rate will clearly be propagated to the query-time search
stages resulting in poor keyword spotting performance.
Lattice-based approaches attempt to accommodate phone recogniser errors by
encoding an utterance within the recognition lattice in terms of multiple hypothe-
ses. A phone lattice not only represents multiple utterance level transcriptions,
but also maintains multiple localised transcriptions at a given time point in an
utterance. It is hoped then, that at least one of the localised hypotheses occurring
at the point of a true target keyword occurrence will match the target keyword
140 Chapter 5. Dynamic Match Lattice Spotting
phone sequence.
One means of further reducing the impact of phone recogniser errors is to
incorporate any prior knowledge of recogniser errors into the search process. For
example, if the phones /aa/ and /ih/ are known to be highly substitutionary
for a given phone recogniser, then improved detection rates may be obtained by
including this prior information in the search process.
However, an unfortunate side-effect is that allowing for such error correc-
tions will inadvertently lead to an increase in false alarm rates. For exam-
ple, when using the /aa/ ↔ /ih/ substitutionary rule, true occurrences of the
word STICK = (/s/, /t/, /ih/, /k/) will be labeled as instances of the word
STOCK = (/s/, /t/, /aa/, /k/). As such, any error correction will need to incur
some kind of cost that affects the overall likelihood of a putative instance.
A method that successfully incorporates phone recogniser error correction
will improve overall keyword spotting robustness. The resultant gains in per-
formance will improve the suitability of lattice-based and bottom-up keyword
spotting methods for very fast keyword spotting tasks.
5.3 Dynamic Match Lattice Spotting method
Dynamic Match Lattice Spotting is an extension of conventional lattice-based
keyword spotting, but uses the Minimum Edit Distance (see Appendix A) during
lattice searching to compensate for phone recogniser insertion, deletion and sub-
stitution errors. This addresses the major shortcoming of lattice-based methods
— the requirement for the target phone sequence to appear in its entirety within
the phone-lattice for consideration as a hypothesised keyword occurrence.
Given source and target sequences, the MED calculates the minimum cost
of transforming the source sequence to the target sequence using a combination
of insertion, deletion, substitution and match operations, where each operation
5.3. Dynamic Match Lattice Spotting method 141
has an associated cost. In the Dynamic Match Lattice Spotting method, each
observed lattice phone sequence is scored against the target phone sequence using
the MED. Lattice sequences are then accepted or rejected by thresholding on this
MED score, hence providing robustness against phone recogniser errors. Conven-
tional lattice-based spotting is a special case of DMLS where a score threshold of
0 is used.
This means of phone recogniser error correction has significant potential for
improving keyword spotting performance. Consider the segment of a phone lattice
from the Wall Street Journal database corresponding to an instance of the word
STOCK, as shown in figure 5.1. Using the conventional lattice-based search, none
of the paths will match the target sequence STOCK = (/s/, /t/, /aa/, /k/).
The lattice shows that the phone recogniser correctly transcribed 3 of the 4
phones correctly: /s/, /t/ and /k/. However a simple substitution error for the
phone /aa/ prevents the word STOCK from being detected.
In contrast, the DMLS search will match a large number of paths, though
each path will have a non-zero MED score. The decision to accept or reject these
putative occurrences is then left to a subsequent thresholding or keyword verifi-
cation stage. This is one of the many examples where DMLS’ error robustness
aids detection performance.
As stated before, a downfall of DMLS is that there will be an increase in false
alarm rate since the sequence matching process is significantly more liberal than
the conventional lattice based technique. However, since each putative occurrence
will have an associated MED score that is indicative of the looseness of the match,
simple techniques such as thresholding can be used to limit the looseness of the
final result set.
Additionally, a keyword spotting stage should try to achieve as low a miss rate
as possible at the expense of false alarm rate if a subsequent keyword verification
stage is being used. The burden of false alarm reduction is then left to the more
142 Chapter 5. Dynamic Match Lattice Spotting
s
s
s
v
r
th
s
s
r
s
s
sh
sh
zh
s
s
s
z
s
s
ah
uw
ae
s
s
s
z
th
t
t
g
d
ih
ah
t
aa ae
ih
ow
k
Figure 5.1: Segment of phone lattice for an instance of the word STOCK
5.3. Dynamic Match Lattice Spotting method 143
specialised keyword verification stage. Hence, although the anticipated increase in
false alarm rate for the Dynamic Match Lattice Spotting technique is unfortunate,
it is a resolvable issue.
5.3.1 Basic method
The Dynamic Match Lattice Spotting algorithm is an extension of the conven-
tional lattice-based spotting method. This is a two stage process consisting of an
initial lattice building stage and a subsequent query-time search stage.
During the lattice building stage, each utterance is decoded using a Viterbi
phone recogniser to generate a recognition phone lattice. The quality of this
lattice can be controlled by adjusting a variety of factors including:
1. Phone language model: The quality of this model has a significant im-
pact on the overall error rate of the recogniser. Typically long context
models such as 4-grams provide better performance than short context 2-
gram models.
2. Word insertion penalty: Tuning of this parameter allows a trade off
between insertion and deletion rates of the recogniser.
3. Grammar scale factor: This affects the importance attributed to the
language model likelihoods
4. Number of phone classes: Using a smaller number of phone classes will
yield improved phone recogniser performance. However, using too small a
phone set will result in very broadly labeled data and hence more confusable
lattices for subsequent searching.
Lattice building only needs to be performed once per utterance. Once a lattice
is built, it can be used repeatedly in subsequent queries regardless of the query
term.
144 Chapter 5. Dynamic Match Lattice Spotting
The second stage of the Dynamic Match Lattice Spotting method is the lattice
search stage. This step is considerably faster than the initial lattice building stage
since processing is purely textual. The process consists of a modified Viterbi
traversal of the lattice that emits putative matches during traversal.
Let P = (p1, ..., pN ) be defined as the target phone sequence, where N is
the target phone sequence length. Additionally let Smax be the maximum MED
score threshold, K be the maximum number of observed phone sequences to
be emitted at each node, and V be defined as the number of tokens used during
lattice traversal. Then for each node in the phone lattice, where node list traversal
is done in time-order:
1. For each token in the top K scoring tokens in the current node:
(a) Let Q = (q1, ..., qM ),M = N+MAX(Ci)∗Smax be the observed phone
sequence obtained by traversing the token history backwardsM levels,
where Ci is the insertion MED cost function.
(b) Let S = BESTMED(Q,P,Ci, Cd, Cs), where
i. Cd is the deletion MED cost function
ii. Cs is the substitution MED cost function
iii. BESTMED(. . .) returns the score of the first element in the last
column of the MED cost matrix that is less than or equal to Smax
(or ∞ otherwise).
(c) Emit Q as a keyword occurrence if S ≤ Smax
2. For each node linked to the current node, perform V -best token set merging
of the current node’s token set into the target node’s token set.
5.3. Dynamic Match Lattice Spotting method 145
5.3.2 Optimised Dynamic Match Lattice Search
The basic DMLS method described above will execute significantly faster than
HMM-based keyword spotting. This is because all search-time processing is
purely textual. However, a significant part of this search process is Viterbi de-
coding, which in itself is a computationally intensive task. It is in fact possible
to remove this Viterbi lattice traversal from query-time processing, as described
below. This yields further increases in query speed.
Since the paths traversed through the lattice are independent of the query
term (traversal is done purely by maximum likelihood), it is possible to perform
the lattice traversal during the lattice building stage. Then it is only necessary
to store the observed phone sequences at each node for searching at query-time.
Furthermore, if it is assumed that the maximum queried phone sequence
length is fixed at Nmax and the maximum sequence match score threshold is
preset at Smax, then it is only necessary to store observed phone sequences of
length Mmax = Nmax +MAX(Ci) ∗ Smax.
Query-time processing then reduces to simply calculating the MED between
each stored observed phone sequence and the target phone sequence. The algo-
rithm for the lattice building stage is hence:
1. Construct the recognition lattice using the same approach as in the basic
DMLS method
2. Let A = , where A is the collection of observed phone sequences
3. For each node in the phone-lattice, where node list traversal is done in
time-order:
(a) For each token in the top K scoring tokens in the current node:
i. Let Q = (q1, ..., qMmax) be the observed phone sequence obtained
by traversing the token history backwards Mmax levels.
146 Chapter 5. Dynamic Match Lattice Spotting
ii. Append the sequence Q to the collection A
(b) For each node linked to the current node, perform V -best token set
merging of the current node’s token set into the target node’s token
set.
4. Store the observed phone sequence collection for subsequent searching
5. The recognition lattice can now be discarded as it is no longer required for
query-time searching
This optimisation results in a significant reduction in the complexity of query-
time processing. Whereas in the basic DMLS approach, full Viterbi traversal was
required, processing using this optimised approach is now a linear progression
through a set of observed phone sequences.
Thus the optimised query time search algorithm is as follows. For each mem-
ber, Q of the collection of observation sequences, A:
1. Let S = BESTMED(Q,P,Ci, Cd, Cs)
2. Emit Q as a putative occurrence if S ≤ Smax
5.4 Evaluation of DMLS performance
Evaluations were performed to compare the DMLS technique with conventional
keyword spotting approaches. Specifically, comparisons were made against the
conventional SBM-based and lattice-based systems. Evaluations were performed
on the TIMIT clean microphone speech database.
5.4.1 Evaluation set
A keyword spotting evaluation set was constructed using speech taken from the
TIMIT test database. The choice of query words was constrained to words that
5.4. Evaluation of DMLS performance 147
had 6-phone-length pronunciations to reduce target word length dependent vari-
ability.
Approximately 1 hour of TIMIT test speech (excluded SA1 and SA2 utter-
ances) was labeled as evaluation speech. From this speech, 200 6-phone-length
unique words were randomly chosen and labeled as query words. These query
words appeared a total of 480 times in the evaluation speech.
5.4.2 Recogniser parameters
16-mixture triphone HMM acoustic models and a 256-mixture Gaussian Mixture
Model background model were trained on a 140 hour subset of Wall Street Jour-
nal 1 database for use with SBM-based, conventional lattice-based and DMLS
evaluations. Additionally 2-gram and 4-gram phone-level language models were
trained on the same section of WSJ1 for use during the lattice building stages of
DMLS and the conventional lattice-based methods.
All speech was parameterised using Perceptual Linear Prediction coefficient
feature extraction and Cepstral Mean Subtraction. In addition to 13 static Cep-
stral coefficients (including the 0th coefficient), deltas and accelerations were
computed to generate 39-dimension observation vectors.
5.4.3 Lattice building
The following lattice building procedure, based on the optimised lattice building
approach described in section 5.3.2, was used for these experiments:
1. Lattices were generated for each utterance by performing a U -token Viterbi
decoding pass using the 2-gram phone-level language model
2. The resulting lattices were expanded using the 4-gram phone-level language
model
148 Chapter 5. Dynamic Match Lattice Spotting
3. Output likelihood lattice pruning was applied using a beam-width of W to
reduce the complexity of the lattices. This essentially removed all paths
from the lattice that had a total likelihood outside a beamwidth of W of
the top-scoring path.
4. A second V -token traversal was performed to generate the top 10 scoring
observed phone sequences of length 11 at each node (allowing spotting of
sequences of up to 11−MAX(Ci) ∗ Smax phones).
Lattice building was only performed once for each utterance. The resulting
observed phone sequence collections were then stored to disk and used during the
actual query time search experiments.
5.4.4 Query-time processing
The optimised lattice search algorithm described in section 5.3.2 was used for
these experiments. The sequence matching threshold, Smax, was fixed at 2 for all
experiments unless noted otherwise. MED calculations used a constant deletion
cost of Cd =∞ as preliminary experiments found that poor results were obtained
for non-infinite values of Cd. The insertion cost was also fixed at Ci = 1.
In contrast, Cs was allowed to vary based on phone substitution rules. The
phone substitution costs used are given in table 5.1 and were determined em-
pirically by examining phone recogniser confusion matrices. Substitutions were
completely symmetric. Hence the substitution of a phone, m in a given phone
group with another phone, n, in the same group yielded the same cost as the
substitution of n with phone m.
The basic rules used to obtain these costs were:
1. Cs = 0 for same-letter consonant phone substitution (eg. /n/ ↔ /nx/,
/z/↔ /zh/)
5.4. Evaluation of DMLS performance 149
2. Cs = 1 for vowel substitutions
3. Cs = 1 for closure and stop substitutions
4. Cs =∞ for all other substitutions.
However, some exceptions to these rules were made based on empirical observa-
tions from small scale experiments.
Phone Group Subst. Cost
aa ae ah ao aw ax ay eh en er ey ih iy ow oy uh uw 1
b d dh g k p t th jh 1
d dh 0
n nx 0
t th 0
w wh 0
uw w 1
z zh s sh 1
Table 5.1: Phone substitution costs for DMLS
5.4.5 Baseline systems
The standard SBM-based background model system described in chapter 3 was
used for evaluating HMM-based keyword spotter performance. This baseline pro-
vides a comparison with standard state-of-the-art keyword spotting performances
achieved for real-time keyword spotting.
The lattice-based baseline system was constructed using the method proposed
by Young and Brown [39]. This algorithm was implemented by simply using
a DMLS system with Smax = 0. This in essence would result in only exact
matches within the recognition lattice being emitted as putative occurrences,
150 Chapter 5. Dynamic Match Lattice Spotting
as required for conventional lattice-based spotting. Miss and false alarm rates
obtained using this approach would be indicative of conventional lattice-based
spotting performance. However, true execution times could not be measured for
this baseline system, as a simulated system was being used rather than the true
lattice-based system described by Young and Brown [39].
5.4.6 Evaluation procedure
The systems were evaluated by performing single-word keyword spotting for each
query word across all utterances in the evaluation set. The total miss rate for all
query words and the false alarm per keyword occurrence rate (FA/kw) were then
calculated using reference transcriptions of the evaluation data. Additionally the
total CPU processing minutes per queried keyword per hour (CPU/kw-hr) was
measured for each experiment using a 3GHz Pentium 4 processor.
For DMLS, CPU/kw-hr only included the CPU time used during the DMLS
search stage. That is, the time required for lattice building was not included.
All experiments used a commercial-grade decoder to ensure that the best
possible CPU/kw-hr results were reported for the HMM-based system. This is
because HMM-based keyword spotting time performance is bound by decoder
performance.
5.4.7 Results
To aid discussion the notation DMLS[U ,V ,W ,Smax] is used to specify DMLS
configurations, where U is the number of tokens for lattice generation, V is the
number of tokens for lattice traversal, W is the pruning beamwidth, and Smax is
the sequence match score threshold (see section 5.4.3 and section 5.4.4 for further
details on these parameters). The notation HMM[α] is used when referring to
baseline SBM-KS systems where α was the duration-normalised output likelihood
5.4. Evaluation of DMLS performance 151
threshold used. Additionally the baseline conventional lattice-based method is
referred to as CLS.
Performances for the HMM-based, lattice-based and the DMLS systems mea-
sured for the TIMIT evaluation set are shown in Table 5.2. For this set of experi-
ments, the DMLS[3,10,200,2] configuration was arbitrarily chosen as the baseline
DMLS configuration.
Method Miss FA/ CPU/
Rate kw kw-hr
HMM[∞] 1.6 44.2 1.58
HMM[-7580] 10.4 36.6 1.58
HMM[-7000] 39.8 16.8 1.58
CLS[3,10,200,0] 32.9 0.4 -.–
DMLS[3,10,200,2] 10.2 18.5 0.30
Table 5.2: Baseline keyword spotting results evaluated on TIMIT
The timing results demonstrate that as expected DMLS was significantly
faster than the SBM-KS method, running at approximately 5 times the speed.
This amounts to a baseline DMLS system capable of searching 1 hour of speech in
18 seconds. DMLS also had more favourable FA/kw performance: at 10.2% miss
rate, it had a FA/kw rate of 18.5, significantly lower than the 36.6 FA/kw rate
achieved by the HMM[-7580] system. However, the HMM system was still capa-
ble of achieving a much lower miss rate of 1.6% using the HMM[∞] configuration,
though at the expense of considerably more false alarms.
The miss rate achieved by the conventional lattice-based system was very poor
compared to that of Dynamic Match Lattice Spotting. This confirms that the
phone error robustness inherent in DMLS yields considerable detection perfor-
mance benefits. However, the false alarm rate for CLS was dramatically better
than all other systems, though with such a high miss rate, this is not surprising.
In summary, these experiments demonstrate that Dynamic Match Lattice
152 Chapter 5. Dynamic Match Lattice Spotting
Spotting is well suited for very fast keyword spotting tasks. Specifically, in the
reported evaluations, DMLS was able to search 1 hour of speech in 18 seconds.
DMLS significantly outperformed the baseline lattice-based system in terms of
miss rate and also yielded considerably lower false alarm rates than the baseline
HMM-based system. Overall, Dynamic Match Lattice Spotting appears to pro-
vide a compromise between the low miss rate of HMM-based systems and the
low false alarm rate of lattice-based systems, while still providing extremely fast
keyword spotting.
5.5 Analysis of dynamic match rules
The miss rate achieved by the baseline lattice-based system was very poor com-
pared to that of DMLS. This indicates that the phone recogniser error robustness
incorporated into the DMLS search does significantly improve keyword spotting
performance. However, it is not immediately clear which aspects of the dynamic
match process are most effective in improving performance.
Specifically, improvements in performance can be attributed to the four main
cost rules used in the dynamic match process:
1. Insertion costs
2. Same letter substitution costs (eg. /d/↔ /dh/, /n/↔ /nx/)
3. Vowel substitution costs
4. Closure/stop substitution costs (eg. /b/↔ /d/, /k/↔ /p/)
As such, experiments are presented here to quantify the benefits of individual
cost rules. The evaluation set, recogniser parameters, experimental procedure
and DMLS algorithm are the same as those used in section 5.4.
5.5. Analysis of dynamic match rules 153
5.5.1 System configurations
Specialised DMLS systems were built to evaluate the effects of individual cost
rules in isolation. These systems were implemented as follows:
• Same letter substitution rules only. Implemented using MED cost functions:
Cd =∞
Ci =∞
Cs(a, b) =
1 a and b have the same letter base
∞ otherwise
• Vowel substitution rules only. Implemented using MED cost functions:
Cd =∞
Ci =∞
Cs(a, b) =
1 a and b are both vowels
∞ otherwise
• Closure/stop substitution rules only. Implemented using MED cost func-
tions:
Cd =∞
Ci =∞
Cs(a, b) =
1 a and b are closures or stops
∞ otherwise
154 Chapter 5. Dynamic Match Lattice Spotting
• Insertion cost rule only. Implemented using MED cost functions:
Cd =∞
Cs =∞
Ci = 1
5.5.2 Results
Table 5.3 shows the results of the specialised DMLS systems, baseline lattice-
based CLS system and the previously evaluated DMLS[3,10,200,2] system with
all MED rules.
Method Miss FA/
Rate kw
CLS[3,10,200,0] 32.9 0.4
DMLS[3,10,200,2] insertions 28.5 1.2
DMLS[3,10,200,2] same letter subst 31.0 0.5
DMLS[3,10,200,2] vowel subst 15.6 7.8
DMLS[3,10,200,2] closure/stop subst 23.5 3.0
DMLS[3,10,200,2] all rules 10.2 18.5
Table 5.3: TIMIT performance when isolating various DP rules
The experiments demonstrate that the magnitude of contributions of the vari-
ous rules to overall keyword spotting performance varies drastically. Interestingly
no single rule brought performance down to the all rules DMLS system. This
indicates that the rules are complementary in nature and yield a combined overall
improvement in miss rate performance.
Using the same letter substitution rules only yielded a small gain in perfor-
mance over the null-rule CLS system: 1.9% absolute in miss rate with only a
5.5. Analysis of dynamic match rules 155
0.1 drop in FA/kw rate. The result suggests that the phone-lattice is already
robust to same letter substitutions, and as such, inclusion of this does not obtain
significant gains in performance. Emperical study of the phone-lattices revealed
this to be the case in many situations. For example, typically if the phone /s/
appeared in the lattice, then it was almost guaranteed that the phone /sh/ also
appeared at a similar time location in the lattice.
The insertions-only system yielded a slight larger gain of 4.4% absolute in
miss rate with only a 0.8 drop in FA/kw rate. The result indicates that the
lattices contain extraneous insertions across many of the multiple hypotheses
paths, preventing detection of the target phone sequence when insertions are
not accounted for. This observation is to be expected since phone recognisers
typically do have significant insertion error rates, even when considering multiple
levels of transcription hypotheses.
A significant absolute miss rate gain of 17.3% was observed for the vowel
substitution system. However, this gain was at the expense of a 7.4 absolute
increase in FA/kw rate. This is a pleasing gain and is supported by the fact that
vowel substitution is a frequent occurrence in the realisation of speech. As such,
incorporating support for vowel substitutions in DMLS not only corrects errors in
the phone recogniser but also accomodates this substitutionary habit of human
speech.
Finally, significant gains were also observed for the closure/stop substitution
system. An absolute gain of 9.4% in miss rate combined with an unfortunate
2.6 absolute increase in FA/kw rate was obtained for this system. Typically
closures and stops are shorter acoustic units and therefore more likely to yield
classification errors. As such, even though the phone lattice encodes multiple
hypotheses, it appears that it is still necessary to incorporate robustness against
closure/stop confusion for lattice-based keyword spotting.
Overall, the experiments demonstrate the benefits of the various classes of
156 Chapter 5. Dynamic Match Lattice Spotting
MED rules used in the evaluated DMLS systems. It was pleasing to note that
even the simplest of these rules still provided tangible gains in performance over
the baseline lattice-based CLS system. This clearly reinforces the fact that the
dynamic matching aspects of DMLS are beneficial. The results showed that
insertion and same-letter consonant substitution rules only provided a small per-
formance benefit over a conventional lattice-based system, whereas vowel and
closure/stop substitution rules yielded considerable gains in miss rate. Gains in
miss rate were typically unfortunately offset by increases in FA/kw rate, although
the majority of these gains were fairly small, and would most likely be justifiable
in light of the resulting improvements in miss rate.
5.6 Analysis of DMLS algorithm parameters
Earlier experiments in this chapter used a fixed DMLS[3,10,200,2] configuration
to reduce the scope of experiments. In this section, the quantitative effects of
these individual parameters on keyword spotting performance are measured and
examined. Specifically the following parameters are studied:
1. Number of tokens used for lattice generation, U
2. Number of tokens used for lattice traversal, V
3. Lattice pruning beamwidth, W
4. The threshold applied to MED score, Smax
The evaluation set, recogniser parameters, experimental procedure and DMLS
algorithm are the same as used in section 5.4.
5.6. Analysis of DMLS algorithm parameters 157
5.6.1 Number of lattice generation tokens
The number of tokens used for lattice generation, U , has a direct impact on the
maximum size of the resulting phone lattice. For example, if a value of U = 3
is used, then a lattice node can have at most 3 predecessor nodes. Whereas, if a
value of U = 5 is used, then the same node can have up to 5 predecessor nodes,
greatly increasing the size and complexity of the lattice when applied across all
nodes.
Tuning of U directly affects the number of hypotheses encoded in the lattice,
and hence the best achievable miss rate. However, using larger values of U also
increases the number of nodes in the lattice, resulting in an increased amount of
processing during DMLS searching and therefore increased execution time.
Table 5.4 shows the result of increasing U from 3 to 5. As expected, increasing
U resulted in an improvement in miss rate of 4.4% absolute but also in an increase
in execution time by a factor of 2.3. A corresponding 19.9 increase FA/kw rate
was also observed.
The obvious benefit of tuning the number of lattice generation tokens is that
appreciable gains in miss rate can be obtained. Although this has a negative
effect on FA/kw rate, a subsequent keyword verification stage may be able to
accommodate the increase.
Method Miss Rate FA/kw-hr CPU/kw-hr
DMLS[3,10,200,2] 10.2 18.5 0.30
DMLS[5,10,200,2] 5.8 38.4 0.71
Table 5.4: Effect of adjusting number of lattice generation tokens
158 Chapter 5. Dynamic Match Lattice Spotting
5.6.2 Pruning beamwidth
Lattice pruning is applied to remove less likely paths from the generated phone
lattice, thus making the lattice more compact. This is typically necessary when
language model expansion is applied. For example, applying 4-gram language
model expansion to a lattice generated using a 2-gram language model results
in a significant increase in the number of nodes in the lattice, many of which
may now how much poorer likelihoods due to additional 4-gram language model
scores.
The direct benefit of applying lattice pruning is an immediate reduction in
the size of the lattice that needs to be searched. This will give improvements
in execution time, though at the expense of losing potentially correct paths that
unfortunately did not score well linguistically.
Table 5.5 shows the effect of pruning beamwidth for four different values: 150,
200, 250 and ∞. As predicted, decreasing pruning beamwidth yielded significant
gains in execution speed at the expense of reductions in miss rate. Corresponding
drops in FA/kw rate were also observed.
Adjusting pruning beamwidth appears to be particularly well suited for tun-
ing execution time. The changes in CPU/kw-hr figures were dramatic, and in
comparison, the miss rate figures varied in a much smaller range.
Method Miss Rate FA/kw CPU/kw-hr
DMLS[3,10,150,2] 12.5 12.2 0.18
DMLS[3,10,200,2] 10.2 18.5 0.30
DMLS[3,10,250,2] 9.2 24.7 0.47
DMLS[3,10,∞,2] 7.3 60.6 2.93
Table 5.5: Effect of adjusting pruning beamwidth
5.6. Analysis of DMLS algorithm parameters 159
5.6.3 Number of lattice traversal tokens
The number of lattice traversal tokens, V , corresponds to the number of tokens
used during the secondary Viterbi traversal. Tuning this parameter affects how
many tokens are propagated out from a node, and hence, the number of paths
entering a node that survive subsequent propagation.
g
t
ae
k Tok from t
Tok from t
Tok from g
Tok from k
Tok from k
Tok from k
Tok from g
Tok from g
g
t
ae
k
Tok from t
Tok from t
Tok from t
Tok from g
Tok from t
Tok from k
Tok from k
Tok from g
Tok from k
Tok from g
Tok from g
Tok from k
Tok from g
Tok from k
Tok from t Emit cutoff
Tok from t
Emit cutoff
Token Set
3−token propagation 5−token propagation
Figure 5.2: Effect of lattice traversal token parameter
160 Chapter 5. Dynamic Match Lattice Spotting
The impact of this on DMLS is actually more subtle, and is demonstrated
by figure 5.2. In this instance, the scores of tokens propagated from the t node
are much higher than the scores from the other nodes. As such, in the 5-token
propagation case, the majority of the high-scoring tokens in the target node are
from the t node. Hence the tokens above the emission cutoff (ie. the tokens from
which observed phone sequences are generated) are mainly t nodes. However,
using the same emission cutoff and 3-token propagation results in a set of top-
scoring tokens from a variety of source nodes. It is not immediately obvious
whether it is better to use a high or low number of lattice traversal tokens for
optimal DMLS performance.
Table 5.6 shows the results of experiments using three different numbers of
traversal tokens: 5, 10 and 20. It appears that all three measured performance
metrics were fairly insensitive to changes in the number of traversal tokens. There
was a slight decrease in miss rate when using a higher value of V , though this may
not be considered a dramatic enough change to justify the additional processing
burden required at the lattice building stage.
Method Miss Rate FA/kw CPU/kw-hr
DMLS[3,5,200,2] 10.4 17.4 0.28
DMLS[3,10,200,2] 10.2 18.5 0.30
DMLS[3,20,200,2] 9.8 18.8 0.29
Table 5.6: Effect of adjusting number of traversal tokens
5.6.4 MED cost threshold
Tuning of the MED cost threshold, Smax, is the most direct means of tuning
miss and FA/kw performance. However, if discrete MED costs are used, then
5.6. Analysis of DMLS algorithm parameters 161
Smax itself will be a discrete variable, and as such, thresholding will not be on a
continuous scale.
Smax specifies the maximum allowable discrepancy between an observed phone
sequence and the target phone sequence. However, this single threshold does
not take into account what kind of mismatch occurred. For example, the se-
quences (/s/, /t/, /aa/, /p/) and (/s/, /sh/, /t/, /aa/, /k/, /g/) will both have a
MED score of 2 when scored against the target sequence (/s/, /p/, /aa/, /k/)
using the cost functions C(i) = 1, C(s) = 1, and C(d) =∞.
Experiments were carried out to study the effects of changes in Smax on per-
formance. The results of these experiments are shown in table 5.7. Since thresh-
olding was applied on the result set of DMLS, there were no changes in execution
time.
Method Miss Rate FA/kw CPU/kw-hr
DMLS[3,10,200,0] 31.0 0.5 0.30
DMLS[3,10,200,1] 13.3 4.3 0.30
DMLS[3,10,200,2] 10.2 18.5 0.30
DMLS[3,10,200,3] 8.7 52.0 0.30
Table 5.7: Effect of adjusting MED cost threshold Smax
The experiments demonstrated that adjusting Smax gave dramatic changes in
FA/kw. In contrast, the changes in miss rate were considerably more conserva-
tive. Tuning of the MED cost threshold therefore appears to be most applicable
to adjusting the FA/kw operating point. This is intuitive since adjusting Smax
adjusts how much error an observed phone sequences is allowed to have, and as
such has a direct correlation with false alarm rate.
162 Chapter 5. Dynamic Match Lattice Spotting
5.6.5 Tuned systems
Previous sections examined tuning of the various DMLS parameters in isolation.
However, it was not clear from these experiments how a system constructed using
a combination of tuned parameters would perform. In particular, it is essential
to know whether the benefits obtained from tuning the individual parameters are
complementary, resulting in even greater increases in keyword spotting perfor-
mance.
As such, two tuned systems were constructed and evaluated on the TIMIT
data set. Parameters for these systems were selected as follows:
1. The number of lattice generation tokens appeared to be well suited to ad-
justing miss rate. As such, a value of U = 5 was used for the tuned systems.
2. DMLS performance appeared insensitive to changes in the number of lattice
traversal tokens. Hence, to remain consistent with previous experiments, a
value of V = 10 was used.
3. The speed increases observed using a reduced lattice pruning beamwidth
were quite dramatic, and in comparison only resulted in a small decrease in
miss rate. Considering the anticipated gains in miss rate from the increase
in the number of lattice generation tokens, a reduced value of W = 150 was
used for the tuned systems.
4. Two values of Smax were evaluated to obtain performance at different false
alarm points. The values evaluated were Smax = 1 and Smax = 2. Although
it was anticipated that a reduction in miss rate would be observed for the
lower Smax = 1 system, it was hoped that this would be compensated for by
the increase in the number of lattice generation tokens, and further justified
by the significantly lower false alarm rate.
5.6. Analysis of DMLS algorithm parameters 163
The results of the tuned systems on the TIMIT evaluation set are shown in
table 5.8. The first system achieved a significant reduction in FA/kw rate over
the initial DMLS[3,10,200,2] system at the expense of only a small 1.3% absolute
increase in miss rate. The second system obtained a good decrease in miss rate of
2.9% with only a small 3.8 FA/kw rate increase. Both these systems maintained
the same execution speed as the initial DMLS system. It is difficult to say which
of the tuned systems is more optimal, since typically the choice of operating point
is application dependent.
Method Miss Rate FA/kw CPU/kw-hr
DMLS[3,10,200,2] 10.2 18.5 0.30
DMLS[5,10,150,1] 11.5 5.6 0.31
DMLS[5,10,150,2] 7.3 22.3 0.31
Table 5.8: Optimised DMLS configurations evaluated on TIMIT
A comparison of these systems in terms of miss rate and FA/kw rate with all
other evaluated DMLS systems is shown in figure 5.3. It can be seen from this
figure that the optimised systems are closer to the origin - indicating an overall
system improvement. More importantly, these performance gains have been made
while maintaining the same execution time.
5.6.6 Conclusions
A number of experiments were conducted to evaluate the impact of four Dynamic
Match Lattice Spotting parameters: the number of lattice generation tokens, the
number of lattice traversal tokens, the pruning beamwidth and the MED cost
threshold Smax. It was concluded that:
1. Good control of miss rate performance can be obtained by tuning the num-
ber of lattice generation tokens.
164 Chapter 5. Dynamic Match Lattice Spotting
−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 41.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
log(FA/kw rate)
log(
Mis
s ra
te)
# Lat Gen Toks# Lat Trav ToksPrune WidthSmax ThreshTuned 1Tuned 2
Figure 5.3: Trends in miss rate and FA/kw rate performance for various types oftuning
2. System performance is fairly insensitive to the number of lattice traversal
tokens, making this a poor parameter for tuning.
3. Adjustment of the lattice pruning beamwidth gives significant changes in
execution time. As such, this is an excellent parameter for tuning system
speed. Although changes in both miss rate and FA/kw rate were also ob-
served, the magnitudes of these changes were much less than the changes
obtained for execution speed.
4. Smax tuning is particularly useful for adjusting FA/kw rates. This parame-
ter can also be used for adjusting miss rate, though the changes are not as
dramatic.
5. Improvements in miss and false alarm rates can be obtained simultaneously
by tuning of all parameters in combination, as demonstrated by the two
5.7. Conversational telephone speech
experiments 165
tuned systems that were constructed. These gains can be obtained without
sacrificing execution speed.
5.7 Conversational telephone speech
experiments
Previous sections of this chapter only evaluated DMLS on the clean microphone
speech domain. The conversational telephone speech domain is a more difficult
domain but is more representative of a real-world practical application of DMLS.
As such, this section reports on experiments to evaluate the performance of DMLS
for this domain.
Specifically, experiments were performed using the Switchboard 1 telephone
speech corpus. To maintain consistency, the same baseline systems, DMLS algo-
rithms and evaluation procedure as used in section 5.4 are used here.
5.7.1 Evaluation set
The evaluation set was constructed in a similar fashion to the previously con-
structed TIMIT evaluation set. Approximately 2 hours of speech was taken from
the Switchboard corpus and labeled as evaluation speech. From this speech, 360
6-phone-length unique words were randomly chosen and marked as query words.
In total, these query words appeared a total of 808 times in the evaluation set.
5.7.2 Recogniser parameters
The same recogniser parameters that were used for the previous TIMIT experi-
ments were used for this set of experiments. This consisted of training 16-mixture
triphone HMM acoustic models as well as a 256-mixture Gaussian Mixture Model
background model for use with the SBM-based keyword spotting experiments. A
166 Chapter 5. Dynamic Match Lattice Spotting
total of 165 hours of speech taken from the Switchboard corpus was used as
training data for these models.
In addition, 2-gram and 4-gram phone-level language models were trained
on the same data set. Phone-level transcriptions were obtained using forced
alignment in conjunction with the PronLex lexicon.
All speech was parameterised using Perceptual Linear Prediction coefficient
feature extraction followed by Cepstral Mean Subtraction.
5.7.3 Results
The results for the HMM-based, conventional lattice-based, and DMLS experi-
ments on conversational speech SWB1 data are shown in table 5.9. DMLS per-
formance was measured using the baseline DMLS[3,10,200,2] system as well as a
number of tuned configurations. Tuned systems were constructed using a combi-
nation of lattice generation tokens, pruning beamwidth and Smax tuning.
Method Miss Rate FA/kw CPU/kw-hr
HMM[-7500] 8.0 366.9 1.77
HMM[-7300] 14.1 319.6 1.77
CLS[3,10,200,0] 38.4 3.2 -.–
DMLS[3,10,200,2] 17.5 59.0 0.51
DMLS[5,10,150,2] 11.0 83.6 0.72
DMLS[5,10,150,1] 14.2 23.0 0.72
DMLS[5,10,100,2] 13.9 36.1 0.18
Table 5.9: Keyword spotting results on SWB1
Of note is the dramatic increase in FA/kw rates for all systems compared to
those observed for the TIMIT evaluations. This is an expected result, since the
conversational telephone speech domain is a more difficult domain for recognition.
For DMLS, this increase in false alarm rate is a result of the increased complexity
5.7. Conversational telephone speech
experiments 167
of the lattices. It was found that the lattices generated for the Switchboard data
were significantly larger than those generated for the TIMIT data when using
the same pruning beamwidth. This meant that there were more paths with high
likelihoods, indicating a greater degree of confusability within the lattices. As a
result, more false alarms were generated.
Losses in miss rate in the vicinity of 5% absolute were also observed for all
systems compared to the TIMIT evaluations. Although this is unfortunate, these
losses are still minor in light of the increased difficulty of the data.
Overall though, DMLS still achieved more favourable performance than the
baseline HMM-based and lattice-based systems. The DMLS systems not only
yielded considerably lower miss rates than CLS but also significantly lower FA/kw
and CPU/kw-hr rates than the HMM-based systems. Figure 5.4 shows a plot of
miss rate versus FA/kw rate for the evaluated systems. The plot indicates that
DMLS offers good middle ground performance.
In terms of detection performance, the two best DMLS systems were the
DMLS[5,10,150,1] and the DMLS[5,10,100,2] configurations. Both had lower false
alarm rates than the other DMLS systems and still maintained fairly low miss
rates. However, the execution speed of the DMLS[5,10,100,2] configuration was 4
times faster than the DMLS[5,10,150,1] system. In fact, this system was capable
of searching 1 hour of speech in 10 seconds. For applications requiring very fast
search speeds, the DMLS[5,10,100,2] system would clearly be the better choice.
Overall, the experiments demonstrate that DMLS is capable of delivering
good keyword spotting performance on the more difficult conversational telephone
speech domain. Although there was some degradation in performance compared
to the clean speech microphone domain, the losses were in line with what would
be expected. Also, DMLS offered much faster performance than the HMM-based
system and considerably lower miss rates than the conventional lattice-based
system.
168 Chapter 5. Dynamic Match Lattice Spotting
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 62
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
log(FA/kw rate)
log(
Mis
s ra
te)
HMM[−7500]HMM[−7300]CLS[3,10,200,0]DMLS[3,10,200,2]DMLS[5,10,150,2]DMLS[5,10,150,1]DMLS[5,10,100,2]
Figure 5.4: Plot of miss rate versus FA/kw rate for HMM, CLS and DMLSsystems evaluated on Switchboard
5.8 Non-destructive optimisations
Experiments on the TIMIT and Switchboard databases have clearly demonstrated
that DMLS is capable of obtaining very fast keyword spotting speeds. Although
these speeds are impressive, further gains in throughput can be obtained through
optimisation of the MED calculations.
MED calculations are in fact the mostly costly operations performed during
the DMLS search stage. The basic MED algorithm is an O(N 2) algorithm and
hence not particularly suitable for high-speed calculation. However, within the
DMLS search context, two specific optimisations can be applied to reduce the
computational cost of these MED calculations. These optimisations are the prefix
sequence optimisation and the early stopping optimisation.
5.8. Non-destructive optimisations 169
5.8.1 Prefix sequence optimisation
The prefix sequence optimisation utilises the similarities in the MED cost matrix
of two observed phone sequences that share a common prefix sequence.
Let A = (a1, a2, . . . , aN) and B = (b1, b2, . . . , bM). Also let B′ be defined as
the first-order prefix sequence of B, given by B ′ = (bi)M−11 . Finally, let the MED
cost matrix between two sequences be defined as Ω(X,Y ).
From the basic definition of the MED cost matrix, the (N + 1) × M cost
matrix Ω(A,B′) is equal to the first M columns of the cost matrix Ω(A,B). This
is because B′ is equal to the first M − 1 elements of B.
Therefore, given the cost matrix, Ω(A,B ′), it is only necessary to calculate
the values of the (M + 1)th column of Ω(A,B) to obtain the full cost matrix
Ω(A,B). This is demonstrated in figure 5.5.
0
1
2
3
1
0
1
2
1
1
2
3
2
1
4
3
3
2
d
c
b
a
ka
5
3
222234
d c
4
d
4
2
2 0
1
2
3
1
0
1
2
1
1
2
3
2
1
4
3
3
2
d
c
b
a
ka
5
3
222234
d c
4
d
4
2
2
e
6
5
5
4
3
Ω(A,B′) Ω(A,B)
Figure 5.5: The relationship between cost matrices for subsequences
The argument extends to even shorter prefix sequences of B. For example, let
B′′′ be defined as the third-order prefix sequence of B, given by B ′′′ = (bi)M−31 .
Then given Ω(A,B′′′), it is only necessary to calculate the (M − 1)th, Mth and
(M + 1)th column of Ω(A,B) to obtain the full cost matrix.
Now, given that the MED cost matrix Ω(A,B) is known, consider the task of
calculating the MED cost matrix Ω(A,C). Let P (B,C) return the longest prefix
sequence of B that is also a prefix sequence of C. Then, Ω(A,C) can be obtained
170 Chapter 5. Dynamic Match Lattice Spotting
by taking Ω(A,B) and recalculating the last |C| − |P (B,C)| columns.
In DMLS, an utterance is represented by a collection of observed phone se-
quences. Typically, there is a degree of prefix similarity between sequences from
the same temporal location, and in particular between sequences emitted from
the same node. As demonstrated above, knowledge of prefix similarity will allow
a significant reduction in the number of MED calculations required.
The simplest means of obtaining this knowledge is to simply sort the phone
sequences of an utterance lexically during the lattice building stage. Then the
degree of prefix similarity between each sequence and it’s predecessor can be
calculated and stored. For this purpose, the degree of prefix similarity is defined
as the length of the longest common prefix subsequence of two sequences.
Then, during the DMLS search stage, all that is required is to step through the
sequence collection and use the predetermined prefix similarity value to determine
what portion of the MED cost matrix needs to be calculated, as demonstrated in
figure 5.6. As such, only changed portions of the MED cost matrix are iteratively
updated, greatly reducing computational burden.
aa ehbkgaeb
aa aedkgaeb
aa whowtkaeb
aa aeuhtkaeb
SimilarityPrefix
6
5
3
5
1
0
To Calculate# MED Cols
1
2
4
2
6
7
Observed Phone Sequences
aa aebkb ae g
aa powhhshehg
Figure 5.6: Demonstration of the MED prefix optimisation algorithm
The resulting algorithm can be summarised as follows:
5.8. Non-destructive optimisations 171
1. Initialise a MED cost matrix of size (N + 1) × (M + 1), where N is the
length of the target phone sequence and M is the maximum length of the
observed phone sequences.
2. For each sequence in the observed phone sequence collection
(a) Let k be defined as the previously computed degree of prefix similarity
metric between this sequence and the previous sequence
(b) Recalculate the last M − k columns of the MED cost matrix
(c) Obtain S = BESTMED(. . .) in the normal fashion given this MED
cost matrix
5.8.2 Early stopping optimisation
The early stopping optimisation uses knowledge about the Smax threshold to limit
the extent of the MED matrix that has to be calculated.
From MED theory, the element Ω(X,Y )i,j of the MED cost matrix Ω(X,Y )
corresponds to the minimum cost of transforming the sequence (x)i1 to the se-
quence (y)j1. For convenience, the notation Ω is used to represent Ω(X,Y ). The
value of Ωi,j is given by the recursive expression
Ωi,j =Min
Ωi−1,j−1 + Cs(xi, yj),
Ωi−1,j + Cd(xi),
Ωi,j−1 + Ci(yj)
(5.1)
Given the above formulation, and assuming non-negative cost functions, the
value of Ωi,j has a lower bound governed by
LowerBound(Ωi,j) ≥Min(
Ωk,ji−1k=1 ∪ Ωk,j−1
|X|k=1
)
(5.2)
172 Chapter 5. Dynamic Match Lattice Spotting
That is, it is bounded by the minimum value of column j−1 and all values above
row i in column j. This states that the lower bound of Ωi,j is a function of Ωi−1,j ,
which implies the recursive formulation
LowerBound(Ωi,j) ≥Min(
LowerBound(Ωi−1,j) ∪ Ωk,j−1|X|k=1
)
(5.3)
This states that the lower bound of Ωi,j is governed by all entries in the previous
column and the lowerbound of the element directly above it in the cost matrix.
If the recursion is continuously unrolled, then the lower bound reduces to being
only a function of the previous column and the very first element in column j,
that is
LowerBound(Ωi,j) ≥Min(
LowerBound(Ω1,j) ∪ Ωk,j−1|X|k=1
)
(5.4)
Now MED theory states that Ω1,j = j × Ci(yj) for all values of j. This means
that for a positive insertion cost function
LowerBound(Ω1,j) ≥ LowerBound(Ω1,j−1) (5.5)
Substituting this back into equation 5.4 gives
LowerBound(Ωi,j) ≥ Min(
LowerBound(Ω1,j) ∪ Ωk,j−1|X|k=1
)
(5.6)
≥ Min(
LowerBound(Ω1,j−1) ∪ Ωk,j−1|X|k=1
)
(5.7)
This reduces to the simple relationship
LowerBound(Ωi,j) ≥ Min(Ωk,j−1|X|k=1) (5.8)
It has therefore been demonstrated that the lower bound of Ωi,j is only a function
5.8. Non-destructive optimisations 173
of the values of the previous column of the MED matrix. This lends itself to a
significant optimisation within the DMLS framework.
Since Smax is fixed prior to the DMLS search, there is an upper bound on the
MED score of observed phone sequences that are to be considered as putative
hits. When calculating columns of the MED matrix, the relationship in equation
5.8 can be used to predict what the lower bound of the current column is. If this
lower bound exceeds Smax then it is not necessary to calculate the current or any
subsequent columns of the cost matrix, since all elements will exceed Smax.
This is a very powerful optimisation, particularly when comparing two se-
quences that are very different. It means that in many cases only the first few
columns will need to be calculated before it can be declared that a sequence is
not a putative occurrence.
The resulting algorithm can be summarised as follows. For each column j of
the MED cost matrix
1. Determine the minimum score, MinScore(j−1) in column j−1 of the cost
matrix
2. If MinScore(j − 1) > Smax then declare this sequence as not being a puta-
tive occurrence and stop processing
3. Calculate all elements for column j of the cost matrix
5.8.3 Combining optimisations
The early stopping optimisation and the prefix sequence optimisation can be
easily combined to give even greater speed improvements. Essentially the prefix
sequence optimisation uses prior information to eliminate computation of the
starting columns of the cost matrix, while the early stopping optimisation uses
prior information to prevent unnecessary computation of the final columns of the
174 Chapter 5. Dynamic Match Lattice Spotting
cost matrix.
When combined, all that remains during MED costing is to calculate the nec-
essary in-between columns of the cost matrix. As such, the combined algorithm
is given by:
1. Initialise a MED cost matrix of size (N + 1) × (M + 1), where N is the
length of the target phone sequence and M is the maximum length of the
observed phone sequences.
2. For each sequence in the observed phone sequence collection
(a) Let k be defined as the previously computed degree of prefix similarity
metric between this sequence and the previous sequence
(b) Using the prefix sequence optimisation, it is only necessary to update
the trailing columns of the MED matrix. Thus, for each column, j,
from (M + 1)− k + 1 to M + 1 of the MED cost matrix
i. Determine the minimum score, MinScore(j − 1) in column j − 1
of the cost matrix
ii. If MinScore(j − 1) > Smax then using the early stopping opti-
misation, this sequence can be declared as not being a putative
occurrence and processing can stop
iii. Calculate all elements for column j of the cost matrix
(c) Obtain S = BESTMED(. . .) in the normal fashion given this MED
cost matrix
5.9 Optimised system timings
Experiments were performed to evaluate the execution time benefits of the prefix
sequence and early stopping optimisations. Five systems were evaluated
5.9. Optimised system timings 175
1. NOPT: DMLS system without prefix sequence and early stopping optimi-
sations
2. ESOPT: DMLS system with early stopping optimisation
3. PSOPT: DMLS system with prefix sequence optimisation
4. COPT: DMLS system with combined early stopping and prefix sequence
optimisations
5. CXOPT: The COPT system with miscellaneous coding optimisations ap-
plied such as removal of dynamic memory allocation, more efficient passing
of data, etc.
5.9.1 Experimental procedure
Experiments were performed using 10 randomly selected utterances from the
Switchboard evaluation set detailed in 5.7.1. Single word keyword spotting was
performed for each utterance using a 6-phone-length target word.
Each utterance was processed repeatedly for the same word 1400 times and
the total execution time was measured for all passes. The total time was then
summed across all tested utterances to obtain the total time required to perform
10× 1400 passes.
The relative speeds were calculated by finding the ratio between the measured
speed of the tested system and the measured speed of the baseline NOPT system.
The entire evaluation was then repeated a total of 10 times and the average
relative speed factor was calculated. Execution time was measured on a single
3GHz Pentium 4 processor.
Additionally, the final putative occurrence result sets were examined to en-
sure that exactly the same miss rate and FA/kw rates were obtained across all
methods, since both optimisations should not affect these metrics.
176 Chapter 5. Dynamic Match Lattice Spotting
5.9.2 Results
Table 5.10 shows the speed of each system relative to the baseline unoptimised
NOPT system. Tests were performed using Smax values of 2 and 4, since the
benefits of the early stopping optimisation depend on the value of Smax.
Smax System Speed factor
2 NOPT 1.00
2 PSOPT 0.60
2 ESOPT 0.42
2 COPT 0.25
2 CXOPT 0.16
4 NOPT 1.00
4 PSOPT 0.60
4 ESOPT 0.64
4 COPT 0.32
4 CXOPT 0.21
Table 5.10: Relative speeds of optimised DMLS systems
The results clearly demonstrate that both optimisations yielded significant
speed benefits. An even more pleasing result was that the two optimisations
combined effectively to reduce execution time by a factor of 4 for the Smax = 2
tests, and by a factor of 3 for the Smax = 4 tests. Overall the fully optimised
CXOPT system ran about 5 to 6 times faster than the original unoptimised
system.
Table 5.11 shows the execution time of the unoptimised DMLS system eval-
uated in section 5.7 as well as the anticipated CPU/kw-hr figure for the same
system incorporating the early stopping and prefix sequence optimisations. It
can be seen that the resultant CPU/kw-hr figure is 0.03. This corresponds to be-
ing capable of searching one hour of speech in 1.8 seconds. This is an impressive
5.10. Summary 177
result and clearly emphasises the suitability of DMLS for very fast large database
keyword spotting applications.
Method Miss FA/ CPU/
Rate kw kw-hr
DMLS[5,10,100,2] 13.9 36.1 0.18
DMLS[5,10,100,2] with CXOPT 13.9 36.1 0.03
Table 5.11: Performance of a fully optimised DMLS system on Switchboard data
5.10 Summary
This chapter presented a novel unrestricted vocabulary audio document index-
ing method named Dynamic Match Lattice Spotting. Through experimentation,
it was demonstrated that this method was capable of searching hours of data
using only seconds of processing time, while maintaining excellent detection per-
formance.
The lack of robustness to subevent recogniser error was identified as a reason
for the poor detection performance of pre-existing unrestricted vocabulary audio
indexing techniques. It was postulated that incorporating prior knowledge of
subevent recogniser errors would be a means of improving detection rates. The
DMLS method was proposed as a means of doing this.
Initial experiments using DMLS demonstrated that it outperformed pre-existing
techniques for the clean microphone speech domain. Compared to SBM-KS,
DMLS was significantly faster and also obtained considerably lower false alarm
rates. Comparisons with the conventional lattice-based technique demonstrated
the miss rate performance of DMLS to be vastly superior.
An analysis of the contributions of dynamic matching rules to DMLS perfor-
mance was presented, to rationalise the benefits of DMLS over the conventional
178 Chapter 5. Dynamic Match Lattice Spotting
lattice-based technique. It was found that the vowel substitution and closure/stop
substitution rules contributed significantly to improving miss rate performance,
while the same-letter substitution and insertion rules only offered small improve-
ments. Nevertheless, in all cases, inclusion of any given dynamic matching rule
offered clear benefits over the null-rule conventional lattice-based method.
A study of key parameters of DMLS was also presented. It was found that
careful tuning of these parameters offered the ability to significantly enhance
DMLS performance. In particular:
1. Lattice generation token tuning was excellent for adjusting the miss rate
operating point
2. The pruning beamwidth was useful for tailoring execution speed
3. Smax tuning was suitable for adjusting the false alarm operating point
Through careful adjustment of these parameters, it was possible to construct a
tuned DMLS system that outperformed the previously evaluated baseline DMLS
system.
Evaluation results were also provided for the conversational telephone speech
domain. As would be expected, there was some degradation in performance
compared to the clean microphone speech domain. Nevertheless the performance
of DMLS was still excellent compared to that of the evaluated baseline techniques.
Finally, two key algorithmic optimisations to increase the speed of DMLS were
presented. It was shown that these optimisations could be combined to further
improve the execution speed of DMLS by a factor of 5 to 6 times.
In summary, this chapter has demonstrated that DMLS is an excellent can-
didate for very fast audio document indexing. The search speeds offered by this
method are exceptional considering the low miss rates it offers. The key results
of this chapter are summarised in table 5.12
5.10. Summary 179
Domain Method Miss FA/ CPU/ Secs. to
Rate kw kw-hr search 1h
TIMIT
HMM[-7580] 10.4 36.6 1.58 95
CLS[3,10,200,0] 32.9 0.4 -.– -.–
DMLS[3,10,200,2] 10.2 18.5 0.30 18
DMLS[5,10,150,1] 11.5 5.6 0.31 18
DMLS[5,10,150,2] 7.3 22.3 0.31 18
SWB1
HMM[-7300] 14.1 319.6 1.77 106
CLS[3,10,200,0] 38.4 3.2 -.– -.–
DMLS[3,10,200,2] 17.5 59.0 0.51 31
DMLS[5,10,100,2] 13.9 36.1 0.18 10
DMLS[5,10,100,2] CXOPT 13.9 36.1 0.03 1.8
Table 5.12: Summary of key results
180 Chapter 5. Dynamic Match Lattice Spotting
Chapter 6
Non-English Spotting
6.1 Introduction
With the recent increase in global security awareness, non-English speech recog-
nition has emerged as a major topic of interest. One problem that has hindered
the development of robust non-English speech recognition is the lack of large
transcribed non-English speech databases.
A lack of available training data has been reported to considerably degrade
the performance of speech recognition. For example, Moore [24] reported losses
as much as 15% absolute for large vocabulary speech recognition when using
training databases smaller than 100 hours (it is noted that the reported losses
in Moore [24] are interpolated results, however they provide an approximation of
anticipated loss).
Losses in performance would also be expected for keyword spotting. However,
keyword spotting is a very different task from speech transcription, and as such,
the magnitude and nature of these losses are likely to be different. In particular,
keyword spotting is a more constrained task, attempting to discriminate between
a much smaller number of classes. It is possible then that it will be less affected by
181
182 Chapter 6. Non-English Spotting
reduced amounts of training data. If so, keyword spotting techniques may provide
a viable short-term solution for the development and deployment of non-English
data mining applications.
Primarily, this chapter investigates the effects of limited training resources on
the performance of keyword spotting systems. Experiments and discussion are
presented to assess the benefits of large training corpora for keyword spotting,
and to determine whether the benefits from increased amounts of training data
provide sufficient gain to motivate the collection of this data.
Trends in English and Spanish keyword spotting performance are examined
with regard to changes in training database size. Given these trends, extrapola-
tions are made to anticipate the loss in performance from using reduced training
data sizes for the low resourced language of Indonesian.
6.2 The issue of limited resources
The amount of transcribed training data available for training speech recognition
systems has been demonstrated to have a marked impact on performance. In par-
ticular, these effects are considerably greater when using small training database
sizes.
Figure 6.1 illustrates these effects on word error rate as evaluated by Moore
[24]. The plots clearly depict that considerable gains in performance are obtained
especially up to the first 100 hours of training data. For example, an anticipated
gain of 10-15% absolute is observed when increasing training database size from
10 hours to 100 hours.
A significant barrier to the research and development of speech recognition
technologies for many non-English languages is the lack of resources. Some of the
key resources required for this development are well-transcribed speech databases
and sizable pronunciation lexicons. Although such resources are slowly becoming
6.2. The issue of limited resources 183
Figure 6.1: Effect of training dataset size on speech recognition [24]
more easily available, such as the OGI Multilingual Corpus [9] and the CALL-
HOME Database [7], the amount of data is still very small in comparison to that
available for English.
As such, the performances of many reported low resource non-English recog-
nisers have been very poor. For example, in a recent publication, Walker et al.
[34] reported phone recognition error rates as high as 60-70% for languages such
as Japanese, Spanish and Farsi. These results are very poor compared to the
typical 20-40% error rates frequently reported for English.
Such poor error rates have hindered the deployment of non-English speech
recognition technologies. It is a well known fact that many consumers consider
commercially deployed English-based speech recognition applications to be de-
plorable and erroneous. Given the immense inferiority of current non-English
speech recognition technologies, it is more than likely that non-English speech
recognition will be even less well received.
A long term solution for this problem is to simply transcribe more data and
build the required pronunciation lexicons. Apart from the immense cost and time
184 Chapter 6. Non-English Spotting
required to do this, this approach also leaves many poor non-English countries
without speech recognition technology. Additionally, non-English speech research
is also of interest in many English-speaking countries, for example, in applications
such as telephone conversation monitoring and security surveillance. These, and
many other reasons, provide a motivation for the immediate development and
deployment of non-English speech technologies.
6.3 The role of keyword spotting
Keyword spotting is a considerably simpler task than speech transcription. For
example, the single-keyword spotting task is a 2-class transcription task, attempt-
ing to segment speech into sequences of target speech and non-target speech. As
such, it is likely to require less prior information and hence less training data
to achieve acceptable performance compared to a large vocabulary STT system.
Therefore, keyword spotting may be less affected by limited amounts of training
resources.
There are plethora of applications for which keyword spotting provides a suf-
ficient solution. These include data mining, real-time monitoring and dialogue
systems. Although a full large vocabulary STT system may be able to provide
further functionality, such as improved speech understanding capability for dia-
logue system applications, a keyword spotting solution will provide a short term
solution if it is less affected by limited training data.
6.4 Experiment setup
To evaluate the effects of reduced training database size, a number of keyword
spotting systems were trained for the languages of interest. Details of these
systems, and how they were evaluated are given in this section.
6.4. Experiment setup 185
6.4.1 Database design
Data was sourced from three language specific databases: the English Switch-
board database, the CALLHOME Spanish database, and the Indonesian split of
the OGI Multilingual corpus. All three databases consisted of narrowband tele-
phone speech. After removing all utterances containing out-of-vocabulary words,
there was a total of 165 hours of English data, 10.2 hours of Spanish data, and
3.5 hours of Indonesian data. Due to the limited amount of data available for
the non-English languages, only 40 minutes was designated as test data while the
remaining data was used for training.
Since there was a large amount of data available for English, three differ-
ent sized training datasets were constructed. The first used the entire training
database consisting of 165.05 hours of speech. The second was an intermediary
sized database consisting of 15.4 hours of speech and the final database was very
small made up from only 4.15 hours of data.
Training databases for the non-English languages were not selected to be the
same size as the English training datasets. Instead it was decided to match
the databases based on a hours per phone metric. This was done because the
number of phones used in English was 44 while the number used for Spanish and
Indonesian was only 28 (taken from the WorldBet phone set). As such, datasets
had to be sized to ensure that there was an equal number of training examples
per phone across all languages to avoid unfairly penalising any one language.
Additionally, database sizes for the non-English languages were limited by the
amount of available data. As such, it was only possible to create intermediary and
small sized databases for Spanish, and only a small sized database for Indonesian.
Sizes were matched using the hours per phone metric as discussed above. Table
6.1 shows a summary of all training datasets.
186 Chapter 6. Non-English Spotting
To avoid confusion, the codes in table 6.1 are used when referring to the in-
dividual training data sets. The S1 training sets correspond to the 0.1 h/phone
training data sets and exist for all three languages. The S2 training sets cor-
respond to the 0.35 h/phone training data sets and only exist for English and
Spanish. Finally the S3E set corresponds to the full sized English training data
set and was included to provide insight into spotting and verification performance
for systems trained using very large databases.
Code Language Hours of speech Hours per phone
S1E English 4.15 0.095
S1S Spanish 2.82 0.10
S1I Indonesian 2.78 0.099
S2E English 15.4 0.35
S2S Spanish 9.59 0.34
S3E English 164.05 3.73
Table 6.1: Summary of training data sets
All data was parameterised using Perceptual Linear Prediction coefficient fea-
ture extraction. Utterance based Cepstral Mean Subtraction was applied to re-
duce the effects of channel/speaker mismatch.
6.4.2 Model architectures
When using limited size training databases, it is often necessary to use simpler
model architectures to avoid data sparsity issues. As such, for this set of ex-
periments, three different HMM phone model architectures were built for each
training data set. These were:
1. 16-mixture triphone - It was anticipated that the triphone architecture
would provide the greatest performance when using the large training data
6.4. Experiment setup 187
sets but would have reduced performance for smaller training data sets due
to data sparsity.
2. 16-mixture monophone - This architecture was included to address the
data sparsity issues of the triphone architecture, although it was expected
that for the large datasets, these models would be too simplistic.
3. 32-mixture monophone - This provides a compromise between the very
high data requirements of the triphone architecture and the modeling sim-
plicity of the 16-mixture monophone set.
In addition, a 256-mixture Gaussian Mixture Model SBM was trained on each
dataset for use as the background model in HMM-based keyword spotting.
To facilitate the ease of reference for the various model sets, the following codes
shown in table 6.2 are used when referring to individual model architectures.
Furthermore, when referring to a model trained on a specific training set, the
Code Description
T16 16-mixture triphone
M16 16-mixture monophone
M32 32-mixture monophone
Table 6.2: Codes used to refer to model architectures
name of the training set is appended to the model label. Hence, a 16-mixture
triphone model set trained on the S2S training set is referred to as the T16S2S
model set, whereas the 32-monophone model set trained on the S1I set is referred
to as the M32S1I model set.
188 Chapter 6. Non-English Spotting
6.4.3 Evaluation set design
The evaluation data sets consisted of approximately 40 minutes for each language.
As stated before, it was not possible to use a larger evaluation set because of the
limited amount of data available for Indonesian and Spanish.
Target words were restricted to being only 6-phone keywords. This was done
to minimise any variations in performance across the evaluation sets for the differ-
ent languages due to non-identical distributions of keyword lengths. For English
and Spanish, 180 unique 6-phone words were randomly selected for each language
and designated as the evaluation query word set. Unfortunately it was not pos-
sible to find 180 unique 6-phone words in the Indonesian evaluation set, and as
such, only 153 words were used in this set.
Table 6.3 summarises each evaluation set. The # words in eval data column
corresponds to the number of instances of the words in the query word set that
occurred in the evaluation data — that is, the total number of hits required to
obtain a miss rate of 0%.
Code Language Mins of speech # query words # words in eval data
EE English 43.62 180 298
ES Spanish 39.60 180 353
EI Indonesian 41.40 150 349
Table 6.3: Summary of evaluation data sets.
6.4.4 Evaluation procedure
Experiments were performed for each language to evaluate keyword spotting per-
formance for every combination of model architecture and training database. A
2-stage keyword spotting system was used, consisting of a SBM-based keyword
6.5. English and Spanish stage 1 evaluations 189
spotting front-end (see section 2.7.1) followed by SBM-based keyword verification
(see section 2.9.4). The procedure used was as follows:
1. Keyword spotting was performed to generate a putative occurrence set
2. Stage 1 miss and FA/kw rates were calculated using reference word-level
transcriptions
3. Keyword verification was performed to generate a confidence-scored puta-
tive occurrence set
4. Stage 2 miss and false alarm probabilities were calculated across a cross-
section of thresholds. Additionally, the equal error rate statistic and DET
plots were generated.
6.5 English and Spanish stage 1 evaluations
Experiments were first performed to evaluate the impact of limited training data
on the stage 1 miss and FA/kw rates. Of particular interest was the effect of
training database size on stage 1 miss rate, as this gives a lower bound on the
achievable miss rate for a successive keyword verification stage. Table 6.4 shows
the results of these experiments.
A number of observations can be made regarding the stage 1 spotting rates. Of
particular note is that the Spanish miss rates were much higher than the English
miss rates. The most likely explanation for this is that the Spanish data was
simply harder to recognise. Random listening of some of the Spanish utterances
revealed many adverse factors such as significant background noise and very fast
speaking rates.
The trend curves shown in figures 6.2 and 6.3 clearly demonstrate that in
most cases increased training database size resulted in decreased miss rates and
190 Chapter 6. Non-English Spotting
0.50
2
4
6
8
10
12
Mis
s ra
te
M16EM32ET16EM16SM32ST16S
Figure 6.2: Trends in miss rate across training database size
0.5100
200
300
400
500
600
700
800
900
1000
FA
/kw
rat
e
M16E
M32E
T16E
M16S
M32S
T16S
Figure 6.3: Trends in FA/kw rate across training database size
6.5. English and Spanish stage 1 evaluations 191
English
Model Miss rate FA/kw
M16S1E 4.0 675.528
M16S2E 2.7 702.064
M16S3E 3.7 687.451
M32S1E 2.3 882.869
M32S2E 2.0 989.52
M32S3E 2.3 999.334
T16S1E 5.7 268.045
T16S2E 5.0 223.189
T16S3E 1.0 215.992
Spanish
Model Miss rate FA/kw
M16S1S 7.6 539.946
M16S2S 6.2 606.144
M32S1S 4.5 733.128
M32S2S 3.7 872.19
T16S1S 11.9 201.63
T16S2S 10.8 208.758
Table 6.4: Stage 1 spotting rates for various model sets and database sizes
increased FA/kw. A decrease in miss rate is beneficial as this reduces the lower
bound for the minimum achievable miss rate for a subsequent keyword verification
stage. Interestingly though, the absolute gains in miss rate were not particularly
large. Apart from the gain observed for the T16S3E system, the other gains were
below 2%, and in most cases below 1%. This implies that miss rate for HMM
keyword spotting is not dramatically affected by training database size in terms
of absolute changes in miss rate.
The only cases where decreased miss rate was not observed when increas-
ing training database size was for the M16S2E → M16S3E and M32S2E →
M32S3E monophone cases. As stated earlier, it was expected that for very large
training database sizes, the simplistic monophone architectures would not be able
to sufficiently model the increased number of modalities of the data, and therefore
would become too generalised and hence poor discriminators.
The triphone architectures also provided significantly lower FA/kw rates than
the monophone architectures for all training data set sizes. One may argue that
192 Chapter 6. Non-English Spotting
this is simply a trade-off in performance - a lower FA/kw result in exchange for
a higher miss rate. This appears to be the case for the Spanish experiments.
However, in the English experiments, both miss rate and FA/kw rates decreased
as training data size was increased. From these limited set of experiments, it
is not possible to determine whether the triphone architecture truly provides an
increase in both rates or simply a trade off between the two measures.
Overall, increased training database size does yield improved performance in
stage 1 miss rate, though the gains are not dramatic unless very large database
sizes are used. For S1 and S2 sized databases, the monophone architectures
yielded more favorable stage 1 miss rates at the expense of significantly higher
stage 1 FA/kw rates.
6.6 English and Spanish post keyword
verification
Post verification performance was evaluated for the various English and Spanish
training databases and model architectures. The aim of these experiments was
to determine the effect of training database size on the final keyword spotting
performance for a multi-stage system, not just the effect on the keyword verifi-
cation stage in isolation. This is because in practice the same data sets would
be used when training models for the spotting and verification stages. As such,
identical model architectures and database sizes were used for the spotting and
verification stages, and the final system performance was measured.
Table 6.5 shows the EERs after keyword verification. Figures 6.4, 6.5 and 6.6
show the detection error trade-off plots for the T16, M16 and M32 experiments
respectively. A number of interesting characteristics can be seen in these results.
6.6. English and Spanish post keyword
verification 193
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
12345
Figure 6.4: DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,4=T16S2S, 5=T16S1S
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
12345
Figure 6.5: DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,4=M16S2S, 5=M16S1S
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
12345
Figure 6.6: DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,4=M32S2S, 5=M32S1S
194 Chapter 6. Non-English Spotting
English
Model EER rate
M16S1E 22.2
M16S2E 19.8
M16S3E 20.5
M32S1E 19.1
M32S2E 17.8
M32S3E 18.5
T16S1E 18.1
T16S2E 17.8
T16S3E 13.0
Spanish
Model EER rate
M16S1S 25.8
M16S2S 25.2
M32S1S 24.4
M32S2S 22.6
T16S1S 28.7
T16S2S 26.9
Table 6.5: Equal error rates after keyword verification for various model sets andtraining database sizes
The plots in figure 6.7 demonstrate that the trends in post verification EER
are similar to the trends in stage 1 miss rate. This is reassuring, demonstrating
consistency in performance between the stages.
Of note is the gain in performance between the S1 and S2 systems given a
fixed model architecture. In most cases, increasing the amount of training data
from the S1 to S2 database size resulted in absolute gains of approximately 1-
2% in EER. Further increasing the database size as done in the S3 experiments
resulted in gains for the triphone system only (4.8% absolute).
This is a positive result, indicating that the relatively small increase in train-
ing database size between S1 and S2 provided a tangible gain in performance.
Furthermore, the fact that a significantly larger training database only yielded
a 4.8% absolute gain for the T16S3E experiment suggests that returns dimin-
ish with increases in training database size. That is, the gain per hour of extra
training data diminishes as total database size increases.
6.6. English and Spanish post keyword
verification 195
0.512
14
16
18
20
22
24
26
28
30
Equ
al e
rror
rat
e
M16EM32ET16EM16SM32ST16S
Figure 6.7: Trends in EER across training dataset size
This observation has important ramifications for the development and deploy-
ment of keyword spotting systems. It indicates that keyword spotting systems
trained on relatively small databases are able to achieve performances well within
an order of magnitude of systems trained using significantly larger databases. De-
pending on the target application, this loss in performance may be an acceptable
trade-off for the time and monetary costs of obtaining larger databases.
Another interesting result is the difference in EER gains observed for English
triphone systems over English monophone systems compared to those observed
for the equivalent Spanish systems. In all cases, the English triphone systems
markedly outperformed the monophone systems, whereas for Spanish, the tri-
phone systems yielded considerably lower EERs compared to the monophone
systems. Further analysis of the data revealed that for the S1S and S2S evalua-
tions, the M32 systems outperformed the performance of the T16 systems at all
operating points, as shown in figure 6.8.
One possible explanation for this disparity in performance gains is the decision
tree clustering process used during triphone training. The question set used for
196 Chapter 6. Non-English Spotting
the English decision tree clustering process was a well established and well tested
question set, whereas the Spanish question set was a relatively new question set
constructed for this particular set of experiments. Although much care was taken
in building the Spanish question set and in removing any errors, it is possible
that the nature of the phonetic questions asked, though relevant and applicable
to English, were not suitable for Spanish decision tree clustering.
In summary, the experiments demonstrate that although some gains in per-
formance were achieved using larger training databases, the magnitude of these
gains were not dramatic and may not justify the costs of obtaining such databases.
For smaller-sized databases, the M32 architecture resulted in more robust perfor-
mance for Spanish keyword spotting, though this may be due to issues with the
triphone training procedures for Spanish.
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
123
Figure 6.8: DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S
6.7. Indonesian spotting and verification 197
6.7 Indonesian spotting and verification
Given the results and trends observed for English and Spanish, experiments were
performed using the small amount of available Indonesian data to obtain baseline
keyword spotting performance. Table 6.6 and figure 6.9 show the stage 1 and
stage 2 results of these experiments.
Model State 1 spot Stage 1 spot Post-verifier
miss rate FA/kw EER
M16S1I 3.4 271.308 22.0
M32S1I 3.0 302.979 21.0
T16S1I 3.4 272.412 22.0
Table 6.6: Stage 1 spotting and stage 2 post verification results for S1I experi-ments
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Mis
s pr
obab
ility
(in
%)
123
Figure 6.9: DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I
Stage 1 results were not as diverse as those observed for English and Spanish
- all models yielded similar miss rates and comparable FA/kw rates. In contrast,
198 Chapter 6. Non-English Spotting
the trends for post-verification EER were similar to that observed for Spanish,
with the M32 architecture yielding the best EER performance and in fact the best
performance at most other operating points. Ultimately though, as demonstrated
by figure 6.9, the post verification performance for all model types were very close,
being within 1% absolute in most cases.
6.8 Extrapolating Indonesian performance
The original goal of this research was to examine how training dataset size af-
fected keyword spotting performance, and more importantly, whether the gains
obtained were sufficiently large to justify the collection of more data. The ex-
periments performed for English and Spanish data provide some insight into how
performance varies with changes in dataset size.
Given these observations, it is possible to perform a degree of extrapolation
to predict keyword spotting performance for other languages, and in particular,
the anticipated gains from increasing the amount of training data. It must be
noted that these extrapolations must be considered with only a low degree of
confidence, since there were only a few data points upon which the extrapolations
were based. Nevertheless, the predictions will still give a general indication of
expected performance.
Of particular interest is the keyword spotting performance of the previously
evaluated M16S1I, M32S1I and T16S1I Indonesian systems. Given the consistent
1-2% EER gain observed when increasing from S1 to S2 sized training data sets
for the English and Spanish experiments, it is reasonable to postulate that similar
gains in EER would be observed in Indonesian. As such, one would expect EER
rates in the vicinity of 19-20%.
Further extrapolations regarding even larger sized training databases can be
6.8. Extrapolating Indonesian performance 199
made given observations from the English S3E experiments. However, these ex-
trapolations may be problematic, since conflicting results were observed across
the three evaluated systems - a decrease in miss rate for the T16S3E system
compared to an increase in miss rate for the M16S3E and M32S3E systems. Re-
alistically though, when developing an Indonesian keyword spotting system with
an S3 sized database, it is likely that a triphone architecture will be used, since
the monophone systems will be too simplistic to model such a large amount of
data.
Since gains in EER remained consistent across English and Spanish for the S1
to S2 experiments, it is not unreasonable to assume that similar consistency will
be maintained going from a S1 sized set to a S3 sized set. As such, one would
expect gains in EER in the vicinity of 5-7% for Indonesian S3 triphone system,
given the 6.1% gain observed for the T16S3E system over the T16S1E system.
This gives a likely absolute Indonesian EER of approximately 15-17% for an S3
sized database.
12
14
16
18
20
22
24
Equ
al e
rror
rat
e
Figure 6.10: Extrapolations of Indonesian keyword spotting performance usinglarger sized databases
200 Chapter 6. Non-English Spotting
6.9 Summary and Conclusions
The reported experiments demonstrate a number of interesting results regarding
the effects of training database size. Most importantly, they indicate that the sen-
sitivity of keyword spotting performance to training database size is significantly
less than that of speech transcription. Specifically, it was found that decreasing
the amount of training data from 160 hours to 4 hours for English resulted in only
a loss of 6.1% in equal error rate. This is significantly less than the approximately
18% loss in word error rate reported by Moore [24] for speech transcription.
It was also found that monophone based keyword spotting yielded better
miss rate performance for limited training database sizes compared to triphone
modeling. This is most likely because triphone systems suffered from data sparsity
issues for the limited data cases. However, the monophone systems did have
significantly higher FA/kw rates which translates to a greater number of actual
false alarms in the final system output. Given this, a triphone system may still
be more appropriate for a limited training data system, even though miss rate is
slightly poorer.
Low confidence extrapolations were also made regarding expected equal error
rate gains for an Indonesian keyword spotting system trained on a large database.
A system trained on 2.8 hours of training data yielded an EER of 21.0% using a
32-mixture monophone model set. Trends seen in English and Spanish implied
an EER gain for Indonesian of 1-2% using a 9.6 hour database and a gain of 5-7%
using a significantly larger training database.
Overall, the research demonstrates that keyword spotting using limited size
training databases is feasible. Such systems are capable of achieving keyword
spotting performance within an order of magnitude of systems training on signif-
icantly larger databases, but require fewer resources. This has ramifications for
the immediate development and deployment of speech enabled systems for low
6.9. Summary and Conclusions 201
resourced non-English languages.
202 Chapter 6. Non-English Spotting
Chapter 7
Summary, Conclusions and
Future Work
This chapter provides a summary of the work presented in this thesis as well as
the primary conclusions arising from it and a discussion of the possible future
research directions.
7.1 HMM-based Spotting and Verification
Chapter 3 presented a comprehensive study of HMM-based spotting and verifi-
cation techniques. In particular, the methods were considered in terms of their
suitability for real-time monitoring applications.
7.1.1 Conclusions
• Of the evaluated methods, the SBM-based approach was found to be the
most appropriate for real-time monitoring applications. This was because
it obtained excellent miss rates as well as fast execution speeds. Although
this method was also hindered by very high false alarm rates, it was argued
203
204 Chapter 7. Summary, Conclusions and Future Work
and subsequently demonstrated that a well-performing keyword verification
stage would be capable of culling a significant portion of these false alarms.
• An analysis of the effect of target word length demonstrated that keyword
spotting and verification performance was noticeably poorer for shorter key-
words. This highlighted the need for techniques that specifically addressed
the issue of short-word spotting and verification.
• The tuning of HMM-based spotting using techniques such as output score
thresholding and target word insertion penalties was demonstrated to be in-
appropriate. This was because any attempts to significantly improve either
miss rate or false alarm rate resulted in considerable losses in the comple-
mentary performance metric.
• A neural network based decision boundary estimate was proposed as an
alternative to the traditional log-likelihood ratio. It was found that such
an approach yielded considerable gains in performance for SBM-based key-
word verification, particularly for short keywords. This suggested that a
similar approach could be used to improve the robustness of many other
log-likelihood ratio based confidence score measures.
7.1.2 Future Work
The reported experiments highlighted the need for well-performing short-word
keyword spotting and verification techniques. Poor performance is typically en-
countered for short words because of the reduced number of observations available
for scoring. As such, techniques that examine additional sources of information
from the observation sequence, such as linguistic or orthogonal feature set infor-
mation may yield improved performance. Additionally, more appropriate decision
boundary modeling techniques, such as the proposed neural network approach,
7.2. Cohort Word Verification 205
may provide an avenue for further improvements.
7.2 Cohort Word Verification
A novel technique of keyword verification was presented in chapter 4. This
method combined high level linguistic information with cohort-based verification
techniques to yield significant improvements in verification performance, partic-
ularly for the problematic class of short-duration target words. Additionally, the
fusion of multiple keyword verifiers was investigated and found to provide further
gains in performance.
7.2.1 Conclusions
• The reported evaluations compared the performance of cohort word verifi-
cation and SBM-based verification for the conversational telephone speech
domain. It was found that cohort word verification provided excellent gains
for short to medium length target words but was markedly poorer for long
words. Further analysis demonstrated that this poor performance was a
result of reduced cohort word set sizes for long words.
• It was found that the fusion of cohort word verification and SBM verification
provided some excellent gains in performance over the unfused systems. In
particular, this architecture was well suited for medium length target word
verification
• The fusion of multiple cohort word verifiers was also examined and demon-
strated to provide considerable gains for short word verification. This was
a pleasing result considering the problematic nature of short word verifica-
tion.
206 Chapter 7. Summary, Conclusions and Future Work
• Multiple formulations of the cohort word confidence score were presented
and investigated. Of these, it was found that the N-class hybrid approach
provided the best compromise between error rate and execution speed.
• The large number of cohort word parameters were rationalised through a
detailed analysis of their effects. It was found that the main parameters of
importance were dmax and the amount of cohort word set downsampling.
Other parameters provided only minors changes in performance.
7.2.2 Future Work
• It was demonstrated in chapter 3 that considerable gains in performance
could be obtained by using a more robust decision boundary estimate, such
as a neural network classifier. Future work could examine the application of
discriminative decision boundary estimates to the cohort word confidence
score as a means of further improving performance.
• Execution speeds for cohort word verification were not reported in this
thesis. However, this is an important metric that needs to be considered
when applying this method to speed-critical tasks such as audio document
indexing. Speed improvements may be obtained in a variety of ways, for
example, through the use of aggressive cohort word set size downsampling
or tighter decoding pruning beamwidths. A study of the execution speed of
cohort word verification and the investigation of techniques to improving it
is a possible avenue for future research.
7.3 Dynamic Match Lattice Spotting
Chapter 5 presented a novel technique of fast and accurate unrestricted vocab-
ulary audio document indexing. This method was evaluated on conversational
7.3. Dynamic Match Lattice Spotting 207
telephone speech and found to provide significant improvements over conventional
lattice-based and HMM-based techniques.
7.3.1 Conclusions
• The chapter presented a novel unrestricted vocabulary audio document in-
dexing method named Dynamic Match Lattice Spotting. Through experi-
mentation, it was demonstrated that this method was capable of searching
hours of data using only seconds of processing time, while maintaining ex-
cellent detection performance. The proposed method provided significant
improvements in detection rate and execution speed over the baseline con-
ventional lattice based and SBM-based systems.
• The lack of robustness to erroneous lattice realisations was identified as a
weakness in conventional lattice-based techniques. Experiments reported
in this chapter highlighted that significant gains in miss rate performance
could be obtained by incorporating robustness to such errors within the
lattice search process.
• Two methods of improving the speed of DMLS were investigated and im-
plemented. It was found that these provided considerable gains in search
speed without affecting miss or false alarm rates. As a result, a DMLS
system was constructed that could search at speeds of 33 hours per minute
with good miss and false alarm rates.
• Individual dynamic match rules were evaluated within the context of the
proposed technique. It was found that even the simplest of rules provided
some tangible gains in performance over the conventional lattice-based tech-
nique. However, in particular, vowel-substitution and closure/stop substi-
tution rules provided dramatic gains, though at the expense of increased
208 Chapter 7. Summary, Conclusions and Future Work
false alarm rates.
• An analysis of the parameters of DMLS demonstrated that the technique
could be easily tuned to obtain low miss rates or low false alarms while
maintaining its fast execution speed.
7.3.2 Future Work
• The dynamic match rules that were proposed and evaluated were derived
empirically. Although these rules provided good performance, they were
unlikely to be optimal. Future work could examine the use of probabilistic
rules, for example, derived from the phone recogniser confusion matrix.
• The MED score used in DMLS was a discrete variable and as such thresh-
olding on this value resulted in a discontinuous tuning curve. Smoother
tuning would be possible if a continuous probabilistic output score could be
derived. One possible solution is to use a combination of the MED score
and the acoustic score of the putative occurrence as estimated from the
lattice. Fusion of these values may result in a output score that is more
useful for continuous tuning.
7.4 Non-English Spotting
Chapter 6 examined the application of keyword spotting to non-English languages
and assessed the impact of limited training data on system performance.
7.4.1 Conclusions
• It was found that the sensitivity of keyword spotting performance to train-
ing database size was considerably less than that previously reported for
7.5. Final Comments 209
speech transcription. This finding supports the argument for the develop-
ment of speech applications that use keyword spotting instead of speech
transcription to satisfy the immediate need for non-English speech-enabled
applications.
• Analysis of the experimental results demonstrated that keyword spotters
trained on limited amounts of training data could achieve performances
well within an order of magnitude of systems trained on very large database
sizes.
• It was demonstrated that monophone based systems were more effective
than triphone based systems in terms of miss rate when using limited
amounts of training data. This was because triphone models suffered from
data sparsity issues for small sized training databases.
7.5 Final Comments
A number of novel contributions to the field of keyword spotting have been gen-
erated by this research. A considerable amount of this research has been used in
the development of data mining applications that are being actively trialled by
external bodies. It is believed that this demonstrates that the work is not only
theoretically sound but practically viable.
210 Chapter 7. Summary, Conclusions and Future Work
Bibliography
[1] J. Alvarez-Cercadillo, J. Ortega-Garcia, and L. A. Hernandez-Gomez, “Con-
text modeling using RNN for keyword detection,” in Proceedings of the 1993
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 1993.
[2] I. Bazzi and J. R. Glass, “Modeling out-of-vocabulary words for robust
speech recognition,” in Proceedings of the 2000 International Conference
on Spoken Language Processing (ICSLP), 2000.
[3] S. Bengio, “Learning the decision function for speaker verification,” in Pro-
ceedings of the 2001 IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2001.
[4] G. Bernadis and H. Bourlard, “Improving posterior-based confidence mea-
sures in hybrid HMM/ANN speech recognition systems,” in Proceedings of
the 1998 International Conference on Spoken Language Processing (ICSLP),
1998.
[5] H. Bourlard, B. D’hoore, and J. M. Boite, “Optimizing recognition and
rejection performance in wordspotting systems,” in Proceedings of the 1994
IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 1994.
211
212 Bibliography
[6] J. S. Bridle, “An efficient elastic-template method for detecting given words
in running speech,” British Acoustic Society Metting, pp. 1–4, 1973.
[7] A. Canavan and G. Zipperlen, “CALLHOME Spanish Speech.”
http://www.ldc.upenn.edu, 2005.
[8] B. Chigier, “Rejection and keyword spotting algorithms for a directory
assitance city name recognition application,” in Proceedings of the 1992
IEEE International Conference on Acoustics, Speech, and Signal Process-
ing (ICASSP), vol. 2, pp. 93–96, 1992.
[9] R. Cole and Y. Muthusamy, “OGI. Multilanguage Corpus.”
http://www.ldc.upenn.edu, 2005.
[10] S. Dharanipragada and S. Roukos, “A multistage algorithm for spotting
new words in speech,” IEEE Transactions on Speech and Audio Processing,
vol. 10, no. 8, pp. 542–550, 2002.
[11] Google, “The Google Internet Search Engine.” http://www.google.com.
[12] Q. Gou, Y. H. Yan, Z. W. Lin, B. S. Yuan, Q. W. Zhao, and J. Liu, “Keyword
spotting in auto-attendant system,” in Proceedings of the 2000 International
Conference on Spoken Language Processing (ICSLP), 2000.
[13] A. L. Higgins and R. E. Wohlford, “Keyword recognition using template con-
catenation,” in Proceedings of the 1985 International Conference on Audio,
Speech and Signal Processing, 1985.
[14] D. A. James and S. J. Young, “A fast lattice-based approach to vocabulary
independent wordspotting,” in Proceedings of the 1994 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,
(Adelaide, Australia), pp. 377–380, 1994.
Bibliography 213
[15] P. Jeanrenaud, K. Ng, M. Siu, J. R. Rohlicek, and H. Gish, “Phonetic based
word spotter: various configurations and applications to event spotting,” in
Proceedings of the 1993 European Conference on Speech Communication and
Technology (EUROSPEECH), 1993.
[16] P. Jeanrenaud, M. H. Siu, J. R. Rohlicek, M. Meteer, and G. Gish, “Spotting
events in continuous speech,” in Proceedings of the 1994 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,
pp. 381–384, 1994.
[17] Z. Jianlai, L. Jian, S. Yantao, and Y. Tiecheng, “Keyword spotting based
on recurrent neural network,” in Proceedings of the 1998 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1998.
[18] A. Kenji, T. Kazushige, O. Kazunari, O. Sumio, and F. Hiroya, “A new
method for dialogue management in an intelligent system for information
retrieval,” in Proceedings of the 2000 International Conference on Spoken
Language Processing (ICSLP), 2000.
[19] P. Kingsbury, S. Strassel, C. McLemore, and R. MacIntyre, “CALLHOME
American English Lexicon (PRONLEX).” http://www.ldc.upenn.edu, 2005.
[20] L. K. Leung and P. Fung, “A more efficient and optimal LLR for decoding
and verification,” in Proceedings of the 1999 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 1999.
[21] R. P. Lippmann and E. Singer, “Hybrid neural-network/HMM approaches
to wordspotting,” in Proceedings of the 1993 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), (Sydney, Australia),
1993.
214 Bibliography
[22] J. Liu and X. Zhu, “Utterance verification based on dynamic garbage evalu-
ation approach,” in Proceedings of the 2000 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 2000.
[23] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in
speech recognition,” Computer Speech and Language, vol. 16, no. 1, pp. 69–
88(20), 2002.
[24] R. Moore, “A comparison of the data requirements of automatic speech
recognition systems and human listeners,” in Proceedings of the 2003
European Conference on Speech Communication and Technology (EU-
ROSPEECH), 2003.
[25] J. Ou, K. Chen, X. Wang, and Z. Li, “Utterance verification of short key-
words using hybrid neural-network/HMM approach,” in Proceedings of the
2001 IEEE International Conference on Info-tech and Info-net (ICII), (Bei-
jing, China), 2001.
[26] D. Reynolds and et al., “The SuperSID Project: Exploiting High-level Infor-
mation for High-accuracy,” in Proceedings of the 2003 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003.
[27] J. R. Rohlicek, P. Jeanrenaud, K. Ng, H. Gish, B. Musicus, and M. Siu,
“Phonetic training and language modeling for word spotting,” in Proceedings
of the 1993 IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), vol. 2, pp. 459–462, 1993.
[28] R. C. Rose and D. B. Paul, “A Hidden Markov Model based keyword recog-
nition system,” in Proceedings of the 1990 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), pp. 129–132, 1990.
Bibliography 215
[29] H. Sakoe and S. Chiba, “A dynamic programming approach to continuous
speech recognition,” in Proceedings of Seventh International Congress on
Acoustics, 1971.
[30] A. Sethy and S. Narayanan, “A syllable based approach for improved recog-
nition of spoken names,” in Pronunciation Modeling and Lexicon Adaptation
for Spoken Language Technology, 2002.
[31] M. Silaghi and H. Bourlard, “A new keyword spotting approach based on
iterative dynamic programming,” in Proceedings of the 2000 IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing (ICASSP),
2000.
[32] R. A. Sukkar and C. H. Lee, “Vocabulary independent discriminative utter-
ance verification for nonkeyword rejection in subword based speech recogni-
tion,” in Proceedings of the 1996 IEEE International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP), 1996.
[33] E. Trentin, Y. Bengio, C. Furlanello, and R. D. Mori, Spoken Dialogues
with Computers, ch. Neural Networks for Speech Recognition, pp. 343–347.
Academic Press, 1998.
[34] B. Walker, B. C. Lackey, J. S. Muller, and P. J. Schone, “Language-
Reconfigurable Universal Phone Recognition,” in Proceedings of the 2003
European Conference on Speech Communication and Technology (EU-
ROSPEECH), 2003.
[35] J. G. Wilpon, L. R. Rabiner, C. H. Lee, and E. R. Goldman, “Automatic
recognition of keywords in unconstrained speech using Hidden Markov Mod-
els,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38,
pp. 1870–1878, 1990.
216 Bibliography
[36] C. H. Wu, Y. J. Chen, and G. L. Yan, “Integration of phonetic and prosodic
information for robust utterance verification,” Vision, Image and Signal Pro-
cessing, vol. 147, pp. 55–61, 2000.
[37] L. Xin and B. Wang, “Utterance verification for spontaneous Mandarin
speech keyword spotting,” in Proceedings of the 2001 IEEE International
Conference on Info-tech and Info-net (ICII), 2001.
[38] C. Yining, L. Jing, Z. Lin, L. Jia, and L. Runsheng, “Keyword spotting
based on mixed grammar model,” in Proceedings of Intelligent Multimedia,
Video and Speech Processing 2001, 2001.
[39] S. J. Young and M. G. Brown, “Acoustic indexing for multimedia retrieval
and browsing,” in Proceedings of the 1997 IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), 1997.
[40] T. Zeppenfeld, “A hybrid neural network, dynamic programming word spot-
ter,” in Proceedings of the 1992 IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 1992.
[41] V. Zue, J. Glass, M. Phillips, and S. Seneff, “The MIT SUMMIT Speech
Recognition System: A Progress Report,” in Proceedings of the First
DARPA Speech and Natural Language Workshop, pp. 178–189, 1989.
Appendix A
The Levenstein Distance
A.1 Introduction
The Levenstein distance measures the minimum cost of transforming one string
to another. Transformation is performed by successive applications of one of
four transformations: matching, substitution, insertion and deletion. Typically,
each transformation has an associated cost, and hence implicitly the Levenstein
algorithm must discover which sequence of transformations results in the cheapest
total transformation cost.
A.2 Applications
Applications of the Levenstein Distance, also known as the Minimum Edit Dis-
tance (MED), span a plethora of fields. In biology, this algorithm is used to
identify similar sequences of nucleic acids in DNA or amino acids in proteins.
Web search engines have used this method for detecting similarity in phrases and
query terms. More obtuse is the use of the edit distance to discover similarities
in documents for the purpose of detecting plagiarism.
217
218 Appendix A. The Levenstein Distance
In speech research, the Levenstein distance is particularly useful in the analy-
sis of phonetic and word sequences. For example, the word error rate of a speech
transcription systems can be calculated using this method. The phonetic sim-
ilarity between two pronunciations can also be measured using the Levenstein
distance, for example, for the purpose of finding similarly pronounced words.
A.3 Algorithm
A basic implementation of the Levenstein algorithm uses a cost matrix to ac-
cumulate transformation costs. A recursive process is used to update successive
elements of this matrix in order to discover the overall minimum transformation
cost.
Let the sequence P = (p1, p2, . . . , pM) be defined as the source sequence and
the sequence Q = (q1, q2, . . . , qN) be defined as the target sequence. Additionally,
three transformation cost functions are defined
• Cs(x, y) - This represents the cost of transforming symbol x in P to symbol
y in Q. Typically this has a cost of 0 if x = y ie. a match operation
• Ci(y) - The cost of inserting the symbol y into sequence P .
• Cd(x) - The cost of deleting the symbol x from sequence P .
The element at row i and column j in the cost matrix represents the minimum cost
of transforming the subsequence (pk)i1 to (qk)
j1. Hence the bottom-right element
of the cost matrix represents the total minimum cost of transforming the entire
source sequence P to the target sequence Q.
The basic premise of the Levenstein algorithm is that the minimum cost of
transforming the sequence (pk)i1 to (qk)
j1 is either:
1. The cost of transforming (pk)i1 to (qk)
j−11 plus the cost of inserting qj
A.3. Algorithm 219
2. The cost of transforming (pk)i−11 to (qk)
j1 plus the cost of deleting pi
3. The cost of transforming (pk)i−11 to (qk)
j−11 plus the cost of substituting pi
with qj. If pi = qj then this is usually taken to have a cost of 0.
In this way, the cost matrix can be filled from the top-left corner to the bottom-
right corner in an iterative fashion.
The Levenstein algorithm is then as follows:
1. Initialise a (M + 1)× (N + 1) matrix Ω. This is called the Levenstein cost
matrix.
2. The top left element Ω0,0 represents the cost of transforming the empty
sequence to the empty sequence. This is therefore initialised to 0.
3. The first row of the cost matrix represents the sequence of successive inser-
tions. Hence it can be initialised to be
Ω0,j = j ∗ Ci(qj) (A.1)
4. The first column of the cost matrix represents successive deletions. It there-
fore can also be immediately initialised to be
Ωi,0 = i ∗ Cd(pi) (A.2)
5. Update elements of the cost matrix from the top-left down to the bottom-
right using the Levenstein update equation
Ωi,j =Min
Ωi,j−1 + Ci(qj),
Ωi−1,j + Cd(pi),
Ωi−1,j−1 + Cs(pi, qj)
(A.3)
220 Appendix A. The Levenstein Distance
Figure A.1 shows an example of the cost matrix obtained using the MED
method for transforming the word deranged to the word hanged using constant
transformation functions. It shows that the cheapest transformation cost is 3.
There are multiple means of obtaining this minimum cost. For example, both
the operation sequences (del, del, subst,match,match,match,match,match) and
(subst, del, del,match,match,match,match,match) have costs of 3.
/ h a n g e d
/ 0 1 2 3 4 5 6
d 1 1 2 3 4 5 5
e 2 2 2 3 4 4 5
r 3 3 3 3 4 5 5
a 4 4 3 4 4 5 6
n 5 5 4 3 4 5 6
g 6 6 5 4 3 4 5
e 7 7 6 5 4 3 4
d 8 8 7 6 5 4 3
Figure A.1: Example of cost matrix calculated using Levenstein algorithm fortransforming deranged to hanged. Cost of substitutions, deletions and insertionsall fixed at 1, cost of match fixed at 0.