acoustic keyword spotting in speech with … · speech and audio research laboratory of the saivt...

Speech and Audio Research Laboratory of the SAIVT program

Centre for Built Environment and Engineering Research

ACOUSTIC KEYWORD SPOTTING

IN SPEECH WITH APPLICATIONS

TO DATA MINING

A. J. Kishan Thambiratnam

BE(Electronics)/BInfTech

SUBMITTED AS A REQUIREMENT OF

THE DEGREE OF

DOCTOR OF PHILOSOPHY

AT

QUEENSLAND UNIVERSITY OF TECHNOLOGY

BRISBANE, QUEENSLAND

9 MARCH 2005

Keywords

Keyword Spotting, Wordspotting, Data Mining, Audio Indexing, Keyword Veri-

fication, Confidence Scoring, Speech Recognition, Utterance Verification

i

Abstract

Keyword Spotting is the task of detecting keywords of interest within continu-

ous speech. The applications of this technology range from call centre dialogue

systems to covert speech surveillance devices. Keyword spotting is particularly

well suited to data mining tasks such as real-time keyword monitoring and unre-

stricted vocabulary audio document indexing. However, to date, many keyword

spotting approaches have suffered from poor detection rates, high false alarm

rates, or slow execution times, thus reducing their commercial viability.

This work investigates the application of keyword spotting to data mining

tasks. The thesis makes a number of major contributions to the field of keyword

spotting.

The first major contribution is the development of a novel keyword verification

method named Cohort Word Verification. This method combines high level lin-

guistic information with cohort-based verification techniques to obtain dramatic

improvements in verification performance, in particular for the problematic short

duration target word class.

The second major contribution is the development of a novel audio document

indexing technique named Dynamic Match Lattice Spotting. This technique aug-

ments lattice-based audio indexing principles with dynamic sequence matching

techniques to provide robustness to erroneous lattice realisations. The resulting

algorithm obtains significant improvement in detection rate over lattice-based

iii

audio document indexing while still maintaining extremely fast search speeds.

The third major contribution is the study of multiple verifier fusion for the task

of keyword verification. The reported experiments demonstrate that substantial

improvements in verification performance can be obtained through the fusion

of multiple keyword verifiers. The research focuses on combinations of speech

background model based verifiers and cohort word verifiers.

The final major contribution is a comprehensive study of the effects of limited

training data for keyword spotting. This study is performed with consideration

as to how these effects impact the immediate development and deployment of

speech technologies for non-English languages.

iv

Contents

Keywords i

Abstract iii

List of Tables xiii

List of Figures xvi

List of Abbreviations xxi

Authorship xxiii

Acknowledgments xxv

1 Introduction 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Major Contributions of this Research . . . . . . . . . . . . . . . . 6

1.4 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 A Review of Keyword Spotting 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

v

2.2 The keyword spotting problem . . . . . . . . . . . . . . . . . . . . 10

2.3 Applications of keyword spotting . . . . . . . . . . . . . . . . . . 11

2.3.1 Keyword monitoring applications . . . . . . . . . . . . . . 11

2.3.2 Audio document indexing . . . . . . . . . . . . . . . . . . 13

2.3.3 Command controlled devices . . . . . . . . . . . . . . . . . 13

2.3.4 Dialogue systems . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 The development of keyword spotting . . . . . . . . . . . . . . . . 15

2.4.1 Sliding window approaches . . . . . . . . . . . . . . . . . . 15

2.4.2 Non-keyword model approaches . . . . . . . . . . . . . . . 16

2.4.3 Hidden Markov Model approaches . . . . . . . . . . . . . . 17

2.4.4 Further developments . . . . . . . . . . . . . . . . . . . . . 17

2.5 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 The reference and result sets . . . . . . . . . . . . . . . . . 19

2.5.2 The hit operator . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.3 Miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.5.4 False alarm rate . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.5 False acceptance rate . . . . . . . . . . . . . . . . . . . . . 21

2.5.6 Execution time . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.7 Figure of Merit . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.8 Equal Error Rate . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.9 Receiver Operating Characteristic Curves . . . . . . . . . . 24

2.5.10 Detection Error Trade-off Plots . . . . . . . . . . . . . . . 25

2.6 Unconstrained vocabulary spotting . . . . . . . . . . . . . . . . . 26

2.6.1 HMM-based approach . . . . . . . . . . . . . . . . . . . . 26

2.6.2 Neural Network Approaches . . . . . . . . . . . . . . . . . 28

2.7 Approaches to non-keyword modeling . . . . . . . . . . . . . . . . 31

2.7.1 Speech background model . . . . . . . . . . . . . . . . . . 31

2.7.2 Phone models . . . . . . . . . . . . . . . . . . . . . . . . . 33

vi

2.7.3 Uniform distribution . . . . . . . . . . . . . . . . . . . . . 34

2.7.4 Online garbage model . . . . . . . . . . . . . . . . . . . . 34

2.8 Constrained vocabulary spotting . . . . . . . . . . . . . . . . . . . 36

2.8.1 Language model approaches . . . . . . . . . . . . . . . . . 36

2.8.2 Event spotting . . . . . . . . . . . . . . . . . . . . . . . . 39

2.9 Keyword verification . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.9.1 A formal definition . . . . . . . . . . . . . . . . . . . . . . 42

2.9.2 Combining keyword spotting and verification . . . . . . . . 42

2.9.3 The problem of short duration keywords . . . . . . . . . . 43

2.9.4 Likelihood ratio based approaches . . . . . . . . . . . . . . 43

2.9.5 Alternate Information Sources . . . . . . . . . . . . . . . . 46

2.10 Audio Document Indexing . . . . . . . . . . . . . . . . . . . . . . 47

2.10.1 Limitations of the Speech-to-Text

Transcription approach . . . . . . . . . . . . . . . . . . . . 48

2.10.2 Reverse dictionary lookup searches . . . . . . . . . . . . . 49

2.10.3 Indexed reverse dictionary lookup searches . . . . . . . . . 51

2.10.4 Lattice based searches . . . . . . . . . . . . . . . . . . . . 53

3 HMM-based spotting and verification 57

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2 The confusability circle framework . . . . . . . . . . . . . . . . . . 58

3.3 Analysis of non-keyword models . . . . . . . . . . . . . . . . . . . 60

3.3.1 All-speech models . . . . . . . . . . . . . . . . . . . . . . . 60

3.3.2 SBM methods . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.3 Phone-set methods . . . . . . . . . . . . . . . . . . . . . . 62

3.3.4 Target-word-excluding methods . . . . . . . . . . . . . . . 62

3.4 Evaluation of keyword spotting techniques . . . . . . . . . . . . . 63

3.4.1 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . 64

vii

3.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.5 Tuning the phone set non-keyword model . . . . . . . . . . . . . . 68

3.6 Output score thresholding for SBM

spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.7 Performance across keyword length . . . . . . . . . . . . . . . . . 72

3.7.1 Evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . 73

3.7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.8 HMM-based keyword verification . . . . . . . . . . . . . . . . . . 74

3.8.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.8.2 Evaluation procedure . . . . . . . . . . . . . . . . . . . . . 77

3.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.9 Discriminative background model KV . . . . . . . . . . . . . . . . 79

3.9.1 System architecture . . . . . . . . . . . . . . . . . . . . . . 79

3.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 82

4 Cohort word keyword verification 85

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Foundational concepts . . . . . . . . . . . . . . . . . . . . . . . . 87

4.2.1 Cohort-based scoring . . . . . . . . . . . . . . . . . . . . . 87

4.2.2 The use of language information . . . . . . . . . . . . . . . 88

4.3 Overview of the cohort word technique . . . . . . . . . . . . . . . 90

4.4 Cohort word set construction . . . . . . . . . . . . . . . . . . . . 92

4.4.1 The choice of dmin and dmax . . . . . . . . . . . . . . . . . 92

4.4.2 Cohort word set downsampling . . . . . . . . . . . . . . . 94

4.4.3 Distance function . . . . . . . . . . . . . . . . . . . . . . . 94

4.5 Classification approach . . . . . . . . . . . . . . . . . . . . . . . . 96

4.5.1 2-class classification approach . . . . . . . . . . . . . . . . 96

viii

4.5.2 Hybrid N-class approach . . . . . . . . . . . . . . . . . . . 98

4.6 Summary of the cohort word algorithm . . . . . . . . . . . . . . . 100

4.7 Comparison of classifier approaches . . . . . . . . . . . . . . . . . 101

4.7.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.7.2 Recogniser parameters . . . . . . . . . . . . . . . . . . . . 103

4.7.3 Cohort word selection . . . . . . . . . . . . . . . . . . . . 103


4.7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.8 Performance across target keyword length . . . . . . . . . . . . . 106

4.8.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 106


4.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.8.4 Analysis of poor 8-phone performance . . . . . . . . . . . . 110

4.8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.9 Effects of selection parameters . . . . . . . . . . . . . . . . . . . . 113

4.9.1 Cohort word set downsampling . . . . . . . . . . . . . . . 114

4.9.2 Cohort word selection range . . . . . . . . . . . . . . . . . 116

4.9.3 MED cost parameters . . . . . . . . . . . . . . . . . . . . 119

4.9.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.10 Fused cohort word systems . . . . . . . . . . . . . . . . . . . . . . 122

4.10.1 Training dataset . . . . . . . . . . . . . . . . . . . . . . . 123

4.10.2 Neural network architecture . . . . . . . . . . . . . . . . . 123

4.10.3 Experimental procedure . . . . . . . . . . . . . . . . . . . 123

4.10.4 Baseline unfused results . . . . . . . . . . . . . . . . . . . 124

4.10.5 Fused SBM-CW experiments . . . . . . . . . . . . . . . . . 125

4.10.6 Fused CW-CW experiments . . . . . . . . . . . . . . . . . 128

4.10.7 Comparison of fused and unfused systems . . . . . . . . . 129

4.11 Conclusions and Summary . . . . . . . . . . . . . . . . . . . . . . 133

ix

5 Dynamic Match Lattice Spotting 137

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.3 Dynamic Match Lattice Spotting method . . . . . . . . . . . . . . 140

5.3.1 Basic method . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.3.2 Optimised Dynamic Match Lattice Search . . . . . . . . . 145

5.4 Evaluation of DMLS performance . . . . . . . . . . . . . . . . . . 146

5.4.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 146


5.4.3 Lattice building . . . . . . . . . . . . . . . . . . . . . . . . 147

5.4.4 Query-time processing . . . . . . . . . . . . . . . . . . . . 148

5.4.5 Baseline systems . . . . . . . . . . . . . . . . . . . . . . . 149


5.4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.5 Analysis of dynamic match rules . . . . . . . . . . . . . . . . . . . 152

5.5.1 System configurations . . . . . . . . . . . . . . . . . . . . 153

5.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.6 Analysis of DMLS algorithm parameters . . . . . . . . . . . . . . 156

5.6.1 Number of lattice generation tokens . . . . . . . . . . . . . 157

5.6.2 Pruning beamwidth . . . . . . . . . . . . . . . . . . . . . . 158

5.6.3 Number of lattice traversal tokens . . . . . . . . . . . . . . 159

5.6.4 MED cost threshold . . . . . . . . . . . . . . . . . . . . . 160

5.6.5 Tuned systems . . . . . . . . . . . . . . . . . . . . . . . . 162

5.6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 163

5.7 Conversational telephone speech

experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5.7.1 Evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . 165


x

5.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.8 Non-destructive optimisations . . . . . . . . . . . . . . . . . . . . 168

5.8.1 Prefix sequence optimisation . . . . . . . . . . . . . . . . . 169

5.8.2 Early stopping optimisation . . . . . . . . . . . . . . . . . 171

5.8.3 Combining optimisations . . . . . . . . . . . . . . . . . . . 173

5.9 Optimised system timings . . . . . . . . . . . . . . . . . . . . . . 174

5.9.1 Experimental procedure . . . . . . . . . . . . . . . . . . . 175

5.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

6 Non-English Spotting 181

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.2 The issue of limited resources . . . . . . . . . . . . . . . . . . . . 182

6.3 The role of keyword spotting . . . . . . . . . . . . . . . . . . . . . 184

6.4 Experiment setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

6.4.1 Database design . . . . . . . . . . . . . . . . . . . . . . . . 185

6.4.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . 186

6.4.3 Evaluation set design . . . . . . . . . . . . . . . . . . . . . 188


6.5 English and Spanish stage 1 evaluations . . . . . . . . . . . . . . 189

6.6 English and Spanish post keyword

verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

6.7 Indonesian spotting and verification . . . . . . . . . . . . . . . . . 197

6.8 Extrapolating Indonesian performance . . . . . . . . . . . . . . . 198

6.9 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . 200

7 Summary, Conclusions and Future Work 203

7.1 HMM-based Spotting and Verification . . . . . . . . . . . . . . . 203

7.1.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 203

xi

7.1.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 204

7.2 Cohort Word Verification . . . . . . . . . . . . . . . . . . . . . . . 205

7.2.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 205

7.2.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.3 Dynamic Match Lattice Spotting . . . . . . . . . . . . . . . . . . 206

7.3.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.4 Non-English Spotting . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.5 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Bibliography 210

A The Levenstein Distance 217

A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

A.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

A.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

xii

List of Tables

3.1 Keyword spotting performance of baseline systems on Switchboard

1 data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2 Effect of target word insertion penalty on PM-KS performance . . 69

3.3 Equal error rates of unnormalised and duration normalised output

score thresholding applied to SBM-KS . . . . . . . . . . . . . . . 71

3.4 Details of phone-length dependent evaluation sets . . . . . . . . . 73

3.5 SBM-KS performance on Switchboard 1 data for different phone-

length target words . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.6 Statistics for keyword verification evaluation sets . . . . . . . . . . 77

3.7 Equal error rates for SBM-based keyword verification . . . . . . . 78

3.8 Equal error rates for SBM and MLP-SBM keyword verification . . 82

4.1 Evaluated cohort word selection parameters . . . . . . . . . . . . 103

4.2 Performance of selected cohort word KV systems on TIMIT eval-

uation set. Cohort word systems are qualified with the appropri-

ate cohort word selection parameters using a tag in the format

dmin, dmax, ψd, ψi. . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.3 Performance of SBM-KV and selected cohort word systems on the

SWB1 evaluation sets. Cohort word selection parameters are spec-

ified with each system in the format dmin, dmax, ψd, ψi. . . . . . 108

xiii

4.4 Mean and standard deviation of the number cohort words used

in the 3 best performing cohort word KV methods for the SWB1

evaluation set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5 Performance of baseline SBM-KV and best cohort word systems

on the SWB1 evaluation sets . . . . . . . . . . . . . . . . . . . . . 124

4.6 Performance of the best fused SBM-cohort systems on the SWB1

evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.7 Performance of the best fused cohort-cohort systems on the SWB1

evaluation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4.8 Correlation analysis of fused EER and individual unfused EER . . 130

4.9 Summary of best performing systems . . . . . . . . . . . . . . . . 135

5.1 Phone substitution costs for DMLS . . . . . . . . . . . . . . . . . 149

5.2 Baseline keyword spotting results evaluated on TIMIT . . . . . . 151

5.3 TIMIT performance when isolating various DP rules . . . . . . . 154

5.4 Effect of adjusting number of lattice generation tokens . . . . . . 157

5.5 Effect of adjusting pruning beamwidth . . . . . . . . . . . . . . . 158

5.6 Effect of adjusting number of traversal tokens . . . . . . . . . . . 160

5.7 Effect of adjusting MED cost threshold Smax . . . . . . . . . . . . 161

5.8 Optimised DMLS configurations evaluated on TIMIT . . . . . . . 163

5.9 Keyword spotting results on SWB1 . . . . . . . . . . . . . . . . . 166

5.10 Relative speeds of optimised DMLS systems . . . . . . . . . . . . 176

5.11 Performance of a fully optimised DMLS system on Switchboard data177

5.12 Summary of key results . . . . . . . . . . . . . . . . . . . . . . . . 179

6.1 Summary of training data sets . . . . . . . . . . . . . . . . . . . . 186

6.2 Codes used to refer to model architectures . . . . . . . . . . . . . 187

6.3 Summary of evaluation data sets. . . . . . . . . . . . . . . . . . . 188

6.4 Stage 1 spotting rates for various model sets and database sizes . 191

xiv

6.5 Equal error rates after keyword verification for various model sets

and training database sizes . . . . . . . . . . . . . . . . . . . . . . 194

6.6 Stage 1 spotting and stage 2 post verification results for S1I ex-

periments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

xv

List of Figures

2.1 An example of a Receiver Operating Characteristic curve . . . . . 24

2.2 An example of a Detection Error Trade-off plot . . . . . . . . . . 25

2.3 Recognition grammar for HMM-based keyword spotting . . . . . . 27

2.4 Sample recognition grammar for small non-keyword vocabulary

keyword spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 System architecture for HMM keyword spotting using a Speech

Background Model as the non-keyword model . . . . . . . . . . . 32

2.6 System architecture for HMM keyword spotting using a composite

non-keyword model constructed from phone models . . . . . . . . 33

2.7 Constructing a recognition network for constrained vocabulary key-

word spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.8 An optimised constrained vocabulary keyword spotting recognition

network (language model probabilities omitted) . . . . . . . . . . 39

2.9 An event spotting network for detecting occurrences of times [16] 40

2.10 Likelihood ratio based keyword occurrence verification with mul-

tiple verifier fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.11 Applying reverse dictionary searches to the detection of the word

ACQUIRE in a phone stream . . . . . . . . . . . . . . . . . . . . 50

2.12 Example of indexed reverse dictionary searching for the detection

of the word ACQUIRE . . . . . . . . . . . . . . . . . . . . . . . . 52

xvii

2.13 Using lattice based searching to locate instances of the word AC-

QUIRE within a phone lattice . . . . . . . . . . . . . . . . . . . . 54

3.1 Confusability circle for the target word STOCK . . . . . . . . . . 59

3.2 Example of the shared subevent confusable acoustic region for the

keyword STOCK . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3 Incorporating target word insertion penalty into HMM-based key-

word spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.4 DET plots for unnormalised and duration normalised output score

thresholding applied to SBM-KS . . . . . . . . . . . . . . . . . . . 72

3.5 DET plots for duration normalised output score thresholding ap-

plied to SBM-KS for keyword length dependent evaluation sets . . 75

3.6 DET plots for different target keyword lengths for SBM-KV on

Switchboard 1 evaluation sets . . . . . . . . . . . . . . . . . . . . 78

3.7 System architecture for MLP background model based KV . . . . 80

3.8 DET plots for SBM and MLP-SBM systems for 4-phone words . . 81



4.1 Controlling the degree of CAR region modeling dmin and dmax tuning 93

4.2 A N-class classifier approach to cohort word verification for the

keyword w and cohort word set R(w) . . . . . . . . . . . . . . . . 99

4.3 DET plot for best cohort word and SBM-KV systems on SWB1

4-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109

4.4 DET plot for best cohort word and SBM-KV systems on SWB1

6-phone length evaluation set . . . . . . . . . . . . . . . . . . . . 109

4.5 Equal error rate versus mean number of cohort words . . . . . . . 112

4.6 Trends in equal error rate with changes in cohort word set down-

sampling size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

xviii

4.7 Trends in equal error rate with changes in cohort word selection

range for 4-phone length cohort word KV . . . . . . . . . . . . . . 117





4.10 Trends in equal error rate with changes in MED cost parameters . 120

4.11 Correlation between unfused system performances and fused sys-

tem performances . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

4.12 Boxplot of EERs for all evaluated architectures and phone-lengths 131

4.13 Boxplot of log(EERs) for all evaluated architectures and phone-

lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.1 Segment of phone lattice for an instance of the word STOCK . . 142

5.2 Effect of lattice traversal token parameter . . . . . . . . . . . . . 159

5.3 Trends in miss rate and FA/kw rate performance for various types

of tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.4 Plot of miss rate versus FA/kw rate for HMM, CLS and DMLS

systems evaluated on Switchboard . . . . . . . . . . . . . . . . . . 168

5.5 The relationship between cost matrices for subsequences . . . . . 169

5.6 Demonstration of the MED prefix optimisation algorithm . . . . . 170

6.1 Effect of training dataset size on speech recognition [24] . . . . . . 183

6.2 Trends in miss rate across training database size . . . . . . . . . . 190

6.3 Trends in FA/kw rate across training database size . . . . . . . . 190

6.4 DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,

4=T16S2S, 5=T16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193

6.5 DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,

4=M16S2S, 5=M16S1S . . . . . . . . . . . . . . . . . . . . . . . . 193

xix

6.6 DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,

4=M32S2S, 5=M32S1S . . . . . . . . . . . . . . . . . . . . . . . . 193

6.7 Trends in EER across training dataset size . . . . . . . . . . . . . 195

6.8 DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S 196

6.9 DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I 197

6.10 Extrapolations of Indonesian keyword spotting performance using

larger sized databases . . . . . . . . . . . . . . . . . . . . . . . . . 199

A.1 Example of cost matrix calculated using Levenstein algorithm for

transforming deranged to hanged. Cost of substitutions, deletions

and insertions all fixed at 1, cost of match fixed at 0. . . . . . . . 220

xx

List of Abbreviations

ADI Audio Document Indexing

CAR Confusable Acoustic Region

CLS Conventional Lattice-based Spotting

CMS Cepstral Mean Subtraction

CW Cohort Word

DAR Disparate Acoustic Region

DET Detection Error Trade-off

DMLS Dynamic Match Lattice Spotting

EER Equal Error Rate

FA False Alarm

GMM Gaussian Mixture Model

HMM Hidden Markov Model

IRDL Indexed Reverse Dictionary Lookup

KS Keyword Spotting

KV Keyword Verification

LVCSR Large Vocabulary Continuous Speech Recognition

MED Minimum Edit Distance

MLP Multi-Layer Perceptron

PLP Perceptual Linear Prediction

RDL Reverse Dictionary Lookup

xxi

ROC Receiver Operating Characteristic

SBM Speech Background Model

SBM-KS Speech Background Model based Keyword Spotting

SBM-KV Speech Background Model based Keyword Verification

STT Speech-to-Text Transcription

SWB1 Switchboard-1

TAR Target Acoustic Region

WSJ1 Wall Street Journal 1

xxii

Authorship

The work contained in this thesis has not been previously submitted for a degree

or diploma at any other higher educational institution. To the best of my knowl-

edge and belief, the thesis contains no material previously published or written

by another person except where due reference is made.

Signed:

Date:

xxiii

Acknowledgments

Foremost I would like to acknowledge my Lord and Saviour Jesus Christ. It is by

His grace that I was given the opportunity and necessary abilities to partake in

this research.

I would also like to thank my beautiful wife, Melenie, who has been a constant

source of support and inspiration. Your words of encouragement have seen me

through the more difficult and frustrating times of this work.

To my supervisor, Professor Sridha Sridharan, I would like to offer my heartfelt

gratitude for your unrelenting support in bringing this research to completion.

Your positive words and guidance have been a true blessing.

I would also like to offer a special thanks for the friendship of the members

of the QUT Speech Research Labs. In particular, I would like to thank Terry

Martin, Robbie Vogt, Michael Mason and Brendan Baker for their constructive

criticism as well as their constant joviality.

Finally, I would like to thank my loving two families for believing in and

supporting me during this long venture, and my wonderful dogs for always giving

me a reason to smile.

Kit Thambiratnam

Queensland University of Technology

February 2005

xxv

Chapter 1

Introduction

1.1 Overview

Keyword Spotting (KS) is the automated task of detecting keywords of interest

within continuous speech. This technology has been used in a variety of appli-

cations, ranging from telephone call centre systems to covert surveillance appli-

cations. Keyword spotting is closely related to the task of speech transcription,

but offers many advantages for certain applications.

Primarily, keyword spotting is well suited to data-mining tasks that process

large amounts of speech. This is because keyword spotting requires significantly

less processing power than transcription, and can therefore run at considerably

faster speeds. Real-time stream monitoring is one such example where this is

required. These applications monitor audio in real-time and flag occurrences of

segments of interest, such as news stories related to a specific topic. Clearly,

the majority of the stream does not require attention, and therefore a keyword

spotting solution that simply detects occurrences of topical keywords will be more

efficient than a fully-fledged large vocabulary transcription engine.

Keyword spotting is also an excellent technology for audio search applications,

1

2 Chapter 1. Introduction

such as audio document indexing. In particular, recent developments in KS in-

cluding lattice-based searching and reverse dictionary lookup methods have made

possible the development of unrestricted vocabulary audio document database

search engines that can search hours of data in seconds.

However, many keyword spotting technologies are encumbered by poor detec-

tion performance or slow search speeds. There is a trade-off between accuracy

and speed that needs to be managed, and unfortunately to date, many practical

keyword spotting applications are forced to sacrifice detection performance to

realise the execution speeds required for commercial deployment. One has only

to use speech-recognition-enabled telephony services such as telephone banking

to conclude that these systems are far from perfect.

Nevertheless, keyword spotting is a powerful and relevant technology. Used

appropriately, a keyword spotting solution brings with it reduced computational

requirements, increased scalability and potentially higher accuracies than a large

vocabulary transcription system.

1.1.1 Aims and Objectives

This work specifically examines the application of keyword spotting technolo-

gies to two data mining tasks: real-time keyword monitoring and large audio

document database indexing. With the ever-increasing amounts of audio and

multimedia being generated daily, the ability to extract information from audio

streams at high speeds while maintaining good detection rates is paramount.

A desirable feature of data mining applications is the support for unrestricted

vocabulary keyword queries. However, a significant portion of past keyword spot-

ting research has dealt primarily with restricted vocabulary methods. Although

these approaches offer advantages in terms of detection and false alarm perfor-

mance, they limit the flexibility of queries. As such, this work concerns itself

1.1. Overview 3

solely with the study of unrestricted vocabulary keyword spotting techniques.

Data throughput is also another major consideration when dealing with large

amounts of data. Although the cost of computing is constantly becoming cheaper,

it is nevertheless beneficial to run at high speeds. This is particularly true for

audio indexing applications, where literally hundreds of hours may need to be

interactively searched by a user. Unfortunately many published KS works neglect

to consider execution time during experimentation. This research will therefore

give considerable attention to the issue of processing speed.

The primary objectives of this thesis are as follows:

1. To review and investigate current state-of-the-art keyword spotting tech-

niques that are relevant to the tasks of real-time keyword monitoring and

audio document indexing

2. To assess and evaluate the performance of these techniques with regards

to crucial performance metrics relevant to the target applications, and as

such, identify potential issues that need to be addressed

3. To investigate and develop novel techniques that can be used to improve the

performance of keyword spotting techniques for data mining applications

4. To investigate the application of keyword spotting technologies for non-

English data mining

1.1.2 Research Scope

Keyword spotting encompasses a plethora of speech recognition research topics

that unfortunately cannot be fully addressed in a single work. As such, the scope

of this research was limited to issues that were directly related to the application

of keyword spotting to real-time keyword monitoring and audio document index-

ing. Additionally, the following restrictions and constraints were applied to this


research:

1. Primarily this work concerns itself with the application of HMM-based

speech recognition techniques to the keyword spotting task. Alternate sta-

tistical modeling approaches, such as neural network techniques, have been

proposed and demonstrated to be suitable for keyword spotting. However,

it is believed that the HMM-based approach provides a greater degree of

flexibility particularly with regards to unrestricted vocabulary tasks, and

as such is the modeling architecture of choice for this research.

2. Experiments reported within this work are limited to single keyword detec-

tion. Although most practical applications of keyword spotting use multi-

word detection during a single pass, it is believed that research constrained

to single keyword detection offers a number of advantages. Primarily, it

allows ease of comparison between results in this thesis and other published

works. Additionally, the variability in performance due to different mix-

tures of words within a multi-word keyword set can be avoided, thereby

ensuring greater consistency between experiments. Finally, it is believed

that trends in single keyword spotting across methods will easily translate

to multi-word keyword spotting tasks, and as such, does not limit the value

of this research.

1.2 Thesis Organisation

An overview of the organisation of this thesis is given below:

Chapter 2 - A Review of Keyword Spotting presents a thorough review of

keyword spotting and associated technologies. A formal definition of the

keyword spotting problem is given, as well a discussion of its primary appli-

cations. This is followed by an overview of the key performance metrics that

1.2. Thesis Organisation 5

are relevant to evaluating and understanding keyword spotting methodol-

ogy. A detailed review of KS literature is then presented covering the topics

of unrestricted and restricted spotting techniques, non-keyword modeling

architectures, keyword verification and confidence scoring methods, and au-

dio indexing approaches.

Chapter 3 - HMM-based Spotting and Verification discusses and evalu-

ates existing HMM-based keyword spotting and verification techniques.

Such methods have a strong following within the keyword spotting com-

munity. However, to date, there has been little published work that com-

pares the performances of the various approaches. What little that has

been published has primarily focused on measuring performance for sim-

plistic domains such as read microphone speech. A number of HMM-based

techniques are evaluated in this chapter and the strengths and weaknesses

of these methods are discussed.

Chapter 4 - Cohort Word Verification proposes a novel keyword verifica-

tion approach that combines high level linguistic information with cohort-

based verification techniques to yield improved performance. A number of

experiments are reported on to measure the performance of this method for

the conversational telephone speech and read microphone speech domains.

The results demonstrate that significant gains can be obtained particularly

for the difficult task of short-word keyword verification. In addition, exper-

iments are performed using a fused architecture that combines cohort word

verification with traditional background model based verification. Further

gains in performance are obtained using this approach.

Chapter 5 - Dynamic Match Lattice Spotting proposes a novel audio in-

dexing technique that is presented and evaluated in this chapter. Although

existing unrestricted audio indexing methods are capable of very fast search


speeds, they are encumbered by very poor miss rate performance. It is ar-

gued here that this poor miss rate is a result of inherent phone recogniser

errors that are not accommodated for by these techniques. As such, a new

method of lattice-based searching is proposed that incorporates dynamic

sequence matching methods to provide robustness against erroneous lattice

realisations. The results demonstrate that dramatic gains in performance

can be obtained while still maintaining extremely fast search speeds.

Chapter 6 - Non-English Spotting studies the application of keyword spot-

ting technologies to non-English languages. In particular, it examines the

effects of limited training data on keyword spotting performance. The lack

of availability of non-English training data has greatly hindered the de-

velopment of other speech technologies such as large vocabulary speech

transcribers. However, keyword spotting is a significantly more constrained

task, and therefore may be less affected by reduced amounts of training

data. If so, this may allow the immediate development of speech tech-

nologies for non-English languages without the need for the costly task of

creating large training databases.

Chapter 7 - Summary, Conclusions and Future Work presents the sum-

mary and conclusions of this work as well as a discussion of future research

directions.

1.3 Major Contributions of this Research

This work has generated a number of novel contributions to the field of keyword

spotting. These are:

1. The development of the novel Cohort Word Verification technique. This

1.4. List of Publications 7

method combines high level linguistic knowledge with cohort-based veri-

fication techniques to yield significant improvements particularly for the

problematic area of short-word keyword verification.

2. The use of multiple keyword verifier fusion, in particular applied to the

combination of cohort word verification with existing HMM-based tech-

niques. It is demonstrated that such fusion techniques allow the strengths

of individual verifiers to be combined to yield considerable improvements

in verification performance.

3. The development of the novel Dynamic Match Lattice Spotting approach.

This technique augments existing lattice-based audio indexing techniques

with dynamic sequence matching to improve robustness to erroneous lattice

realisation. The resulting algorithm is capable of searching hours of speech

using seconds of processor time while maintaining good miss and false alarm

rates.

4. A detailed study of the effects of limited training data for keyword spotting,

as well as how this impacts the immediate development and deployment of

speech technologies for non-English languages.

1.4 List of Publications

The research presented in this thesis has resulted in the publication of a number

of fully referenced peer reviewed works.

1. K. Thambiratnam and S. Sridharan. “Isolated word verification using Co-

hort Word-level Verification”, in Proceedings of the European Conference

on Speech Communication and Technology (EUROSPEECH), (Geneva,

Switzerland), 2003


2. K. Thambiratnam and S. Sridharan. “A study on the effects of limited

training data for English, Spanish and Indonesian keyword spotting”, in

Proceedings of the 10th Australian International Conference on Speech Sci-

ence and Technology (SST), (Sydney, Australia), 2004

3. T. Martin, K. Thambiratnam and S. Sridharan. “Target Structured Cross

Language Model Refinement”, in Proceedings of the 10th Australian In-

ternational Conference on Speech Science and Technology (SST), (Sydney,

Australia), 2004

4. K. Thambiratnam and S. Sridharan, “Fusion of cohort-word and speech

background model based confidence scores for improved keyword confidence

scoring and verification”, in Proceedings of the IEEE 3rd International Con-

ference on Sciences of Electronic, Technologies of Information and Telecom-

munications, (Susa, Tunisia), 2005

5. K. Thambiratnam and S. Sridharan, “Dynamic match phone-lattice searches

for very fast and accurate unrestricted vocabulary keyword spotting”, in Pro-

ceedings of the 2005 IEEE International Conference on Acoustics, Speech,

and Signal Processing (ICASSP), (Philadelphia, USA), 2005

Chapter 2

A Review of Keyword Spotting

2.1 Introduction

This chapter presents a comprehensive review of keyword spotting technologies

to date. Section 2.2 gives a formal definition of the keyword spotting problem

and is followed by a discussion of the various applications of keyword spotting in

section 2.3. A brief synopsis of the development of keyword spotting research is

provided in section 2.4 as well as a detailed description of how keyword spotting

performance is measured in section 2.5.

Subsequent sections discuss the current methods of keyword spotting with

respect to their key applications. Section 2.6 discusses a number of algorithms

for unconstrained vocabulary keyword spotting. This is followed by a description

of the various approaches to non-keyword modeling in section 2.7. Approaches to

constrained vocabulary keyword spotting are then presented in section 2.8 as well

as methods for keyword occurrence verification in section 2.9. Finally, methods

of applying KS to the task of audio document indexing are discussed in section

2.10.

9

10 Chapter 2. A Review of Keyword Spotting

2.2 The keyword spotting problem

Keyword spotting can be viewed as a special case of Speech-to-Text Transcription

(STT), in which the transcription vocabulary is restricted to keywords of interest

plus a non-keyword symbol that is used to represent all other words in the target

application domain.

Let O be an observation sequence, V be the vocabulary of the target appli-

cation domain, Q be the set of keywords of interest and Ω be the non-keyword

symbol. If STT is represented as the transformation W = Transcribe(O, V ),

where W = w1, w2, . . . is the resulting hypothesised word sequence, then the

keyword spotting task can be defined as

KS(O, V,Q) = f(Transcribe(O, V ), Q) (2.1)

where f(W,Q) is a transformation applied to the output of STT and is given by

f(W,Q) =

W |W | = 1, w1 ∈ Q

Ω |W | = 1, w1 6∈ Q

w1, f(Tail(W ), Q) |W | > 1, w1 ∈ Q

Ω, f(Tail(W ), Q) |W | > 1, w1 6∈ Q, w2 ∈ Q

f(Tail(W ), Q) otherwise

and Tail(xiNi=1) = xi

Ni=2

f(W,Q) essentially replaces all sequences of non-keywords in the word se-

quence output by the transcriber by a single non-keyword symbol Ω.

Although valid, this formulation of keyword spotting is inefficient as it requires

full transcription using a vocabulary of size |V |. Typically keyword spotting is

2.3. Applications of keyword spotting 11

only interested in occurrences of a much smaller set of words defined by Q. Given

this simplification, a more practical and efficient formulation of keyword spotting

is

KS(O, V,Q) = Transcribe(O, g(Q)) (2.2)

where g(Q) = Q ∪ Ω

This alternate approach requires transcription using a much smaller vocab-

ulary of size |Q| + 1. Clearly, this is a considerably less computationally inten-

sive task than transcription using the formulation in equation 2.1. However, it

introduces the additional burden of an acoustic model representation of the non-

keyword symbol Ω. Definition of the non-keyword symbol is one of the active

areas of keyword spotting research and is discussed further in section 2.7.

2.3 Applications of keyword spotting

Keyword spotting lends itself to a plethora of speech-enabled applications. Key-

word spotting is particularly well suited to applications where large amounts of

speech need to be processed. This is because it offers a significant speed benefit

over a large vocabulary STT approach. Four major applications of this technol-

ogy are keyword monitoring, audio document indexing, command control devices

and dialogue systems.

2.3.1 Keyword monitoring applications

Keyword monitoring applications are required to continuously monitor a real-

time stream of audio and to flag any occurrences of a keyword in the query

set. Specific keyword monitoring applications include telephone tapping, listening

device monitoring and broadcast monitoring.


Telephone tapping and listening device monitoring are used extensively by

security organisations to detect criminal or malicious activity. Keyword spotting

provides a fast and automatic solution to this task and potentially a higher de-

tection accuracy then human monitoring, particularly when a very large number

of audio streams needs to be monitored. However, these applications create a

considerable challenge for keyword spotting because of the noisy nature of the

speech being monitored. Telephone conversations may be plagued with signifi-

cant background noise, multiple languages and even multiple speakers, providing

challenges for acoustic modeling. Listening device audio may suffer from very low

signal-to-noise ratios, a difficulty for any speech processing application.

Broadcast monitoring is actively performed by commercial broadcast mon-

itoring companies to locate segments that may be of interest to a client. For

example, a senator may be interested in all news stories in which he or she is

mentioned in - broadcast monitoring organisations provide such a service at a

fee. A significant challenge of broadcast monitoring is the amount of audio that

needs to be processed daily. Broadcast monitoring clients may be interested in

stories from a comprehensive set of broadcast sources, including free-to-air tele-

vision, cable-television, commercial radio and community radio. It is easy to see

that the vast numbers of these combined with the fact that many of these sources

broadcast continually 24 hours a day, 7 days a week, makes broadcast monitoring

a very data intensive problem.

Keyword spotting provides an excellent solution to all these keyword mon-

itoring tasks. Faster-than-real-time keyword spotting technologies are likely to

process audio faster than a human processor. Additionally the accuracy of an

automatic system is also likely to exceed that of a human processor since com-

puters do not suffer from fatigue and mental distractions that plague a human

processor. Keyword spotting is particularly well suited to the broadcast moni-

toring task since audio quality in this domain is usually of much higher quality

2.3. Applications of keyword spotting 13

than telephone and listening device audio.

2.3.2 Audio document indexing

Audio document indexing is the task of rapidly searching an audio document

database for keywords and topics of interest. This functionality is analogous

to traditional text document indexing systems such as the Google [11] Internet

search engine, but operates on audio documents instead. The need for efficient

and fast audio document indexing is paramount in a world where audio and

multimedia documents play a greater role in everyday life.

STT systems are one solution to the audio document indexing problem. Audio

is first transcribed to text that can then be rapidly searched during query time.

However, many applications of audio document indexing, such as news database

searching, require support for proper noun queries such as names, places and

foreign words — terms that in many cases are not a part of the transcription

system’s vocabulary. As such, alternates to the STT-based approach that do not

constrain the query vocabulary are required.

Thankfully, a keyword spotting solution does provide support for unrestricted

vocabulary queries. The trade-off though is a reduction in query speeds, since

most KS approaches are nowhere near as fast as text-based searching methods.

Nevertheless, the support for unrestricted vocabulary queries is important, and

as such, a keyword spotting system can be used to augment an STT-based system

to provide very fast queries for in-vocabulary words while still supporting out-of-

vocabulary queries.

2.3.3 Command controlled devices

Command controlled devices monitor the ambient audio and react when they

detect specific command words. Examples of command controlled devices are


speech-enabled mobile phones, voice-controlled VCRs and command-controlled

factory machinery.

Although generic keyword monitoring technologies can be used for command

controlled devices, they typically place too high a processing or memory require-

ment to be feasible, especially in the case of DSP-based or embedded applications.

Additionally, the query terms of command controlled devices tend to be fixed,

allowing more application-specific information to be incorporated into the key-

word detection process. This includes query word linguistic context information

and environmental noise conditions.

Hence command controlled device KS lends itself to the development of cus-

tom solutions. Though many of these solutions may be based on existing key-

word spotting approaches, significant enhancements and modifications are made

to provide maximum performance for the intended application.

2.3.4 Dialogue systems

Automated dialogue systems are becoming more common in the commercial en-

vironment as a viable alternative to human-operated call centres. A dialogue

system mimics a human call-centre operator by playing voice prompts to a caller

and then attempting to detect keywords that indicate the response of caller.

Since the volume of calls processed by a call-centre can be very large, large vo-

cabulary STT approaches have proven infeasible due to their high computational

requirements. Instead restricted grammar speech recognisers or keyword spotting

technologies are used to interpret the response of callers.

Keyword spotting approaches offer a benefit over restricted grammar speech

recogniser approaches because they allow greater flexibility in the response of the

speaker. This is because KS accommodates out-of-vocabulary words by means

of non-keyword modeling. However, a cleverly constructed restricted grammar

2.4. The development of keyword spotting 15

speech recogniser can better understand the intention of a caller using contextual

information, and therefore may prove more appropriate for certain applications.

2.4 The development of keyword spotting

In a similar fashion to general speech recognition theory, keyword spotting has un-

dergone a number of generations of development. Early approaches were limited

by low computing resources and hence KS research was limited to simpler tasks

such as isolated keyword detection. As speech recognition technology matured,

more advanced tasks were explored, such as the detection of keywords embedded

in noise or continuous speech.

2.4.1 Sliding window approaches

Initial methods focused on using sliding window approaches such as the dynamic

time warping approaches proposed by Sakoe and Chiba [29] and Bridle [6], or

the sliding window based neural network method prescribed by Zeppenfeld [40].

Such techniques yielded acceptable results in isolated keyword spotting tasks,

but suffered from considerable drops in performance when spotting keywords

embedded in continuous speech.

A major reason for this drop in performance was because sliding window ap-

proaches did not model non-keywords either implicitly or explicitly. Spotting

of keywords in continuous speech is essentially a 2-class discrimination task, at-

tempting to classify regions as either a keyword or a non-keyword instance. Since

the traditional sliding window approaches did not model non-keywords, they es-

sentially were attempting discrimination with only knowledge of the target class.

This was analogous to making measurements without a point of reference - all ob-

servations were purely relative and therefore provided little confidence for making

absolute decisions.


2.4.2 Non-keyword model approaches

To address the lack of knowledge of the non-target class, the concept of non-

keyword models (also known as filler models) was introduced into keyword spot-

ting. Non-keyword models attempted to model all speech that did not form a

part of the target keyword speech. For example, in a closed vocabulary system, a

non-keyword model would attempt to model all words in the vocabulary except

for the target keywords. Using a non-keyword model provided more confidence

when accepting or rejecting putative instances of target keywords compared to

the sliding window approaches because a comparison was being made between

the target keyword model and the non-keyword model.

One of the initial approaches used to incorporate non-keyword models was pro-

posed by Higgins and Wohlford [13]. Here a DTW-based continuous speech recog-

niser was modified to use filler non-keyword models to represent non-keyword

speech. The modified speech recogniser was then used to transcribe continuous

speech into regions of keywords and non-keywords. Finally, a likelihood ratio

was used to normalise keyword likelihoods by the corresponding likelihood of the

non-keyword model over the same observation sequence. Non-keyword models in

this particular approach were modeled by using pieces and subsequences of the

target keyword.

The introduction of non-keyword models into keyword spotting saw the fu-

sion of continuous speech recognition research with keyword spotting techniques.

Whereas previously KS approaches had exclusively used sliding window tech-

niques, the use of non-keyword models required a paradigm shift into the speech

recognition context. Specifically, keyword spotting could be simply viewed as a

special case of continuous speech recognition, where all non-keyword speech was

labeled with a single non-keyword tag. Operating within the speech recognition

framework allowed the latest developments in continuous speech recognition such

2.4. The development of keyword spotting 17

as advances in modeling techniques to be transferred to the KS domain. Hence,

keyword spotting research began to more closely follow the trends of speech recog-

nition research.

2.4.3 Hidden Markov Model approaches

The advent of Hidden Markov Model (HMM) based speech recognition lead to

the introduction of HMM-based keyword spotting techniques. As for DTW-based

keyword spotting, HMM-based keyword spotting could be viewed as a special case

of HMM-based speech recognition, where all non-target words were represented

by a non-keyword model.

One common approach was to use a word loop consisting of all target keywords

in parallel with the non-keyword. Target keywords were typically modeled using

either word models or sub-word models, while non-keyword speech was modeled

using a plethora of architectures, including a high-order Gaussian Mixture Model

as prescribed by Wilpon et al. [35] or a monophone model set as suggested by Rose

and Paul [28]. This lead to the development of better performing KS systems,

paving the way to more complex keyword spotting applications.

2.4.4 Further developments

Advances in high-level linguistic modeling through recognition grammars and lan-

guage modeling were also incorporated into keyword spotting. These advances

were motivated by the need to reduce false alarm rates of KS systems through

the use of contextual information, specifically to reduce or constrain the emission

of false putative keyword occurrences. Kenji et al. [18] and Gou et al. [12] both

described techniques of incorporating finite state grammars into the spotting pro-

cess. The reported experiments demonstrated significant gains in performance for

simple recognition grammar applications compared to non-grammar-constrained


approaches. However, such systems were less flexible since they were configured

specifically for a target grammar, limiting the ability to be easily ported to a

different recognition grammar environment.

Other experiments reported by Rohlicek et al. [27] and Jeanrenaud et al.

[16] described how bigram language modeling concepts could be incorporated

into KS to improve performance. Results demonstrated that such methods were

particularly well suited for event spotting (spotting a set of keywords belonging to

a specific class such as dates). However, such approaches resulted in significantly

greater computational burden due to increased recognition network complexity.

Finally, the increased importance of audio and multimedia introduced the

requirement for fast unrestricted vocabulary keyword spotting in large audio

databases. Prior to this, the majority of keyword spotting research had focused

on real-time monitoring style applications or telephone dialogue systems. Al-

though these existing technologies were fast, they did not provide the very high

speed and scalability required for large audio database data mining.

This saw the introduction of two-stage algorithms such as that proposed by

Young and Brown [39]. Such algorithms first transcribed speech to a low-level

textual representation (eg. phone labels) that could then be searched very quickly

at query time for a target keyword of interest. This resulted in query speeds sev-

eral orders of magnitude faster than previously obtained using existing keyword

spotting methods.

2.5 Performance Measures

A broad range of metrics are used to measure the performance of keyword spotting

algorithms. Understanding these measures is essential both for the purpose of

discussing algorithms as well as comprehending the significance of results. This

section defines the metrics and terms used in keyword spotting literature and

2.5. Performance Measures 19

within this work.

2.5.1 The reference and result sets

The output of a keyword spotting system is a set of tuples, Ψ, representing

putative keyword occurrences. Each tuple consists of an utterance identifier, a

keyword tag, a start time and an end time.

When evaluating performance, this result set is scored against a reference

set of tuples, Γ, containing the true occurrences of the keywords to be detected.

Formally, these results are defined as follows:

Γ = γ1, γ2, . . . , γN (2.3)

Ψ = ψ1, ψ2, . . . , ψM (2.4)

where γi = (uri , wri , s

ri , e

ri )

ψj = (upj , wpj , s

pj , e

pj)

uri , upi = ith reference/putative occurrence’s utterance identifier

wri , w

pi = ith reference/putative occurrence’s keyword tag

sri , spi = ith reference/putative occurrence’s keyword start time

eri , epi = ith reference/putative occurrence’s keyword end time

2.5.2 The hit operator

The hit operation is used to determine whether a reference occurrence was suc-

cessfully detected. In this work, the following definition of a hit is used

Hit Operator. Given a reference occurrence γ and a putative result set, Ψ,

then the reference occurrence γ is declared as hit if the mid-point of the reference

occurrence falls within the time boundaries of one of the putative occurrences in

Ψ, and the respective keyword tags and utterance identifiers are equal.


A similar hit definition is used in other keyword spotting literature and soft-

ware, including the HTK software package.

Formally, the hit operation is defined as follows:

γi ª ψj =

0 uri 6= upj

0 wri 6= wp

j

0 spj >(sr

i+eri )

2

0 epj <(sr

i+eri )

2

1 otherwise

(2.5)

2.5.3 Miss rate

In keyword spotting literature, both miss rate and its converse, hit rate, are used

predominantly as the measure of reference occurrence detection error rate. The

miss rate is defined as

Miss rate. Given a reference occurrence set Γ and a putative result set Ψ, the

miss rate is defined as the total number of elements of Γ that were not hit by at

least one element in Ψ.

Formally, the miss rate is defined in terms of the hit operator as follows:

MissRate(Γ,Ψ) =|Γ| − |HitSet(Γ,Ψ)|

|Γ|(2.6)

where HitSet(Γ,Ψ) = γ ∈ Γ|∑

j

(γ ª ψj) > 0

Hit rate is the converse of miss rate and is defined as

HitRate(Γ,Ψ) = 1−MissRate(Γ,Ψ) (2.7)


2.5.4 False alarm rate

False alarm rate is a measure of the number of incorrect results output by a

keyword spotter. The false alarm is defined as

False Alarm. A member of the result set Ψ that does not hit any of the members

of the reference set Γ.

Two different definitions for false alarm rate are used in literature. The more

common of the two and the definition used within this work is

False Alarm Rate. The number of false alarms in the keyword spotting result

set normalised by the total duration of evaluation speech searched and the number

of unique keywords searched for.

False alarm rate can be expressed formally as

FARate(Γ,Ψ,W, T ) =|FASet(Γ,Ψ)|

|W | ∗ T(2.8)

where FASet(Γ,Ψ) = ψ ∈ Ψ|∑

j

(γj ª ψ) = 0

W = List of keywords being queried for

T = Duration of speech searched in hours

This definition of false alarm rate is used when measuring the overall keyword

spotting performance of a system. The alternate definition of false alarm, referred

to as false acceptance rate in this work, is typically more useful in measuring the

performance of the keyword confidence scoring and verification stages of a KS

system.

2.5.5 False acceptance rate

The false acceptance rate is an alternate measure to the false alarm rate that re-

flects the impurity of a result set. Specifically, the false acceptance rate measures


what percentage of the final result set is comprised of false alarms. The measure

is defined as follows:

False Acceptance Rate. The number of false alarms in the keyword spotting

result set normalised by the total size of the result set.

Formally, the false acceptance rate is defined as:

FalseAcceptRate(Γ,Ψ) =|FASet(Γ,Ψ)|

|Ψ|(2.9)

2.5.6 Execution time

In applications that require high query speeds, it is necessary to measure the

execution time of algorithms. A convenient measure of execution time for keyword

spotting is

Execution time. Given a query word set, W , execution time is defined as the

total number of minutes of CPU time used for execution normalised by the size

of the query word set and the number of hours of speech searched. CPU time is

measured using a 3.0GHz Pentium 4 processor.

Formally, execution time is expressed as:

ExecutionT ime =tcpu

|W | ∗ T(2.10)

where tcpu = Total CPU minutes for execution

W = Query word set

T = Total hours of speech searched

2.5.7 Figure of Merit

A frequently quoted metric in literature is the Figure Of Merit (FOM). This

metric is a measure of the average hit rate performance across a cross-section of


system operating points. FOM is defined as

Figure Of Merit. The average hit rate taken over false alarm rates from 0

FA/kw-hr to 10 FA/kw-hr.

The FOM metric has a number of issues that restrict its usefulness. Most

importantly, it is not a good comparative measure for systems where low miss

rate (or high hit rate) is more important than low false alarm rates. This is

typically the case in security surveillance applications where it is important to

miss as few events as possible. For such applications, a better alternative would

be to take the average false alarm rate across a cross-section of low miss rates.

Additionally, the FOM cannot be used to measure the performance of systems

that do not provide an output score. This is because for such systems it is not

possible to easily tune the system to obtain hit rates at various false alarm rates,

and hence it is difficult to calculate the FOM.

2.5.8 Equal Error Rate

The Equal Error Rate (EER) is a single metric that attempts to provide a figure

of comparison between tunable systems. For keyword spotting, EER is defined

as

Equal Error Rate. The miss rate at the point on the operating characteristic

curve at which the miss rate equals the false acceptance rate.

EERs should be used carefully when comparing systems as they only sum-

marise performance at a single operating point. They do not take into effect

important performance considerations such as the gradient of the operating char-

acteristic curve. However they are a good comparative measure when the gradient

and form of the operating characteristic curves being compared are similar.


2.5.9 Receiver Operating Characteristic Curves

A Receiver Operating Characteristic (ROC) curve is a plot of hit rate versus

false acceptance/alarm rate. ROCs provide a view of how keyword spotting

performance varies as the system is tuned. Figure 2.1 gives an example of an

ROC plot.

Figure 2.1: An example of a Receiver Operating Characteristic curve

ROCs are excellent for examining performance at low false acceptance/alarm

rates. However, they provide very poor resolution at low miss rates, making it

difficult to understand how performance varies at low miss rates, as demonstrated

in figure 2.1.


2.5.10 Detection Error Trade-off Plots

Detection Error Trade-off (DET) plots are a plot of miss rate versus false ac-

ceptance rate on a logarithmic scale. Viewing this data on a logarithmic scale

typically results in a linear operating characteristic curve that is easier to interpret

visually than an ROC curve. Additionally DET plots provide good resolution at

both low miss rates as well as low false acceptance rates. An example of a DET

plot is shown in figure 2.2.

Figure 2.2: An example of a Detection Error Trade-off plot

DET plots may suffer from step effects within the low miss rate regions in some

experiments. Such an effect indicates that the output scores of putative hits are

distinctly segregated from the putative false alarms within the step region. Hence

there are sudden jumps along the curve (ie. a sequence of putative hits followed

by a sequence of putative misses within the result set ordered by output score).


Step effects may also be observed when a small reference set Γ is used. How-

ever, in such small reference set size cases, the step effects would be observed

across all operating points since the lower bound of changes in miss rate would

be 1/|Γ|.

2.6 Unconstrained vocabulary spotting

Unconstrained vocabulary keyword spotting algorithms perform keyword spotting

without placing significant constraints on the query vocabulary. Such systems of-

fer significant flexibility, in particular for applications with a dynamic vocabulary.

News monitoring is an example of such an application, where there is a need to

be able to continually update the query word set to include the latest terms of

interest, such as names, abbreviations and foreign keywords.

An unfortunate byproduct of this flexibility is that it is not possible to in-

corporate prior knowledge about query keywords into the KS search process.

This may result in potentially lower performance than a constrained vocabulary

keyword spotting system.

2.6.1 HMM-based approach

Subword acoustic unit HMM-based speech recognition techniques provide a con-

venient and popular framework for unconstrained vocabulary keyword spotting.

This is because the recognition vocabulary of such systems can be easily extended

by updating the recogniser lexicon, assuming appropriate subword unit acoustic

models (eg. monophone or triphone models) are being used.

A HMM-based keyword spotting system is constructed in a similar fashion

to a word-loop speech recognition system. First a recognition word loop is built

with nodes for each of the words in the query keyword set. Additionally, a non-

keyword node is included to represent non-keyword speech, resulting in a word

2.6. Unconstrained vocabulary spotting 27

Keyword W1

Keyword W2

Keyword WN

Non-Keyword

Figure 2.3: Recognition grammar for HMM-based keyword spotting

network of the form shown in figure 2.3. Then the word network is expanded into

an acoustic model network using a lexicon. The lexicon must contain mappings

for all words in the recognition network as well as the non-keyword word.

Viterbi decoding is then performed using this network to transcribe a given

observation sequence into a word sequence. The result is a time-marked transcrip-

tion of the observation sequence in terms of keyword symbols and the non-keyword

symbol.

Modeling of keyword models for keyword spotting is straightforward as well-

established techniques from speech recognition literature can be employed. These

include word-based, phone-based and syllable-based models. However, for uncon-

strained vocabulary applications, phone-based or syllable-based models are more

suitable as they allow a much greater flexibility in the choice of keywords that

can be modeled.

In contrast, modeling of the non-keyword symbol is more difficult. A good

non-keyword model must adequately represent almost the entire vocabulary of an

application domain while still rejecting any instances of keyword speech. Ideally,


given a query word set, W , and an observation sequence O corresponding to the

utterance of a single word ω, the following conditions should hold in a maximum-

likelihood sense:

argmaxip(O|λwi

)− p(O|λnkw)

≥ 0 if ω = wi

< 0 otherwise

(2.11)

where λwiis the acoustic model for word wi and λnkw is the non-keyword model.

For domains where the vocabulary is anticipated to be small and well defined,

a possible composite non-keyword model could be constructed from a parallel

combination of all non-keyword words. Figure 2.4 demonstrates a trivial exam-

ple of such a non-keyword model using a small task vocabulary where the male

names Joe,William and Henry are target keywords and the female names Sherita,

Rachel, Jessica, Tiani and Chelsea are non-keywords.

Such a solution becomes more intractable as the size of the vocabulary for non-

keyword speech increases. This is because the given architecture is essentially a

speech recognition solution, attempting to transcribe all speech. Instead, a non-

keyword model representation is required that does not increase in complexity

with the size of the non-keyword speech vocabulary. A number of solutions to

this problem have been proposed in literature and are discussed further in section

2.7.

2.6.2 Neural Network Approaches

Neural network based speech recognition is a popular alternative to the more

common HMM-based approach. In particular, neural networks are a good can-

didate for highly discriminative tasks, such as keyword spotting, since they are

more discriminative than the mixture of Gaussian HMM models typically used

in KS.

2.6. Unconstrained vocabulary spotting 29

Henry

William

Sherita

Rachael

Jessica

Tiani

Chelsea

Joe

Target Keyword Models

Non−Keyword Models

Figure 2.4: Sample recognition grammar for small non-keyword vocabulary key-word spotting

However pure neural network methods have been rarely proposed for keyword

spotting, particularly in recent literature. Those that have proposed this archi-

tecture have mainly used time-modeling architectures such as Recurrent Neural

Networks or Time-Delay Neural Networks (eg. Trentin et al. [33]). One such ex-

ample is the Recurrent Neural Network approach proposed by Jianlai et al. [17].

Unfortunately the reported experiments demonstrated that although this method

was capable of reliably detecting simple signal shapes, such as sinusoids embedded

in continuous signals, performance was less than pleasing for digit spotting.

A more common and far more successful approach to incorporating neural

networks has been through the use of hybrid neural network HMM systems.


Such systems include those prescribed by Bernadis and Bourlard [4], Lippmann

and Singer [21] and Ou et al. [25]. These algorithms attempted to augment

the convenient and well-performing HMM architecture with the discriminative

benefits of neural networks.

Lippmann and Singer [21] performed keyword spotting by using HMM acous-

tic models with Radial Basis Function neural network state distributions. The

motivation here was to use the Radial Basis Function state distribution in place

of the less discriminative mixture of Gaussian architecture normally used with

HMMs. Results demonstrated some appreciable gains in performance.

An alternate hybrid system was proposed by Alvarez-Cercadillo et al. [1].

Here a Recurrent Neural Network decoder was used in conjunction with HMM-

based observation scoring. The method achieved low miss and false alarm rates

demonstrating the suitability of the method for the KS task. However, it must be

noted that very low word error rates (2% and lower) were also reported for speech

recognition using this approach, implying fairly simplistic (eg. single speaker or

small vocabulary) evaluation data sets. The reported results should be considered

in the light of this.

Two stage hybrid systems have also been proposed, such as the methods de-

scribed by Bernadis and Bourlard [4] and Ou et al. [25]. In such approaches, an

initial mixture of Gaussians HMM system was used to generate a set of puta-

tive occurrences. A secondary neural network classifier was then used to further

classify the putative occurrences based on the HMM output likelihoods. Com-

parisons of these methods with standard HMM approaches is difficult because

they implicitly incorporate a secondary confidence scoring stage. It is more than

likely that a KS system with a confidence scoring stage will outperform a single-

stage HMM system since the latter does not incorporate any explicit confidence

scoring. This must be taken into account when considering results reported for

two-stage hybrid systems.

2.7. Approaches to non-keyword modeling 31

2.7 Approaches to non-keyword modeling

The choice of non-keyword model architecture is a significant consideration for

any keyword spotting and keyword confidence scoring system. These models play

a crucial role in determining the final miss rate and false alarm rates of these

systems since they represent the important non-target class in keyword spotting.

This section discusses a cross-section of non-keyword model approaches proposed

in literature.

2.7.1 Speech background model

Wilpon et al. [35] and Rose and Paul [28] both espoused the use of a single high-

order Gaussian Mixture Model (GMM) for non-keyword modeling. This type

of model is commonly referred to as a universal background model (UBM) or

speech background model (SBM) in literature. Reported experiments demon-

strated that the SBM-based approach was capable of attaining miss rates below

6% for telephone speech keyword spotting.

Figure 2.5 shows the typical architecture for an SBM-based approach. The

motivation for this approach was to capture the acoustic characteristics of all

speech in the target domain within a single non-keyword model. However, as

stated before, an ideal non-keyword model should uphold the condition given

in equation 2.11. In the SBM approach, this condition is not guaranteed to be

maintained since the SBM is trained on speech that may include instances of the

target keywords. As such, for an observation sequence O corresponding to an

instance of word w, the output likelihood of the SBM, p(O|λnkw), may exceed the

output likelihood of the target word model, p(O|λw), thus breaking the condition

in equation 2.11. This is more likely for words with a high frequency count in the

SBM training data.

However, since the SBM models greater acoustic variance than the target


Keyword W1

Keyword W2

Keyword WN

Non-Keyword Model

Speech Background

Figure 2.5: System architecture for HMM keyword spotting using a Speech Back-ground Model as the non-keyword model

keyword models, it is likely that the SBM distribution will be flatter than the

target keyword model distributions in the acoustic regions corresponding to target

keyword speech. This would result in average frame likelihoods output by the

SBM being lower than the average frame likelihoods output by the target keyword

models in the acoustic regions of target keyword speech. Hence, although there

would be some overlap with target keyword speech, it is not unreasonable to

expect that the SBM would be sufficiently generalised such as not to dominate

the target keyword models.

Additionally, a significant benefit of the SBM approach is target vocabulary

independence. Since an SBM is trained over all speech, there are no implicit

assumptions made regarding target vocabulary. In terms of flexibility then, a

SBM-based approach system is well suited to the unconstrained vocabulary key-

word spotting task.


2.7.2 Phone models

In the phone model approach, a composite non-keyword model is constructed

from the parallel combination of phone models. Each phone model is assigned an

optional weight, α, allowing prior weighting of phones within the non-keyword

model. The phone models used may be monophones, triphones, or in fact any

appropriate subword model. Figure 2.6 demonstrates the recognition network

required for the phone model approach.

Keyword W1

Keyword W2

Keyword WN

aa

ae

z

zh

Non-Keyword Model

αae

αz

αaa

αzh

Figure 2.6: System architecture for HMM keyword spotting using a compositenon-keyword model constructed from phone models

The phone model approach is similar to the SBM approach, in that all speech

is modeled within the non-keyword model. Hence, there may be similar issues

regarding upholding the ideal non-keyword model condition in equation 2.11 (see


section 2.7.1). In fact, this issue may be further compounded if the same set of

models are being used in the non-keyword model and the target keyword models,

for example, if both used the same triphone model set. If so, then an observation

sequence may be scored using the same sequence of models, resulting in equal

target keyword model and non-keyword model likelihoods. This will have a direct

impact on detection performance.

Experiments using the phone model approach were reported by Bourlard et al.

[5] and Bazzi and Glass [2]. Both reported good performance, though careful

choice of the phone model weightings was required. Weightings in these cases

were allocated using phone language model bigram/unigram statistics.

2.7.3 Uniform distribution

Silaghi and Bourlard [31] proposed a method of keyword spotting that used

a uniform distribution with a constant value of ε, to approximate the non-

keyword model. Using a novel iterative dynamic programming method, Silaghi

and Bourlard [31] described how this uniform distribution could be estimated

from a set of training utterances. Once derived, the uniform distribution could

then be used as the non-keyword model during HMM-based keyword spotting.

A significant benefit of this approach is a dramatic reduction in computational

processing, as there are no calculations required to score an observation against

the non-keyword model. However, the method is limited by the requirement that ε

must be calculated on a per-target-word basis. This poses significant implications

for unconstrained vocabulary keyword spotting as it requires training examples

of all target words of interest.

2.7.4 Online garbage model

Bourlard et al. [5] proposed an implicitly derived non-keyword model for keyword


spotting. This online garbage model approach estimated a filler model likelihood

at each frame by using the average of the K-best frame likelihoods of phone models

not belonging to the target keyword. The method is classified as an implicit non-

keyword modeling approach because the architecture of the non-keyword model is

synthesised dynamically based on the phonetic decomposition of the target word.

This differs from the phone model approach discussed in section 2.7.2 where the

non-keyword model was built from the parallel combination of all phone models.

Formally, the online garbage likelihood for an observation o, when detecting

keyword w with phone pronunciation Q = (q1, q2, ..., qN ) and using the top K

scoring phone models from the phone model set S is given by:

p(o|λnkw, w) =

∑

i p(o|λj′i)

|J ′|

where J ′ = sortdesc(J)K1

J = p(o|λr1), p(o|λr2), . . . , p(o|λrM)

R = s ∈ S|s 6∈ Q

sortdesc(X) = X ordered in descending order

This approach is more computationally expensive than other approaches, as

each frame must be scored against a number of phone models. However, it has the

advantage of being more flexible, as no prior training is required. Additionally,

gains may be observed over the phone model approach in section 2.7.2 since only

phones that are not a part of the target word are used in discrimination. As such,

there is a reduction in modeling overlap between the target word models and the

non-keyword model.


2.8 Constrained vocabulary spotting

Constrained vocabulary keyword spotting is used for fixed or well-defined query

set tasks, such as dialogue and command control applications. Using a con-

strained vocabulary allows greater levels of syntactic and lexical information to

be incorporated into the recognition process through the use of recognition gram-

mars and task-specific trigger words. Incorporating such prior knowledge has

been demonstrated in literature to yield significant improvements.

2.8.1 Language model approaches

Language Models (LMs), in particular N-gram models, have proven to be a pop-

ular means of incorporating contextual information into constrained vocabulary

keyword spotting. The simplest means of incorporating language models is to use

the LM to obtain all sequences of words that contain one of the target keywords.

Decoding can then be performed to detect these significantly longer phrasal events

rather than single word events. Since longer events are markedly easier to detect

(due to an increased number of associated frames), significant improvements in

miss rate and false alarm rate can be obtained.

A simple and naive means of constructing a LM-based keyword spotting net-

work is as follows. Let W be defined as the set of target keywords, W , and Θ be

defined as the task-specific N -gram language model that models word sequences

of length N that occur in the target application domain. The language model Θ

is represented as a collection of 2-element tuples as specified by

Θ = (θ1, P (θ1)), (θ2, P (θ2)), . . . , (θM , P (θ(M))) (2.12)

θi = θi(1), . . . , θi(N) (2.13)

2.8. Constrained vocabulary spotting 37

Then the subset of all sequences in Θ that contain an element in W at posi-

tion k, called the constrained vocabulary subsequence set, is given by Ω(W, k,Θ)

defined as:

Ω(W, k,Θ) = (ω1, P (ω1)), (ω,P (ω1)), . . . (2.14)

= F (W, k,Θ)

where F (W, k,Θ) = (Si, P (Si)) ∈ Θ|G(W, k, Si) = 1

G(W, k,X) =

1 if X(k) ∈ W

0 otherwise

A recognition network can then be constructed from the constrained vocabulary

subsequence set using approaches similar to unconstrained vocabulary keyword

spotting. A single Viterbi decoding pass will then result in a transcription se-

quence consisting of sequences in F (W, k,Θ) and the non-keyword symbol. Figure

2.7 describes the process of building the constrained vocabulary keyword spotting

recognition network for a simple small vocabulary domain.

A significant issue with this approach is that even for low values of N , the

number of elements in the constrained vocabulary subsequence set can be large,

particularly for large target vocabulary domains. This results in significantly

slower decoding, reducing the speed benefits offered by keyword spotting over

large vocabulary speech recognition. Network optimisation techniques such as

the determinisation and minimisation techniques of Mohri et al. [23] may be ap-

plied to reduce the size of networks, resulting in significant reductions in network

size as demonstrated in figure 2.8. Nevertheless, the network complexity is still

considerably greater (particularly for a large number of contexts) than that of

unconstrained vocabulary keyword spotting, though this may be an acceptable

trade-off in exchange for improved miss and false alarm rates.


Smith

Bloggs

Doe

F (W,k,Θ)Language Model, Θ

θ1

θ2

θ3

θM

= P (θ1)

= P (θ2)

= P (θ3)

= P (θM )

Said

Said

Said

Said

Said

Said

SmithMister

John Smith

Mister Bloggs

John Bloggs

Miss Doe

Jane Doe

Non-Keyword

Target set, W

ω1

ω2

ω3

ω4

ω5

ω6

CVKS Subsequence Set, Ω

= P (ω1)

= P (ω2)

= P (ω3)

= P (ω4)

= P (ω5)

= P (ω6)

SaidSmithMister

John Smith Said

SaidBloggsMister

John Bloggs Said

SaidDoeMiss

Jane Doe Said

CVKS unoptimised network

P (ω3)

P (ω2)

P (ω1)

P (ω6)

P (ω4)

P (ω5)

Figure 2.7: Constructing a recognition network for constrained vocabulary key-word spotting

2.8. Constrained vocabulary spotting 39

Smith

Bloggs

Doe

Miss

John

Jane

Mister

Said

Non−Keyword

Figure 2.8: An optimised constrained vocabulary keyword spotting recognitionnetwork (language model probabilities omitted)

2.8.2 Event spotting

Event spotting is the task of detecting events of interest such as occurrences of

times, dates, or questions. If an event can be sufficiently described by a set of

word sequences, then constrained vocabulary keyword spotting methods can be

used to detect these events embedded in continuous speech.

Event spotting experiments for the detection of times in continuous speech

were reported by Jeanrenaud et al. [16] and Jeanrenaud et al. [15]. A recognition

network (see figure 2.9) was first constructed to represent various means of utter-

ing a time, such as Midnight, 3 o’clock in the afternoon and 12 thirty PM. Decod-

ing was then performed using phone-based HMMs to transcribe a given utterance

in terms of time and non-time events. The results demonstrated that incorpora-

tion of contextual information and the increased phrase length significantly aided

event detection performance. However recognition speed was markedly reduced

because of the additional complexity of the recognition network.

40Chapter

2.AReview

ofKeyword

Spottin

g

1

2 o’clock

thirty

12

A.M.

P.M.

at night

in the

midnight

noon

Non−Keyword

morning

evening

The ’Time’ Event

Figure2.9:

Aneven

tspottin

gnetwork

fordetectin

goccurren

cesoftimes[16]

2.9. Keyword verification 41

An obvious restriction of the event spotting method described above is that

any deviation from one of the prescribed methods of uttering a time may result

in a missed detection. For example, the network shown in figure 2.9 does not

represent utterances of times such as Half Past One or Quarter to Six.

To address this issue, Yining et al. [38] proposed augmenting the event spot-

ting network with filler context models. For example, for the network in figure

2.9, the parallel group of numbers 1 to 12 could be augmented with a parallel

filler model to capture any alternate forms at that point in the network, such

as the slang term Twelvish. Another filler model could be included in parallel

with morning and evening to handle phrases such as One thirty in the afternoon.

Similarly filler models could be introduced at other points to handle variations

in time phrase utterance.

Finite Stage Grammar approaches have also been proposed for event spotting

by Gou et al. [12] and Kenji et al. [18]. In these methods, nodes in an event

spotting recognition network were clustered using knowledge based classing to

reduce the complexity of a network. For example, the nodes in figure 2.9 could

be clustered using class tags such as number = 1, 2, ..., 12 and timeofday =

morning, evenining,A.M., P.M. Models were then constructed for each class

and decoding was performed using this simplified grammar. This approach gave

considerable improvements in recognition speed since the network complexity was

greatly reduced.

2.9 Keyword verification

Keyword Verification (KV) is a vital stage of many keyword spotting systems.

The purpose of these algorithms is to reduce the typically high false alarm rate

of a prior keyword spotting stage while preserving as many true keyword oc-

currences as possible. The majority of methods proposed in literature perform


keyword verification by deriving a confidence score for each putative occurrence.

Thresholding is then used to accept or reject candidates.

2.9.1 A formal definition

Verification is usually performed on an isolated sequence of observations corre-

sponding to a putative occurrence output by a prior keyword spotting stage. The

verification task is a 2-class discrimination task seeking to determine if a given ob-

servation sequence, O, corresponds to a true occurrence of the keywordW . Thus,

a keyword verification classifier must maximise the probability of the following

condition being maintained:

KV (W,O) =

1 if O is a true occurrence

0 otherwise

(2.15)

The classifier needs to be robust against a number of errors. For example,

there may be a degree of error with regards to the word boundary time alignments

output by the keyword spotter. This may have ramifications for target word scores

if used within the verification algorithm. A keyword verification algorithm should

also remain robust against variations in target word length, to accommodate

cross-speaker and cross-utterance variations. Thus, it is often prudent to include

a degree of duration normalisation in any keyword verification algorithm.

2.9.2 Combining keyword spotting and verification

Many keyword spotting algorithms already contain a degree of implicit key-

word verification. For example, HMM-based techniques normalise target keyword

model scores using filler model likelihoods, thus implicitly performing KV. How-

ever, it is important not to rely too heavily on the implicit verification ability of


a keyword spotting stage, for example, through aggressive system tuning. This is

because any true occurrences that are missed by the keyword spotting stage are

occurrences that cannot be recovered by a subsequent post processing stage.

Instead, keyword spotting stages should be tuned to obtain lower miss rates

at the expense of higher false alarm rates. A subsequent KV stage can then be

used to cull extraneous false alarms. In this way, each subsystem does what it

is best at — detection with low miss rate for keywords spotting and verification

with low false alarm rate for keyword verification.

2.9.3 The problem of short duration keywords

Short duration keywords pose a significant difficulty for many keyword verification

methods. This is because it is more difficult to obtain a robust likelihood estimate

as the number of observations decreases, since there is less information upon which

to base any classification decisions.

Consider the example of a simple KV algorithm that uses a mean frame like-

lihood confidence statistic. In this case, the error in the mean frame likelihood

estimate will be a function of the number of frames used in the estimation. For

instance, a confidence statistic with a Gaussian distribution will have an error

proportionate to σNbased on normal distribution error analysis theory.

Hence, verification of short words tends to be more erroneous than that of

long words. This is a significant issue as the false alarm rates of keyword spotters

are also typically poorer for shorter words.

2.9.4 Likelihood ratio based approaches

The likelihood ratio test is one of the more common forms of confidence scores

used in keyword verification. Specifically, the likelihood ratio is used in keyword


verification as a simple 2-class discrimination statistic for determining if an ob-

servation sequence, O, should be classified as an occurrence of the target word w.

A typical formulation uses the non-keyword model likelihood as the normalising

term, as given by

LR(W,O) =p(O|w)

p(O|!w)(2.16)

logLR(W, 0) = log p(O|λw)− log p(O|λnkw) (2.17)

This formulation is convenient since it makes use of the non-keyword model, λnkw,

that has been well defined in keyword spotting literature.

Figure 2.10 shows a typical configuration for a verification system that uses a

likelihood ratio based confidence score. Additionally it shows how multiple veri-

fiers can be combined using fusion techniques to obtain a more robust verification

system.

As for keyword spotting, the main difference between the majority of likeli-

hood ratio based methods is the choice of non-keyword model architecture. A

high-order GMM non-keyword model was proposed by Wilpon et al. [35]. In

a similar fashion to SBM-based keyword spotting (see section 2.7.1), this non-

keyword model would provide discrimination against the speech of the average

word. Wilpon et al. [35] first scored putative occurrences against the target word

model and the non-keyword model. These scores were then combined using a

likelihood ratio to obtain the confidence statistic:

LR(W, 0) =p(O|λw)− p(O|λGMM)

|O|(2.18)

Leung and Fung [20] proposed an alternate non-keyword model that was con-

structed from the parallel combination of all states and models that were not

a part of the target keyword model. The non-keyword model likelihood for a


Keyword

Spotter

Putative N

Putative 1

Putative 2

Scorer

LR

Scorer

Scorer

LR

λnkw

Verifier 1

Fusion

Scorer

λnkw

Verifier 2

λw

λw

Result

Figure 2.10: Likelihood ratio based keyword occurrence verification with multipleverifier fusion


putative occurrence was then calculated by determining the maximum likelihood

obtained when decoding the occurrence set using the non-keyword network:

logLR(W, 0) = log p(O|λw)− argmaxilog p(O|qi) (2.19)

where qi = ith path through the non-keyword network

Sukkar and Lee [32] and Xin and Wang [37] proposed a more constrained non-

keyword network constructed from the parallel combination of all cohort phones.

A cohort phone was defined as a phone that was highly confusable with one

of the phones of the target keyword. Confusion information could be derived

from either knowledge based rules or from confusion matrix statistics of a phone

recogniser. These approaches alleviated the ad hoc nature used in constructing

the non-keyword network and additionally reduced the size of the non-keyword

model network.

A similar cohort based approach was proposed by Liu and Zhu [22], how-

ever the cohort models were derived dynamically. A non-keyword network was

first constructed from all subword models. Then, a N-Best recognition pass was

used to determine the N-Best subword models for the given putative occurrence

sequence. Finally, the non-keyword model likelihood was estimated using the

average of the likelihoods of the N-Best subword models.

2.9.5 Alternate Information Sources

Another means of keyword verification is to use an alternate information source

to that used in the initial keyword spotting stage. This means that the keyword

is verified using new information, rather than simply using a transformation of

the same information.

One source of alternate information is to use an alternate feature set. The

information captured within a different feature set may contain complimentary

2.10. Audio Document Indexing 47

information which may aid in discrimination. Wu et al. [36] suggested the use

of prosodic information to augment phonetic information in confidence scoring.

This system first used phonetic information in a typical filler model based keyword

spotter. The verification stage then used phonetic features and prosodic features

to derive a log likelihood ratio.

An alternate means of incorporating alternate information is to use a different

classifier. For example, a Gaussian-based classifier uses Gaussian decision bound-

aries as a basis for decision making. In contrast, certain classes of neural-network

classifiers use a non-linear decision boundary. Both classifiers may provide com-

plimentary information that may be useful in keyword verification. Ou et al. [25]

combined a neural network in series with a HMM system to utilise the orthogo-

nal discriminative abilities of the two recognition systems. The speech signal was

passed into a typical filler-model based keyword spotting system. The posterior

probabilities of the HMM system were then passed to a neural network for further

discriminative analysis.

2.10 Audio Document Indexing

Audio Document Indexing (ADI) algorithms are required to provide rapid search-

ing of large collections of audio documents for keywords of interest. Typical al-

gorithms use a two-pass approach, where the audio is first prepared during an

initial pass for rapid searching during subsequent query passes. In this way, the

majority of query-word independent processing can be performed during audio

preparation, leaving as little processing as possible to be performed during query

time.


2.10.1 Limitations of the Speech-to-Text

Transcription approach

The most intuitive, and the approach offering the greatest query speed, uses a

large vocabulary speech transcription system to fully transcribe audio documents

to text for subsequent rapid textual searching. This requires an initial large

overhead to perform STT, but as a result allows very fast text-only searching

during queries.

Unfortunately, a significant restriction of this approach is that queries are

restricted to the vocabulary of the STT engine. Many ADI applications require

support for unrestricted vocabulary queries, such as name, place, slang, and for-

eign language terms. Even transcription systems with very large vocabularies are

unlikely to provide sufficient coverage of the required query vocabulary for such

systems.

Additionally, any errors made by the transcription system during transcription

are completely unrecoverable at query time, even with knowledge of the actual

query term. Hence, the performance of these systems will always be limited by

the word error rate of the associated STT system.

In contrast, unconstrained vocabulary audio document indexing techniques

are specifically designed to support unrestricted vocabulary queries. Unfortu-

nately, these unconstrained methods also have much slower query speeds than

the STT approach.

Unconstrained vocabulary indexing methods use an initial pass of the data

to derive a compact intermediary representation of the speech. This intermedi-

ary representation must be sufficiently terse to allow rapid query-time searching,

while still preserving sufficient information to provide accurate unconstrained vo-

cabulary querying. At search time, the intermediary representation can then be

searched in a bottom-up fashion to locate putative keyword locations.


2.10.2 Reverse dictionary lookup searches

Early audio document indexing approaches used Reverse Dictionary Lookup

(RDL) searching to detect keywords. RDL searches attempt to infer the location

of high-level events (ie. dictionary items) from a stream of low-level events (ie.

dictionary decompositions). For audio document indexing, this inference is made

from a stream of low-level acoustic events such as phones or syllables.

Speech utterances are first transcribed using a subevent-level recogniser, typ-

ically constrained by a language model, to generate a set of subevent level tran-

scriptions. This is only performed once for each speech utterance to prepare the

utterance for subsequent querying.

At search time, the query word is decomposed using a dictionary or letter-

to-sound rules to obtain a target subevent sequence. The previously generated

subevent transcriptions are then searched to locate instances of this target se-

quence.

Commonly, phones have been used as the subevent representation, such as

in the experiments reported by Chigier [8]. In these experiments, the speech

preparation stage used a phone recogniser to generate phone transcriptions. Then,

at query time, the query word was decomposed to a phone sequence and the phone

transcriptions were searched for instances of the target phone sequence.

A significant shortcoming of RDL searches is that there is little consideration

given to subevent-level recogniser errors. For example, phone-level transcribers

typically have high error rates (in excess of 40% in many cases) and as such any

generated phone transcriptions are likely to contain a large number of errors.

Figure 2.11 demonstrates how RDL performance is affected by various subevent-

level recogniser errors.

One means of improving robustness is to generate and search multiple phone

transcription hypothesis for each speech utterance. This increases the chance of


diy ahsh ax k w ay r d

Detected successfully

diy ahsh ax w ay r d

Missed because of deletion error

diy ahsh ax w ay r dp

Missed because of substitution error

diy dsh ahax k w ay er r

Missed because of insertion error

Figure 2.11: Applying reverse dictionary searches to the detection of the wordACQUIRE in a phone stream

the correct transcription being generated for a speech utterance but also increases

the chance of false alarms. Zue et al. [41] used a phone dendrogram to maintain

multiple hypothesis information while Sethy and Narayanan [30] used N-best

phone transcriptions. Both methods yielded improved performance over single-

level phone transcriptions.

An alternative is to use a more robust subevent representation. Sethy and

Narayanan [30] proposed the use of the syllable subevent instead of the phone

subevent. The syllable is a considerably longer unit than the phone, and hence

more easily detected and classified. The robustness of the syllable was demon-

strated by Sethy and Narayanan [30] in experiments where a 17% reduction in


transcription error was obtained using a syllable transcriber instead of a phone

recogniser. The improvement in transcription error rate resulted in considerable

improvements in overall ADI system performance.

2.10.3 Indexed reverse dictionary lookup searches

A short-coming of reverse dictionary lookup ADI methods is that the amount

of data that needs to be searched during query time is very large. In fact, an

RDL search has an O(N) level of complexity that may prove problematic for very

large database searches. The Indexed Reverse Dictionary Lookup (IRDL) search,

proposed by Dharanipragada and Roukos [10], addresses this issue by using an

index to constrain searching to only plausible regions of speech. This reduced set

of regions can then searched using a more thorough algorithm to generate a final

set of putative occurrences.

Figure 2.12 demonstrates the IRDL method proposed by Dharanipragada and

Roukos [10]. During the preparation stage, speech is transcribed to a subevent

level using the same approach used in RDL. A subevent index is then built that

contains the times at which each subevent occurred. Dharanipragada and Roukos

[10] used trigram subevents for this index.

At query time, the query term is first decomposed into a subevent-level rep-

resentation. The subevent index is then consulted to obtain the location of all

subevents that the query word is comprised off. Following this, correlations be-

tween subevent locations are examined to determine locations in speech that most

closely match the query term subevent sequence. Dharanipragada and Roukos

[10] performed this correlation by quantising the subevent timescale into 1 sec-

ond units and then finding times at which a majority of the subevents occurred

in close proximity. Finally, the resultant candidate regions are searched using a

more thorough keyword spotting method. A HMM-based approach was used by


k w ay r

Sax = ta1x, ta2x, t

a3x, . . .

Sk = tk1 , tk2 , t

k3 , . . .

Sw = tw1 , tw2 , t

w3 , . . .

Say = ta1y, ta2y, t

a3y, . . .

Sr = tr1, tr2, t

r3, . . .

ACQUIRE

Query term

Phone decomposition

ax

Subevent combination

Preparation Stage

Putative Results

Query Stage

S1 = t11, t12, t

13, . . .

S2 = t21, t22, t

23, . . .

Subevent index

eg. Lattice, Dendrogram

Subevent Transcripts

Figure 2.12: Example of indexed reverse dictionary searching for the detection ofthe word ACQUIRE


Dharanipragada and Roukos [10] for this stage.

The IRDL approaches offers considerable benefits in terms of scalability by

using the subevent index to constrain query time searching. Since the size of the

candidate region set is considerably smaller than the size of the entire database,

processing requirements will be considerably lower than the standard reverse

dictionary lookup approach.

However, the coarseness of the subevent correlation process may result in

a very large number of candidate regions that then need to be searched using

slower acoustic search methods. Hence any gain in speed obtained in reducing

the candidate search space from subevent indexing may be lost in the requirement

for subsequent slower acoustic searching.

2.10.4 Lattice based searches

Lattice based searching, proposed by Young and Brown [39], reduces the effect

of phone recogniser error in RDL by searching a phone lattice representation

of speech at query time. Phone lattices encode a significantly greater number

of recognition paths than individual N-best transcriptions, therefore greatly in-

creasing the possible search space for query time processing.

In lattice based searching, an initial pass of the speech is first made using a

phone recogniser to generate phone lattices. At query time, these lattices are then

searched for any instances of the target word phone sequence. Any discovered

sequences are extracted from the lattices together with corresponding node times

and scores to generate a set of putative occurrences. Figure 2.13 demonstrates

the lattice based search process for detecting the word ACQUIRE with phone

transcription (/ax/, /k/, /w/, /ay/, /r/)’.

One advantage of this method is that a degree of phone recogniser error ro-

bustness is implicitly provided by the multiple paths encoded within the lattice.

54Chapter

2.AReview

ofKeyword

Spottin

g

ax

ax

aa

aa

ow

k

p

d

d

t

wh

wh

w

eh

ay

uw

ay

ey

ey

ay

r

r

er

er

uw

k

k

d

p

d

t

d er

ax

kw

ay

r

Time

k

p

ax

k

p

d

Figure2.13:

Usinglattice

based

searchingtolocate

instan

cesoftheword

AC-

QUIREwithinaphonelattice


Phone recognisers are highly susceptible to error and therefore such robustness is

likely to be beneficial.

However, a major problem when using phone lattices is generating a suffi-

ciently compact representation for storage. A phone lattice can potentially rep-

resent thousands of possible utterance transcriptions that need to be preserved

for query time searching. This results in significant disk storage requirements

which results in query time searching requiring a large amount of disk access - a

potentially slow operation.

Hence Young and Brown [39] proposed storing an approximation of lattices.

The lattice time scale was quantised and nodes were then placed into bins based on

their quantised time value. Then, it was assumed that a node in bin T was always

connected to a node in bin T + δ where δ was the quantisation unit. In this way,

lattice storage no longer required storage of node interconnection information,

resulting in a significant saving in storage for typically highly interconnected

phone lattices.

Unfortunately, this optimisation also introduces node interconnections that

may not have previously existed, potentially leading to false lattice paths and

therefore increased false alarm rates.

An extension to this lattice based search was proposed by James and Young

[14]. This work proposed the introduction of dynamic programming techniques

into the lattice search to provide additional robustness against phone recogniser

error. Experiments demonstrated some gains in detection performance. Interest-

ingly though, this technique was removed from subsequent publications by the

secondary authors suggesting that the additional processing required for this dy-

namic programming search did not necessarily justify the gains in performance.

Chapter 3

HMM-based spotting and

verification

3.1 Introduction

A range of keyword spotting and verification algorithms have been proposed in

literature. Of these, HMM-based methods have stood out as offering exceptional

performance and have the benefit of being within a well-studied and mature

speech recognition framework.

This chapter analyses a number of these methods and reports on compara-

tive evaluations performed on conversational telephone speech. The motivation

for this work was to provide a platform for further research by investigating the

current approaches and identifying the key issues that needed to be addressed in

subsequent work. The research was further necessitated by the lack of compara-

tive keyword spotting evaluations available in recent literature.

The majority of HMM-based methods differ primarily in their approach to

non-keyword modeling. As such, initial sections of this chapter examine a num-

ber of non-keyword modeling techniques and discuss individual strengths and

57

58 Chapter 3. HMM-based spotting and verification

weaknesses with regards to keyword spotting, verification and discrimination in

general. A conceptual framework named the confusability circle is presented and

used here to aid discussion.

Subsequent sections report on HMM-based keyword spotting experiments us-

ing these non-keyword model architectures. The reported experiments include a

comparative evaluation of various non-keyword models, the effects of target key-

word length on performance and a discussion on the tunability of HMM-based

systems.

The final sections of this chapter discuss and evaluate HMM-based keyword

verification. The presented work investigates the performance of SBM-based

verification as well as the role of discriminative classifiers.

3.2 The confusability circle framework

The confusability circle is a simple tool that aids the visualisation of aspects

related to the analysis of non-keyword models. This concept is used extensively

within this work to provide a well-defined framework for such discussion.

The framework is based on three key properties that define a good non-

keyword model:

1. A high degree of representation for words that are confusable with the target

word. Within this work, the word W1 is said to be confusable with word

W2 if there is a reasonably high probability that occurrences of word W1

will be output as putative occurrences of W2 by a keyword spotter.

2. A high degree of representation for words that are very disparate from the

target word. The word W1 is said to be disparate with regards to the word

W2 if there is a low probability of occurrences of word W1 being output as

putative occurrences of word W2.

3.2. The confusability circle framework 59

3. A low degree of representation for the target word

The properties above can be viewed as modeling three types of regions in

acoustic feature space. The confusability circle in figure 3.1 shows these regions

within a simplified two-dimensional feature space. The centre region, labeled the

Target Acoustic Region (TAR), corresponds to the acoustic feature space within

which observations for the target word fall. The surrounding region, named

the Confusable Acoustic Region (CAR), represents the region where observations

from highly confusable words occur. Finally the outer region, called the Disparate

Acoustic Region (DAR), represents the rest of acoustic feature space, within

which observations for words that are disparate from the target word fall.

Disparate Acoustic Region

STOCK

STACK

CLOCK

STROKES

FLAKEY

DRAGON

Target Acoustic Region

Confusable Acoustic Region

LOGISTICAL

Figure 3.1: Confusability circle for the target word STOCK

Within this framework, a good non-keyword model should have a high degree

of representation of speech within the DAR and CAR regions and a low level

of representation within the TAR region. As continually highlighted in keyword


verification literature, this is a difficult problem because of the large non-keyword

acoustic space that needs to be modeled and the comparatively small target

keyword region of acoustic space that has to be excluded.

The confusability circle paradigm has a number of limitations. Primarily, it is

unlikely that there will be clear distinctions between the individual confusability

regions. Instead, there is more likely to be some overlap between regions resulting

in fuzzy boundaries. Additionally, the definitions of confusability are fairly loose,

and are not defined formally.

Nevertheless, the confusability circle framework provides a convenient means

of representing and discussing the various aspects related to non-keyword mod-

eling. It does not provide a means of proving any conjectures arising from these

discussions but merely simplifies the process of visualising related issues. It is for

this reason that it is used in subsequent discussion in this work.

3.3 Analysis of non-keyword models

3.3.1 All-speech models

A distinct class of non-keyword models is those that model all speech, such as the

SBM method used by Wilpon et al. [35] and the phone set approach reported by

Bourlard et al. [5]. Within the confusability circle framework, it can be seen that

these methods perform unbiased modeling of all three regions: the TAR, CAR

and DAR. No attempts are made to include implicit discrimination between the

various confusability regions directly into the non-keyword model.

Despite this, all-speech non-keyword modeling techniques have been demon-

strated to be effective in literature, for example in the work of Bourlard et al. [5].

Clearly this cannot be because of any specific DAR and CAR modeling. Instead,

such methods must rely on robust CAR modeling in the target keyword models

3.3. Analysis of non-keyword models 61

to obtain good discriminatability. If such a robust target keyword model can be

constructed, then output likelihoods of the target word model would hopefully

exceed the output likelihoods of the non-keyword model within the CAR region,

leading to proper target word discrimination. All-speech non-keyword models are

therefore dependent on the robustness of the target keyword model rather than

the robustness of the non-keyword model.

3.3.2 SBM methods

SBM based non-keyword models are a specific type of all-speech model that have

been demonstrated to be effective for keyword discrimination by Wilpon et al. [35]

and Rose and Paul [28]. The key benefits of such an approach is the simplicity of

the model and the reduced network complexity compared to a phone model set

approach.

As stated before, the performance of all-speech models will be dictated to a

degree by the quality of the target word model. In particular, good performances

can only be expected if a robust target word model is used. In a HMM-based

framework, it is easy to build such a model, either through a word-based model or

more flexibly through the concatenation of phone models. Both these approaches

will result in a robust target word model that is sufficiently disparate from the

non-keyword SBM model.

As such, one would expect good detection rates using an SBM non-keyword

model. However, the highly generalised nature of the SBM means that there

is little CAR modeling in the non-keyword model, and therefore, a significant

number of false alarms is likely to be observed.


3.3.3 Phone-set methods

The phone-set approach is another all-speech architecture that performs more

explicit modeling of speech than an SBM by using individual phone models. In

the confusability circle, this can be visualised as modeling individual pockets of

the entire confusability circle.

One limitation of a phone-set non-keyword model is that there will be overlap

between the target word model and the non-keyword model if both are con-

structed using models from the same phone model set. Given this, observations

associated with common phones will score equally well in the target word model

and the non-keyword model. As a result, one would expect an increase in miss

rate compared to an SBM non-keyword model.

Obviously, this overlap is reduced if using a word-based target model. How-

ever, the use of word-based models significantly reduces the flexibility of key-

word spotting and verification algorithms, and is particularly unsuitable for un-

restricted vocabulary tasks.

3.3.4 Target-word-excluding methods

Another distinct class of non-keyword models is those constructed from a combi-

nation of all subevent models (eg. phones, states) excluding those that occur in

the target keyword. The methods proposed by Leung and Fung [20], Sukkar and

Lee [32], and Xin and Wang [37] are examples of such models. These approaches

specifically attempt to exclude speech in the TAR region, hence constraining

non-keyword modeling to only the DAR and CAR regions.

Unfortunately, excluding target word subevents from the non-keyword model

does not necessarily guarantee this to be the case. This is because a number of

confusable words in the CAR region will have subevents that are shared with the

target word. This region of overlap between the subevents from confusable words

3.4. Evaluation of keyword spotting techniques 63

and those from the target word is shown as the Shared Subevent CAR (SSCAR)

region in figure 3.2.

The subevents in the SSCAR are a subset of the subevents in the target word

model but are also excluded from the non-keyword model. As a result, there will

be a number of highly confusable words that are better modeled by the target

word model than the non-keyword model. This unfortunately is likely to lead to

an increased number of false alarms over the standard phone model approach.

However, a possible benefit is a decrease in miss rate since there is now a greater

separation between the non-keyword model and the target word model.

STOCK




STACK

SOCK

FLOCK

SPOCK

Shared Subevent


Figure 3.2: Example of the shared subevent confusable acoustic region for thekeyword STOCK

3.4 Evaluation of keyword spotting techniques

The choice of non-keyword architecture is the primary difference between many

HMM-based keyword spotting techniques. In the previous section, a number


of these were compared and contrasted, and hypotheses were made regarding

expected miss rate and false alarm rate performances. This section reports on

experiments that quantitatively evaluate HMM-based keyword spotting using a

selection of these non-keyword models. Three specific algorithms are examined

here:

1. SBM-KS - Speech background model approach (section 2.7.1)

2. PM-KS - Phone model approach (section 2.7.2)

3. XPM-KS - Target-word-exclusive phone model approach based on the

non-keyword model described by Leung and Fung [20] (section 2.9.4).

3.4.1 Experiment setup

Recogniser parameters

HMMmodels were trained using data taken from a 165 hour subset of the Switch-

board 1 conversational telephone speech corpus. Speech was parameterised using

Perceptual Linear Prediction (12 statics and C0 + deltas + accelerations) fea-

ture extraction with Cepstral Mean Subtraction applied to provide speaker and

channel compensation.

Target word models were constructed by training 16-mixture cross-word tri-

phone Gaussian state distribution HMM models on the Switchboard training

data. A 256-mixture GMM model was trained on this same data to be used as

the non-keyword model for the SBM-KS experiments. Additionally, 32-mixture

monophone models were constructed for use as non-keyword models for the PM-

KS and XPM-KS methods.


Evaluation set

Evaluation speech was taken from a 2 hour subset (not overlapping with the

training data) of the Switchboard data. The query words were constrained to

only 6-phone length words to remove any variations in experiment results due

to keyword length. As such, 360 6-phone length unique words were randomly

chosen from the evaluation speech and labeled as query words. These query

words appeared a total of 808 times in the evaluation set.

Evaluation procedure

Experiments were performed as follows.

1. For each query word, single-word keyword spotting was performed for every

utterance in the evaluation set.

2. Total execution time was measured using a single 3GHz Pentium 4 proces-

sor.

3. Miss and false alarm rates were calculated using reference forced-aligned

word-level transcription timings. No thresholding or normalisation was ap-

plied to output scores prior to calculating performance metrics.

Execution time is reported in terms of CPU minutes per queried word per hour

(CPU/kw-hr).

3.4.2 Results

Results of the experiments are shown in table 3.1. In terms of detection per-

formance, SBM-KS clearly outperforms the other two systems, yielding a very

low miss rate of 1.9%. Execution time is also significantly faster than the phone-

model based methods, being approximately 10 times faster than the XPM-KS and


PM-KS systems. Unfortunately, the high FA/kw rate of SBM-KS is of concern.

Method Miss rate FA/kw CPU/kw-hr

SBM-KS 1.9 419.7 1.82

PM-KS 32.6 2.1 18.0

XPM-KS 19.8 10.3 16.1

Table 3.1: Keyword spotting performance of baseline systems on Switchboard 1data

Typically a high FA/kw rate is considered crippling for a keyword spotting

system. This is because a system that has such a high number of extraneous

incorrect results is essentially unusable from a practical perspective. However,

if a keyword spotting system is used in conjunction with an accurate keyword

verification stage that can successfully remove a large proportion of the false

alarms, then the significance of the high false alarm rate is greatly reduced. In

essence, the major penalty of a high false alarm rate is then simply an increase in

the amount of processing required in the keyword verification stage. False alarm

errors are therefore recoverable errors within the keyword spotting context.

In contrast, a poor miss rate is an error that is completely unrecoverable.

Once a true occurrence has been missed by a keyword spotter, no amount of

subsequent processing can recover the missed occurrence. As such, it is more

favourable for a keyword spotting system to have a low miss rate rather than a

low false alarm rate, given an accurate subsequent keyword verification stage.

The poor miss rate performances of PM-KS and XPM-KS is a result of overly

greedy non-keyword models, and is in line with assertions made in sections 3.3.3

and 3.3.4. A good non-keyword model should generate much lower likelihoods

than the target word model in the TAR. However, an overly-greedy non-keyword


model generates likelihoods that are close to or exceed the likelihoods of the

target word model in the TAR, leading to unfortunate miss errors. The presented

results validate the assertion that the PM-KS and XPM-KS non-keyword models

are more greedy that the SBM-KS non-keyword model.

Reduction of overlap between the non-keyword model and the target-word

model is one method of reducing the greediness of a non-keyword model. This

in fact is the motivation behind the XPM-KS method. XPM-KS removes all

phone models that exist in the target word from the list of phone models used in

it’s non-keyword model, thus generating a less-overlapping non-keyword model

on a per word basis. As demonstrated by the above results, and as postulated

in section 3.3.4, the method is effective in improving miss error rate over a non-

target-word-excluding PM-KS system. Specifically, an absolute decrease of 12.8%

in miss rate is observed.

However, an unfortunate side-effect of this is that speech corresponding to

words similar to the target word are now consumed by the target word model,

leading to increased false alarm rates. As explained in section 3.3.4, this leads

to an increase in false alarm rate as demonstrated by the 8.2 FA/kw absolute

increase observed for XPM-KS over PM-KS.

The significantly slower execution times for PM-KS and XPM-KS is a direct

result of the increased complexity of the non-keyword model. Using a composite

non-keyword model comprised of multiple phone models results in an increase

in the number of nodes in the recognition lattice, and hence an increase in the

amount of decoding processing. For example, a 44-entry phone set requires 44

nodes in the recognition network as well as 44 × 44 = 1936 extra links between

phone nodes. In contrast, the SBM-KS approach only has a single node within

the non-keyword model, resulting in very little impact on the complexity of the

decoding network.


Overall then, the SBM-KS appears to be the most appealing choice of al-

gorithm for keyword spotting, assuming the availability of a subsequent well-

performing keyword verification stage. Not only is a very low miss rate obtained

using this method, but execution speed is significantly faster, making the method

suitable for real-time speech processing tasks.

3.5 Tuning the phone set non-keyword model

As noted above, one reason for the poor detection performance of the PM-KS

and XPM-KS methods was an overly greedy non-keyword model. One means

of addressing this issue is to introduce a target word insertion penalty. This is

analogous to the concept of word insertion penalty used in speech recognition -

a means to control the insertion rate of tokens.

The word insertion penalty is incorporated into the recognition process by

inserting a link transition probability into the recognition network. This is de-

noted by the α link probabilities in figure 3.3. Using a positive word insertion

penalty on a target keyword node will favour the emission of target words over

non-keywords during the decoding process. On the contrary, using a negative

penalty simulates a more greedy non-keyword model, resulting in an increase in

miss rate but also a decrease in false alarm rate.

Experiments were performed to determine if adjusting target word inser-

tion penalty would yield improvements in the performance of phone-model-non-

keyword based keyword spotting methods. The PM-KS experiments reported

in section 3.4 were repeated using two different values of target word insertion

penalty: 50 and 100. It was assumed that trends observed for PM-KS would be

similar to trends observed in XPM-KS and as such experiments were not repeated

for the XPM-KS method. Results of these evaluations are shown in table 3.2.

The original PM-KS method evaluated in section 3.4 is listed in this table as

3.5. Tuning the phone set non-keyword model 69

Keyword W2

Keyword WN

Keyword W1

α2

αN

α1

Non-Keyword Model

Figure 3.3: Incorporating target word insertion penalty into HMM-based keywordspotting

having a word insertion penalty of 0.

Method Word insertion penalty Miss rate FA/kw

PM-KS 0 32.6 2.1

PM-KS 50 21.0 8.8

PM-KS 100 15.8 17.8

Table 3.2: Effect of target word insertion penalty on PM-KS performance

The results demonstrate that target word insertion penalty is indeed an ef-

fective means of obtaining a more palatable level of performance. Specifically, an

absolute decrease of 16.8% in miss rate is obtained using a penalty of 100, though

at the expense of a 15.7 increase in FA/kw. This suggests phone-model based

systems may be tuned to obtain low miss rates in a similar fashion to SBM-KS,


in particular if much higher word insertion penalties are used. Unfortunately, the

growth in FA/kw rate indicates that a similar problem as faced in SBM-KS will

occur - a very low miss rate but at the expense of a very high FA/kw rate.

Additionally, this type of tuning does not provide any benefit for execution

speed. As a result, even though improved detection performance may be achiev-

able, the very high execution time of PM-KS and XPM-KS still make them

unattractive for real-time keyword spotting tasks.

3.6 Output score thresholding for SBM

spotting

Output score thresholding is a simple means of adjusting the operating point

of a keyword spotting system. Given a set of putative occurrences and their

corresponding likelihoods, such thresholding can be used to reduce false alarm

rate in exchange for an increase in miss rate performance.

As demonstrated in section 3.4, SBM-KS was capable of delivering lower miss

rates and faster execution speeds than PM-KS and XPM-KS. However, this was

at the expense of a considerably higher false alarm rate. Experiments were there-

fore performed to determine whether output score thresholding could be used to

reduce false alarm rates.

It must be noted that the aim of these experiments was not to yield false alarm

rates sufficiently low for the final output of a keyword spotting system. This was

because in practice a subsequent keyword verification stage would be used to

further cull false alarms. Instead, it was hoped that output score thresholding

could be applied here to remove any highly improbable putative occurrences

without significantly affecting miss rate.

The output score thresholding techniques that were evaluated were:

3.6. Output score thresholding for SBM

spotting 71

1. UNT - This method applied direct thresholding on the unnormalised pu-

tative occurrence likelihoods output by SBM-KS

2. DNT - In this approach, duration normalisation was applied to all putative

occurrence likelihoods prior to applying thresholding. This was done to

compensate for variations in the length of realisations of keywords. The

duration normalised likelihood was calculated by:

DNT (p, ts, te) =p

te − ts(3.1)

where p = Output likelihood of putative occurrence

ts = Start time of putative occurrence

te = End time of putative occurrence

The SBM-KS putative occurrence set from experiments reported in section

3.4 was initially post filtered using each of the above methods. Performance was

then measured at various thresholds to obtain the DET plots shown in figure 3.4.

Equal error rates for these systems are given in table 3.3.

Thresholding Method Equal error rate

UNT 58.1

DNT 41.1

Table 3.3: Equal error rates of unnormalised and duration normalised outputscore thresholding applied to SBM-KS

The results clearly indicate that duration normalisation is more appropriate

than unnormalised score thresholding for the task of crude false alarm culling.

The equal error rate obtained using DNT is almost 20% lower than that obtained

for UNT.


0.05 0.1 0.2 0.5 1 2 5 10 20 40 60 80 0.05

0.1

0.2

0.5

1

2

5

10

20

40

60

80

False Alarm probability (in %)

Mis

s pr

obab

ility

(in

%)

UNTDNT

Figure 3.4: DET plots for unnormalised and duration normalised output scorethresholding applied to SBM-KS

The slope of both DET plots is almost at a 45o angle, indicating that the

trade-off between miss rate and false acceptance rate is approximately one-for-

one. That is, a reduction of 10% in false acceptance rate would be matched by

an approximately equal increase of 10% in miss rate. This is unfortunate as such

a sensitivity means that attempting to reduce false alarm rates by any significant

amount would result in significant losses in miss rate.

The experiments then indicate that output score thresholding provides a poor

method of reducing the false alarm of SBM-KS, even as an initial pre-processing

step to remove highly improbable putative occurrences.

3.7 Performance across keyword length

The keyword spotting experiments reported so far in this chapter have been re-

stricted to only the detection of 6-phone-length keywords. This was done to

reduce any variations in performance caused by target keyword length. This

3.7. Performance across keyword length 73

section quantitatively measures the performance of SBM-KS for three target key-

word lengths: 4-phone, 6-phone and 8-phone, corresponding to short, medium

and long keywords respectively.

3.7.1 Evaluation sets

Three target-word-length dependent evaluation sets were used for this set of ex-

periments. The 6-phone set was taken directly from previous experiments, while

4-phone and 8-phone sets were constructed anew. These new sets were built in a

similar fashion to the 6-phone set, except for the number of query words selected.

This was because there were limits on the available number of unique words in

the evaluation speech.

In particular, it was not possible to find 360 unique 8-phone words, and as

such, only 184 words were selected. In contrast, the number of words selected for

the 4-phone set was increased from 360 to 400. Table 3.4 shows the number of

query words selected for each phone length as well as the corresponding number

of actual occurrences of these words within the evaluation speech.

Phone-length Number of query words Number of occurrences in data

4 400 1957

6 360 808

8 184 364

Table 3.4: Details of phone-length dependent evaluation sets

3.7.2 Results

SBM-KS performance for the phone-length dependent evaluation sets are shown

in table 3.5. As expected, the numbers show that miss rate decreases as target


word phone-length increases. For very short 4-phone keywords, a miss rate of

5.7% was obtained, which is pleasing considering the very short duration of 4-

phone length events.

Phone-Length Miss rate FA/kw

4 5.7 315.4

6 1.9 419.7

8 1.6 302.9

Table 3.5: SBM-KS performance on Switchboard 1 data for different phone-lengthtarget words

An unexpected result is the trend in FA/kw rates. False alarm rates for

medium length words actually exceeded that observed for both 4-phone and 8-

phone words. This is a perplexing result, but is most likely due to the chance

occurrence that the operating point for SBM-KS for this particular 6-phone set

happened to be at a lower miss-rate and higher false alarm rate.

Duration normalised output thresholding was applied to further study the

trends in performance across keyword length. As shown in figure 3.5, DNT does

not provide any significant capability for culling false alarms without dramatically

affecting miss rate. This is true even for long duration 8-phone keywords which

theoretically should be easier to cull due to their increased duration and therefore

increased discriminatability.

3.8 HMM-based keyword verification

To date, many of the HMM-based verification systems proposed in literature

have differed primarily in their choice of non-keyword model. The proposed non-

keyword models have also been used in keyword spotting and hence there is a

significant overlap between the two areas of research.

3.8. HMM-based keyword verification 75

0.05 0.1 0.2 0.5 1 2 5 10 20 40 60 80 0.05

0.1

0.2

0.5

1

2

5

10

20

40

60

80


Mis

s pr

obab

ility

(in

%)

4−phone6−phone8−phone

Figure 3.5: DET plots for duration normalised output score thresholding appliedto SBM-KS for keyword length dependent evaluation sets

HMM-based keyword verification is very similar to HMM-based keyword spot-

ting, with the major exception that recognition is performed on isolated word

instances. As such, many findings from keyword spotting research will also apply

to keyword verification.

In previous sections, a number of experiments comparing SBM-based and

phone set based keyword spotting were presented. In these experiments, it was

established that the SBM-based method was considerably faster in terms of ex-

ecution speed. This is equally applicable in SBM-based keyword verification

providing a clear benefit for real-time processing.

Additionally, it was found that the SBM-based system achieved significantly

lower miss rates (though a phone set based system could potentially be tuned to

obtain a lower miss rate as well). A low miss rate is paramount for a keyword

verification system to avoid compounding the miss rate of the previous keyword

spotting stage.


For these reasons, only the SBM-based configuration is evaluated in this sec-

tion. Its speed and low miss rate performance make it an excellent candidate for

real-time keyword verification tasks.

3.8.1 Evaluation set

A keyword verification evaluation set can be constructed in two ways. The sim-

plest is to randomly select words from a word-level transcription and relabel a

selection of these to simulate false alarms. An alternate approach is to use a

keyword spotter to first generate a putative occurrence set. The evaluation set

can then be constructed by selecting hits and false alarms from this result set.

The latter approach better reflects the type of putative occurrences that a KV

system would typically operate on. Occurrences in this set would be acoustically

similar to the target word since they were obviously confused by the keyword

spotter. As a result, they would also be more difficult to verify. Therefore, this is

the approach used for generating the evaluation set for these set of experiments.

Target word length dependent evaluation sets were constructed for 3 keyword

lengths - 4-phone, 6-phone and 8-phone - using the following procedure:

1. SBM-KS was performed using the appropriate target word length dependent

evaluation set from section 3.7

2. A reference transcription was then used to mark each putative occurrence

as a hit or false alarm

3. Finally, a verification evaluation set was constructed by randomly select-

ing a required number of hits and false alarms from the set of putative

occurrences.

Table 3.6 summarises the number of hits and false alarms in each evaluation

set.

3.8. HMM-based keyword verification 77

Phone-length # Hits # False alarms

4 1882 617171

6 799 339115

8 362 110266

Table 3.6: Statistics for keyword verification evaluation sets

3.8.2 Evaluation procedure

SBM-based keyword verification was performed by calculating the log-likelihood

ratio confidence score for each item in the evaluation set using equation 2.18

in section 2.9.4. False alarm and miss rate performance were then measured at

various thresholds to obtain DET plots and equal error rates. Experiments for

each target word length were performed completely independently.

Note that when calculating miss rate, the miss rate from the previous keyword

spotting stage was added to the keyword verification miss rate. In this way, true

overall system miss rates are reported. This method of calculating miss rate is

used for all keyword verification experiments in this work unless otherwise noted.

3.8.3 Results

The equal error rates for SBM-KV on the individual keyword length evaluations

sets are given in table 3.7. DET plots for performance across operating points

are shown in figure 3.6.

The results indicate that target keyword length has a significant effect on

keyword verification. Performance is markedly better for longer length keywords,

with the 8-phone tests yielding the lowest equal error rate of 7.9%. Medium

length KV performance is poorer, with an equal error rate of 12.1%. However,

a very poor 19.9% EER was obtained for 4-phone length words, correponding to


Phone-length Equal error rate

4 19.9

6 12.1

8 7.9

Table 3.7: Equal error rates for SBM-based keyword verification

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

4−phone6−phone8−phone

Figure 3.6: DET plots for different target keyword lengths for SBM-KV onSwitchboard 1 evaluation sets

an absolute increase of 7.8% going from 6-phone to 4-phone words.

These numbers highlight a significant issue for SBM-based keyword verifica-

tion and in fact a problem in general for KV. That is, the inability to estimate

robust confidence scores for short words as a result of a reduced number of obser-

vation vectors available for scoring. Since the SBM-KV confidence score is derived

from the ratio of target word score to non-keyword score, a reduced amount of

data results in statistically less robust estimates of the individual component

scores, and therefore a less robust confidence score ratio.

3.9. Discriminative background model KV 79

3.9 Discriminative background model KV

Background model based KV has typically been implemented using a LLR con-

fidence score formulation. However, the log-likelihood ratio only provides a very

crude decision boundary and therefore is suboptimal for discriminative tasks. In-

stead, a discriminative framework such as a neural network or support vector

machine is likely to provide a more robust decision boundary.

Such concepts have been previously applied to keyword verification in the

works of Ou et al. [25] and Bourlard et al. [5]. Discriminative frameworks have

also been shown to be effective for other classification tasks, such as speaker

verification as demonstrated by Bengio [3].

Given this, the previously evaluated speech background model approach was

modified to incorporate a decision boundary governed by a Multi Layer Per-

ceptron (MLP) neural network. Experiments were performed to compare the

performance of such an approach with the traditional SBM-KV approach.

3.9.1 System architecture

The MLP Speech Background Model (MLP-SBM) method was implemented as

shown in figure 3.7. Each putative occurrence was scored against the target

word model and background model to obtain segment based likelihoods. These

likelihoods, together with the start and end times of the putative occurrence

and duration normalised versions of the likelihoods were then fed into the neural

network as inputs.

The MLP itself was constructed using a single hidden layer consisting of 25

sigmoid neurons. Standard squared error gradient descent training methods were

used to train the network. Training data for the neural network was obtained

by using half the evaluation set from section 3.8. The remaining half of the

evaluation set was used for actual experimentation.


Non-Keyword Model

Speech Background

Putative

Occurrence

MLP

Putative

Occurrence Timings

Word model, W

Figure 3.7: System architecture for MLP background model based KV

3.9.2 Results

Figures 3.8, 3.9 and 3.10 show DET plot comparisons of the standard SBM

method and the MLP-SBM method. Table 3.8 shows the equal error rates for

these systems. Note that the miss rate of the previous keyword spotting stage was

not included when calculating miss rates as the focus of these experiments was to

compare the performances of the individual KV methods rather than overall sys-

tem performance. As such, the DET plots and equal error rates for SBM-KV do

not coincide with those reported in section 3.8. Additionally, the longer-length

evaluations appear to suffer from data sparsity issues as indicated by the step

effects that are prominent in figure 3.10. This is unfortunate but could not be

avoided since the evaluation set size had to be reduced to provide training data

for the MLP.

The results show that there is a marked improvement when using MLP-SBM

for short word verification. An absolute equal error rate gain of 3.7% (from

18.0% down to 14.3%) is observed. This is a notable improvement considering

the difficulty of short-word KV. Additionally, the DET plots show that this gain

is consistent across operating points.

3.9. Discriminative background model KV 81

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

MLPLLR

Figure 3.8: DET plots for SBM and MLP-SBM systems for 4-phone words

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

MLPLLR


0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

MLPLLR



Phone-length SBM MLP-SBM

4 18.0 14.3

6 11.2 9.9

8 7.1 7.2

Table 3.8: Equal error rates for SBM and MLP-SBM keyword verification

Unfortunately, the benefits appear to reduce as keyword length increases, with

only a 1.3% gain for 6-phone words and 0.1% gain for 8-phone words. Considering

the additional complexity and effort required for an MLP-SBM system, these

minimal gains may not warrant the use of MLP-SBM for longer words.

Overall, the experiments demonstrate that a decision boundary generated by

a discriminative framework, such as an MLP, is beneficial. This is particularly

true for the task of short-word KV. A possible future research task would be to

examine whether additional inputs could be used to further improve performance,

such as scores from multiple non-keyword models or likelihood scores generated

using different feature sets. Compared to the fairly restrictive log-likelihood ratio,

the flexibility of the MLP architecture allows a plethora of possible structures to

be examined, providing greater potential for improved KV performance.

3.10 Summary and Conclusions

A number of keyword spotting and verification architectures were discussed and

evaluated in this chapter. It was found that individual systems excelled in dif-

ferent aspects of performance. Overall though, when the suite of experiments

presented in this chapter are considered holistically, it was established that the

conventional background model based keyword spotting and verification systems

3.10. Summary and Conclusions 83

provided the best compromise in performance for a real-time unrestricted vocab-

ulary task.

Specifically, experiments demonstrated that the SBM non-keyword model ob-

tained good miss rate and execution speed performance for keyword spotting.

However false alarm rate for this system was very high indicating that in practi-

cal applications, a subsequent keyword verification stage would be required.

In contrast the phone set non-keyword model method achieved a significantly

lower false alarm rate, though at the expense of severe degradation in miss rate.

Additionally, this method was approximately 10 times slower than the background

model method, deeming it less appropriate for real-time and speed critical appli-

cations.

A target-word-excluding phone model approach was also evaluated for key-

word spotting. This approach was found to have better miss rate performance

than the basic phone model approach, but still suffered from very slow execution

speed.

Experiments were performed to evaluate the use of target word insertion

penalties as a means of improving the miss rate performance of phone model

based keyword spotting. It was found that tuning word insertion penalty was an

effective means of improving miss rate, though at the expense of increasing false

alarm rate. However, this still did not address the problematic execution speed

of phone model based methods.

The effects of output score thresholding for SBM keyword spotting was also

studied as a means of reducing high false alarm rates. Unfortunately it was

found that any significant reduction in false alarm rate using this approach was

matched by dramatic increases in miss rate, making output score thresholding an

inappropriate tuning method.

Analysis of how keyword spotting performance of the SBM-KS method varied

for different target word lengths demonstrated that there were indeed significant


variations. In general, miss rate was poorer, and the associated DET curves were

further from the origin for shorter keywords.

Keyword verification experiments were performed using an SBM based ap-

proach. The results showed that this method was effective for long-length key-

word verification tasks but performed very poorly for short words. Specifically

an equal error rate of 19.9% was obtained for 4-phone keywords.

To address the issue of poor short-word performance, a discriminative frame-

work KV approach was evaluated. Here a neural network was used to estimate

the keyword verification confidence score using component scores from SBM key-

word verification. It was found that this approach provided significant benefits

for short word verification, yielding an impressive absolute gain of 3.7%. Sadly,

the magnitude of this gain was not matched at longer keyword lengths, suggesting

that the additional complexity of using a neural network may not be warranted

for longer word keyword verification.

Chapter 4

Cohort word keyword verification

4.1 Introduction

This chapter presents a novel KV technique called Cohort Word Keyword Veri-

fication. The proposed method attempts to address the issue of poor short-word

performance by incorporating linguistic information into the non-keyword model

construction process. This is done by synthesising the non-keyword model from

a combination of similarly pronounced cohort words in the target language. The

verification confidence measure is then derived using a combination of scores from

the target word model and cohort words in the non-keyword model.

Chapter 3 demonstrated the poor performance of HMM-based spotting and

verification for short target words. Detection of such words lead to significantly

higher miss and false alarm rates than those observed for longer words. The

resulting performance would significantly restrict the practical application of such

algorithms.

The key motivation for the cohort word technique was to obtain a more robust

non-keyword model representation that was suitable for short-word KV. Many of

85

86 Chapter 4. Cohort word keyword verification

the traditional keyword verification techniques discussed in section 2.9 used non-

keyword models that were either independent of the target keyword or used fairly

simplistic means of deriving a word-dependent non-keyword model. For example

the SBM approach used a GMM that was completely independent of the target

keyword while target-word-excluding methods, such as the method proposed by

Leung and Fung [20], used a simple out-of-target-word phone selection process to

construct the non-keyword model.

Examination of the putative occurrence sets resulting from short word key-

word spotting revealed that a large portion of the false putative occurrences

corresponded to instances of words that had very similar pronunciations to the

target word. For example, the result set when spotting the word STOCK had

instances of words such as LOCK, STOP , and STICK. Since the durations of

these putative occurrences were short, only a very small number of the observa-

tions in each occurrence scored poorly against the target word model. As such,

in many cases, these false putative occurrences would score highly against the

target word model resulting in the potential for false acceptance.

This suggests that constructing a non-keyword model for a given target word

from knowledge of similarly pronounced words may provide more robust keyword

verification. One means of doing this is to use linguistic pronunciation information

to determine the set of confusable words and to then use cohort-based scoring

techniques to perform keyword verification.

Initial sections of this chapter present a thorough discussion of the cohort word

technique. First, discussions of cohort-based scoring and the role of linguistic

information in keyword verification are presented. This is followed by a detailed

description of the cohort word approach, including a formalised definition of the

algorithm.

Subsequent sections report on various experiments that were performed to

4.2. Foundational concepts 87

evaluate the performance of the cohort word technique. In addition to deter-

mining miss and false alarm rates of cohort word KV for various target keyword

lengths, a number of experiments to validate various design choices and assump-

tions are reported on. The effects of the various parameters of the cohort word

method are also discussed here.

The final sections of this chapter examine fused KV architectures involving

the cohort word method. Specifically, two key systems are examined: the fusion

of multiple cohort word verifiers and the fusion of a cohort word system with a

SBM-based verifier.

4.2 Foundational concepts

The key aspects of cohort verification are the incorporation of word-level linguistic

information and cohort-based scoring into the keyword verification process. These

concepts are discussed further in this section.

4.2.1 Cohort-based scoring

Section 3.3 presented a detailed analysis of two common classes of non-keyword

models: all-speech and target-word-excluding models. A third class is the co-

hort non-keyword model. Such models are constructed using prior information

concerning the cohorts of a target class

In keyword verification, a cohort non-keyword model is built using subevents

that are deemed confusable with the subevents of the target word. This style

of non-keyword model has been proposed previously in literature, such as in the

works of Sukkar and Lee [32], and Xin and Wang [37].

Within the confusability circle framework, cohort methods can be seen as

attempting to directly model the Confusable Acoustic Region, and as such should


provide better discrimination than non-cohort methods when operating in this

region.

Unfortunately, the behaviour of such systems is unpredictable when operating

in the DAR region. This is because both the target keyword model and the non-

keyword model do not explicitly model this region. For keyword verification,

it is imperative that the non-keyword model generates higher probabilities than

the target word model in both the DAR and CAR regions. Since this cannot

be guaranteed for the DAR region, performance may be poor when verifying

occurrences from this region.

However, within the context of a keyword spotting system, the significance of

this issue is greatly reduced. This is because the majority of false alarms in the

output of a keyword spotter should fall within the CAR region, given that this

is the region that is considered confusable. The putative occurrence set will then

only contain a limited number of entries in the DAR. Any errors arising from

poor DAR region verification performance will thus be minimal.

4.2.2 The use of language information

A limitation of many previously proposed KV techniques is that non-keyword

models are constructed without giving an consideration to higher level linguistic

information. Linguistic information has been demonstrated in speech processing

literature to be of considerable value. For example, language models are perva-

sively used in speech recognisers to constrain word sequences, and to improve

recognition performance through word n-gram statistics. Linguistic information

has also been used to improve the performance of speaker verification systems,

for example by Reynolds and et al. [26].

The use of language information is particularly relevant for approaches that

use subevent models internally, such as the phone model approaches described by

4.2. Foundational concepts 89

Leung and Fung [20], and Sukkar and Lee [32]. These methods use a non-keyword

model likelihood based on the subevent sequence that maximises the likelihood

of an observation sequence.

When no auxiliary linguistic syntactic constraints are applied, the best scoring

subevent sequence may be a high scoring nonsensical sequence that does not

occur in the language. For example, when scoring a true occurrence of the word

DONOR = (/d/, /ow/, /n/, /er/), the best scoring phone sequence in the non-

keyword model may, for example, be G = (/p/, /ow/, /n/, /er/) — a sequence

that does not correspond to an English word. G may in this case actually score

higher thanDONOR because of effects such as modeling inadequacies, or channel

mismatch effects. If this happened, then this true occurrence may be falsely

rejected based on a comparison made with a nonsensical phone sequence.

The use of phone-level language model statistics may reduce this effect by

applying probabilistic constraints to the phone sequences that are considered

for scoring in the non-keyword model. However, this approach still has some

shortcomings. For example, if the trigram /p/ − /ow/ − /n/ occurs frequently

in English, then G may still score highly, resulting in an unrealistically high

non-keyword model score.

Word-level information is another source of language information that un-

fortunately to date has not been explored in KV literature. One application of

word-level information is in the derivation of a set of potentially confusable words

for a target word. Intuitively one would expect that words that sound similar to

a target word are likely candidates for confusion by a keyword spotter. As such,

there is the potential that false alarms in the output putative occurrence set are in

fact true occurrences of a similar sounding word. Therefore, if knowledge regard-

ing the target vocabulary could be used to derive a list of potentially confusable

words for a given target word, then this list could be used to aid discrimination

in keyword verification.


One major advantage of word-level information over phone-level information

is the coupling between target word and non-keyword model construction. In

word-level methods, the KV process is very tightly coupled to the target word.

For example, the confusion word set is specific for each target word. In con-

trast, phone-level information is more loosely coupled to the target word since

decisions regarding non-keyword model construction are made at the phone-level.

For example, when using the cohort phone approach proposed by Sukkar and Lee

[32], the choice of models included in the non-keyword model is derived from a

knowledge based mapping of cohort phones. Although the list of phones consid-

ered is constrained by the phone decomposition of the target word, the level of

dependence on the target word is still considerably lower than word-level based

methods.

4.3 Overview of the cohort word technique

The cohort word keyword verification method incorporates the distinct advan-

tages of word-level language information and cohort non-keyword models. Word-

level language information is used to obtain a set of potentially confusable words

for a given target word. This set of confusable words are combined in parallel to

create the cohort word non-keyword model. Cohort verification techniques are

then used to classify a putative occurrence as belonging to either the target class

or the non-keyword class.

It is anticipated that using this approach will improve the robustness of the

non-keyword model since it is more directly representative of potential false pu-

tative occurrences, and as such the CAR region. For example, when verifying the

word DONOR, comparisons will be made with true events that occur in the ap-

plication’s vocabulary, such as SONAR and LONER. As a result, non-keyword

model likelihoods will not be derived from nonsensical events, hence avoiding the

4.3. Overview of the cohort word technique 91

issue of unrealistically high non-keyword likelihoods.

In essence, the cohort word method can be described as a classifier that asks

the question Is this occurrence best modeled by the target word or one of the words

that are easily confusable with the target word? This is in fact a sensible question

to ask for keyword verification, since in practice this is exactly the question that

is trying to be answered.

The behaviour of the cohort word method is anticipated to be as follows for

the given scenarios:

1. The putative occurrence is an instance of the target word. In this case, both

the target keyword model and the individual confusable word non-keyword

models will output high likelihoods. However, on average the likelihood

output by the target word model should be greater than the likelihood

output by any of the individual confusable word models.

2. The putative occurrence is an instance of a confusable word. Given that

a keyword spotter is trying to detect occurrences of a specific keyword, it

is not unreasonable to expect that a significant portion of putative occur-

rences will actually correspond to instances of a confusable word. In these

cases, a cohort word non-keyword model will provide robust discrimina-

tion since the confusable words are explicitly modeled by the non-keyword

model. The performance of the cohort word approach may outperform that

of cohort subevent methods since non-keyword models cannot be based on

nonsensical subevent sequences.

Unfortunately, as with cohort-subevent approaches, performance for putative

occurrences in the DAR of the confusability circle may be unpredictable. For

such putative occurrences, low likelihoods would be observed for both the target

word model and the non-keyword model. Ratio based confidence scores, such as


the log likelihood ratio, may not provide sufficient discrimination in such cases.

Instead, a confidence score that accounts for low absolute model likelihoods may

be more appropriate.

A number of issues need to be considered when formulating the cohort word

method. Of most importance are the choice of cohort word selection process

and the non-keyword modeling architecture. These are discussed in the following

sections.

4.4 Cohort word set construction

Obtaining the cohort word set of a target word is in itself a considerable problem.

In particular, for unconstrained vocabulary keyword verification, an automatic

procedure for deriving this set is required. One means of performing this auto-

matic selection is to use a word-confusability measure in conjunction with a large

word list that provides adequate coverage of the target application vocabulary.

Let D(w1, w2) be a distance that measures the confusability between two

words, w1 and w2. Additionally, let V be a list of words representing the target

application vocabulary. Then the cohort word set, R(w), of word w can be

expressed as:

R(w) = v ∈ V |dmin ≤ D(w, v) ≤ dmax (4.1)

where dmin and dmax are thresholds used to limit the degree of confusability of

cohort words.

4.4.1 The choice of dmin and dmax

Both dmin and dmax play important roles in controlling the performance of cohort

word keyword verification. Within the confusability circle, these can be seen as

controlling the extent of CAR modeling within the non-keyword model, as shown

4.4. Cohort word set construction 93

in figure 4.1.




STACK

SOCK

FLOCK

SPOCK

dmin

STOCK

dmax

Figure 4.1: Controlling the degree of CAR region modeling dmin and dmax tuning

Tuning of the dmax parameter changes the number of words included in the

cohort word set. Specifically, using a very large value is likely to improve perfor-

mance since a greater number of cohort words are included in the discrimination

process. However, a large dmax will also reduce execution speed because of the

increased complexity of the non-keyword model.

Careful attention must also be given to the choice of dmin. Using too small

a value will result in extremely confusable words being included in the cohort

word set. This may result in reduced false alarm rates since highly confusable

words are included in the non-keyword model. However, it is more likely that

this will introduce significant overlaps between the target word model and the

non-keyword model, resulting in an increase in miss rate.


4.4.2 Cohort word set downsampling

Further control of the cohort word set can be obtained by downsampling. This

results in less coverage of the CAR region, potentially leading to suboptimal

performance. However, for large vocabularies, this may be necessary to maintain

practical execution speed. In this work, random sampling is used to perform this

downsampling. That is, the reduced cohort word set is given by:

RJ(w) = shuffle(R(w))Ji=1 (4.2)

where the shuffle function randomly shuffles a set.

4.4.3 Distance function

The choice of distance function will also play a significant role in controlling the

quality of the cohort word set. Distance can be calculated either acoustically or

linguistically.

An acoustic approach is more in line with the motivation of cohort word KV -

that is, to determine and model the acoustic region of confusability. The distance

measures used in data clustering algorithms, such as likelihood-based metrics and

the Kullback-Leibler divergence, are candidates for such a distance function.

Although both these acoustic approaches are suitable for determining cohort

words, they are computationally and data intensive. In particular, for large target

application vocabularies, determining the acoustic distance between the target

word and every word in the vocabulary list V would be almost intractable.

In contrast, a pronunciation-based linguistic distance function is more practi-

cal. Such a measure would reflect the distance between two words with regards

to their difference in pronunciation. Intuitively, one would expect words with

similar pronunciations to be confusable, and therefore likely candidates for the

4.4. Cohort word set construction 95

CAR region. This is particularly true for phone-based recognition system where

confusable words would share a significant number of the same phone models.

A candidate for this pronunciation-based linguistic distance function is the

MED or Levenstein distance (see Appendix A). The MED distance determines

the minimum cost of transforming a source sequence to a target sequence using

a combination of match, substitution, insertion and deletion operations, where

each operation has an associated cost.

Within the context of cohort word selection, the MED distance can be used to

measure the distance between the phonetic pronunciations of two words. There-

fore, given the mapping function, Φ(w), that maps a word w to it’s phonetic

pronunciation sequence, the MED distance based cohort word distance function

is given by:

DL(w1, w2) =MED(Φ(w1),Φ(w2)) (4.3)

where MED(A,B) is the MED distance between sequence a and sequence b.

A powerful feature of the Minimum Edit Distance is the support for vari-

able cost functions. This allows certain types of transformations to be favoured

over others. For example, a low deletion cost will result in a reduced cost of

transformation to shorter sequences.

In cohort word selection, tuning of the MED costs can be used to favour certain

types of cohort words over others. For example, using a high insertion penalty

will result in more cohort words with less phones in their pronunciation than the

target word. Intuitively it is not possible to predict what combination of costs

would be best suited for cohort word selection. However, the ability to tune MED

cost parameters provides an additional avenue for performance optimisation.

Thus, a more generalised MED-based cohort word distance function is then


given by:

DM(w1, w2) =MED(Φ(w1),Φ(w2), ψd, ψi, ψs, ψm) (4.4)

where MED(A,B, . . .) = MED distance to transform A to B

ψd = MED deletion cost function

ψi = MED insertion cost function

ψs = MED substitution cost function

ψm = MED match cost function

4.5 Classification approach

The cohort non-keyword model is based on the knowledge of the cohort words of

a given target word. How this knowledge is incorporated into the construction

process of the non-keyword model, and more importantly, how individual putative

occurrences are scored is an important consideration.

Two approaches to this issue are discussed here: a 2-class approach and a

hybrid N-class approach.

4.5.1 2-class classification approach

From a functional view, cohort word keyword verification is a 2-class classification

task, attempting to discriminate between putative occurrences corresponding to

a target word and putative occurrences representing false alarms. A benefit of

using a 2-class approach is that classifiers used in other verification methods,

such as LLR discrimination, can then be easily incorporated. However, a distinct

disadvantage is that a means of fusing the individual cohort word scores must be

determined in order to generate a single non-keyword model likelihood.

A possible formulation of a cohort word LLR is as follows. Given a target

4.5. Classification approach 97

keyword w, let R(w) = r1, r2, . . . be defined as the cohort word set of w. Note

that R(w) does not include the target word w. The corresponding models of the

cohort word set can then be used as the non-keyword model term in equation

2.17. The generalised cohort word confidence score for an observation sequence,

X, is hence given by:

C(X,w,R(w)) = log p(X|λw)− logF (p(X|λr1), p(X|λr2), . . .) (4.5)

The fusion function, F (. . .) is used here to denote the fusion of the individual

cohort word model scores.

The choice of fusion function will have an effect on overall performance, and

therefore must be carefully considered. One candidate for this fusion function is

to use a probabilistic ’OR’. This is an intuitive approach since the non-keyword

model likelihood will then be based on the likelihood that the putative occurrence

represents any one of the cohort words. The confidence score formulation using

this technique is then:

Ce(X,w,R(w)) = log p(X|λw)− log

∑

i p(X|λri)

|R(w)|(4.6)

However, this approach requires likelihood calculations for all words in the

cohort word set. This can be computationally very expensive, particularly if a

large cohort word set is used to maximise coverage of the CAR region.

To reduce computational requirements, the following approximation may be

used. Let S(w,K) = s1, s2, ..., sK be the subset of the K top scoring words

from the cohort word set R(w). Then the simplified cohort word confidence score

can be defined as:

Ce′(X,w,R(w), K) = log p(X|λw)− logF (p(X|λs1), . . . , p(X|λsK

)) (4.7)


This simplified confidence score only requires the likelihoods of the K best

scoring cohort words. Computational requirements can be significantly reduced

for many scoring algorithms if only the top K scoring models are required. For

example, if HMM cohort word models are being used, Viterbi N-Best recognition

can be used to determine the K best scoring models - a significantly cheaper

operation than obtaining the individual scores of every cohort word.

However a trade-off is that the overall non-keyword model likelihood is now

estimated from a much smaller set of cohort words. Estimation theory states

that the error in estimation tends to increase as the number of samples de-

creases. As a result, reduced KV performance may be observed when using this

Ce′(X,w,R(w), K)′ approximation.

4.5.2 Hybrid N-class approach

Cohort word KV may also be implemented as an N-class classification task. Here,

the algorithm attempts to classify a putative occurrence as belonging to one of

a set of word classes, w ∪ R(w), where w is the target word and R(w) is the

cohort word set of w. This is a direct application of the cohort word motivation:

attempting to ask the question Is this occurrence best modeled by the target word

or one of the words that are easily confusable with the target word.

However, a distinct consequence of using a N-class approach is that there is

no special consideration given to the target word class. Standard N-class clas-

sifier training algorithms do not typically provide the facility to directly train

the classifier to favour optimal decision making for a single class (ie. the target

word class for KV). In fact, optimising a classifier to favour a specific class is very

much against the fundamental concepts of many N-class classifier training meth-

ods. As such, any classifier training for N-class cohort word keyword verification

would have to be indirectly optimised to favour correct target word classification

4.5. Classification approach 99

decisions.

An alternative is to only consider the cohort word models in the classification

task - that is, to exclude the target word from the N-class classification. Using this

approach, a putative occurrence can first be classified in terms of the best scoring

cohort word. A subsequent 2-class classification may then be used to discriminate

between the target keyword model and the best-scoring cohort word. Figure 4.2

demonstrates this technique.

Putative

Occurrence

λr2

λr1

λrN

N-Class Classifier

2-Class Classifier

λrk

λw

Result

Best class, λrk

Figure 4.2: A N-class classifier approach to cohort word verification for the key-word w and cohort word set R(w)

This hybrid approach combines the benefits of N-class classification and 2-class

classification. N-class classification is used to select the most appropriate cohort

word model from which to estimate the non-keyword model likelihood. This

circumvents the need for a fusion function as required for the 2-class classifier

approach. In addition the use of the subsequent 2-class classification stage allows

direct tuning of the final decision stage to be optimised for target word class


discrimination.

In this work, the maximum likelihood N-class classifier and the LLR 2-class

classifier are used to implement the hybrid approach. Using these classifiers, the

hybrid N-class cohort word confidence score can be expressed as:

Ch(X,w,R(w)) = log p(X|λw)− log argmaxkp(X|λrk

) (4.8)

It can be seen that this formulation is in fact a special case of the 2-class

classification Ce′(X,w,R(w), K) score, with K = 1.

4.6 Summary of the cohort word algorithm

A summary of the cohort word keyword verification algorithm for verifying pu-

tative occurrences of the target word w is given below:

1. Define algorithm parameters

(a) Let V be the list of words in the target application vocabulary and

Φ(w) be a function that maps the word w to its phonetic pronunciation

sequence.

(b) Let J be defined as the maximum cohort word set size as required for

cohort word set downsampling (see equation 4.2).

(c) Let [dmin, dmax] be defined as the cohort word selection range as re-

quired for cohort word selection in equation 4.1.

(d) Let DM(x, y) be the MED distance function used for cohort word dis-

tance calculations

(e) Let ψd, ψi, ψs, ψm be defined as the MED deletion, insertion, substi-

tution and match cost functions respectively

4.7. Comparison of classifier approaches 101

(f) Let Ca(X,w,R(w)) be defined as the cohort word confidence score,

where this is one of the previously discussed confidence score formula-

tions:

i. 2-class classification score: Ce(X,w,R(w))

ii. Approximated 2-class classification score: Ce′(X,w,R(w), K)

iii. Hybrid N-classification score: Ch(X,w,R(w)).

2. Determine the cohort word set, RJ(w)

(a) Obtain the cohort word set:

R(w) = v ∈ V |dmin ≤ DM(w, v) ≤ dmax

(b) Generate the downsampled cohort word set:

RJ(w) = shuffle(R(w))Ji=1

3. For each putative occurrence, X, perform cohort word verification

(a) Calculate the cohort word confidence score Ca(X,w,R(w))

(b) Perform thresholding to accept or reject the putative occurrence

For convenience, the parameter set Θ = ψd, ψi, ψs, ψm, dmin, dmax, J is col-

lectively referred to as the cohort word selection parameters within the rest of

this work.

4.7 Comparison of classifier approaches

An important consideration for any keyword verification algorithm is the struc-

ture of the confidence score metric. The confidence score metric is the crux of


these algorithms and therefore it is paramount that significant care and attention

is given to it’s formulation.

In section 4.5, two cohort word confidence metrics were presented: the 2-class

classification approach and the hybrid N-class approach. Discussions was pre-

sented regarding the benefits and flaws of each method, however no quantitative

performances were provided to validate that assertions. As such, this section

reports on experiments that were performed to compare these two methods.


The evaluation set was generated from the putative occurrence set of a SBM-

based keyword spotter using the following process:

1. 100 6-phone length keywords were randomly selected from the TIMIT test

set

2. Keyword spotting was performed on the TIMIT test set for each of the 100

words to generate a set of putative occurrences.

3. Each putative occurrence was classified as a true or false occurrence using

the reference transcriptions of the TIMIT test set.

4. 111 true occurrences were randomly selected from the set of putative oc-

currences and included in the evaluation set as true occurrences

5. 555 false alarm occurrences were randomly selected from the set of putative

occurrences and included in the evaluation set as false alarms

Restrictions were applied to the number of true and false occurrences in the

evaluation set to allow practical computational time. This was particularly rele-

vant for the 2-class classification approach.


4.7.2 Recogniser parameters

Perceptual Linear Prediction (12 statics and C0 + deltas + accelerations) feature

extraction was used to parameterise each utterance. In addition, Cepstral Mean

Subtraction was applied to reduce the effects of channel/speaker mismatch.

Cross-word triphone HMM models with 16-mixture Gaussian state distribu-

tions were used to model keywords and cohort words. As there was insufficient

data in the TIMIT training set to train robust triphone models, the HMMs were

trained using the Long and Short Training subsets (140 hours) of the Wall Street

Journal 1 clean microphone speech database.

4.7.3 Cohort word selection

Table 4.1 shows the values of the cohort word parameters that were used. Every

combination of these parameters was evaluated for each cohort word method so

that experiments would not be penalised by a poor choice of selection parameters.

This resulted in a total of 40 evaluation systems.

Parameter Range

V The CALLHOME PronLex [19] word listconsisting of approximately 90000 words.

Φ(w) Phonetic pronunciations taken from thePronlex lexicon.

J 200

dmin 1-4

dmax dmin-4

ψd, ψi, ψs, ψm 1-2,1-2,1,0

Table 4.1: Evaluated cohort word selection parameters

The limits on the evaluated values of dmin, dmax and the MED cost parameters


were applied to restrict the scope of the experiments. Using these parameters

resulted in 40 different possible cohort word systems that had to be evaluated.

For the 2-class experiments, all cohort words were included in the summation

of cohort word likelihoods. That is, the confidence score formulation used was

Ce(X,w,R(w)).


Experiments were performed using the SBM-KV, equal-weighted sum cohort word

and hybrid N-class cohort word methods. For each method, the following proce-

dure was used:

1. For each putative occurrence:

(a) The target word score was calculated using a target-word model con-

structed from a concatenation of triphone models

(b) The non-keyword model score was calculated using the appropriate

non-keyword model

(c) The confidence score was calculated

2. Thresholding on confidence score was applied to obtain false acceptance

performance at the 3% and 10% miss rate performance points.

4.7.5 Results

The results of the KV experiments are shown in table 4.2. Once again every

combination of selection parameters was evaluated resulting in 40 systems per

method. However, only the best performing system at each of the two miss rates

are reported.

Clearly the CW-NClass architecture outperforms the CW-2Class system at

both miss rate operating points. The optimal cohort word selection parameters


Method FA @ 3% Miss rate FA @ 10% Miss rate

CW-2Class 67.8% 1,4,1,2 44.1% 1,4,1,2

CW-NClass 31.5% 2,4,1,2 17.8% 3,3,1,2

Table 4.2: Performance of selected cohort word KV systems on TIMIT evaluationset. Cohort word systems are qualified with the appropriate cohort word selectionparameters using a tag in the format dmin, dmax, ψd, ψi.

varied at each operating point, but in nearly all of the 40 evaluated systems, the

CW-NClass method significantly outperformed the CW-2Class method.

Additionally, the CW-NClass systems were in the order of 30 times faster

than the CW-2Class systems. This was because the CW-2Class systems required

calculation of likelihoods for every cohort word model (200 cohort words in this

case). In contrast, the CW-NClass systems only required finding the best scoring

cohort word using a single Viterbi recognition pass - a considerably faster task.

The execution speed of the CW-2Class approach could be improved by re-

ducing the size of the number of cohort word models considered in the fusion

function. That is, a smaller value of K could be used. However, this would result

in less coverage of the CAR region within the non-keyword model, and therefore

poorer performance.

Further analysis of the the CW-2Class method revealed that there was a

large variance in the distribution of cohort word likelihoods for a given putative

occurrence. Additionally, the mean of these likelihoods was often very much lower

than the best cohort word likelihood. Since the fusion function used essentially

calculated the mean cohort word likelihood, this meant that the resulting non-

keyword model likelihood was often quite low. This explains the high false alarm

rates of the CW-2Class approach.

In summary, the reported experiments demonstrate the advantages of the N-

class hybrid cohort word approach over the 2-class cohort word method. The


N-class approach outperformed the 2-class approach in terms of detection perfor-

mance and was also approximately 30 times faster. It can be concluded then that

the N-class hybrid confidence score formulation is more appropriate for cohort

word keyword verification.

4.8 Performance across target keyword length

In a similar fashion to keyword spotting, KV performance is very much affected

by the length of the target keyword. Specifically, the longer the average duration

of a target word, the easier it is to robustly verify it’s putative occurrences.

This section examines the effects of target keyword length on cohort word

verification performance. In particular, three target word lengths are studied:

4-phone, 6-phone and 8-phone, corresponding to short, medium and long words

respectively.

Experiments were performed using the specific classes of target word length.

Only the hybrid N-class approach was evaluated since experiments in section 4.7

demonstrated it’s superiority over the 2-class formulation particularly in terms of

speed. Additionally, a more difficult evaluation set taken from the Switchboard-1

conversational telephone speech corpus was used. This was done to provide in-

sight into the performance of the cohort word method under less ideal conditions.

The cohort word selection parameters and evaluation procedure used were the

same as those described in sections 4.7.3 and 4.7.4.


Evaluation speech was taken from a subset of the SWB1 database. A list of

words for each phone length class were obtained by using phone pronunciations

from the PronLex lexicon. Evaluation sets for each phone length class were then

4.8. Performance across target keyword length 107

restricted to containing only putative occurrences for words of the same phone

length.

The putative occurrence sets for 4-phone and 6-phone words consisted of 375

true putative occurrences and 1875 false putative occurrences. Only 175 true

and 875 false putative occurrences were used for the 8-phone evaluation set as

8-phone words were less frequent in the test set.

True occurrences were obtained by randomly selecting words of the required

phone length from high quality forced-aligned transcriptions of the SWB1 test

data. False putative occurrences were obtained by first performing keyword spot-

ting using a SBM-based keyword spotting system for each of the words in the

true putative occurrence set, and then randomly selecting the necessary number

of false occurrences from the false alarm outputs. This restricted the false oc-

currences to being acoustically similar occurrences of words that were confusable

with the target word.


Speech was parameterised using Perceptual Linear Prediction (12 statics and C0

+ deltas + accelerations) feature extraction. Cepstral Mean Subtraction was

applied to provide speaker and channel compensation.

Cross-word triphone HMM models with 16-mixture Gaussian state distribu-

tions were used to model target words and cohort words. These HMMs were

trained using a 165 hour training subset of the SWB1 data that was independent

of the evaluation data set.

Additionally, a 256-mixture GMM was trained on the same SWB1 training

dataset for use with SBM-KV baseline experiments.


4.8.3 Results

Equal error results for the SBM-KV baseline, the 3 best cohort word methods,

the median cohort word method and the worst cohort word method are shown

below.

4 phone-len 6 phone-len 8 phone-len

Method EER Method EER Method EER

SBM-KV 19.7 SBM-KV 11.7 SBM-KV 7.5

CW 3, 3, 2, 1 14.1 CW 1, 4, 2, 1 9.6 CW 1, 4, 1, 1 9.7

CW 3, 3, 2, 2 14.1 CW 3, 4, 2, 1 9.8 CW 3, 4, 2, 1 10.3

CW 2, 3, 1, 2 14.2 CW 4, 4, 1, 1 10.1 CW 2, 4, 1, 1 10.3

CW 1, 4, 2, 1 16.3 CW 1, 3, 1, 1 12.3 CW 3, 3, 2, 1 19.3

CW 1, 1, 2, 2 25.9 CW 1, 1, 2, 2 55.7 CW 1, 1, 2, 2 66.3

Table 4.3: Performance of SBM-KV and selected cohort word systems on theSWB1 evaluation sets. Cohort word selection parameters are specified with eachsystem in the format dmin, dmax, ψd, ψi.

In terms of EER, the cohort word methods clearly outperformed SBM-KV for

the 4-phone and 6-phone keyword lengths. However, the contrary was observed

for 8-phone words. EER gains were particularly pleasing for the 4-phone set,

where the 3 best cohort word methods yielded an approximate 30% relative gain

over SBM keyword verification.

Cohort word performance gains are consistent across operating points for the

4-phone and 6-phone tests. Figures 4.3 and 4.4 show the DET plots for the best

cohort word system and the SBM-KV system for 4-phone and 6-phone words

respectively. For short length 4-phone KV, the cohort word method maintained

a considerable margin of gain over SBM-KV at all operating points. Cohort

word KV also outperformed SBM-KV for medium length 6-phone words at false

acceptance rates above 6%.


0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

Cohortword (3,3,2,1)SBM−KV

Figure 4.3: DET plot for best cohort word and SBM-KV systems on SWB14-phone length evaluation set

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

Cohortword (1,4,2,1)SBM−KV

Figure 4.4: DET plot for best cohort word and SBM-KV systems on SWB16-phone length evaluation set


4.8.4 Analysis of poor 8-phone performance

Trends in performance figures across keyword length indicate that the benefits

of cohort word KV over SBM-KV decreased with increased keyword length. A

likely explanation is a decrease in the quality of the cohort word non-keyword

model.

The cohort word method relies on a non-keyword model that adequately mod-

els the CAR region. In order to maintain practical execution speeds, the exper-

iments used a random sampling of at most 200 cohort words to model the CAR

region.

However, examination of the average number of cohort words used for each

phone length demonstrated that in fact 200 cohort words were not used in every

case. Table 4.4 shows the mean and standard deviation of the number of cohort

words used for the 3 best performing systems for each phone length. For the

majority of the 4-phone and 6-phone experiments, the number of cohort words

was 200 as expected. In contrast, none of the 3 best performing 8-phone systems

had a mean cohort word set size of 200. This implies that for the 8-phone length

experiments, the coverage of the CAR region for non-keyword modeling was less

than expected, and as such, the quality of the cohort word confidence score was

compromised.

A reduced number of cohort words was obtained for the 8-phone length exper-

iments because there were simply insufficient words of the required length that

were potential cohort candidates. In fact, it must be noted that the 3 best per-

forming 8-phone length cohort word systems all had high dmax values allowing a

greater number of words to be considered as cohort words. This suggests that

perhaps higher values of dmax than that evaluated may result in better 8-phone

performance.

Further analysis of experimental results reveal that there is a high degree


Phone Length Method Mean # Std #

cohort words cohorts words

4 CW 3, 3, 2, 1 200.0 0.0

4 CW 3, 3, 2, 2 199.9 0.2

4 CW 2, 3, 1, 2 200.0 0.0

6 CW 1, 4, 2, 1 200.0 0.0

6 CW 3, 4, 2, 1 200.0 0.0

6 CW 4, 4, 1, 1 199.8 2.6

8 CW 1, 4, 1, 1 180.9 39.9

8 CW 3, 4, 2, 1 107.8 53.4

8 CW 2, 4, 1, 1 158.2 51.1

Table 4.4: Mean and standard deviation of the number cohort words used in the3 best performing cohort word KV methods for the SWB1 evaluation set

of correlation between equal error rate and the mean number of cohort words.

Figure 4.5 shows a plot of equal error rate versus the mean number of cohort

words for all the cohort word systems that were evaluated. The reflected trends

show that increased equal error rates were observed for systems that had a lower

mean number of cohort words. Additionally it can be seen that a good majority

of 4-phone and 6-phone systems had mean cohort word set sizes of 200 while the

8-phone systems had a significant number of lower valued cohort word set sizes.

4.8.5 Conclusions

In conclusion, the experiments demonstrate that the cohort word method is ad-

vantageous for short to medium length KV. This is a pleasing result since short

word KV is a particularly difficult problem prone to high miss rates and false

alarm rates.

However, there are issues that need to be addressed for long word KV. In


050

100

150

200

250

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Mea

n co

hort

−w

ord

coun

t

Equal error rate

4−ph

ones

6−ph

ones

8−ph

ones

Figure 4.5: Equal error rate versus mean number of cohort words

4.9. Effects of selection parameters 113

particular, the experiments suggest that for long words, there were insufficient

words in the dictionary used to be able to construct a cohort word non-keyword

model that provided sufficient coverage of the CAR region.

A simple solution to this may be to increase the number of long words in

the dictionary. However, considering that a very large 90000 word dictionary was

used in these experiments, it is likely that there simply aren’t enough linguistically

similar words in the English language that are suitable for use in 8-phone cohort

word keyword verification.

An alternative solution may be to increase dmax. This would increase the

number of candidate cohort words, thus resulting in larger cohort word set sizes

and more robust confidence scores.

4.9 Effects of selection parameters

The experiments in previous sections demonstrated that cohort word verification

performance is very sensitive to the choice of cohort word selection parameters.

In particular, for 6-phone and 8-phone target words, a poor choice in cohort

word selection parameters can result in extraordinarily poor rates of performance

compared to the best achievable rates. For example in section 4.8 the best 6-

phone cohort word system had an EER of 9.6% while the worst had an EER of

55.7%.

This section examines the contributions of various selection parameters to

overall cohort word performance. In particular, attention is given to the effects

of cohort word set downsampling, selection range tuning and MED cost function

tuning.

Analysis is performed on the experimental outputs of the evaluations reported

in section 4.8. Any additional experiments conducted used the same evaluation

sets, recogniser parameters and experimental procedures documented in sections


4.8.1, 4.8.2 and 4.7.4 respectively.

4.9.1 Cohort word set downsampling

Cohort word set downsampling is used to reduce the size of a cohort word set

to maintain fast execution speeds. Without downsampling, the size of the co-

hort word set can reach extremely large sizes resulting in increased non-keyword

complexity and thus slower processing times.

Downsampling is particularly important when a large cohort word selection

range is used, since this dramatically increases the number of candidate cohort

words. For example, the word GALLERY has a cohort word set size of 3044

when using the selection parameters dmin = 1, dmax = 4, φd = 2 and φi = 1

in conjunction with the 90000 word PronLex dictionary. In contrast, using a

reduced selection range of dmin = 1 and dmax = 2 results in a set size of only 49

words.

The method of downsampling will also have some effect on performance. In

this work, random sampling is used to perform downsampling. There is likely to

be a significant amount of redundancy in terms of CAR region coverage within

the cohort word set. As such, in addition to reducing the size of the cohort word

set, random sampling will hopefully also remove some of this redundancy without

significantly compromising the quality of the non-keyword model.

In order to gain a better understanding of the effects of downsampling, the

experiments from section 4.8 were repeated using three alternate cohort word

set downsampling sizes: 50, 100 and 300. Figure 4.6 shows the distribution of

EER across these various cohort word set downsampling sizes. The effect of

downsampling clearly decreases with increased keyword length. For 6-phone KV,

there appears to be only a small effect on both the mean EER and the best

obtained EER. There are no noticable effects on 8-phone KV, though this is to


50 100 200 300

0.14

0.16

0.18

0.2

0.22

0.24

0.26

EE

R

PL = 4

CW downsample set size

50 100 200 300

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

EE

R

PL = 6


50 100 200 300

0.1

0.2

0.3

0.4

0.5

0.6

EE

R

PL = 8


Figure 4.6: Trends in equal error rate with changes in cohort word set downsam-pling size


be expected since as discussed previously many of the configurations had very

small cohort word sets anyway. This is a positive result indicating that smaller

cohort word set downsampling sizes can be used without significantly impacting

KV performance. This is particularly beneficial for the overall execution speed

of cohort word KV.

In contrast, 4-phone cohort word KV appears to be very sensitive to the size

of the cohort word set. Specifically an absolute gain of 0.8% can be obtained from

increasing the set size to 300 while losses of 0.8% and 2.1% result from reducing

the downsampled set size to 100 and 50 respectively. This indicates that short-

word KV requires considerably more information within the non-keyword model

to perform robust verification. This additional information may be required to

compensate for the reduced number of observations available for scoring.

4.9.2 Cohort word selection range

Tuning of the cohort word selection range, [dmin, dmax], affects the degree of lin-

guistic similarity between the cohort word set and the target word. This in turn

directly affects the portion of the CAR modeled in the cohort word non-keyword

model.

A range of selection ranges were evaluated in the experiments reported in

section 4.8. Figure 4.7 shows how equal error rate varied with changes in selection

range for the 4-phone length experiments.

A key observation is that the choice of dmax had a significant impact on EER

performance. In all systems, the change in EER when tuning dmax was more

dramatic than changes observed when tuning dmin. Figure 4.7 shows that for

4-phone KV, a value of dmax = 3 yielded close to optimum performance in most

cases. Further tuning of dmin could be used to obtain the absolute local minimum.


1

2

3

4

11.5

22.5

33.5

4

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

dmin

EER across cohort−word selection ranges for deletion penalty = 1, insertion penalty = 1

dmax

EE

R

1

1.5

2

2.5

3

3.5

4

11.5

22.5

33.5

4

0.16

0.18

0.2

0.22

0.24

0.26

dmin


dmax

EE

R

1

2

3

4

11.522.533.54

0.15

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

dmin

dmax


EE

R

1

1.5

2

2.5

3

3.5

4

11.5

22.5

33.5

4

0.16

0.18

0.2

0.22

0.24

0.26

dmin


dmax

EE

R

Figure 4.7: Trends in equal error rate with changes in cohort word selection rangefor 4-phone length cohort word KV


1

2

3

4

11.522.533.54

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

dmin


dmax

EE

R


1

2

3

4

11.522.533.54

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

dmin


dmax

EE

R



Figures 4.8 and 4.9 show EER trends for the 6-phone and 8-phone experi-

ments. As for the 4-phone systems, similar trends in EER were observed for all

combinations of φi and φd and as such only the plots for φd = 2, φi = 1 are given

here.

Once again, the results demonstrate that system performance is more sensitive

to choice in dmax, while dmin only provides some fine tuning capability. However,

of note is that values of dmax = 4 yielded the best EER performance. Since the

maximum value of dmax evaluated was 4, this suggests that further improvements

in performance may be obtained by using even higher values of dmax. This seems

more promising for the 8-phone experiments since the trend curve does not seem

to have flattened at dmax = 4. In contrast, it is unlikely significant improvements

in EER will be observed for 6-phone KV since the trend curve is fairly flat at

dmax = 4.

4.9.3 MED cost parameters

Adjusting the MED insertion and deletion cost parameters affects the type of

words included in the cohort word set. For example, using a higher insertion cost

than deletion cost will favour the inclusion of words shorter than the target word,

while penalising words longer than the target word.

A number of cohort word configurations were evaluated in section 4.8. Figure

4.10 shows box and whisker plots of the equal error rates grouped by MED cost

parameters for the 4-phone, 6-phone and 8-phone experiments.

For the 4-phone and 6-phone data, there does not appear to be a significant

difference in terms of mean equal error between the four MED cost parameter

sets. There are however some differences in mean EER for the 8-phone data.

These differences though must be considered in light of the fact that the 8-phone


0.14

0.16

0.18

0.2

0.22

0.24

0.26

Equ

al e

rror

rat

e

DelCost,InsCost11

12

21

22

EER performance grouped by MED cost parameters − 4phone keywords

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Equ

al e

rror

rat

e

DelCost,InsCost11

12

21

22


0.1

0.2

0.3

0.4

0.5

0.6

Equ

al e

rror

rat

e

DelCost,InsCost11

12

21

22


Figure 4.10: Trends in equal error rate with changes in MED cost parameters


experiments suffered from poorer performance in many cases due to reduced co-

hort word set size, as demonstrated in section 4.8.

The range of EERs observed for a fixed MED cost parameter set indicate the

potential robustness of a cohort word system to careless choice of the other cohort

word selection parameters. In this dataset, it is clear that certain choices of MED

cost parameters result in less sensitivity towards other selection parameters. For

example, for 4-phone KV, the choice of φd = 1, φi = 2 clearly gives a much smaller

range in observed EERs. In contrast, this same combination of MED parameters

leads to significantly greater sensitivity for 6-phone KV.

Overall, there are no strong conclusions that can be drawn regarding the choice

of MED cost parameters. At best, the results suggest that the φd = 2, φi = 1

combination is likely to yield better performance, since this configuration resulted

in the lowest mean EER as well as one of the lowest equal error rates for all phone

lengths. However, this configuration is particularly sensitive to other cohort word

selection parameters for 4-phone and 6-phone KV.

4.9.4 Conclusions

The reported experiments quantify the benefits of tuning the various cohort word

selection parameters. Specifically, cohort word KV appears most sensitive to the

choice of dmax. Additionally for short-word KV, system performance is highly

dependent on the choice of cohort word set downsampling size. In contrast, the

remaining cohort word selection parameters: dmin, φd and φi, only have a minor

effect on overall system performance.

As such, when using a cohort word system, tuning of dmax and cohort word

set downsampling size will provide close to optimal performance. This finding

significantly reduces the complexity of constructing and deploying a cohort word

keyword verification system.


4.10 Fused cohort word systems

Multiple system fusion has been demonstrated to yield significant gains in a va-

riety of applications, such as biometric authentication. The primary benefit of

fusion is the ability to use complementary and orthogonal information effectively

to combine the strengths of individual systems as well as to negate their weak-

nesses.

A notable finding of previous sections of this chapter was that the gains of

the cohort word method over SBM-based KV reduced with increases in keyword

length. Specifically, while cohort word KV outperformed SBM-based KV for

short and medium length keywords, performance was in fact poorer for long

length keywords.

Verifier fusion may provide a means of combining the benefits of the cohort

word method and SBM method. There are a number of potential benefits of

using such a fused system, including:

1. Improved performance for other keyword lengths using the fusion of in-

dependent keyword-length-optimised systems. For example, given a cohort

word system optimised for 4-phone words, and a SBM-KV system optimised

for 8-phone words, a fused verifier may be able to improve the verification

performance for other phone-length words, such as 5-phone words.

2. The potential to fuse mutual information (if any exists) from individual

verifiers to improve robustness. For example, if all individual systems give

low confidence scores, then it is significantly more likely that the putative

occurrence should be rejected.

In these experiments, only the late fusion of systems will be examined. Fu-

sion is performed by combining the output scores of individual systems using a

Multi-Layer Perceptron neural network. Middle-fusion techniques such as the

4.10. Fused cohort word systems 123

inclusion of a SBM within the cohort word recognition network may also yield

improvements, but are not explored here.

The cohort word selection parameters, evaluation set and recogniser parame-

ters used are the same as those described in section 4.8.

4.10.1 Training dataset

The use of a MLP network for fusion required an appropriate training dataset

to train the network. As such, network training datasets were constructed for

each evaluated keyword length. These training sets were constructed in the same

fashion as the evaluation sets and it was ensured that there were no overlaps

between the training and evaluation sets. The resulting training sets consisted

of 375/1875 true/false putative occurrences for the 4-phone and 6-phone training

sets, and 175/875 true/false putative occurrences for the 8-phone training set.

4.10.2 Neural network architecture

A three layer MLP neural network architecture was used. The confidence scores

from the individual verifiers to be fused were used as the input values for the

neural network. A 25 node hidden layer was used in the intermediary layer and

2 nodes were used in the output layer, one for true occurrences and one for false

occurrences. The network was then trained using 4-fold cross validation and

squared error gradient descent training.

4.10.3 Experimental procedure

Two fusion architectures were examined. The first was a fused cohort-SBM sys-

tem. Each of the 40 cohort word systems evaluated in section 4.8 was fused with

a SBM-KV system. It was anticipated that this approach would allow the indi-

vidual verifiers to augment each other’s weakness at different keyword lengths.


The second approach was the fusion of two cohort word systems. The per-

formance of every combination of two cohort word systems was evaluated; that

is 40 × 39 = 1536 fused systems. These experiments sought to use the mutual

information of individual cohort word verifiers to improve overall performance.

The following experimental procedure was then used for each KV experiment:

1. The confidence score for each individual KV system was calculated for every

putative occurrence in the training and evaluation sets.

2. An MLP was trained using the individual confidence scores of the training

data set as inputs and the reference classifications as outputs.

3. The trained neural network was used to calculate the fused confidence score

for each putative occurrence in the evaluation sets.

4. Thresholding was applied on the fused confidence scores to obtain EER

performance

4.10.4 Baseline unfused results

Baseline SBM and cohort word KV performances from experiments in section 4.8

are reproduced in table 4.5. Only the best performing cohort word KV systems

are reported here.

4-phone 6-phone 8-phone

Method EER Method EER Method EER

SBM-KV 19.7 SBM-KV 11.7 SBM-KV 7.5

CW 3, 3, 2, 1 14.1 CW 1, 4, 2, 1 9.6 CW 1, 4, 1, 1 9.7

Table 4.5: Performance of baseline SBM-KV and best cohort word systems onthe SWB1 evaluation sets


4.10.5 Fused SBM-CW experiments

Table 4.6 shows the results of SBM-cohort fusion experiments. Only the top 3

fused systems are reported here.

Phone-length Method EER

4

SBM-CW 1, 4, 2, 2 13.9

SBM-CW 3, 3, 2, 1 13.9

SBM-CW 1, 3, 1, 1 14.1

Baseline SBM 19.7

Baseline CW 3, 3, 2, 1 14.1

6

SBM-CW 2, 4, 2, 1 8.2

SBM-CW 1, 4, 1, 2 8.3

SBM-CW 1, 3, 2, 1 8.3

Baseline SBM 11.7

Baseline CW 3, 3, 2, 1 9.6

8

SBM-CW 2, 4, 1, 2 5.7

SBM-CW 2, 4, 2, 1 5.7

SBM-CW 3, 4, 2, 1 5.7

Baseline SBM-KV 7.5

Baseline CW 3, 3, 2, 1 9.7

Table 4.6: Performance of the best fused SBM-cohort systems on the SWB1evaluation sets

The results demonstrate considerable gains over the baseline SBM system for

all configurations, particularly for the longer length keywords. Specifically gains

in EER of 5.8%, 3.5% and 1.8% absolute were observed using the best fused

systems for the 4-phone, 6-phone, and 8 phone-length experiments respectively.

Gains were also observed over the individual unfused cohort word systems, though

these were not as dramatic except for the 8-phone experiments.

The mean EER gains across all cohort word selection parameter configurations


were considerable in all cases: 2.1%, 7.3%, and 15.2% for 4-phone, 6-phone, and

8 phone-length experiments respectively. However, the majority of these mean

stats were dominated by the gains observed for the very poor performing unfused

cohort word systems.

Gains were better for longer keyword lengths. This is an interesting result

since the unfused cohort word systems actually performed more poorly at longer

keyword lengths than the unfused SBM system. This result suggests that the

extra information provided by the cohort word system is able to significantly

improve performance for longer keywords. However, the results suggest that

little information is provided by the SBM system for short word KV.

The plots in figure 4.11 show a comparison of EER between the fused and

unfused systems, where unfused cohort word and fused SBM-cohort systems with

the same cohort word selection parameters are plotted at the same point on the

horizontal axis. Clearly for the majority of cohort word configurations, the fused

system outperformed the unfused system.

The plots also demonstrate that there was significant correlation between the

performance of unfused and fused cohort word systems that used the same cohort

word selection parameters. In fact, numerical analysis found that there was a cor-

relation coefficient of 0.84 between unfused and fused system EER performance.

This indicates that a well performing unfused system would be a good candidate

for a fused architecture.

More so, the trend lines in figure 4.11 suggest that the best performing unfused

system will give close to the best achievable performance of a fused setup. Thus,

when constructing a fused SBM-CW system, it is sufficient to simply select the

best performing isolated cohort word system as the candidate for fusion.


0 5 10 15 20 25 30 35 40−2

−1.9

−1.8

−1.7

−1.6

−1.5

−1.4

−1.3

−1.2

CW configuration index

log(

Equ

al e

rror

rat

e)

EERs in fused and unfused systems for PL=4Correlation coefficient CW/SBM−CW = 0.8354

Mean CW/SBM−CW gain = 0.0213Std CW/SBM−CW gain = 0.0217

CWSBM−CWSBM

0 5 10 15 20 25 30 35 40−2.5

−2

−1.5

−1

−0.5


log(

Equ

al e

rror

rat

e)



CWSBM−CWSBM

0 5 10 15 20 25 30 35 40−3

−2.5

−2

−1.5

−1

−0.5

0


log(

Equ

al e

rror

rat

e)



CWSBM−CWSBM

Figure 4.11: Correlation between unfused system performances and fused systemperformances


4.10.6 Fused CW-CW experiments

Table 4.7 shows the results of the fusion of 2 cohort word systems. Only the top 3

fused systems are reported here. Cohort word selection parameters are reported in

the format dmin1, dmax1, ψd1, ψi1, dmin2, dmax2, ψd2, ψi2, where the individual

groups of cohort word selection parameters correspond to the individual cohort

word systems


4

CW-CW 1, 3, 2, 1, 3, 4, 2, 2 12.3

CW-CW 3, 3, 2, 2, 2, 4, 2, 1 12.3

CW-CW 4, 4, 2, 1, 3, 3, 2, 2 12.5

Baseline SBM 19.7

Baseline CW 3, 3, 2, 1 14.1

6

CW-CW 1, 4, 2, 1, 1, 4, 1, 1 8.3

CW-CW 4, 4, 1, 1, 1, 4, 2, 1 8.5

CW-CW 1, 4, 1, 2, 4, 4, 1, 1 8.5

Baseline SBM 11.7

Baseline CW 3, 3, 2, 1 9.6

8

CW-CW 1, 1, 1, 1, 1, 4, 1, 1 8.0

CW-CW 1, 1, 1, 2, 1, 4, 1, 1 8.1

CW-CW 3, 3, 2, 2, 1, 4, 1, 1 8.5

Baseline SBM-KV 7.5

Baseline CW 3, 3, 2, 1 9.7

Table 4.7: Performance of the best fused cohort-cohort systems on the SWB1evaluation sets

The trends in gains across phone-length for the CW-CW systems versus the

unfused systems were in stark contrast with the trends observed for the SBM-

CW gains. Specifically the magnitude of gains was greater for the short-word

evaluations. This is in line with previous results that found that unfused CW


was much better suited to short-word KV. Overall maximum EER gains of 1.8%,

1.3% and 1.7% over the best unfused cohort word systems were observed for the

4-phone, 6-phone and 8-phone experiments respectively. The gain for 4-phone

KV is particularly pleasing correponding to a relative gain of 13%.

Table 4.8 shows a correlation analysis between unfused and fused EERs. For

any given fusion pair, EER1 was the lower unfused equal error rate while EER2

was the other equal error rate. The results demonstrate two important properties

of CW-CW fusion. First, there is a high degree of correlation between the fused

EER and the EER1 statistic. For example, the correlation coefficient between

4-phone FusedEER and EER1 was 0.8961. This shows that the equal error rate

of the better performing system in a fusion pair has a significant effect on the

overall fused equal error rate.

Additionally, there is also a high degree of correlation between the product of

the individual unfused EERs and the fused EER. For example, for 6-phone EER,

the correlation coefficient is 0.8427. This indicates that combining two well-

performing systems will result in even more optimal fused system performance,

while combining two poor performing systems will result in poor fused system

performance.

Given the results of the correlation analysis, it is clear that selecting two

well performing unfused cohort word systems will result in good fused system

performance. This means that it is not necessary to try every combination of

cohort word system configurations to obtain close to optimum fusion performance.

4.10.7 Comparison of fused and unfused systems

Figure 4.12 shows a comparison of performance across all evaluated architectures

and phone-lengths. Figure 4.13 shows the same results using a log EER scale to

provide better resolution at low error rates. The figures clearly demonstrate the


4-phones

FusedEER EER1 EER2 EER1 * EER2

1.0000 0.8961 0.5218 0.7820

0.8961 1.0000 0.4481 0.7729

0.5218 0.4481 1.0000 0.9099

0.7820 0.7729 0.9099 1.0000

6-phones


1.0000 0.9887 0.4783 0.8427

0.9887 1.0000 0.4213 0.8158

0.4783 0.4213 1.0000 0.8332

0.8427 0.8158 0.8332 1.0000

8-phones


1.0000 0.9867 0.4708 0.8681

0.9867 1.0000 0.4498 0.8511

0.4708 0.4498 1.0000 0.8000

0.8681 0.8511 0.8000 1.0000

Table 4.8: Correlation analysis of fused EER and individual unfused EER

benefits of various architectures.

For 4-phone KV, the best performing system was CW-CW - the majority of

configurations significantly outperformed the baseline SBM system. Additionally,

both unfused cohort word KV and SBM-CW fused KV gave good improvements

over SBM-KV. However, overall the gains observed for CW-CW, particularly

when using optimal cohort word selection parameters make it clearly the system

of choice for short-word KV.


SBM CW CW−CW SBM−CW

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

EE

R

Comparison of KV performance across architectures. PL = 4.


0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

EE

R



0.1

0.2

0.3

0.4

0.5

0.6

EE

R


Figure 4.12: Boxplot of EERs for all evaluated architectures and phone-lengths



0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

EE

R



0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

EE

R



0.1

0.2

0.3

0.4

0.5

0.6

EE

R


Figure 4.13: Boxplot of log(EERs) for all evaluated architectures and phone-lengths

4.11. Conclusions and Summary 133

Both the SBM-CW and CW-CW fused architectures provided good improve-

ments for medium length 6-phone KV. SBM-CW KV appeared particularly well

suited to this task, with almost all evaluated configurations outperforming SBM-

KV. In terms of execution speed, an SBM-CW system is also quicker than a

CW-CW system and as such, would be a better choice for real-time applications.

Finally, some benefits can be observed for long 8-phone KV using SBM-CW

KV over the baseline SBM system. All fused SBM-CW systems outperformed the

baseline. As such even careless choice of cohort word selection parameters would

yield a better system. However, the maximum achieved EER gain was 1.8%

absolute (24% relative gain). Considering that long word KV EERs are already

quite low compared to shorter word KV, the additional processing burden may

not justify the small absolute gains that may be achieved by a fused system. As

such, based on available processing power, the most appropriate architecture for

long word KV is either a pure SBM approach or the SBM-CW approach.

4.11 Conclusions and Summary

This chapter presented a novel KV technique named Cohort Word Verification.

The cohort word technique combines high level linguistic information with cohort

verification techniques to obtain a well performing non-keyword model.

Multiple formulations for the cohort word confidence score were presented

and evaluated. Small scale experiments on the TIMIT clean microphone speech

database demonstrated that the N-class hybrid approach provided better KV

performance than the 2-class approach. Additionally, the N-class approach was

significantly faster further supporting it as the formulation of choice.

KV experiments on the Switchboard-1 telephone speech corpus were also pre-

sented. These experiments sought to benchmark the performance of the cohort


word method compared to the baseline SBM approach. The experiments demon-

strated that considerable gains in EER performance could be obtained using co-

hort word KV over the baseline method for short and medium keyword lengths.

Specifically the experiments showed absolute EER gains of 5.6% and 2.1% for

4-phone and 6-phone target words respectively. The observed gains were particu-

larly pleasing for the 4-phone case since the baseline SBM system achieved a very

poor EER of 19.7%. Unfortunately, the cohort word method performed poorly

for long word 8-phone KV, resulting in an absolute EER loss of 2.2%.

Analysis was performed on the various cohort word selection parameters to

quantify the effects of its plethora of tuning parameters. It was found that the

key parameters of interest were dmax and the amount of cohort word set size

downsampling. Tuning of the other cohort word parameters only gave small

subsequent refinements in performance. These findings radically simplify the

process of constructing and deploying a cohort word keyword verifier.

Finally, fused SBM-cohort and cohort-cohort systems were examined. Exper-

iments were performed to quantify the gains of these architectures over unfused

SBM and cohort word systems. The results demonstrated that cohort-cohort fu-

sion was well suited for short-word KV, yielding an absolute EER gain of 9.4%

and 1.8% over the baseline SBM and cohort word systems. SBM-cohort KV was

found to be excellent for medium length KV, resulting in an absolute EER gain

of 3.5% and 1.4% over the baseline SBM and cohort word systems. Additionally

SBM-cohort KV was also found to yield improvements for long phone-length KV,

providing an absolute EER gain of 1.8% over the baseline SBM system.

In summary, cohort word verification has been demonstrated to be a well per-

forming means of keyword verification. In particular, fusion of this technique with

SBM-KV yields systems that perform well across a variety of keyword lengths.

The key results of this chapter are summarised in table 4.9

4.11. Conclusions and Summary 135


4

SBM-KV 19.7

CW 3, 3, 2, 1 14.1

SBM-CW 1, 4, 2, 2 13.9

CW-CW 1, 3, 2, 1, 3, 4, 2, 2 12.3

6

SBM-KV 11.7

CW 1, 4, 2, 1 9.6

SBM-CW 2, 4, 2, 1 8.2

CW-CW 1, 4, 2, 1, 1, 4, 1, 1 8.3

8

SBM-KV 7.5

CW 1, 4, 1, 1 9.7

SBM-CW 2, 4, 1, 2 5.7

CW-CW 1, 1, 1, 1, 1, 4, 1, 1 8.0

Table 4.9: Summary of best performing systems

Chapter 5

Dynamic Match Lattice Spotting

5.1 Introduction

The ever-increasing volume and importance of audio and multimedia data has

brought with it the need for rapid audio indexing technologies. Complex large

vocabulary speech-to-text transcription engines have provided an intermediary

solution by transcribing speech into text that can then be rapidly searched using

conventional text search engines. However such systems are severely restricted

by the vocabulary of the STT engine.

Many applications, such as surveillance and news-story indexing, require sup-

port for typically out-of-vocabulary keyword queries such as names, acronyms and

foreign words. In such cases, unrestricted vocabulary keyword spotting methods

such as HMM-based keyword spotting have provided a solution, though at the

expense of considerably slower query speeds. Faster approaches such as reverse

dictionary lookup search engines and phone lattice spotting techniques (see sec-

tion 2.10) offer significantly quicker searching but are encumbered by poor miss

rate performance.

This chapter proposes a very fast and accurate keyword spotting method

137

138 Chapter 5. Dynamic Match Lattice Spotting

named Dynamic Match Lattice Spotting (DMLS). DMLS builds upon lattice-

based spotting methods, but addresses the issue of inherent phone recogniser

errors that adversely affect miss rate performance. This is done by augmenting

the lattice search with dynamic programming sequence matching techniques to

provide robustness against erroneous phone lattice realisations. The resulting

system is capable of searching 1 hour of audio in 2 seconds while maintaining

good detection performance.

Initial sections of this chapter discuss the motivation for the DMLS method

and present a detailed description of all associated algorithms. Subsequent sec-

tions then report on experiments to compare the performance of DMLS to pre-

existing keyword spotting methods. These sections also provide a detailed anal-

ysis of the various parameters of DMLS. The final sections discuss methods of

optimising the execution speed of DMLS to improve real-time speed without af-

fecting detection performance.

5.2 Motivation

The experiments reported in chapter 3 demonstrated that although HMM-based

keyword spotting systems provided good detection performance, they left much

wanting in terms of execution speed. Specifically, it was found that the fastest

SBM-based spotting method required 110 seconds to search 1 hour of speech for

a single keyword using a 3GHz Pentium 4 processor. Although such speeds are

more than adequate for real-time monitoring tasks such as broadcast monitoring

and keyword control systems, significantly faster speeds are required for tasks

such as large database searching.

One solution is to use a two stage STT approach as described in section

2.10.1. Such an approach provides very fast searches since query time processing

is purely textual. However, keyword queries using this method are restricted by

5.2. Motivation 139

the vocabulary of the STT system. In very large vocabulary domains or domains

with dynamic vocabulary sets - such as the broadcast domain - this restriction

will be problematic. For example, the name of the latest elected president of Sri

Lanka is unlikely to be in the vocabulary of most STT systems, but may be of

interest to a user searching for any related news stories.

In contrast, lattice-based and bottom-up keyword spotting methods provide

unrestricted query-time vocabularies while maintaining very fast query speeds.

Instead of transcribing the speech into words during the speech preparation stage,

these methods use a low-level representation such as phone or syllable labels. This

low-level representation can then be searched at query time to infer putative

locations of a target word.

Unfortunately, the detection performances of such methods are significantly

poorer than HMM-based methods. For example, the indexed reverse dictionary

lookup method proposed by Dharanipragada and Roukos [10] achieved a miss

rate of approximately 35% @ 10 FA/kw-hr. Poor performances have also been

reported for lattice based keyword spotting by Young and Brown [39].

A major reason for the poor performance of lattice-based and bottom-up

methods is that query time searching is based on highly erroneous phone recog-

niser transcriptions. Phone-recogniser error rates are typically in the vicinity of

30%-50% in favourable conditions, and potentially even poorer for adverse con-

ditions. This high error rate will clearly be propagated to the query-time search

stages resulting in poor keyword spotting performance.

Lattice-based approaches attempt to accommodate phone recogniser errors by

encoding an utterance within the recognition lattice in terms of multiple hypothe-

ses. A phone lattice not only represents multiple utterance level transcriptions,

but also maintains multiple localised transcriptions at a given time point in an

utterance. It is hoped then, that at least one of the localised hypotheses occurring

at the point of a true target keyword occurrence will match the target keyword


phone sequence.

One means of further reducing the impact of phone recogniser errors is to

incorporate any prior knowledge of recogniser errors into the search process. For

example, if the phones /aa/ and /ih/ are known to be highly substitutionary

for a given phone recogniser, then improved detection rates may be obtained by

including this prior information in the search process.

However, an unfortunate side-effect is that allowing for such error correc-

tions will inadvertently lead to an increase in false alarm rates. For exam-

ple, when using the /aa/ ↔ /ih/ substitutionary rule, true occurrences of the

word STICK = (/s/, /t/, /ih/, /k/) will be labeled as instances of the word

STOCK = (/s/, /t/, /aa/, /k/). As such, any error correction will need to incur

some kind of cost that affects the overall likelihood of a putative instance.

A method that successfully incorporates phone recogniser error correction

will improve overall keyword spotting robustness. The resultant gains in per-

formance will improve the suitability of lattice-based and bottom-up keyword

spotting methods for very fast keyword spotting tasks.

5.3 Dynamic Match Lattice Spotting method

Dynamic Match Lattice Spotting is an extension of conventional lattice-based

keyword spotting, but uses the Minimum Edit Distance (see Appendix A) during

lattice searching to compensate for phone recogniser insertion, deletion and sub-

stitution errors. This addresses the major shortcoming of lattice-based methods

— the requirement for the target phone sequence to appear in its entirety within

the phone-lattice for consideration as a hypothesised keyword occurrence.

Given source and target sequences, the MED calculates the minimum cost

of transforming the source sequence to the target sequence using a combination

of insertion, deletion, substitution and match operations, where each operation

5.3. Dynamic Match Lattice Spotting method 141

has an associated cost. In the Dynamic Match Lattice Spotting method, each

observed lattice phone sequence is scored against the target phone sequence using

the MED. Lattice sequences are then accepted or rejected by thresholding on this

MED score, hence providing robustness against phone recogniser errors. Conven-

tional lattice-based spotting is a special case of DMLS where a score threshold of

0 is used.

This means of phone recogniser error correction has significant potential for

improving keyword spotting performance. Consider the segment of a phone lattice

from the Wall Street Journal database corresponding to an instance of the word

STOCK, as shown in figure 5.1. Using the conventional lattice-based search, none

of the paths will match the target sequence STOCK = (/s/, /t/, /aa/, /k/).

The lattice shows that the phone recogniser correctly transcribed 3 of the 4

phones correctly: /s/, /t/ and /k/. However a simple substitution error for the

phone /aa/ prevents the word STOCK from being detected.

In contrast, the DMLS search will match a large number of paths, though

each path will have a non-zero MED score. The decision to accept or reject these

putative occurrences is then left to a subsequent thresholding or keyword verifi-

cation stage. This is one of the many examples where DMLS’ error robustness

aids detection performance.

As stated before, a downfall of DMLS is that there will be an increase in false

alarm rate since the sequence matching process is significantly more liberal than

the conventional lattice based technique. However, since each putative occurrence

will have an associated MED score that is indicative of the looseness of the match,

simple techniques such as thresholding can be used to limit the looseness of the

final result set.

Additionally, a keyword spotting stage should try to achieve as low a miss rate

as possible at the expense of false alarm rate if a subsequent keyword verification

stage is being used. The burden of false alarm reduction is then left to the more


s

s

s

v

r

th

s

s

r

s

s

sh

sh

zh

s

s

s

z

s

s

ah

uw

ae

s

s

s

z

th

t

t

g

d

ih

ah

t

aa ae

ih

ow

k

Figure 5.1: Segment of phone lattice for an instance of the word STOCK


specialised keyword verification stage. Hence, although the anticipated increase in

false alarm rate for the Dynamic Match Lattice Spotting technique is unfortunate,

it is a resolvable issue.

5.3.1 Basic method

The Dynamic Match Lattice Spotting algorithm is an extension of the conven-

tional lattice-based spotting method. This is a two stage process consisting of an

initial lattice building stage and a subsequent query-time search stage.

During the lattice building stage, each utterance is decoded using a Viterbi

phone recogniser to generate a recognition phone lattice. The quality of this

lattice can be controlled by adjusting a variety of factors including:

1. Phone language model: The quality of this model has a significant im-

pact on the overall error rate of the recogniser. Typically long context

models such as 4-grams provide better performance than short context 2-

gram models.

2. Word insertion penalty: Tuning of this parameter allows a trade off

between insertion and deletion rates of the recogniser.

3. Grammar scale factor: This affects the importance attributed to the

language model likelihoods

4. Number of phone classes: Using a smaller number of phone classes will

yield improved phone recogniser performance. However, using too small a

phone set will result in very broadly labeled data and hence more confusable

lattices for subsequent searching.

Lattice building only needs to be performed once per utterance. Once a lattice

is built, it can be used repeatedly in subsequent queries regardless of the query

term.


The second stage of the Dynamic Match Lattice Spotting method is the lattice

search stage. This step is considerably faster than the initial lattice building stage

since processing is purely textual. The process consists of a modified Viterbi

traversal of the lattice that emits putative matches during traversal.

Let P = (p1, ..., pN ) be defined as the target phone sequence, where N is

the target phone sequence length. Additionally let Smax be the maximum MED

score threshold, K be the maximum number of observed phone sequences to

be emitted at each node, and V be defined as the number of tokens used during

lattice traversal. Then for each node in the phone lattice, where node list traversal

is done in time-order:

1. For each token in the top K scoring tokens in the current node:

(a) Let Q = (q1, ..., qM ),M = N+MAX(Ci)∗Smax be the observed phone

sequence obtained by traversing the token history backwardsM levels,

where Ci is the insertion MED cost function.

(b) Let S = BESTMED(Q,P,Ci, Cd, Cs), where

i. Cd is the deletion MED cost function

ii. Cs is the substitution MED cost function

iii. BESTMED(. . .) returns the score of the first element in the last

column of the MED cost matrix that is less than or equal to Smax

(or ∞ otherwise).

(c) Emit Q as a keyword occurrence if S ≤ Smax

2. For each node linked to the current node, perform V -best token set merging

of the current node’s token set into the target node’s token set.


5.3.2 Optimised Dynamic Match Lattice Search

The basic DMLS method described above will execute significantly faster than

HMM-based keyword spotting. This is because all search-time processing is

purely textual. However, a significant part of this search process is Viterbi de-

coding, which in itself is a computationally intensive task. It is in fact possible

to remove this Viterbi lattice traversal from query-time processing, as described

below. This yields further increases in query speed.

Since the paths traversed through the lattice are independent of the query

term (traversal is done purely by maximum likelihood), it is possible to perform

the lattice traversal during the lattice building stage. Then it is only necessary

to store the observed phone sequences at each node for searching at query-time.

Furthermore, if it is assumed that the maximum queried phone sequence

length is fixed at Nmax and the maximum sequence match score threshold is

preset at Smax, then it is only necessary to store observed phone sequences of

length Mmax = Nmax +MAX(Ci) ∗ Smax.

Query-time processing then reduces to simply calculating the MED between

each stored observed phone sequence and the target phone sequence. The algo-

rithm for the lattice building stage is hence:

1. Construct the recognition lattice using the same approach as in the basic

DMLS method

2. Let A = , where A is the collection of observed phone sequences

3. For each node in the phone-lattice, where node list traversal is done in

time-order:

(a) For each token in the top K scoring tokens in the current node:

i. Let Q = (q1, ..., qMmax) be the observed phone sequence obtained

by traversing the token history backwards Mmax levels.


ii. Append the sequence Q to the collection A

(b) For each node linked to the current node, perform V -best token set

merging of the current node’s token set into the target node’s token

set.

4. Store the observed phone sequence collection for subsequent searching

5. The recognition lattice can now be discarded as it is no longer required for

query-time searching

This optimisation results in a significant reduction in the complexity of query-

time processing. Whereas in the basic DMLS approach, full Viterbi traversal was

required, processing using this optimised approach is now a linear progression

through a set of observed phone sequences.

Thus the optimised query time search algorithm is as follows. For each mem-

ber, Q of the collection of observation sequences, A:

1. Let S = BESTMED(Q,P,Ci, Cd, Cs)

2. Emit Q as a putative occurrence if S ≤ Smax

5.4 Evaluation of DMLS performance

Evaluations were performed to compare the DMLS technique with conventional

keyword spotting approaches. Specifically, comparisons were made against the

conventional SBM-based and lattice-based systems. Evaluations were performed

on the TIMIT clean microphone speech database.


A keyword spotting evaluation set was constructed using speech taken from the

TIMIT test database. The choice of query words was constrained to words that

5.4. Evaluation of DMLS performance 147

had 6-phone-length pronunciations to reduce target word length dependent vari-

ability.

Approximately 1 hour of TIMIT test speech (excluded SA1 and SA2 utter-

ances) was labeled as evaluation speech. From this speech, 200 6-phone-length

unique words were randomly chosen and labeled as query words. These query

words appeared a total of 480 times in the evaluation speech.


16-mixture triphone HMM acoustic models and a 256-mixture Gaussian Mixture

Model background model were trained on a 140 hour subset of Wall Street Jour-

nal 1 database for use with SBM-based, conventional lattice-based and DMLS

evaluations. Additionally 2-gram and 4-gram phone-level language models were

trained on the same section of WSJ1 for use during the lattice building stages of

DMLS and the conventional lattice-based methods.

All speech was parameterised using Perceptual Linear Prediction coefficient

feature extraction and Cepstral Mean Subtraction. In addition to 13 static Cep-

stral coefficients (including the 0th coefficient), deltas and accelerations were

computed to generate 39-dimension observation vectors.

5.4.3 Lattice building

The following lattice building procedure, based on the optimised lattice building

approach described in section 5.3.2, was used for these experiments:

1. Lattices were generated for each utterance by performing a U -token Viterbi

decoding pass using the 2-gram phone-level language model

2. The resulting lattices were expanded using the 4-gram phone-level language

model


3. Output likelihood lattice pruning was applied using a beam-width of W to

reduce the complexity of the lattices. This essentially removed all paths

from the lattice that had a total likelihood outside a beamwidth of W of

the top-scoring path.

4. A second V -token traversal was performed to generate the top 10 scoring

observed phone sequences of length 11 at each node (allowing spotting of

sequences of up to 11−MAX(Ci) ∗ Smax phones).

Lattice building was only performed once for each utterance. The resulting

observed phone sequence collections were then stored to disk and used during the

actual query time search experiments.

5.4.4 Query-time processing

The optimised lattice search algorithm described in section 5.3.2 was used for

these experiments. The sequence matching threshold, Smax, was fixed at 2 for all

experiments unless noted otherwise. MED calculations used a constant deletion

cost of Cd =∞ as preliminary experiments found that poor results were obtained

for non-infinite values of Cd. The insertion cost was also fixed at Ci = 1.

In contrast, Cs was allowed to vary based on phone substitution rules. The

phone substitution costs used are given in table 5.1 and were determined em-

pirically by examining phone recogniser confusion matrices. Substitutions were

completely symmetric. Hence the substitution of a phone, m in a given phone

group with another phone, n, in the same group yielded the same cost as the

substitution of n with phone m.

The basic rules used to obtain these costs were:

1. Cs = 0 for same-letter consonant phone substitution (eg. /n/ ↔ /nx/,

/z/↔ /zh/)


2. Cs = 1 for vowel substitutions

3. Cs = 1 for closure and stop substitutions

4. Cs =∞ for all other substitutions.

However, some exceptions to these rules were made based on empirical observa-

tions from small scale experiments.

Phone Group Subst. Cost

aa ae ah ao aw ax ay eh en er ey ih iy ow oy uh uw 1

b d dh g k p t th jh 1

d dh 0

n nx 0

t th 0

w wh 0

uw w 1

z zh s sh 1

Table 5.1: Phone substitution costs for DMLS

5.4.5 Baseline systems

The standard SBM-based background model system described in chapter 3 was

used for evaluating HMM-based keyword spotter performance. This baseline pro-

vides a comparison with standard state-of-the-art keyword spotting performances

achieved for real-time keyword spotting.

The lattice-based baseline system was constructed using the method proposed

by Young and Brown [39]. This algorithm was implemented by simply using

a DMLS system with Smax = 0. This in essence would result in only exact

matches within the recognition lattice being emitted as putative occurrences,


as required for conventional lattice-based spotting. Miss and false alarm rates

obtained using this approach would be indicative of conventional lattice-based

spotting performance. However, true execution times could not be measured for

this baseline system, as a simulated system was being used rather than the true

lattice-based system described by Young and Brown [39].


The systems were evaluated by performing single-word keyword spotting for each

query word across all utterances in the evaluation set. The total miss rate for all

query words and the false alarm per keyword occurrence rate (FA/kw) were then

calculated using reference transcriptions of the evaluation data. Additionally the

total CPU processing minutes per queried keyword per hour (CPU/kw-hr) was

measured for each experiment using a 3GHz Pentium 4 processor.

For DMLS, CPU/kw-hr only included the CPU time used during the DMLS

search stage. That is, the time required for lattice building was not included.

All experiments used a commercial-grade decoder to ensure that the best

possible CPU/kw-hr results were reported for the HMM-based system. This is

because HMM-based keyword spotting time performance is bound by decoder

performance.

5.4.7 Results

To aid discussion the notation DMLS[U ,V ,W ,Smax] is used to specify DMLS

configurations, where U is the number of tokens for lattice generation, V is the

number of tokens for lattice traversal, W is the pruning beamwidth, and Smax is

the sequence match score threshold (see section 5.4.3 and section 5.4.4 for further

details on these parameters). The notation HMM[α] is used when referring to

baseline SBM-KS systems where α was the duration-normalised output likelihood


threshold used. Additionally the baseline conventional lattice-based method is

referred to as CLS.

Performances for the HMM-based, lattice-based and the DMLS systems mea-

sured for the TIMIT evaluation set are shown in Table 5.2. For this set of experi-

ments, the DMLS[3,10,200,2] configuration was arbitrarily chosen as the baseline

DMLS configuration.

Method Miss FA/ CPU/

Rate kw kw-hr

HMM[∞] 1.6 44.2 1.58

HMM[-7580] 10.4 36.6 1.58

HMM[-7000] 39.8 16.8 1.58

CLS[3,10,200,0] 32.9 0.4 -.–

DMLS[3,10,200,2] 10.2 18.5 0.30

Table 5.2: Baseline keyword spotting results evaluated on TIMIT

The timing results demonstrate that as expected DMLS was significantly

faster than the SBM-KS method, running at approximately 5 times the speed.

This amounts to a baseline DMLS system capable of searching 1 hour of speech in

18 seconds. DMLS also had more favourable FA/kw performance: at 10.2% miss

rate, it had a FA/kw rate of 18.5, significantly lower than the 36.6 FA/kw rate

achieved by the HMM[-7580] system. However, the HMM system was still capa-

ble of achieving a much lower miss rate of 1.6% using the HMM[∞] configuration,

though at the expense of considerably more false alarms.

The miss rate achieved by the conventional lattice-based system was very poor

compared to that of Dynamic Match Lattice Spotting. This confirms that the

phone error robustness inherent in DMLS yields considerable detection perfor-

mance benefits. However, the false alarm rate for CLS was dramatically better

than all other systems, though with such a high miss rate, this is not surprising.

In summary, these experiments demonstrate that Dynamic Match Lattice


Spotting is well suited for very fast keyword spotting tasks. Specifically, in the

reported evaluations, DMLS was able to search 1 hour of speech in 18 seconds.

DMLS significantly outperformed the baseline lattice-based system in terms of

miss rate and also yielded considerably lower false alarm rates than the baseline

HMM-based system. Overall, Dynamic Match Lattice Spotting appears to pro-

vide a compromise between the low miss rate of HMM-based systems and the

low false alarm rate of lattice-based systems, while still providing extremely fast

keyword spotting.

5.5 Analysis of dynamic match rules

The miss rate achieved by the baseline lattice-based system was very poor com-

pared to that of DMLS. This indicates that the phone recogniser error robustness

incorporated into the DMLS search does significantly improve keyword spotting

performance. However, it is not immediately clear which aspects of the dynamic

match process are most effective in improving performance.

Specifically, improvements in performance can be attributed to the four main

cost rules used in the dynamic match process:

1. Insertion costs

2. Same letter substitution costs (eg. /d/↔ /dh/, /n/↔ /nx/)

3. Vowel substitution costs

4. Closure/stop substitution costs (eg. /b/↔ /d/, /k/↔ /p/)

As such, experiments are presented here to quantify the benefits of individual

cost rules. The evaluation set, recogniser parameters, experimental procedure

and DMLS algorithm are the same as those used in section 5.4.

5.5. Analysis of dynamic match rules 153

5.5.1 System configurations

Specialised DMLS systems were built to evaluate the effects of individual cost

rules in isolation. These systems were implemented as follows:

• Same letter substitution rules only. Implemented using MED cost functions:

Cd =∞

Ci =∞

Cs(a, b) =

1 a and b have the same letter base

∞ otherwise

• Vowel substitution rules only. Implemented using MED cost functions:

Cd =∞

Ci =∞

Cs(a, b) =

1 a and b are both vowels

∞ otherwise

• Closure/stop substitution rules only. Implemented using MED cost func-

tions:

Cd =∞

Ci =∞

Cs(a, b) =

1 a and b are closures or stops

∞ otherwise


• Insertion cost rule only. Implemented using MED cost functions:

Cd =∞

Cs =∞

Ci = 1

5.5.2 Results

Table 5.3 shows the results of the specialised DMLS systems, baseline lattice-

based CLS system and the previously evaluated DMLS[3,10,200,2] system with

all MED rules.

Method Miss FA/

Rate kw

CLS[3,10,200,0] 32.9 0.4

DMLS[3,10,200,2] insertions 28.5 1.2

DMLS[3,10,200,2] same letter subst 31.0 0.5

DMLS[3,10,200,2] vowel subst 15.6 7.8

DMLS[3,10,200,2] closure/stop subst 23.5 3.0

DMLS[3,10,200,2] all rules 10.2 18.5

Table 5.3: TIMIT performance when isolating various DP rules

The experiments demonstrate that the magnitude of contributions of the vari-

ous rules to overall keyword spotting performance varies drastically. Interestingly

no single rule brought performance down to the all rules DMLS system. This

indicates that the rules are complementary in nature and yield a combined overall

improvement in miss rate performance.

Using the same letter substitution rules only yielded a small gain in perfor-

mance over the null-rule CLS system: 1.9% absolute in miss rate with only a

5.5. Analysis of dynamic match rules 155

0.1 drop in FA/kw rate. The result suggests that the phone-lattice is already

robust to same letter substitutions, and as such, inclusion of this does not obtain

significant gains in performance. Emperical study of the phone-lattices revealed

this to be the case in many situations. For example, typically if the phone /s/

appeared in the lattice, then it was almost guaranteed that the phone /sh/ also

appeared at a similar time location in the lattice.

The insertions-only system yielded a slight larger gain of 4.4% absolute in

miss rate with only a 0.8 drop in FA/kw rate. The result indicates that the

lattices contain extraneous insertions across many of the multiple hypotheses

paths, preventing detection of the target phone sequence when insertions are

not accounted for. This observation is to be expected since phone recognisers

typically do have significant insertion error rates, even when considering multiple

levels of transcription hypotheses.

A significant absolute miss rate gain of 17.3% was observed for the vowel

substitution system. However, this gain was at the expense of a 7.4 absolute

increase in FA/kw rate. This is a pleasing gain and is supported by the fact that

vowel substitution is a frequent occurrence in the realisation of speech. As such,

incorporating support for vowel substitutions in DMLS not only corrects errors in

the phone recogniser but also accomodates this substitutionary habit of human

speech.

Finally, significant gains were also observed for the closure/stop substitution

system. An absolute gain of 9.4% in miss rate combined with an unfortunate

2.6 absolute increase in FA/kw rate was obtained for this system. Typically

closures and stops are shorter acoustic units and therefore more likely to yield

classification errors. As such, even though the phone lattice encodes multiple

hypotheses, it appears that it is still necessary to incorporate robustness against

closure/stop confusion for lattice-based keyword spotting.

Overall, the experiments demonstrate the benefits of the various classes of


MED rules used in the evaluated DMLS systems. It was pleasing to note that

even the simplest of these rules still provided tangible gains in performance over

the baseline lattice-based CLS system. This clearly reinforces the fact that the

dynamic matching aspects of DMLS are beneficial. The results showed that

insertion and same-letter consonant substitution rules only provided a small per-

formance benefit over a conventional lattice-based system, whereas vowel and

closure/stop substitution rules yielded considerable gains in miss rate. Gains in

miss rate were typically unfortunately offset by increases in FA/kw rate, although

the majority of these gains were fairly small, and would most likely be justifiable

in light of the resulting improvements in miss rate.

5.6 Analysis of DMLS algorithm parameters

Earlier experiments in this chapter used a fixed DMLS[3,10,200,2] configuration

to reduce the scope of experiments. In this section, the quantitative effects of

these individual parameters on keyword spotting performance are measured and

examined. Specifically the following parameters are studied:

1. Number of tokens used for lattice generation, U

2. Number of tokens used for lattice traversal, V

3. Lattice pruning beamwidth, W

4. The threshold applied to MED score, Smax

The evaluation set, recogniser parameters, experimental procedure and DMLS

algorithm are the same as used in section 5.4.

5.6. Analysis of DMLS algorithm parameters 157

5.6.1 Number of lattice generation tokens

The number of tokens used for lattice generation, U , has a direct impact on the

maximum size of the resulting phone lattice. For example, if a value of U = 3

is used, then a lattice node can have at most 3 predecessor nodes. Whereas, if a

value of U = 5 is used, then the same node can have up to 5 predecessor nodes,

greatly increasing the size and complexity of the lattice when applied across all

nodes.

Tuning of U directly affects the number of hypotheses encoded in the lattice,

and hence the best achievable miss rate. However, using larger values of U also

increases the number of nodes in the lattice, resulting in an increased amount of

processing during DMLS searching and therefore increased execution time.

Table 5.4 shows the result of increasing U from 3 to 5. As expected, increasing

U resulted in an improvement in miss rate of 4.4% absolute but also in an increase

in execution time by a factor of 2.3. A corresponding 19.9 increase FA/kw rate

was also observed.

The obvious benefit of tuning the number of lattice generation tokens is that

appreciable gains in miss rate can be obtained. Although this has a negative

effect on FA/kw rate, a subsequent keyword verification stage may be able to

accommodate the increase.

Method Miss Rate FA/kw-hr CPU/kw-hr

DMLS[3,10,200,2] 10.2 18.5 0.30

DMLS[5,10,200,2] 5.8 38.4 0.71

Table 5.4: Effect of adjusting number of lattice generation tokens


5.6.2 Pruning beamwidth

Lattice pruning is applied to remove less likely paths from the generated phone

lattice, thus making the lattice more compact. This is typically necessary when

language model expansion is applied. For example, applying 4-gram language

model expansion to a lattice generated using a 2-gram language model results

in a significant increase in the number of nodes in the lattice, many of which

may now how much poorer likelihoods due to additional 4-gram language model

scores.

The direct benefit of applying lattice pruning is an immediate reduction in

the size of the lattice that needs to be searched. This will give improvements

in execution time, though at the expense of losing potentially correct paths that

unfortunately did not score well linguistically.

Table 5.5 shows the effect of pruning beamwidth for four different values: 150,

200, 250 and ∞. As predicted, decreasing pruning beamwidth yielded significant

gains in execution speed at the expense of reductions in miss rate. Corresponding

drops in FA/kw rate were also observed.

Adjusting pruning beamwidth appears to be particularly well suited for tun-

ing execution time. The changes in CPU/kw-hr figures were dramatic, and in

comparison, the miss rate figures varied in a much smaller range.

Method Miss Rate FA/kw CPU/kw-hr

DMLS[3,10,150,2] 12.5 12.2 0.18

DMLS[3,10,200,2] 10.2 18.5 0.30

DMLS[3,10,250,2] 9.2 24.7 0.47

DMLS[3,10,∞,2] 7.3 60.6 2.93

Table 5.5: Effect of adjusting pruning beamwidth


5.6.3 Number of lattice traversal tokens

The number of lattice traversal tokens, V , corresponds to the number of tokens

used during the secondary Viterbi traversal. Tuning this parameter affects how

many tokens are propagated out from a node, and hence, the number of paths

entering a node that survive subsequent propagation.

g

t

ae

k Tok from t

Tok from t

Tok from g

Tok from k

Tok from k

Tok from k

Tok from g

Tok from g

g

t

ae

k

Tok from t

Tok from t

Tok from t

Tok from g

Tok from t

Tok from k

Tok from k

Tok from g

Tok from k

Tok from g

Tok from g

Tok from k

Tok from g

Tok from k

Tok from t Emit cutoff

Tok from t

Emit cutoff

Token Set

3−token propagation 5−token propagation

Figure 5.2: Effect of lattice traversal token parameter


The impact of this on DMLS is actually more subtle, and is demonstrated

by figure 5.2. In this instance, the scores of tokens propagated from the t node

are much higher than the scores from the other nodes. As such, in the 5-token

propagation case, the majority of the high-scoring tokens in the target node are

from the t node. Hence the tokens above the emission cutoff (ie. the tokens from

which observed phone sequences are generated) are mainly t nodes. However,

using the same emission cutoff and 3-token propagation results in a set of top-

scoring tokens from a variety of source nodes. It is not immediately obvious

whether it is better to use a high or low number of lattice traversal tokens for

optimal DMLS performance.

Table 5.6 shows the results of experiments using three different numbers of

traversal tokens: 5, 10 and 20. It appears that all three measured performance

metrics were fairly insensitive to changes in the number of traversal tokens. There

was a slight decrease in miss rate when using a higher value of V , though this may

not be considered a dramatic enough change to justify the additional processing

burden required at the lattice building stage.


DMLS[3,5,200,2] 10.4 17.4 0.28

DMLS[3,10,200,2] 10.2 18.5 0.30

DMLS[3,20,200,2] 9.8 18.8 0.29

Table 5.6: Effect of adjusting number of traversal tokens

5.6.4 MED cost threshold

Tuning of the MED cost threshold, Smax, is the most direct means of tuning

miss and FA/kw performance. However, if discrete MED costs are used, then


Smax itself will be a discrete variable, and as such, thresholding will not be on a

continuous scale.

Smax specifies the maximum allowable discrepancy between an observed phone

sequence and the target phone sequence. However, this single threshold does

not take into account what kind of mismatch occurred. For example, the se-

quences (/s/, /t/, /aa/, /p/) and (/s/, /sh/, /t/, /aa/, /k/, /g/) will both have a

MED score of 2 when scored against the target sequence (/s/, /p/, /aa/, /k/)

using the cost functions C(i) = 1, C(s) = 1, and C(d) =∞.

Experiments were carried out to study the effects of changes in Smax on per-

formance. The results of these experiments are shown in table 5.7. Since thresh-

olding was applied on the result set of DMLS, there were no changes in execution

time.


DMLS[3,10,200,0] 31.0 0.5 0.30

DMLS[3,10,200,1] 13.3 4.3 0.30

DMLS[3,10,200,2] 10.2 18.5 0.30

DMLS[3,10,200,3] 8.7 52.0 0.30

Table 5.7: Effect of adjusting MED cost threshold Smax

The experiments demonstrated that adjusting Smax gave dramatic changes in

FA/kw. In contrast, the changes in miss rate were considerably more conserva-

tive. Tuning of the MED cost threshold therefore appears to be most applicable

to adjusting the FA/kw operating point. This is intuitive since adjusting Smax

adjusts how much error an observed phone sequences is allowed to have, and as

such has a direct correlation with false alarm rate.


5.6.5 Tuned systems

Previous sections examined tuning of the various DMLS parameters in isolation.

However, it was not clear from these experiments how a system constructed using

a combination of tuned parameters would perform. In particular, it is essential

to know whether the benefits obtained from tuning the individual parameters are

complementary, resulting in even greater increases in keyword spotting perfor-

mance.

As such, two tuned systems were constructed and evaluated on the TIMIT

data set. Parameters for these systems were selected as follows:

1. The number of lattice generation tokens appeared to be well suited to ad-

justing miss rate. As such, a value of U = 5 was used for the tuned systems.

2. DMLS performance appeared insensitive to changes in the number of lattice

traversal tokens. Hence, to remain consistent with previous experiments, a

value of V = 10 was used.

3. The speed increases observed using a reduced lattice pruning beamwidth

were quite dramatic, and in comparison only resulted in a small decrease in

miss rate. Considering the anticipated gains in miss rate from the increase

in the number of lattice generation tokens, a reduced value of W = 150 was

used for the tuned systems.

4. Two values of Smax were evaluated to obtain performance at different false

alarm points. The values evaluated were Smax = 1 and Smax = 2. Although

it was anticipated that a reduction in miss rate would be observed for the

lower Smax = 1 system, it was hoped that this would be compensated for by

the increase in the number of lattice generation tokens, and further justified

by the significantly lower false alarm rate.


The results of the tuned systems on the TIMIT evaluation set are shown in

table 5.8. The first system achieved a significant reduction in FA/kw rate over

the initial DMLS[3,10,200,2] system at the expense of only a small 1.3% absolute

increase in miss rate. The second system obtained a good decrease in miss rate of

2.9% with only a small 3.8 FA/kw rate increase. Both these systems maintained

the same execution speed as the initial DMLS system. It is difficult to say which

of the tuned systems is more optimal, since typically the choice of operating point

is application dependent.


DMLS[3,10,200,2] 10.2 18.5 0.30

DMLS[5,10,150,1] 11.5 5.6 0.31

DMLS[5,10,150,2] 7.3 22.3 0.31

Table 5.8: Optimised DMLS configurations evaluated on TIMIT

A comparison of these systems in terms of miss rate and FA/kw rate with all

other evaluated DMLS systems is shown in figure 5.3. It can be seen from this

figure that the optimised systems are closer to the origin - indicating an overall

system improvement. More importantly, these performance gains have been made

while maintaining the same execution time.

5.6.6 Conclusions

A number of experiments were conducted to evaluate the impact of four Dynamic

Match Lattice Spotting parameters: the number of lattice generation tokens, the

number of lattice traversal tokens, the pruning beamwidth and the MED cost

threshold Smax. It was concluded that:

1. Good control of miss rate performance can be obtained by tuning the num-

ber of lattice generation tokens.


−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5 41.6

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

log(FA/kw rate)

log(

Mis

s ra

te)

# Lat Gen Toks# Lat Trav ToksPrune WidthSmax ThreshTuned 1Tuned 2

Figure 5.3: Trends in miss rate and FA/kw rate performance for various types oftuning

2. System performance is fairly insensitive to the number of lattice traversal

tokens, making this a poor parameter for tuning.

3. Adjustment of the lattice pruning beamwidth gives significant changes in

execution time. As such, this is an excellent parameter for tuning system

speed. Although changes in both miss rate and FA/kw rate were also ob-

served, the magnitudes of these changes were much less than the changes

obtained for execution speed.

4. Smax tuning is particularly useful for adjusting FA/kw rates. This parame-

ter can also be used for adjusting miss rate, though the changes are not as

dramatic.

5. Improvements in miss and false alarm rates can be obtained simultaneously

by tuning of all parameters in combination, as demonstrated by the two

5.7. Conversational telephone speech

experiments 165

tuned systems that were constructed. These gains can be obtained without

sacrificing execution speed.

5.7 Conversational telephone speech

experiments

Previous sections of this chapter only evaluated DMLS on the clean microphone

speech domain. The conversational telephone speech domain is a more difficult

domain but is more representative of a real-world practical application of DMLS.

As such, this section reports on experiments to evaluate the performance of DMLS

for this domain.

Specifically, experiments were performed using the Switchboard 1 telephone

speech corpus. To maintain consistency, the same baseline systems, DMLS algo-

rithms and evaluation procedure as used in section 5.4 are used here.


The evaluation set was constructed in a similar fashion to the previously con-

structed TIMIT evaluation set. Approximately 2 hours of speech was taken from

the Switchboard corpus and labeled as evaluation speech. From this speech, 360

6-phone-length unique words were randomly chosen and marked as query words.

In total, these query words appeared a total of 808 times in the evaluation set.


The same recogniser parameters that were used for the previous TIMIT experi-

ments were used for this set of experiments. This consisted of training 16-mixture

triphone HMM acoustic models as well as a 256-mixture Gaussian Mixture Model

background model for use with the SBM-based keyword spotting experiments. A


total of 165 hours of speech taken from the Switchboard corpus was used as

training data for these models.

In addition, 2-gram and 4-gram phone-level language models were trained

on the same data set. Phone-level transcriptions were obtained using forced

alignment in conjunction with the PronLex lexicon.

All speech was parameterised using Perceptual Linear Prediction coefficient

feature extraction followed by Cepstral Mean Subtraction.

5.7.3 Results

The results for the HMM-based, conventional lattice-based, and DMLS experi-

ments on conversational speech SWB1 data are shown in table 5.9. DMLS per-

formance was measured using the baseline DMLS[3,10,200,2] system as well as a

number of tuned configurations. Tuned systems were constructed using a combi-

nation of lattice generation tokens, pruning beamwidth and Smax tuning.


HMM[-7500] 8.0 366.9 1.77

HMM[-7300] 14.1 319.6 1.77

CLS[3,10,200,0] 38.4 3.2 -.–

DMLS[3,10,200,2] 17.5 59.0 0.51

DMLS[5,10,150,2] 11.0 83.6 0.72

DMLS[5,10,150,1] 14.2 23.0 0.72

DMLS[5,10,100,2] 13.9 36.1 0.18

Table 5.9: Keyword spotting results on SWB1

Of note is the dramatic increase in FA/kw rates for all systems compared to

those observed for the TIMIT evaluations. This is an expected result, since the

conversational telephone speech domain is a more difficult domain for recognition.

For DMLS, this increase in false alarm rate is a result of the increased complexity

5.7. Conversational telephone speech

experiments 167

of the lattices. It was found that the lattices generated for the Switchboard data

were significantly larger than those generated for the TIMIT data when using

the same pruning beamwidth. This meant that there were more paths with high

likelihoods, indicating a greater degree of confusability within the lattices. As a

result, more false alarms were generated.

Losses in miss rate in the vicinity of 5% absolute were also observed for all

systems compared to the TIMIT evaluations. Although this is unfortunate, these

losses are still minor in light of the increased difficulty of the data.

Overall though, DMLS still achieved more favourable performance than the

baseline HMM-based and lattice-based systems. The DMLS systems not only

yielded considerably lower miss rates than CLS but also significantly lower FA/kw

and CPU/kw-hr rates than the HMM-based systems. Figure 5.4 shows a plot of

miss rate versus FA/kw rate for the evaluated systems. The plot indicates that

DMLS offers good middle ground performance.

In terms of detection performance, the two best DMLS systems were the

DMLS[5,10,150,1] and the DMLS[5,10,100,2] configurations. Both had lower false

alarm rates than the other DMLS systems and still maintained fairly low miss

rates. However, the execution speed of the DMLS[5,10,100,2] configuration was 4

times faster than the DMLS[5,10,150,1] system. In fact, this system was capable

of searching 1 hour of speech in 10 seconds. For applications requiring very fast

search speeds, the DMLS[5,10,100,2] system would clearly be the better choice.

Overall, the experiments demonstrate that DMLS is capable of delivering

good keyword spotting performance on the more difficult conversational telephone

speech domain. Although there was some degradation in performance compared

to the clean speech microphone domain, the losses were in line with what would

be expected. Also, DMLS offered much faster performance than the HMM-based

system and considerably lower miss rates than the conventional lattice-based

system.


1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 62

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

log(FA/kw rate)

log(

Mis

s ra

te)

HMM[−7500]HMM[−7300]CLS[3,10,200,0]DMLS[3,10,200,2]DMLS[5,10,150,2]DMLS[5,10,150,1]DMLS[5,10,100,2]

Figure 5.4: Plot of miss rate versus FA/kw rate for HMM, CLS and DMLSsystems evaluated on Switchboard

5.8 Non-destructive optimisations

Experiments on the TIMIT and Switchboard databases have clearly demonstrated

that DMLS is capable of obtaining very fast keyword spotting speeds. Although

these speeds are impressive, further gains in throughput can be obtained through

optimisation of the MED calculations.

MED calculations are in fact the mostly costly operations performed during

the DMLS search stage. The basic MED algorithm is an O(N 2) algorithm and

hence not particularly suitable for high-speed calculation. However, within the

DMLS search context, two specific optimisations can be applied to reduce the

computational cost of these MED calculations. These optimisations are the prefix

sequence optimisation and the early stopping optimisation.

5.8. Non-destructive optimisations 169

5.8.1 Prefix sequence optimisation

The prefix sequence optimisation utilises the similarities in the MED cost matrix

of two observed phone sequences that share a common prefix sequence.

Let A = (a1, a2, . . . , aN) and B = (b1, b2, . . . , bM). Also let B′ be defined as

the first-order prefix sequence of B, given by B ′ = (bi)M−11 . Finally, let the MED

cost matrix between two sequences be defined as Ω(X,Y ).

From the basic definition of the MED cost matrix, the (N + 1) × M cost

matrix Ω(A,B′) is equal to the first M columns of the cost matrix Ω(A,B). This

is because B′ is equal to the first M − 1 elements of B.

Therefore, given the cost matrix, Ω(A,B ′), it is only necessary to calculate

the values of the (M + 1)th column of Ω(A,B) to obtain the full cost matrix

Ω(A,B). This is demonstrated in figure 5.5.

0

1

2

3

1

0

1

2

1

1

2

3

2

1

4

3

3

2

d

c

b

a

ka

5

3

222234

d c

4

d

4

2

2 0

1

2

3

1

0

1

2

1

1

2

3

2

1

4

3

3

2

d

c

b

a

ka

5

3

222234

d c

4

d

4

2

2

e

6

5

5

4

3

Ω(A,B′) Ω(A,B)

Figure 5.5: The relationship between cost matrices for subsequences

The argument extends to even shorter prefix sequences of B. For example, let

B′′′ be defined as the third-order prefix sequence of B, given by B ′′′ = (bi)M−31 .

Then given Ω(A,B′′′), it is only necessary to calculate the (M − 1)th, Mth and

(M + 1)th column of Ω(A,B) to obtain the full cost matrix.

Now, given that the MED cost matrix Ω(A,B) is known, consider the task of

calculating the MED cost matrix Ω(A,C). Let P (B,C) return the longest prefix

sequence of B that is also a prefix sequence of C. Then, Ω(A,C) can be obtained


by taking Ω(A,B) and recalculating the last |C| − |P (B,C)| columns.

In DMLS, an utterance is represented by a collection of observed phone se-

quences. Typically, there is a degree of prefix similarity between sequences from

the same temporal location, and in particular between sequences emitted from

the same node. As demonstrated above, knowledge of prefix similarity will allow

a significant reduction in the number of MED calculations required.

The simplest means of obtaining this knowledge is to simply sort the phone

sequences of an utterance lexically during the lattice building stage. Then the

degree of prefix similarity between each sequence and it’s predecessor can be

calculated and stored. For this purpose, the degree of prefix similarity is defined

as the length of the longest common prefix subsequence of two sequences.

Then, during the DMLS search stage, all that is required is to step through the

sequence collection and use the predetermined prefix similarity value to determine

what portion of the MED cost matrix needs to be calculated, as demonstrated in

figure 5.6. As such, only changed portions of the MED cost matrix are iteratively

updated, greatly reducing computational burden.

aa ehbkgaeb

aa aedkgaeb

aa whowtkaeb

aa aeuhtkaeb

SimilarityPrefix

6

5

3

5

1

0

To Calculate# MED Cols

1

2

4

2

6

7

Observed Phone Sequences

aa aebkb ae g

aa powhhshehg

Figure 5.6: Demonstration of the MED prefix optimisation algorithm

The resulting algorithm can be summarised as follows:


1. Initialise a MED cost matrix of size (N + 1) × (M + 1), where N is the

length of the target phone sequence and M is the maximum length of the

observed phone sequences.

2. For each sequence in the observed phone sequence collection

(a) Let k be defined as the previously computed degree of prefix similarity

metric between this sequence and the previous sequence

(b) Recalculate the last M − k columns of the MED cost matrix

(c) Obtain S = BESTMED(. . .) in the normal fashion given this MED

cost matrix

5.8.2 Early stopping optimisation

The early stopping optimisation uses knowledge about the Smax threshold to limit

the extent of the MED matrix that has to be calculated.

From MED theory, the element Ω(X,Y )i,j of the MED cost matrix Ω(X,Y )

corresponds to the minimum cost of transforming the sequence (x)i1 to the se-

quence (y)j1. For convenience, the notation Ω is used to represent Ω(X,Y ). The

value of Ωi,j is given by the recursive expression

Ωi,j =Min

Ωi−1,j−1 + Cs(xi, yj),

Ωi−1,j + Cd(xi),

Ωi,j−1 + Ci(yj)

(5.1)

Given the above formulation, and assuming non-negative cost functions, the

value of Ωi,j has a lower bound governed by

LowerBound(Ωi,j) ≥Min(

Ωk,ji−1k=1 ∪ Ωk,j−1

|X|k=1

)

(5.2)


That is, it is bounded by the minimum value of column j−1 and all values above

row i in column j. This states that the lower bound of Ωi,j is a function of Ωi−1,j ,

which implies the recursive formulation


LowerBound(Ωi−1,j) ∪ Ωk,j−1|X|k=1

)

(5.3)

This states that the lower bound of Ωi,j is governed by all entries in the previous

column and the lowerbound of the element directly above it in the cost matrix.

If the recursion is continuously unrolled, then the lower bound reduces to being

only a function of the previous column and the very first element in column j,

that is


LowerBound(Ω1,j) ∪ Ωk,j−1|X|k=1

)

(5.4)

Now MED theory states that Ω1,j = j × Ci(yj) for all values of j. This means

that for a positive insertion cost function

LowerBound(Ω1,j) ≥ LowerBound(Ω1,j−1) (5.5)

Substituting this back into equation 5.4 gives

LowerBound(Ωi,j) ≥ Min(

LowerBound(Ω1,j) ∪ Ωk,j−1|X|k=1

)

(5.6)

≥ Min(

LowerBound(Ω1,j−1) ∪ Ωk,j−1|X|k=1

)

(5.7)

This reduces to the simple relationship

LowerBound(Ωi,j) ≥ Min(Ωk,j−1|X|k=1) (5.8)

It has therefore been demonstrated that the lower bound of Ωi,j is only a function


of the values of the previous column of the MED matrix. This lends itself to a

significant optimisation within the DMLS framework.

Since Smax is fixed prior to the DMLS search, there is an upper bound on the

MED score of observed phone sequences that are to be considered as putative

hits. When calculating columns of the MED matrix, the relationship in equation

5.8 can be used to predict what the lower bound of the current column is. If this

lower bound exceeds Smax then it is not necessary to calculate the current or any

subsequent columns of the cost matrix, since all elements will exceed Smax.

This is a very powerful optimisation, particularly when comparing two se-

quences that are very different. It means that in many cases only the first few

columns will need to be calculated before it can be declared that a sequence is

not a putative occurrence.

The resulting algorithm can be summarised as follows. For each column j of

the MED cost matrix

1. Determine the minimum score, MinScore(j−1) in column j−1 of the cost

matrix

2. If MinScore(j − 1) > Smax then declare this sequence as not being a puta-

tive occurrence and stop processing

3. Calculate all elements for column j of the cost matrix

5.8.3 Combining optimisations

The early stopping optimisation and the prefix sequence optimisation can be

easily combined to give even greater speed improvements. Essentially the prefix

sequence optimisation uses prior information to eliminate computation of the

starting columns of the cost matrix, while the early stopping optimisation uses

prior information to prevent unnecessary computation of the final columns of the


cost matrix.

When combined, all that remains during MED costing is to calculate the nec-

essary in-between columns of the cost matrix. As such, the combined algorithm

is given by:

1. Initialise a MED cost matrix of size (N + 1) × (M + 1), where N is the

length of the target phone sequence and M is the maximum length of the

observed phone sequences.

2. For each sequence in the observed phone sequence collection

(a) Let k be defined as the previously computed degree of prefix similarity

metric between this sequence and the previous sequence

(b) Using the prefix sequence optimisation, it is only necessary to update

the trailing columns of the MED matrix. Thus, for each column, j,

from (M + 1)− k + 1 to M + 1 of the MED cost matrix

i. Determine the minimum score, MinScore(j − 1) in column j − 1

of the cost matrix

ii. If MinScore(j − 1) > Smax then using the early stopping opti-

misation, this sequence can be declared as not being a putative

occurrence and processing can stop

iii. Calculate all elements for column j of the cost matrix

(c) Obtain S = BESTMED(. . .) in the normal fashion given this MED

cost matrix

5.9 Optimised system timings

Experiments were performed to evaluate the execution time benefits of the prefix

sequence and early stopping optimisations. Five systems were evaluated

5.9. Optimised system timings 175

1. NOPT: DMLS system without prefix sequence and early stopping optimi-

sations

2. ESOPT: DMLS system with early stopping optimisation

3. PSOPT: DMLS system with prefix sequence optimisation

4. COPT: DMLS system with combined early stopping and prefix sequence

optimisations

5. CXOPT: The COPT system with miscellaneous coding optimisations ap-

plied such as removal of dynamic memory allocation, more efficient passing

of data, etc.

5.9.1 Experimental procedure

Experiments were performed using 10 randomly selected utterances from the

Switchboard evaluation set detailed in 5.7.1. Single word keyword spotting was

performed for each utterance using a 6-phone-length target word.

Each utterance was processed repeatedly for the same word 1400 times and

the total execution time was measured for all passes. The total time was then

summed across all tested utterances to obtain the total time required to perform

10× 1400 passes.

The relative speeds were calculated by finding the ratio between the measured

speed of the tested system and the measured speed of the baseline NOPT system.

The entire evaluation was then repeated a total of 10 times and the average

relative speed factor was calculated. Execution time was measured on a single

3GHz Pentium 4 processor.

Additionally, the final putative occurrence result sets were examined to en-

sure that exactly the same miss rate and FA/kw rates were obtained across all

methods, since both optimisations should not affect these metrics.


5.9.2 Results

Table 5.10 shows the speed of each system relative to the baseline unoptimised

NOPT system. Tests were performed using Smax values of 2 and 4, since the

benefits of the early stopping optimisation depend on the value of Smax.

Smax System Speed factor

2 NOPT 1.00

2 PSOPT 0.60

2 ESOPT 0.42

2 COPT 0.25

2 CXOPT 0.16

4 NOPT 1.00

4 PSOPT 0.60

4 ESOPT 0.64

4 COPT 0.32

4 CXOPT 0.21

Table 5.10: Relative speeds of optimised DMLS systems

The results clearly demonstrate that both optimisations yielded significant

speed benefits. An even more pleasing result was that the two optimisations

combined effectively to reduce execution time by a factor of 4 for the Smax = 2

tests, and by a factor of 3 for the Smax = 4 tests. Overall the fully optimised

CXOPT system ran about 5 to 6 times faster than the original unoptimised

system.

Table 5.11 shows the execution time of the unoptimised DMLS system eval-

uated in section 5.7 as well as the anticipated CPU/kw-hr figure for the same

system incorporating the early stopping and prefix sequence optimisations. It

can be seen that the resultant CPU/kw-hr figure is 0.03. This corresponds to be-

ing capable of searching one hour of speech in 1.8 seconds. This is an impressive

5.10. Summary 177

result and clearly emphasises the suitability of DMLS for very fast large database

keyword spotting applications.

Method Miss FA/ CPU/

Rate kw kw-hr

DMLS[5,10,100,2] 13.9 36.1 0.18

DMLS[5,10,100,2] with CXOPT 13.9 36.1 0.03

Table 5.11: Performance of a fully optimised DMLS system on Switchboard data

5.10 Summary

This chapter presented a novel unrestricted vocabulary audio document index-

ing method named Dynamic Match Lattice Spotting. Through experimentation,

it was demonstrated that this method was capable of searching hours of data

using only seconds of processing time, while maintaining excellent detection per-

formance.

The lack of robustness to subevent recogniser error was identified as a reason

for the poor detection performance of pre-existing unrestricted vocabulary audio

indexing techniques. It was postulated that incorporating prior knowledge of

subevent recogniser errors would be a means of improving detection rates. The

DMLS method was proposed as a means of doing this.

Initial experiments using DMLS demonstrated that it outperformed pre-existing

techniques for the clean microphone speech domain. Compared to SBM-KS,

DMLS was significantly faster and also obtained considerably lower false alarm

rates. Comparisons with the conventional lattice-based technique demonstrated

the miss rate performance of DMLS to be vastly superior.

An analysis of the contributions of dynamic matching rules to DMLS perfor-

mance was presented, to rationalise the benefits of DMLS over the conventional


lattice-based technique. It was found that the vowel substitution and closure/stop

substitution rules contributed significantly to improving miss rate performance,

while the same-letter substitution and insertion rules only offered small improve-

ments. Nevertheless, in all cases, inclusion of any given dynamic matching rule

offered clear benefits over the null-rule conventional lattice-based method.

A study of key parameters of DMLS was also presented. It was found that

careful tuning of these parameters offered the ability to significantly enhance

DMLS performance. In particular:

1. Lattice generation token tuning was excellent for adjusting the miss rate

operating point

2. The pruning beamwidth was useful for tailoring execution speed

3. Smax tuning was suitable for adjusting the false alarm operating point

Through careful adjustment of these parameters, it was possible to construct a

tuned DMLS system that outperformed the previously evaluated baseline DMLS

system.

Evaluation results were also provided for the conversational telephone speech

domain. As would be expected, there was some degradation in performance

compared to the clean microphone speech domain. Nevertheless the performance

of DMLS was still excellent compared to that of the evaluated baseline techniques.

Finally, two key algorithmic optimisations to increase the speed of DMLS were

presented. It was shown that these optimisations could be combined to further

improve the execution speed of DMLS by a factor of 5 to 6 times.

In summary, this chapter has demonstrated that DMLS is an excellent can-

didate for very fast audio document indexing. The search speeds offered by this

method are exceptional considering the low miss rates it offers. The key results

of this chapter are summarised in table 5.12

5.10. Summary 179

Domain Method Miss FA/ CPU/ Secs. to

Rate kw kw-hr search 1h

TIMIT

HMM[-7580] 10.4 36.6 1.58 95

CLS[3,10,200,0] 32.9 0.4 -.– -.–

DMLS[3,10,200,2] 10.2 18.5 0.30 18

DMLS[5,10,150,1] 11.5 5.6 0.31 18

DMLS[5,10,150,2] 7.3 22.3 0.31 18

SWB1

HMM[-7300] 14.1 319.6 1.77 106

CLS[3,10,200,0] 38.4 3.2 -.– -.–

DMLS[3,10,200,2] 17.5 59.0 0.51 31

DMLS[5,10,100,2] 13.9 36.1 0.18 10

DMLS[5,10,100,2] CXOPT 13.9 36.1 0.03 1.8

Table 5.12: Summary of key results

Chapter 6

Non-English Spotting

6.1 Introduction

With the recent increase in global security awareness, non-English speech recog-

nition has emerged as a major topic of interest. One problem that has hindered

the development of robust non-English speech recognition is the lack of large

transcribed non-English speech databases.

A lack of available training data has been reported to considerably degrade

the performance of speech recognition. For example, Moore [24] reported losses

as much as 15% absolute for large vocabulary speech recognition when using

training databases smaller than 100 hours (it is noted that the reported losses

in Moore [24] are interpolated results, however they provide an approximation of

anticipated loss).

Losses in performance would also be expected for keyword spotting. However,

keyword spotting is a very different task from speech transcription, and as such,

the magnitude and nature of these losses are likely to be different. In particular,

keyword spotting is a more constrained task, attempting to discriminate between

a much smaller number of classes. It is possible then that it will be less affected by

181

182 Chapter 6. Non-English Spotting

reduced amounts of training data. If so, keyword spotting techniques may provide

a viable short-term solution for the development and deployment of non-English

data mining applications.

Primarily, this chapter investigates the effects of limited training resources on

the performance of keyword spotting systems. Experiments and discussion are

presented to assess the benefits of large training corpora for keyword spotting,

and to determine whether the benefits from increased amounts of training data

provide sufficient gain to motivate the collection of this data.

Trends in English and Spanish keyword spotting performance are examined

with regard to changes in training database size. Given these trends, extrapola-

tions are made to anticipate the loss in performance from using reduced training

data sizes for the low resourced language of Indonesian.

6.2 The issue of limited resources

The amount of transcribed training data available for training speech recognition

systems has been demonstrated to have a marked impact on performance. In par-

ticular, these effects are considerably greater when using small training database

sizes.

Figure 6.1 illustrates these effects on word error rate as evaluated by Moore

[24]. The plots clearly depict that considerable gains in performance are obtained

especially up to the first 100 hours of training data. For example, an anticipated

gain of 10-15% absolute is observed when increasing training database size from

10 hours to 100 hours.

A significant barrier to the research and development of speech recognition

technologies for many non-English languages is the lack of resources. Some of the

key resources required for this development are well-transcribed speech databases

and sizable pronunciation lexicons. Although such resources are slowly becoming

6.2. The issue of limited resources 183

Figure 6.1: Effect of training dataset size on speech recognition [24]

more easily available, such as the OGI Multilingual Corpus [9] and the CALL-

HOME Database [7], the amount of data is still very small in comparison to that

available for English.

As such, the performances of many reported low resource non-English recog-

nisers have been very poor. For example, in a recent publication, Walker et al.

[34] reported phone recognition error rates as high as 60-70% for languages such

as Japanese, Spanish and Farsi. These results are very poor compared to the

typical 20-40% error rates frequently reported for English.

Such poor error rates have hindered the deployment of non-English speech

recognition technologies. It is a well known fact that many consumers consider

commercially deployed English-based speech recognition applications to be de-

plorable and erroneous. Given the immense inferiority of current non-English

speech recognition technologies, it is more than likely that non-English speech

recognition will be even less well received.

A long term solution for this problem is to simply transcribe more data and

build the required pronunciation lexicons. Apart from the immense cost and time


required to do this, this approach also leaves many poor non-English countries

without speech recognition technology. Additionally, non-English speech research

is also of interest in many English-speaking countries, for example, in applications

such as telephone conversation monitoring and security surveillance. These, and

many other reasons, provide a motivation for the immediate development and

deployment of non-English speech technologies.

6.3 The role of keyword spotting

Keyword spotting is a considerably simpler task than speech transcription. For

example, the single-keyword spotting task is a 2-class transcription task, attempt-

ing to segment speech into sequences of target speech and non-target speech. As

such, it is likely to require less prior information and hence less training data

to achieve acceptable performance compared to a large vocabulary STT system.

Therefore, keyword spotting may be less affected by limited amounts of training

resources.

There are plethora of applications for which keyword spotting provides a suf-

ficient solution. These include data mining, real-time monitoring and dialogue

systems. Although a full large vocabulary STT system may be able to provide

further functionality, such as improved speech understanding capability for dia-

logue system applications, a keyword spotting solution will provide a short term

solution if it is less affected by limited training data.

6.4 Experiment setup

To evaluate the effects of reduced training database size, a number of keyword

spotting systems were trained for the languages of interest. Details of these

systems, and how they were evaluated are given in this section.

6.4. Experiment setup 185

6.4.1 Database design

Data was sourced from three language specific databases: the English Switch-

board database, the CALLHOME Spanish database, and the Indonesian split of

the OGI Multilingual corpus. All three databases consisted of narrowband tele-

phone speech. After removing all utterances containing out-of-vocabulary words,

there was a total of 165 hours of English data, 10.2 hours of Spanish data, and

3.5 hours of Indonesian data. Due to the limited amount of data available for

the non-English languages, only 40 minutes was designated as test data while the

remaining data was used for training.

Since there was a large amount of data available for English, three differ-

ent sized training datasets were constructed. The first used the entire training

database consisting of 165.05 hours of speech. The second was an intermediary

sized database consisting of 15.4 hours of speech and the final database was very

small made up from only 4.15 hours of data.

Training databases for the non-English languages were not selected to be the

same size as the English training datasets. Instead it was decided to match

the databases based on a hours per phone metric. This was done because the

number of phones used in English was 44 while the number used for Spanish and

Indonesian was only 28 (taken from the WorldBet phone set). As such, datasets

had to be sized to ensure that there was an equal number of training examples

per phone across all languages to avoid unfairly penalising any one language.

Additionally, database sizes for the non-English languages were limited by the

amount of available data. As such, it was only possible to create intermediary and

small sized databases for Spanish, and only a small sized database for Indonesian.

Sizes were matched using the hours per phone metric as discussed above. Table

6.1 shows a summary of all training datasets.


To avoid confusion, the codes in table 6.1 are used when referring to the in-

dividual training data sets. The S1 training sets correspond to the 0.1 h/phone

training data sets and exist for all three languages. The S2 training sets cor-

respond to the 0.35 h/phone training data sets and only exist for English and

Spanish. Finally the S3E set corresponds to the full sized English training data

set and was included to provide insight into spotting and verification performance

for systems trained using very large databases.

Code Language Hours of speech Hours per phone

S1E English 4.15 0.095

S1S Spanish 2.82 0.10

S1I Indonesian 2.78 0.099

S2E English 15.4 0.35

S2S Spanish 9.59 0.34

S3E English 164.05 3.73

Table 6.1: Summary of training data sets

All data was parameterised using Perceptual Linear Prediction coefficient fea-

ture extraction. Utterance based Cepstral Mean Subtraction was applied to re-

duce the effects of channel/speaker mismatch.

6.4.2 Model architectures

When using limited size training databases, it is often necessary to use simpler

model architectures to avoid data sparsity issues. As such, for this set of ex-

periments, three different HMM phone model architectures were built for each

training data set. These were:

1. 16-mixture triphone - It was anticipated that the triphone architecture

would provide the greatest performance when using the large training data

6.4. Experiment setup 187

sets but would have reduced performance for smaller training data sets due

to data sparsity.

2. 16-mixture monophone - This architecture was included to address the

data sparsity issues of the triphone architecture, although it was expected

that for the large datasets, these models would be too simplistic.

3. 32-mixture monophone - This provides a compromise between the very

high data requirements of the triphone architecture and the modeling sim-

plicity of the 16-mixture monophone set.

In addition, a 256-mixture Gaussian Mixture Model SBM was trained on each

dataset for use as the background model in HMM-based keyword spotting.

To facilitate the ease of reference for the various model sets, the following codes

shown in table 6.2 are used when referring to individual model architectures.

Furthermore, when referring to a model trained on a specific training set, the

Code Description

T16 16-mixture triphone

M16 16-mixture monophone

M32 32-mixture monophone

Table 6.2: Codes used to refer to model architectures

name of the training set is appended to the model label. Hence, a 16-mixture

triphone model set trained on the S2S training set is referred to as the T16S2S

model set, whereas the 32-monophone model set trained on the S1I set is referred

to as the M32S1I model set.


6.4.3 Evaluation set design

The evaluation data sets consisted of approximately 40 minutes for each language.

As stated before, it was not possible to use a larger evaluation set because of the

limited amount of data available for Indonesian and Spanish.

Target words were restricted to being only 6-phone keywords. This was done

to minimise any variations in performance across the evaluation sets for the differ-

ent languages due to non-identical distributions of keyword lengths. For English

and Spanish, 180 unique 6-phone words were randomly selected for each language

and designated as the evaluation query word set. Unfortunately it was not pos-

sible to find 180 unique 6-phone words in the Indonesian evaluation set, and as

such, only 153 words were used in this set.

Table 6.3 summarises each evaluation set. The # words in eval data column

corresponds to the number of instances of the words in the query word set that

occurred in the evaluation data — that is, the total number of hits required to

obtain a miss rate of 0%.

Code Language Mins of speech # query words # words in eval data

EE English 43.62 180 298

ES Spanish 39.60 180 353

EI Indonesian 41.40 150 349

Table 6.3: Summary of evaluation data sets.


Experiments were performed for each language to evaluate keyword spotting per-

formance for every combination of model architecture and training database. A

2-stage keyword spotting system was used, consisting of a SBM-based keyword

6.5. English and Spanish stage 1 evaluations 189

spotting front-end (see section 2.7.1) followed by SBM-based keyword verification

(see section 2.9.4). The procedure used was as follows:

1. Keyword spotting was performed to generate a putative occurrence set

2. Stage 1 miss and FA/kw rates were calculated using reference word-level

transcriptions

3. Keyword verification was performed to generate a confidence-scored puta-

tive occurrence set

4. Stage 2 miss and false alarm probabilities were calculated across a cross-

section of thresholds. Additionally, the equal error rate statistic and DET

plots were generated.

6.5 English and Spanish stage 1 evaluations

Experiments were first performed to evaluate the impact of limited training data

on the stage 1 miss and FA/kw rates. Of particular interest was the effect of

training database size on stage 1 miss rate, as this gives a lower bound on the

achievable miss rate for a successive keyword verification stage. Table 6.4 shows

the results of these experiments.

A number of observations can be made regarding the stage 1 spotting rates. Of

particular note is that the Spanish miss rates were much higher than the English

miss rates. The most likely explanation for this is that the Spanish data was

simply harder to recognise. Random listening of some of the Spanish utterances

revealed many adverse factors such as significant background noise and very fast

speaking rates.

The trend curves shown in figures 6.2 and 6.3 clearly demonstrate that in

most cases increased training database size resulted in decreased miss rates and


0.50

2

4

6

8

10

12

Mis

s ra

te

M16EM32ET16EM16SM32ST16S

Figure 6.2: Trends in miss rate across training database size

0.5100

200

300

400

500

600

700

800

900

1000

FA

/kw

rat

e

M16E

M32E

T16E

M16S

M32S

T16S

Figure 6.3: Trends in FA/kw rate across training database size

6.5. English and Spanish stage 1 evaluations 191

English

Model Miss rate FA/kw

M16S1E 4.0 675.528

M16S2E 2.7 702.064

M16S3E 3.7 687.451

M32S1E 2.3 882.869

M32S2E 2.0 989.52

M32S3E 2.3 999.334

T16S1E 5.7 268.045

T16S2E 5.0 223.189

T16S3E 1.0 215.992

Spanish

Model Miss rate FA/kw

M16S1S 7.6 539.946

M16S2S 6.2 606.144

M32S1S 4.5 733.128

M32S2S 3.7 872.19

T16S1S 11.9 201.63

T16S2S 10.8 208.758

Table 6.4: Stage 1 spotting rates for various model sets and database sizes

increased FA/kw. A decrease in miss rate is beneficial as this reduces the lower

bound for the minimum achievable miss rate for a subsequent keyword verification

stage. Interestingly though, the absolute gains in miss rate were not particularly

large. Apart from the gain observed for the T16S3E system, the other gains were

below 2%, and in most cases below 1%. This implies that miss rate for HMM

keyword spotting is not dramatically affected by training database size in terms

of absolute changes in miss rate.

The only cases where decreased miss rate was not observed when increas-

ing training database size was for the M16S2E → M16S3E and M32S2E →

M32S3E monophone cases. As stated earlier, it was expected that for very large

training database sizes, the simplistic monophone architectures would not be able

to sufficiently model the increased number of modalities of the data, and therefore

would become too generalised and hence poor discriminators.

The triphone architectures also provided significantly lower FA/kw rates than

the monophone architectures for all training data set sizes. One may argue that


this is simply a trade-off in performance - a lower FA/kw result in exchange for

a higher miss rate. This appears to be the case for the Spanish experiments.

However, in the English experiments, both miss rate and FA/kw rates decreased

as training data size was increased. From these limited set of experiments, it

is not possible to determine whether the triphone architecture truly provides an

increase in both rates or simply a trade off between the two measures.

Overall, increased training database size does yield improved performance in

stage 1 miss rate, though the gains are not dramatic unless very large database

sizes are used. For S1 and S2 sized databases, the monophone architectures

yielded more favorable stage 1 miss rates at the expense of significantly higher

stage 1 FA/kw rates.

6.6 English and Spanish post keyword

verification

Post verification performance was evaluated for the various English and Spanish

training databases and model architectures. The aim of these experiments was

to determine the effect of training database size on the final keyword spotting

performance for a multi-stage system, not just the effect on the keyword verifi-

cation stage in isolation. This is because in practice the same data sets would

be used when training models for the spotting and verification stages. As such,

identical model architectures and database sizes were used for the spotting and

verification stages, and the final system performance was measured.

Table 6.5 shows the EERs after keyword verification. Figures 6.4, 6.5 and 6.6

show the detection error trade-off plots for the T16, M16 and M32 experiments

respectively. A number of interesting characteristics can be seen in these results.

6.6. English and Spanish post keyword

verification 193

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

12345

Figure 6.4: DET plot for T16 experiments. 1=T16S3E, 2=T16S2E, 3=T16S1E,4=T16S2S, 5=T16S1S

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

12345

Figure 6.5: DET plot for M16 experiments. 1=M16S3E, 2=M16S2E, 3=M16S1E,4=M16S2S, 5=M16S1S

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

12345

Figure 6.6: DET plot for M32 experiments. 1=M32S3E, 2=M32S2E, 3=M32S1E,4=M32S2S, 5=M32S1S


English

Model EER rate

M16S1E 22.2

M16S2E 19.8

M16S3E 20.5

M32S1E 19.1

M32S2E 17.8

M32S3E 18.5

T16S1E 18.1

T16S2E 17.8

T16S3E 13.0

Spanish

Model EER rate

M16S1S 25.8

M16S2S 25.2

M32S1S 24.4

M32S2S 22.6

T16S1S 28.7

T16S2S 26.9

Table 6.5: Equal error rates after keyword verification for various model sets andtraining database sizes

The plots in figure 6.7 demonstrate that the trends in post verification EER

are similar to the trends in stage 1 miss rate. This is reassuring, demonstrating

consistency in performance between the stages.

Of note is the gain in performance between the S1 and S2 systems given a

fixed model architecture. In most cases, increasing the amount of training data

from the S1 to S2 database size resulted in absolute gains of approximately 1-

2% in EER. Further increasing the database size as done in the S3 experiments

resulted in gains for the triphone system only (4.8% absolute).

This is a positive result, indicating that the relatively small increase in train-

ing database size between S1 and S2 provided a tangible gain in performance.

Furthermore, the fact that a significantly larger training database only yielded

a 4.8% absolute gain for the T16S3E experiment suggests that returns dimin-

ish with increases in training database size. That is, the gain per hour of extra

training data diminishes as total database size increases.

6.6. English and Spanish post keyword

verification 195

0.512

14

16

18

20

22

24

26

28

30

Equ

al e

rror

rat

e

M16EM32ET16EM16SM32ST16S

Figure 6.7: Trends in EER across training dataset size

This observation has important ramifications for the development and deploy-

ment of keyword spotting systems. It indicates that keyword spotting systems

trained on relatively small databases are able to achieve performances well within

an order of magnitude of systems trained using significantly larger databases. De-

pending on the target application, this loss in performance may be an acceptable

trade-off for the time and monetary costs of obtaining larger databases.

Another interesting result is the difference in EER gains observed for English

triphone systems over English monophone systems compared to those observed

for the equivalent Spanish systems. In all cases, the English triphone systems

markedly outperformed the monophone systems, whereas for Spanish, the tri-

phone systems yielded considerably lower EERs compared to the monophone

systems. Further analysis of the data revealed that for the S1S and S2S evalua-

tions, the M32 systems outperformed the performance of the T16 systems at all

operating points, as shown in figure 6.8.

One possible explanation for this disparity in performance gains is the decision

tree clustering process used during triphone training. The question set used for


the English decision tree clustering process was a well established and well tested

question set, whereas the Spanish question set was a relatively new question set

constructed for this particular set of experiments. Although much care was taken

in building the Spanish question set and in removing any errors, it is possible

that the nature of the phonetic questions asked, though relevant and applicable

to English, were not suitable for Spanish decision tree clustering.

In summary, the experiments demonstrate that although some gains in per-

formance were achieved using larger training databases, the magnitude of these

gains were not dramatic and may not justify the costs of obtaining such databases.

For smaller-sized databases, the M32 architecture resulted in more robust perfor-

mance for Spanish keyword spotting, though this may be due to issues with the

triphone training procedures for Spanish.

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

123

Figure 6.8: DET plot for S2S experiments. 1=T16S2S, 2=M16S2S, 3=M32S2S

6.7. Indonesian spotting and verification 197

6.7 Indonesian spotting and verification

Given the results and trends observed for English and Spanish, experiments were

performed using the small amount of available Indonesian data to obtain baseline

keyword spotting performance. Table 6.6 and figure 6.9 show the stage 1 and

stage 2 results of these experiments.

Model State 1 spot Stage 1 spot Post-verifier

miss rate FA/kw EER

M16S1I 3.4 271.308 22.0

M32S1I 3.0 302.979 21.0

T16S1I 3.4 272.412 22.0

Table 6.6: Stage 1 spotting and stage 2 post verification results for S1I experi-ments

0.1 0.2 0.5 1 2 5 10 20 40

0.1

0.2

0.5

1

2

5

10

20

40


Mis

s pr

obab

ility

(in

%)

123

Figure 6.9: DET plot for S1I experiments. 1=T16S1I, 2=M16S1I, 3=M32S1I

Stage 1 results were not as diverse as those observed for English and Spanish

- all models yielded similar miss rates and comparable FA/kw rates. In contrast,


the trends for post-verification EER were similar to that observed for Spanish,

with the M32 architecture yielding the best EER performance and in fact the best

performance at most other operating points. Ultimately though, as demonstrated

by figure 6.9, the post verification performance for all model types were very close,

being within 1% absolute in most cases.

6.8 Extrapolating Indonesian performance

The original goal of this research was to examine how training dataset size af-

fected keyword spotting performance, and more importantly, whether the gains

obtained were sufficiently large to justify the collection of more data. The ex-

periments performed for English and Spanish data provide some insight into how

performance varies with changes in dataset size.

Given these observations, it is possible to perform a degree of extrapolation

to predict keyword spotting performance for other languages, and in particular,

the anticipated gains from increasing the amount of training data. It must be

noted that these extrapolations must be considered with only a low degree of

confidence, since there were only a few data points upon which the extrapolations

were based. Nevertheless, the predictions will still give a general indication of

expected performance.

Of particular interest is the keyword spotting performance of the previously

evaluated M16S1I, M32S1I and T16S1I Indonesian systems. Given the consistent

1-2% EER gain observed when increasing from S1 to S2 sized training data sets

for the English and Spanish experiments, it is reasonable to postulate that similar

gains in EER would be observed in Indonesian. As such, one would expect EER

rates in the vicinity of 19-20%.

Further extrapolations regarding even larger sized training databases can be

6.8. Extrapolating Indonesian performance 199

made given observations from the English S3E experiments. However, these ex-

trapolations may be problematic, since conflicting results were observed across

the three evaluated systems - a decrease in miss rate for the T16S3E system

compared to an increase in miss rate for the M16S3E and M32S3E systems. Re-

alistically though, when developing an Indonesian keyword spotting system with

an S3 sized database, it is likely that a triphone architecture will be used, since

the monophone systems will be too simplistic to model such a large amount of

data.

Since gains in EER remained consistent across English and Spanish for the S1

to S2 experiments, it is not unreasonable to assume that similar consistency will

be maintained going from a S1 sized set to a S3 sized set. As such, one would

expect gains in EER in the vicinity of 5-7% for Indonesian S3 triphone system,

given the 6.1% gain observed for the T16S3E system over the T16S1E system.

This gives a likely absolute Indonesian EER of approximately 15-17% for an S3

sized database.

12

14

16

18

20

22

24

Equ

al e

rror

rat

e

Figure 6.10: Extrapolations of Indonesian keyword spotting performance usinglarger sized databases


6.9 Summary and Conclusions

The reported experiments demonstrate a number of interesting results regarding

the effects of training database size. Most importantly, they indicate that the sen-

sitivity of keyword spotting performance to training database size is significantly

less than that of speech transcription. Specifically, it was found that decreasing

the amount of training data from 160 hours to 4 hours for English resulted in only

a loss of 6.1% in equal error rate. This is significantly less than the approximately

18% loss in word error rate reported by Moore [24] for speech transcription.

It was also found that monophone based keyword spotting yielded better

miss rate performance for limited training database sizes compared to triphone

modeling. This is most likely because triphone systems suffered from data sparsity

issues for the limited data cases. However, the monophone systems did have

significantly higher FA/kw rates which translates to a greater number of actual

false alarms in the final system output. Given this, a triphone system may still

be more appropriate for a limited training data system, even though miss rate is

slightly poorer.

Low confidence extrapolations were also made regarding expected equal error

rate gains for an Indonesian keyword spotting system trained on a large database.

A system trained on 2.8 hours of training data yielded an EER of 21.0% using a

32-mixture monophone model set. Trends seen in English and Spanish implied

an EER gain for Indonesian of 1-2% using a 9.6 hour database and a gain of 5-7%

using a significantly larger training database.

Overall, the research demonstrates that keyword spotting using limited size

training databases is feasible. Such systems are capable of achieving keyword

spotting performance within an order of magnitude of systems training on signif-

icantly larger databases, but require fewer resources. This has ramifications for

the immediate development and deployment of speech enabled systems for low

6.9. Summary and Conclusions 201

resourced non-English languages.

Chapter 7

Summary, Conclusions and

Future Work

This chapter provides a summary of the work presented in this thesis as well as

the primary conclusions arising from it and a discussion of the possible future

research directions.

7.1 HMM-based Spotting and Verification

Chapter 3 presented a comprehensive study of HMM-based spotting and verifi-

cation techniques. In particular, the methods were considered in terms of their

suitability for real-time monitoring applications.

7.1.1 Conclusions

• Of the evaluated methods, the SBM-based approach was found to be the

most appropriate for real-time monitoring applications. This was because

it obtained excellent miss rates as well as fast execution speeds. Although

this method was also hindered by very high false alarm rates, it was argued

203

204 Chapter 7. Summary, Conclusions and Future Work

and subsequently demonstrated that a well-performing keyword verification

stage would be capable of culling a significant portion of these false alarms.

• An analysis of the effect of target word length demonstrated that keyword

spotting and verification performance was noticeably poorer for shorter key-

words. This highlighted the need for techniques that specifically addressed

the issue of short-word spotting and verification.

• The tuning of HMM-based spotting using techniques such as output score

thresholding and target word insertion penalties was demonstrated to be in-

appropriate. This was because any attempts to significantly improve either

miss rate or false alarm rate resulted in considerable losses in the comple-

mentary performance metric.

• A neural network based decision boundary estimate was proposed as an

alternative to the traditional log-likelihood ratio. It was found that such

an approach yielded considerable gains in performance for SBM-based key-

word verification, particularly for short keywords. This suggested that a

similar approach could be used to improve the robustness of many other

log-likelihood ratio based confidence score measures.

7.1.2 Future Work

The reported experiments highlighted the need for well-performing short-word

keyword spotting and verification techniques. Poor performance is typically en-

countered for short words because of the reduced number of observations available

for scoring. As such, techniques that examine additional sources of information

from the observation sequence, such as linguistic or orthogonal feature set infor-

mation may yield improved performance. Additionally, more appropriate decision

boundary modeling techniques, such as the proposed neural network approach,

7.2. Cohort Word Verification 205

may provide an avenue for further improvements.

7.2 Cohort Word Verification

A novel technique of keyword verification was presented in chapter 4. This

method combined high level linguistic information with cohort-based verification

techniques to yield significant improvements in verification performance, partic-

ularly for the problematic class of short-duration target words. Additionally, the

fusion of multiple keyword verifiers was investigated and found to provide further

gains in performance.

7.2.1 Conclusions

• The reported evaluations compared the performance of cohort word verifi-

cation and SBM-based verification for the conversational telephone speech

domain. It was found that cohort word verification provided excellent gains

for short to medium length target words but was markedly poorer for long

words. Further analysis demonstrated that this poor performance was a

result of reduced cohort word set sizes for long words.

• It was found that the fusion of cohort word verification and SBM verification

provided some excellent gains in performance over the unfused systems. In

particular, this architecture was well suited for medium length target word

verification

• The fusion of multiple cohort word verifiers was also examined and demon-

strated to provide considerable gains for short word verification. This was

a pleasing result considering the problematic nature of short word verifica-

tion.


• Multiple formulations of the cohort word confidence score were presented

and investigated. Of these, it was found that the N-class hybrid approach

provided the best compromise between error rate and execution speed.

• The large number of cohort word parameters were rationalised through a

detailed analysis of their effects. It was found that the main parameters of

importance were dmax and the amount of cohort word set downsampling.

Other parameters provided only minors changes in performance.

7.2.2 Future Work

• It was demonstrated in chapter 3 that considerable gains in performance

could be obtained by using a more robust decision boundary estimate, such

as a neural network classifier. Future work could examine the application of

discriminative decision boundary estimates to the cohort word confidence

score as a means of further improving performance.

• Execution speeds for cohort word verification were not reported in this

thesis. However, this is an important metric that needs to be considered

when applying this method to speed-critical tasks such as audio document

indexing. Speed improvements may be obtained in a variety of ways, for

example, through the use of aggressive cohort word set size downsampling

or tighter decoding pruning beamwidths. A study of the execution speed of

cohort word verification and the investigation of techniques to improving it

is a possible avenue for future research.

7.3 Dynamic Match Lattice Spotting

Chapter 5 presented a novel technique of fast and accurate unrestricted vocab-

ulary audio document indexing. This method was evaluated on conversational

7.3. Dynamic Match Lattice Spotting 207

telephone speech and found to provide significant improvements over conventional

lattice-based and HMM-based techniques.

7.3.1 Conclusions

• The chapter presented a novel unrestricted vocabulary audio document in-

dexing method named Dynamic Match Lattice Spotting. Through experi-

mentation, it was demonstrated that this method was capable of searching

hours of data using only seconds of processing time, while maintaining ex-

cellent detection performance. The proposed method provided significant

improvements in detection rate and execution speed over the baseline con-

ventional lattice based and SBM-based systems.

• The lack of robustness to erroneous lattice realisations was identified as a

weakness in conventional lattice-based techniques. Experiments reported

in this chapter highlighted that significant gains in miss rate performance

could be obtained by incorporating robustness to such errors within the

lattice search process.

• Two methods of improving the speed of DMLS were investigated and im-

plemented. It was found that these provided considerable gains in search

speed without affecting miss or false alarm rates. As a result, a DMLS

system was constructed that could search at speeds of 33 hours per minute

with good miss and false alarm rates.

• Individual dynamic match rules were evaluated within the context of the

proposed technique. It was found that even the simplest of rules provided

some tangible gains in performance over the conventional lattice-based tech-

nique. However, in particular, vowel-substitution and closure/stop substi-

tution rules provided dramatic gains, though at the expense of increased


false alarm rates.

• An analysis of the parameters of DMLS demonstrated that the technique

could be easily tuned to obtain low miss rates or low false alarms while

maintaining its fast execution speed.

7.3.2 Future Work

• The dynamic match rules that were proposed and evaluated were derived

empirically. Although these rules provided good performance, they were

unlikely to be optimal. Future work could examine the use of probabilistic

rules, for example, derived from the phone recogniser confusion matrix.

• The MED score used in DMLS was a discrete variable and as such thresh-

olding on this value resulted in a discontinuous tuning curve. Smoother

tuning would be possible if a continuous probabilistic output score could be

derived. One possible solution is to use a combination of the MED score

and the acoustic score of the putative occurrence as estimated from the

lattice. Fusion of these values may result in a output score that is more

useful for continuous tuning.

7.4 Non-English Spotting

Chapter 6 examined the application of keyword spotting to non-English languages

and assessed the impact of limited training data on system performance.

7.4.1 Conclusions

• It was found that the sensitivity of keyword spotting performance to train-

ing database size was considerably less than that previously reported for

7.5. Final Comments 209

speech transcription. This finding supports the argument for the develop-

ment of speech applications that use keyword spotting instead of speech

transcription to satisfy the immediate need for non-English speech-enabled

applications.

• Analysis of the experimental results demonstrated that keyword spotters

trained on limited amounts of training data could achieve performances

well within an order of magnitude of systems trained on very large database

sizes.

• It was demonstrated that monophone based systems were more effective

than triphone based systems in terms of miss rate when using limited

amounts of training data. This was because triphone models suffered from

data sparsity issues for small sized training databases.

7.5 Final Comments

A number of novel contributions to the field of keyword spotting have been gen-

erated by this research. A considerable amount of this research has been used in

the development of data mining applications that are being actively trialled by

external bodies. It is believed that this demonstrates that the work is not only

theoretically sound but practically viable.

Bibliography

[1] J. Alvarez-Cercadillo, J. Ortega-Garcia, and L. A. Hernandez-Gomez, “Con-

text modeling using RNN for keyword detection,” in Proceedings of the 1993

IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP), 1993.

[2] I. Bazzi and J. R. Glass, “Modeling out-of-vocabulary words for robust

speech recognition,” in Proceedings of the 2000 International Conference

on Spoken Language Processing (ICSLP), 2000.

[3] S. Bengio, “Learning the decision function for speaker verification,” in Pro-

ceedings of the 2001 IEEE International Conference on Acoustics, Speech,

and Signal Processing (ICASSP), 2001.

[4] G. Bernadis and H. Bourlard, “Improving posterior-based confidence mea-

sures in hybrid HMM/ANN speech recognition systems,” in Proceedings of

the 1998 International Conference on Spoken Language Processing (ICSLP),

1998.

[5] H. Bourlard, B. D’hoore, and J. M. Boite, “Optimizing recognition and

rejection performance in wordspotting systems,” in Proceedings of the 1994

IEEE International Conference on Acoustics, Speech, and Signal Processing

(ICASSP), 1994.

211

212 Bibliography

[6] J. S. Bridle, “An efficient elastic-template method for detecting given words

in running speech,” British Acoustic Society Metting, pp. 1–4, 1973.

[7] A. Canavan and G. Zipperlen, “CALLHOME Spanish Speech.”

http://www.ldc.upenn.edu, 2005.

[8] B. Chigier, “Rejection and keyword spotting algorithms for a directory

assitance city name recognition application,” in Proceedings of the 1992

IEEE International Conference on Acoustics, Speech, and Signal Process-

ing (ICASSP), vol. 2, pp. 93–96, 1992.

[9] R. Cole and Y. Muthusamy, “OGI. Multilanguage Corpus.”

http://www.ldc.upenn.edu, 2005.

[10] S. Dharanipragada and S. Roukos, “A multistage algorithm for spotting

new words in speech,” IEEE Transactions on Speech and Audio Processing,

vol. 10, no. 8, pp. 542–550, 2002.

[11] Google, “The Google Internet Search Engine.” http://www.google.com.

[12] Q. Gou, Y. H. Yan, Z. W. Lin, B. S. Yuan, Q. W. Zhao, and J. Liu, “Keyword

spotting in auto-attendant system,” in Proceedings of the 2000 International

Conference on Spoken Language Processing (ICSLP), 2000.

[13] A. L. Higgins and R. E. Wohlford, “Keyword recognition using template con-

catenation,” in Proceedings of the 1985 International Conference on Audio,

Speech and Signal Processing, 1985.

[14] D. A. James and S. J. Young, “A fast lattice-based approach to vocabulary

independent wordspotting,” in Proceedings of the 1994 IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,

(Adelaide, Australia), pp. 377–380, 1994.

Bibliography 213

[15] P. Jeanrenaud, K. Ng, M. Siu, J. R. Rohlicek, and H. Gish, “Phonetic based

word spotter: various configurations and applications to event spotting,” in

Proceedings of the 1993 European Conference on Speech Communication and

Technology (EUROSPEECH), 1993.

[16] P. Jeanrenaud, M. H. Siu, J. R. Rohlicek, M. Meteer, and G. Gish, “Spotting

events in continuous speech,” in Proceedings of the 1994 IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1,

pp. 381–384, 1994.

[17] Z. Jianlai, L. Jian, S. Yantao, and Y. Tiecheng, “Keyword spotting based

on recurrent neural network,” in Proceedings of the 1998 IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1998.

[18] A. Kenji, T. Kazushige, O. Kazunari, O. Sumio, and F. Hiroya, “A new

method for dialogue management in an intelligent system for information

retrieval,” in Proceedings of the 2000 International Conference on Spoken

Language Processing (ICSLP), 2000.

[19] P. Kingsbury, S. Strassel, C. McLemore, and R. MacIntyre, “CALLHOME

American English Lexicon (PRONLEX).” http://www.ldc.upenn.edu, 2005.

[20] L. K. Leung and P. Fung, “A more efficient and optimal LLR for decoding

and verification,” in Proceedings of the 1999 IEEE International Conference

on Acoustics, Speech, and Signal Processing (ICASSP), 1999.

[21] R. P. Lippmann and E. Singer, “Hybrid neural-network/HMM approaches

to wordspotting,” in Proceedings of the 1993 IEEE International Conference

on Acoustics, Speech, and Signal Processing (ICASSP), (Sydney, Australia),

1993.

214 Bibliography

[22] J. Liu and X. Zhu, “Utterance verification based on dynamic garbage evalu-

ation approach,” in Proceedings of the 2000 IEEE International Conference


[23] M. Mohri, F. Pereira, and M. Riley, “Weighted finite-state transducers in

speech recognition,” Computer Speech and Language, vol. 16, no. 1, pp. 69–

88(20), 2002.

[24] R. Moore, “A comparison of the data requirements of automatic speech

recognition systems and human listeners,” in Proceedings of the 2003

European Conference on Speech Communication and Technology (EU-

ROSPEECH), 2003.

[25] J. Ou, K. Chen, X. Wang, and Z. Li, “Utterance verification of short key-

words using hybrid neural-network/HMM approach,” in Proceedings of the

2001 IEEE International Conference on Info-tech and Info-net (ICII), (Bei-

jing, China), 2001.

[26] D. Reynolds and et al., “The SuperSID Project: Exploiting High-level Infor-

mation for High-accuracy,” in Proceedings of the 2003 IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003.

[27] J. R. Rohlicek, P. Jeanrenaud, K. Ng, H. Gish, B. Musicus, and M. Siu,

“Phonetic training and language modeling for word spotting,” in Proceedings

of the 1993 IEEE International Conference on Acoustics, Speech, and Signal

Processing (ICASSP), vol. 2, pp. 459–462, 1993.

[28] R. C. Rose and D. B. Paul, “A Hidden Markov Model based keyword recog-

nition system,” in Proceedings of the 1990 IEEE International Conference

on Acoustics, Speech, and Signal Processing (ICASSP), pp. 129–132, 1990.

Bibliography 215

[29] H. Sakoe and S. Chiba, “A dynamic programming approach to continuous

speech recognition,” in Proceedings of Seventh International Congress on

Acoustics, 1971.

[30] A. Sethy and S. Narayanan, “A syllable based approach for improved recog-

nition of spoken names,” in Pronunciation Modeling and Lexicon Adaptation

for Spoken Language Technology, 2002.

[31] M. Silaghi and H. Bourlard, “A new keyword spotting approach based on

iterative dynamic programming,” in Proceedings of the 2000 IEEE Interna-

tional Conference on Acoustics, Speech, and Signal Processing (ICASSP),

2000.

[32] R. A. Sukkar and C. H. Lee, “Vocabulary independent discriminative utter-

ance verification for nonkeyword rejection in subword based speech recogni-

tion,” in Proceedings of the 1996 IEEE International Conference on Acous-

tics, Speech, and Signal Processing (ICASSP), 1996.

[33] E. Trentin, Y. Bengio, C. Furlanello, and R. D. Mori, Spoken Dialogues

with Computers, ch. Neural Networks for Speech Recognition, pp. 343–347.

Academic Press, 1998.

[34] B. Walker, B. C. Lackey, J. S. Muller, and P. J. Schone, “Language-

Reconfigurable Universal Phone Recognition,” in Proceedings of the 2003

European Conference on Speech Communication and Technology (EU-

ROSPEECH), 2003.

[35] J. G. Wilpon, L. R. Rabiner, C. H. Lee, and E. R. Goldman, “Automatic

recognition of keywords in unconstrained speech using Hidden Markov Mod-

els,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38,

pp. 1870–1878, 1990.

216 Bibliography

[36] C. H. Wu, Y. J. Chen, and G. L. Yan, “Integration of phonetic and prosodic

information for robust utterance verification,” Vision, Image and Signal Pro-

cessing, vol. 147, pp. 55–61, 2000.

[37] L. Xin and B. Wang, “Utterance verification for spontaneous Mandarin

speech keyword spotting,” in Proceedings of the 2001 IEEE International

Conference on Info-tech and Info-net (ICII), 2001.

[38] C. Yining, L. Jing, Z. Lin, L. Jia, and L. Runsheng, “Keyword spotting

based on mixed grammar model,” in Proceedings of Intelligent Multimedia,

Video and Speech Processing 2001, 2001.

[39] S. J. Young and M. G. Brown, “Acoustic indexing for multimedia retrieval

and browsing,” in Proceedings of the 1997 IEEE International Conference


[40] T. Zeppenfeld, “A hybrid neural network, dynamic programming word spot-

ter,” in Proceedings of the 1992 IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), 1992.

[41] V. Zue, J. Glass, M. Phillips, and S. Seneff, “The MIT SUMMIT Speech

Recognition System: A Progress Report,” in Proceedings of the First

DARPA Speech and Natural Language Workshop, pp. 178–189, 1989.

Appendix A

The Levenstein Distance

A.1 Introduction

The Levenstein distance measures the minimum cost of transforming one string

to another. Transformation is performed by successive applications of one of

four transformations: matching, substitution, insertion and deletion. Typically,

each transformation has an associated cost, and hence implicitly the Levenstein

algorithm must discover which sequence of transformations results in the cheapest

total transformation cost.

A.2 Applications

Applications of the Levenstein Distance, also known as the Minimum Edit Dis-

tance (MED), span a plethora of fields. In biology, this algorithm is used to

identify similar sequences of nucleic acids in DNA or amino acids in proteins.

Web search engines have used this method for detecting similarity in phrases and

query terms. More obtuse is the use of the edit distance to discover similarities

in documents for the purpose of detecting plagiarism.

217

218 Appendix A. The Levenstein Distance

In speech research, the Levenstein distance is particularly useful in the analy-

sis of phonetic and word sequences. For example, the word error rate of a speech

transcription systems can be calculated using this method. The phonetic sim-

ilarity between two pronunciations can also be measured using the Levenstein

distance, for example, for the purpose of finding similarly pronounced words.

A.3 Algorithm

A basic implementation of the Levenstein algorithm uses a cost matrix to ac-

cumulate transformation costs. A recursive process is used to update successive

elements of this matrix in order to discover the overall minimum transformation

cost.

Let the sequence P = (p1, p2, . . . , pM) be defined as the source sequence and

the sequence Q = (q1, q2, . . . , qN) be defined as the target sequence. Additionally,

three transformation cost functions are defined

• Cs(x, y) - This represents the cost of transforming symbol x in P to symbol

y in Q. Typically this has a cost of 0 if x = y ie. a match operation

• Ci(y) - The cost of inserting the symbol y into sequence P .

• Cd(x) - The cost of deleting the symbol x from sequence P .

The element at row i and column j in the cost matrix represents the minimum cost

of transforming the subsequence (pk)i1 to (qk)

j1. Hence the bottom-right element

of the cost matrix represents the total minimum cost of transforming the entire

source sequence P to the target sequence Q.

The basic premise of the Levenstein algorithm is that the minimum cost of

transforming the sequence (pk)i1 to (qk)

j1 is either:

1. The cost of transforming (pk)i1 to (qk)

j−11 plus the cost of inserting qj

A.3. Algorithm 219

2. The cost of transforming (pk)i−11 to (qk)

j1 plus the cost of deleting pi

3. The cost of transforming (pk)i−11 to (qk)

j−11 plus the cost of substituting pi

with qj. If pi = qj then this is usually taken to have a cost of 0.

In this way, the cost matrix can be filled from the top-left corner to the bottom-

right corner in an iterative fashion.

The Levenstein algorithm is then as follows:

1. Initialise a (M + 1)× (N + 1) matrix Ω. This is called the Levenstein cost

matrix.

2. The top left element Ω0,0 represents the cost of transforming the empty

sequence to the empty sequence. This is therefore initialised to 0.

3. The first row of the cost matrix represents the sequence of successive inser-

tions. Hence it can be initialised to be

Ω0,j = j ∗ Ci(qj) (A.1)

4. The first column of the cost matrix represents successive deletions. It there-

fore can also be immediately initialised to be

Ωi,0 = i ∗ Cd(pi) (A.2)

5. Update elements of the cost matrix from the top-left down to the bottom-

right using the Levenstein update equation

Ωi,j =Min

Ωi,j−1 + Ci(qj),

Ωi−1,j + Cd(pi),

Ωi−1,j−1 + Cs(pi, qj)

(A.3)

220 Appendix A. The Levenstein Distance

Figure A.1 shows an example of the cost matrix obtained using the MED

method for transforming the word deranged to the word hanged using constant

transformation functions. It shows that the cheapest transformation cost is 3.

There are multiple means of obtaining this minimum cost. For example, both

the operation sequences (del, del, subst,match,match,match,match,match) and

(subst, del, del,match,match,match,match,match) have costs of 3.

/ h a n g e d

/ 0 1 2 3 4 5 6

d 1 1 2 3 4 5 5

e 2 2 2 3 4 4 5

r 3 3 3 3 4 5 5

a 4 4 3 4 4 5 6

n 5 5 4 3 4 5 6

g 6 6 5 4 3 4 5

e 7 7 6 5 4 3 4

d 8 8 7 6 5 4 3

Figure A.1: Example of cost matrix calculated using Levenstein algorithm fortransforming deranged to hanged. Cost of substitutions, deletions and insertionsall fixed at 1, cost of match fixed at 0.

acoustic keyword spotting in speech with … · speech and audio research laboratory of the saivt...

Documents