design of a keyword spotting system using modified cross-correlation in the time and mfcc domain...
TRANSCRIPT
DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND
MFCC DOMAIN
Presented by:Olakunle Anifowose
Thesis Advisor:Dr. Robert Yantorno
Committee Members: Dr. Joseph PiconeDr. Dennis Silage
Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY
Speech Processing LaboratoryTemple University
3
Outline
Introduction to keyword spotting
Motivation for this work
Experimental Conditions
Common approach to Keyword Spotting
Method used Time Domain
MFCC Domain
Conclusions
Future Work
Keyword Spotting
Identify keyword in spoken utterance or written document Determine if keyword is present in
utterance Location of keyword in utterance
Possible operational results Hits False Alarms Miss
Keyword Spotting
Speaker dependent/Independent (speech recognition) Speaker Dependent
Single speaker Lacks flexibility and not speaker adaptable Easier to develop
Speaker Independent Multi-Speaker Flexible Harder to develop
Monitor conversations for flag words.
Automated response system.
Security.
Automatically search through speeches for certain words or phrases.
Voice command/dialing.
Applications
7
Motivation for this work
Typical Large Vocabulary Continous Speech Recognizer (LVCSR) / Hidden Markov Model (HMM) based approaches requires a garbage model
To train the system for non-keyword speech data.
The better the garbage model, the better the keyword spotting performance
Use of LVCSR techniques can introduce
Computational load, complexity.
Need for training data.
Research Objectives
Development of a simple keyword spotting system based on cross-correlation.
Maximize hits while keeping false alarms and misses low.
Speech Database Used
Call Home Database Contains more than 40 telephone
conversation between male and female speakers.
30 minutes long conversation. Switchboard Database
Two sided conversations collected from various speakers in the United States.
Experimental Setup
Conversations are split into single channels Call home database
60 utterances ranging from 30secs to 2mins. Keyword of interest were college, university,
language, something, student, school, zero, relationship, necessarily, really, think, English, program, tomorrow, bizarre, conversation and circumstance.
Switchboard database 30 utterances ranging from 30secs to 2mins. Keyword of interest always, money and something.
Hidden Markov Model Statistical model – hidden states / observable
outputs Emission probability – p(x|q1) Transition probability – p(q2|q1)
Common Approaches
First order Markov Process – probability of next state depends only on current state.
Infer output given the underlying system.
HMM for Speech Recognition Each word – sequence of unobservable states
with certain emission probabilities (features) and transition probabilities (to next state).
Estimate the model for each word in the training vocabulary.
For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm.
KWS directly built using HMM based Large Vocabulary Continuous Speech Recognizer (LVCSR).
Hidden Markov Models
Limitation
Large amount of training data required.
Training data has to be transcribed in word level and/or phone level.
Transcribed data costs time and money.
Not available in all languages.
HMM
Various keyword Spotting system
HMM Context dependent state of the art phoneme
recognizer Keyword model. Garbage model.
Evaluated on the Conversational Telephone Speech database.
Accuracy varies with keyword 52.6% with keyword “because”. 94.5% with keyword “zero”.
Ketadbar etal, 2006
Various keyword Spotting system Spoken Term Detection using Phonetic
Posteriogram Trained on acoustic phonetic models. Compared using dynamic time warping. Trained on switchboard cellular corpus. Tested on Fisher english development test
from NIST. Average precision for top 10hits was 63.3%.
Hazen etal, 2009
Various keyword Spotting system S-DTW
Evaluated on the switchboard corpus. 75% accuracy for all keywords tested.
Jansen etal 2010
Contributions
We have proposed a novel approach to keyword spotting in both the time and MFCC domain using cross-correlation.
The Design of a Global keyword for cross-correlation in the time domain.
Cross-correlation
Measure of similarity between two signals.
Two signals compared by: Sliding one signal by a certain time lag Multiplying both the overlapping regions
and taking the sum Repeating the process and adding the
products until there is no more overlap If both signals are exactly the same,
there’s a maximum peak at the time = 0, and the rest of the correlation signal tapers off to zero.
Research Using Cross-correlation
The identification of cover songs Search musical database and determine
songs that similar but performed by different artist with different instruments Features of choice - chroma features
Representation for music Entire spectrum projected onto 12 bins
representing 12 distinct semitones. Method used is cross-correlation Cross-correlation is used to determine
similarities betweeen songs based on their chroma features
Cases Considered
Time Domain
Initial approach
Modified approach
MFCC Domain
Time Domain Initial Approach
0 1 2 3 4 5 6
x 104
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4
x 104
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.81. Let the length
of thekeyword or phrase
be n.The cross
correlation of the keyword and
the first n samples of the
utterance is computed.
xcorr
2. Observe position of peak to see if it’s around the zero lag.
Yes: KeywordNo: Not keyword
3. Shift observed portion by a small amount and repeat process
If a portion is reached where the peak is close to the zero lag, then that’s where the keyword is. If not, the utterance does not contain the keyword.
The power around the “zero” lag is obtained and compared to the power in the rest of the correlation signal. This ratio is referred to as Zero lag to Rest Ratio (ZRR).
If the ZRR is greater than a certain threshold(2.5) then that segment of the utterance contains the keyword or phrase.
The test utterance is shifted and the process is repeated
If there is no segment with a ZRR greater than 2.5, the utterance does not contain the keyword
ZRR-Zero lag to Rest Ratio
• Same Speaker
• Keyword part of the utterance
• Different Speaker
• Keyword from different speaker
Test Cases
Results(utterance-male1 keyword-male1)
0 50 100 150 200 250 3000
1
2
3
4
5
6
7
8
Ra
tio
P
lot
S hift Count
Result(utterance-male1 keyword-male2)
0 50 100 150 200 250 3000
0.5
1
1.5
2
2.5
3
3.5
Ra
tio
Plo
t
S hift Count
Results( utterance-female1 keyword-female1 )
0 50 100 150 200 250 300 350 400 450 5000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Ra
tio
P
lot
S hift Count
Results( utterance-female1 keyword-female2 )
0 50 100 150 200 250 300 350 400 450 5000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Ra
tio
Plo
t
S hift Count
Result Speaker Dependent Initial Time Domain Approach
Tested on 30 utterances single instances of the following keyword
bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from the same speaker
Percentage
Hits 86%
False Alarm 14%
Miss 14%
Result Speaker Independent Initial time Domain approach
Tested on 40 utterances Multiple instances of the following
keyword bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from various speakers. Percentage
Hits 26%
False Alarm 65%
Miss 26%
Keyword → REALLY
* 13 and 14 → same gender (female)
Challenge
Time Domain Modified
Utterance Global Keyword from Quantized Dynamic Time
Warping Pitch Smoothening
Cross-Correlate both signals and Computer zero lag to
Rest Ratio (ZRR) on a frame by frame basis
Highest Zero Lag ratio is the location of the
keyword
Measure of frequency level
Change in pitch results in a change in the fundamental frequency of speech
Difference in pitch between keyword and utterance increases detection errors.
Pitch
Pitch is a form of speaker information
Limit the effects pitch has on a speech system
Kawahara Algorithm
STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum) algorithm to modify pitch.
It reduces periodic variation in time caused by excitation
Pitch Normalization
Utterance Pitch Normalization
Straight Algorithm Elimination of periodicity interference
Temporal interference around peaks can be removed by constructing a new timing window based on a cardinal B-spline basis function that is adaptive to the fundamental period.
F0 Extraction Natural speech is not purely periodic
Speech resythensis The extracted F0 is then used to resynthesize speech
Modeling a Global Keyword
Compute MFCC features for each keyword
Perform Quantized Dynamic Time Warping (DTW) on several keyword templates.
MFCC
Take the Fourier transform of (a windowed portion of) a signal.
Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
Take the log of the power at each of the mel frequencies.
Take the Discrete Cosine Transform (DCT) of the mel log powers, as if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.
Dynamic Time Warping
Time stretching and contracting one signal so that it aligns with the other signal.
Time-series similarity measure. Reference and test keyword
arranged along two side of the grid.
Template keyword – vertical axis, test keyword – horizontal.
Each block in the grid – distance between corresponding feature vectors.
Best match – path through the grid that minimizes cumulative distance.
Quantized Dynamic Time Warping
The MFCC features extracted from various instances of a keyword will be divided into 2 sets: A and B . Each reference template Ai will be paired with only one Bi.
For each pair Ai and Bi the optimal path will be computed (using the classic DTW algorithm).
The new vector Ci= (c1, c2,…cNc) will be generated Repeat the process considering the pair (Ci, Ci+1)
as a new Ai and Bi pair . Result is a single reference vector Cy Invert the vector into a time domain signal
Sample Results
Sample Results
Result Using a Global Keyword and Pitch Normalized utterances and keywords.
Tested on 60 utterances Used a global keyword 10 utterances associated with each keyword keyword of interest bizarre, conversation, something,
really, necessarily, relationship, think, tomorrow, computer, college, university, zero, student, school, language, program.
Percentage
Hits 41.2%
False Alarm 37%
Miss 42%
Results Analysis
Result differ from keyword to keyword The best performing keyword was
bizarre which had a hit rate of 60% Time domain is not suitable due to
uneven statistical behavior of signals.
MFCC Domain Steps for cross-correlating the keyword and utterance in the MFCC domain.
Step 1: Pitch normalized utterances and keywords Using the straight Algorithm
Step 2: Estimate the length of the keyword (n) and computed its MFCC feature
Step3: Compute the MFCC feature of the first n samples of the utterance
Step 4: Normalize the MFCC features of the utterance and keyword and cross-correlate them.
Step 5: Store a single value from the cross-correlation result in a matrix and shift along the utterance by a couple of sample and repeat steps 3-5 until the end of the utterance.
Step 6: Identify the maximum value in the matrix as the location of the keyword.
Normalizing MFCC Features
Divide the features by the square root of the sum of the squares of each vector
Similar to dividing a vector by its unit norm to obtain a unit vector.
Reason so MFCC features ranges from zero to one
Interpreting cross-correlation result of MFCC features
Similar to cosine similarity measure. If two vector are exactly the same there
is an angle of zero between them and the cosine of that would be a one.
The closer the cross-correlation result of two vector is to one. The more likely they are to be a match.
Vectors that are dissimilar will have a wider angle and their cross-correlation results will be a lot less than one.
Distance Between MFCC Features for Different Keywords
College Distance
College 1.1*10-8
University 2.1*10-4
Something 8.3*10-3
Conversation 1.98*10-5
School 1.1*10-2
Zero 0.98*10-3
Program 2.1*10-6
Language 1.5*10-4
Bizarre 7.8*10-5
Circumstance 2.6*10-4
Really 3.2*10-2
Speaker Dependent
Test were conducted on 30 utterances
Keywords were extracted from the same speaker College,
university, student, school, bizarre
The maximum matching score corresponds to the location of the keyword.
Speaker Independent Test
Test Samples Average of 5
utterances associated with each keyword
Average of 5 version of keyword
20-25 trials 13 keywords
More Results
The maximum matching score corresponds to the location of the keyword.
More Results
0 500 1000 1500 2000 2500 3000 3500 4000 4500 50000.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
segment count for utterance
mat
chin
g sc
ore
Maximum matching score location of keyword University
Second Highest score is the word Universities
Result Using a Cross-correlation in MFCC Domain.
Average considering results from every keyword
20-25 trials per keyword 13 keywords considered call home
database 3 keywords considered in the
switchboard databasePercentage
Hits 66%
False Alarm 12%
Miss 23%
Keyword Dependence
Keyword Accuracy
College 0.83
Circumstance 0.77
Conversation 0.63
English 0.35
Computer 0.50
Always 0.85
School 0.63
College 0.63
Language 0.40
Student 0.62
Money 0.7
Program 0.62
Something 0.68
System Performance Threshold
0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.970
0.2
0.4
0.6
0.8
1
1.2
HitsFalse AlarmMiss
Threshold
Syst
em
Acc
ura
cy %
Conclusions
Cross-correlation in the time domain is not very accurate for a speaker independent system Because of the behavior of the signals in the
time domain. Improvement with the use of a global
keyword and pitch normalization Not enough to deem a success
Cross-correlation of MFCC features is a very viable alternative to keyword spotting
Future Work
Experiments using more keywords Use larger dataset to optimize system
performance Test cross-correlation in other domains
THANK YOU!
Any Questions?