speaker id smorgasbord or how i spent my summer at icsi
DESCRIPTION
Speaker ID Smorgasbord or How I spent My Summer at ICSI. Kofi A. Boakye International Computer Science Institute. Outline. Keyword System Enhancements Monophone System Hybrid HMM/SVM Score Combinations Possible Directions. Keyword System: A Review. Motivation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/1.jpg)
9/20/2004 Speech Group Lunch Talk
Speaker ID Smorgasbordor
How I spent My Summer at ICSI
Kofi A. Boakye
International Computer Science Institute
![Page 2: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/2.jpg)
9/20/2004 Speech Group Lunch Talk
Outline
• Keyword System
• Enhancements
• Monophone System
• Hybrid HMM/SVM
• Score Combinations
• Possible Directions
![Page 3: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/3.jpg)
9/20/2004 Speech Group Lunch Talk
Keyword System: A ReviewMotivation
I. Text-dependent systems have high performance, but limited flexibility when compared to text-independent systems
Capitalize on advantages of text-dependent systems in this text-independent domain by limiting words of interest to a select group:
Backchannels (yeah, uhhuh) , filled pauses (um, uh), discourse markers (like, well, now…)
=> high frequency and high speaker-characteristic quality
II. GMMs assume frames are independent and fail to take advantage of sequential information
=> Use HMMs instead to model the evolution of speech in time
![Page 4: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/4.jpg)
9/20/2004 Speech Group Lunch Talk
Keyword System: A ReviewApproach
Model each speaker using a collection of keyword HMMs
Speaker models generated via adaptation of background models trained from a development data set
Use standard likelihood ratio approach:
Compute log likelihood ratio scores using accumulated log probabilities from keyword HMMs
Use a speech recognizer to:
1) Locate words in the speech stream
2) Align speech frames to the HMM
3) Generate acoustic likelihood scores
Word Extractor
HMM-UBMN
HMM-UBM2
HMM-UBM1
signal
Com
bin
atio
n
![Page 5: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/5.jpg)
9/20/2004 Speech Group Lunch Talk
Keyword System: A ReviewKeywordsDiscourse markers: {actually, anyway, like, see, well, now, you_know, you_see, i_think, i_mean}
Filled pauses: {um, uh}
Backchannels: {yeah, yep, okay, uhhuh, right, i_see, i_know }
Keyword ModelsSimple left-to-right (whole word) HMMs with self-loops and no skips
4 Gaussian components per state
Number of states related to number of phones and median number of frames for word
HMMs trained and scored using HTK
Acoustic features: 19 mel-cepstra, zeroth cepstrum, and their first differences
![Page 6: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/6.jpg)
9/20/2004 Speech Group Lunch Talk
System Performance
Switchboard 1 Dev Set
Data partitioned into 6 splits
Tests use jack-knifing procedure:
Test on splits 1 - 3 using background model trained on splits 4 – 6 (and vice versa)
For development, tested primarily on split 1 with 8-side training
Result:EER = 0.83%
![Page 7: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/7.jpg)
9/20/2004 Speech Group Lunch Talk
System Performance
0
10
20
30
40
50
60
EER(%)
EER by Word
0
5000
10000
15000
20000
25000
30000
35000
Word Frequencies
Observations:
Well-performing bigrams have comparable EERs
Poorly-performing bigrams suffer from a paucity of data
•Suggests possibility of frequency threshold for performance
Single word ‘yeah’ yields EER of 4.62%
![Page 8: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/8.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: WordsExamine the performance of other words
Sturim et al. propose word sets for text-constrained GMM system
1) Full set: 50 words that occur in > 70% of conversation sides
![Page 9: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/9.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: WordsExamine the performance of other words
Sturim et al. propose word sets for text-constrained GMM system
1) Full set: 50 words that occur in > 70% of conversation sides
{ and, I , that, yeah, you, just like, uh, to, think, the, have, so, know, in, but, they, really, it, well, is, not, because, my, that’s, on, its, about, do, for, was, don’t, one, get, all, with, oh, a, we, be, there, of, this, I’m, what, out, or, if, are, at }
![Page 10: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/10.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: WordsExamine the performance of other words
Sturim et al. propose word sets for text-constrained GMM system
1) Full set: 50 words that occur in > 70% of conversation sides
2) Min set: 11 words that yield the lowest word-specific EERs
![Page 11: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/11.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: WordsExamine the performance of other words
Sturim et al. propose word sets for text-constrained GMM system
1) Full set: 50 words that occur in > 70% of conversation sides
2) Min set: 11 words that yield the lowest word-specific EERs
{and, I, that, yeah, you, just, like, uh, to, think, the}
![Page 12: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/12.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Words
Performance
Full set:
EER = 1.16%
My set Full set =
{ yeah, like, uh, well, I,
think, you }
![Page 13: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/13.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Words
0
5
10
15
20
25
30
EER (% )
EER by Word
0
20000
40000
60000
80000
100000
120000
140000
Word Count
Observations:
Some poorly performing words occur quite frequently
•Such words may simply not be highly discriminative in nature
Single word ‘and’ yields EER of 2.48% !!
![Page 14: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/14.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Words
Performance
Min set:
EER = 0.99%
My set Min set =
{yeah, like, uh, I, you, think}
![Page 15: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/15.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Words
0
1
2
3
4
5
6
7
EER (% )
EER by Word
0
20000
40000
60000
80000
100000
120000
140000
Word Count
Observations:
Except for ‘and’, min set words have comparable performance
Most can fall into one of the three categories of filled pause, discourse marker, or backchannel, either in isolation or conjunction
![Page 16: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/16.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNormTarget model scores have different distributions for utterances based on handset type
Perform mean and variance normalization of scores based on estimated impostor score distribution
For split 1, use impostor utterances from splits 2 and 3
•75 females
•86 males
tgt1
tgt2
elec
elec
carb
carb
LR scores HNorm Scores
![Page 17: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/17.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNorm
Performance
EER = 1.65%
Performance worsened!
Possible issue in HNorm implementation?
![Page 18: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/18.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNorm
Examine effect of HNorm on particular speaker scores
Speakers of interest: Those generating the most errors
3 Speakers each generating 4 errors
![Page 19: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/19.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNorm
![Page 20: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/20.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNorm
![Page 21: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/21.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNorm
![Page 22: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/22.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: HNormConclusion: HNorm works…but doesn’t
One possibility: Look at computed devs…
Old NewSpeaker 1316 Impostor 0.588 1.333
Target 1.039 2.122Speaker 1437 Impostor 0.342 0.786
Target 0.381 0.766Speaker 1530 Impostor 0.554 1.375
Target 0.42 0.992
Distributions are widening in some cases
![Page 23: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/23.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Deltas
Problem: System performance differs significantly by gender
Hypothesis: Higher deltas for females may be noisier
Solution: Use longer window for delta computation to smooth
![Page 24: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/24.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Deltas
Extended window size from 2->3
Result:
EER = 0.83%
Performance nearly indistinguishable
![Page 25: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/25.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Deltas
Extended window size from 2->3
Result:
Male and female disparity remains
![Page 26: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/26.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Deltas
Extended window size from 3->5
Result:
EER = 1.32%
Performance worsens!
![Page 27: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/27.jpg)
9/20/2004 Speech Group Lunch Talk
Enhancements: Deltas
Extended window size from 3->5
Result:
Male female disparity widens
Further investigation necessary
![Page 28: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/28.jpg)
9/20/2004 Speech Group Lunch Talk
Monophone SystemMotivation
Keyword system, with its use of HMMs, appears to have good performance
However, we are only using a small amount (~10%) of the total data available
=>Get full coverage by using phone HMMs rather than word HMMs
System represents a trade-off between token coverage and “sharpness” of modeling
![Page 29: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/29.jpg)
9/20/2004 Speech Group Lunch Talk
Monophone SystemImplementation
System implemented similarly to keyword system, with phones replacing words
Background models differ in that:
1) All models have 3 states, with 128 Gaussians per state
2) Models trained by successive splitting and Baum-Welch re-estimation, starting with a single Gaussian
![Page 30: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/30.jpg)
9/20/2004 Speech Group Lunch Talk
Monophone System
Performance
EER = 1.16%
Similar performance to keyword system
Uses a lot more data!
![Page 31: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/31.jpg)
9/20/2004 Speech Group Lunch Talk
Hybrid HMM/SVM SystemMotivation
SVMs have been shown to yield good performance in speaker recognition systems
Features used:
• Frames
• Phone and word n-gram counts/frequencies
• Phone lattices
![Page 32: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/32.jpg)
9/20/2004 Speech Group Lunch Talk
Hybrid HMM/SVM SystemMotivation
Keyword system looks at “distance” between target and background models as measured by log-probabilities
Look at distance between models more explicitly
=> Use model parameters as features
![Page 33: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/33.jpg)
9/20/2004 Speech Group Lunch Talk
Hybrid HMM/SVM SystemApproach
Use concatenated mixture means as features for SVM
Positive examples obtained by adapting background HMM to each of 8 training conversations
Negative examples obtained by adapting background HMM to each conversation in the background set
Keyword-level SVM outputs combined to give final score
-Presently simple linear combination with equal weighting is used (though clearly suboptimal)
![Page 34: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/34.jpg)
9/20/2004 Speech Group Lunch Talk
Hybrid HMM/SVM System
Performance
EER = 1.82%
Promising first start
![Page 35: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/35.jpg)
9/20/2004 Speech Group Lunch Talk
Score Combination
We have three independent systems, so let’s see how they combine…
Perform post facto (read: cheating) linear combination
System EER (%)Baseline 0.83Monophone 1.16SVM 1.82
System EER (%) WeightsBaseline + Monophone 0.66 0.5, 0.5Baseline + SVM 0.66 0.6, 0.4Monophone + SVM 0.66 0.6, 0.4
Each best combination yields same EER
=>Possibly approaching EER limit for data set
![Page 36: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/36.jpg)
9/20/2004 Speech Group Lunch Talk
Possible Directions• Develop on SWB2
• Create word “master list” for keyword system
• TNorm
• Modify features to address gender-specific performance disparity
• Score combination for hybrid system
• Modified hybrid system
• Tuning
• Plowing
![Page 37: Speaker ID Smorgasbord or How I spent My Summer at ICSI](https://reader036.vdocument.in/reader036/viewer/2022062423/56814d0b550346895dba4413/html5/thumbnails/37.jpg)
9/20/2004 Speech Group Lunch Talk
Fin