an overview of robustness related issues in speaker recognition a plenary overview talk at apsipa...
TRANSCRIPT
An overview of Robustness Related Issues in speaker recognition
a plenary overview talk at APSIPA ASC 2014
Thomas Fang ZhengCSLT, RIIT, Tsinghua University
APSIPA ASC 2014, Dec 9-12, 2014, Siem Reap, Cambodia
1
Outline
• Introduction
• Environmental-Related Issues
• Speaker-Related Issues
• Application-Oriented Issues
• Summary
• Reference
2
Introduction
• Automatic speaker recognition (id & veri)Active research areas (cross-channel, noise, …)
Wide applications (telephone banking, forensics, …)
• A lot of challenges in practical applications
• Three categories of robustness issuesEnvironment-related issues
Speaker-related issues
Application-oriented issues
3
Three Issues
4
Environment-Related Issues – Noise robustness
• Factors:Recording / Environmental noises
• Two research directions:Feature level:
Spectral Subtraction (Boll 1979)/ RASTA filtering (Hermansky 1994)
PCA (Kocsor 2000) /LDA (Lomax 2007) /HLDA (Saon 2000) in feature domain
Model level:Model compensation algorithms (Gales 1996)
5
Environment-Related Issues – Channel mismatch
• Factors:Various types of microphones /transmission channels
• Three research directions:Feature transformation
CMS /CMN (Furui 1981) ; Feature mapping (Reynolds 2003)
Model compensation SMS (Teunen 2000) (Speaker Model Synthesis); subspace projection
Score normalizationZ-Norm, H-Norm, T-Norm, ...
6
Environment-Related Issues – Channel mismatch
• State-of-the-art approachesJFA (Joint Factor Analysis) (Kenny 2007): a more comprehensive statistical
approach, which defines both the speaker- and channel- variations
as two independent random variables.
i-vector (Dehak 2011): a low-rank total variability is defined to
represent both speaker- and channel-variations at the same time.
7
• Inter-channel compensation methodsi-vector leads to less discrimination among speakers due to channel
variations. So many inter-channel compensation methods were proposed to
extract accentuate speaker information.
NAP (Nuisance Attribute Projection) (Solomonoff 2004): to find the optimized projection.
WCCN (Within Class Covariance Normalization) (Hatch 2006): Linear transform.
LDA (Linear Discriminant Analysis) (Dehak 2011)/PLDA (Probabilistic LDA) (Loffe 2006): PLDA is a
generative model and has achieved great success.
Environment-Related Issues – Channel mismatch
8
Three Issues
9
Speaker-Related Issues
• Gender
• Physical conditions (cold or laryngitis)
• Speaking style (emotion /speaking rate /volume /idiom)
• Cross-Lingual (language mismatch)
• Ageing (voice changes with time/age)
10
Speaker-Related Issues – Gender
• Better scenario: training with gender dependent (GD) features and recognizing with
known gender information. In applications, gender info is often not available.
• Approaches: To design a gender independent system, and then
Pairwise discriminative training based on i-vector (Cumani 2012)
Source-normalization for variation to separate genders as a pre-processing step based
on a PLDA classifier (McLaren 2012)
Male and female are physiologically different, their speech should be difficultly
precossed and analyzed: FFT-size, frame-shift (resolution), UBM, ..., the authors’
preliminary results show significant improvement when doing this way.
11
Speaker-Related Issues – Physical conditions
• Speech is a behavioral signal.
• Variability of Speaker’s physical conditionsCold /nasal congestion /laryngitis, etc.
• “cold-affected” speech in speaker recognition (Tull 1996)
• This direction is still rare, and speech databases are difficult
to collect and organize.
• But research on it has practical importance.
12
Speaker-Related Issues – Speaking style (Emotion)
• Emotion: an intrinsic nature of human beings.
• Categories:Analysis of various emotion-related acoustic factors
Prosody /Voice quality /pitch /duration /sound intensity
Emotion-compensation methodsemotion-added model training method (Wu 2005)
supra-segmental HMM (Shahin 2009)
emotion-dependent CMLLR transformations (Bie 2013)
13
Speaker-Related Issues – Speaking style (Rate)
• Speaking rate: another high level speaker-related variable and has a big impact
on speaker verification performance.
• Rate mismatch between training and test utterances
• Speech recognitionA probabilistic method to estimate speaking rate (Yasuda 2012)
A speech rate classifier (SRC) (Martinez 1998)
• Speaker recognitionNon-linear time alignment or DTW (Dynamic Time Warping), effective or not?
14
Speaker-Related Issues – Speaking style (Idiom)
• Idiom: a person’s personal style of word usage and a high-level
inter-speaker characteristic. It is actually a kind of discriminate
information rather than a robustness issue, but it helps to improve
the recognition performance.
• Human brain: self-learning with idioms
• Important threads:Idiosyncratic word-usage: high-level feature
Idiosyncratic pronunciation feature: low-level feature
15
Speaker-Related Issues – Speaking style (X-lingual)
• Language mismatch results in performance degradation.
• Previous work:Training a pooled model from multi-lingual corpora (Ma 2004)
Language normalization (Akbacak 2007)
Language factor compensation (Lu 2009)
Feature combination (Nagaraja 2013)
16
Speaker-Related Issues – Speaking style (Ageing)
• Whether voice changes significantly with time?
• Performance degradation has been observed in the
presence of time intervals.
• From the point of view of patter recognition:Enrollment data (training model) and test utterances for
verification are separated by some period of time.
17
• Model domain: Data augmentation (Beigi 2009): speaker re-enrollment
MAP/MLLR-adaptation (Lamel 2000): model adaptation
• Score domain:A classifier with an ageing-dependent decision boundary (Kelly 2011)
• Feature domain:F-ratio measure (Lu 2007)
Frequency warping and filter output weighting to emphasize speaker-sensitive and time-
insensitive sub-bands (Wang 2012)
Speaker-Related Issues – Speaking style (Ageing)
18
Three Issues
19
APP-Oriented Issues – Main applications
• User Authenticationcommercial transactions /control access /online shopping
• Public Security and JudicatureParolees monitoring /In-prison call monitoring /Forensics
• Speaker Adaptation in Speech RecognitionSpeaker-dependent speech recognizer
• Multi-Speaker EnvironmentsSpeaker detection /tracking /segmentation /diarization
20
APP-Oriented Issues – SUSR
• Short utterance speaker recognition (SUSR)Unsatisfactory performance on GMM-UBM (NIST), JFA (Kenny 2004) and i-vector (Vogt 2008).
• Challenges (Zhang 2014)
Discriminative information inadequate and confusable
• Research directionsTo select more discriminative data: Fisher-voice based feature fusion method combined
with PCA and LDA (Zhang 2013).
To train more accurate model with high-level information: JFA and i-vector / phoneme
specific multi-model method (Zhang 2012).
Better algorithms for scoring: ULS (Parris 1998) / WBLS (Malegaonkar 2008).
21
APP-Oriented Issues – Many others
• Coding mismatchG.711 /G.729 /WeChat-specific format /...
• Integration of speech recognition and speaker recognitionSpeech recognition: more speaker/dialect-independent
Speaker recognition: more speaker-dependent
• Voice quality control:VAD and higher-discriminative feature/segment retrieval
High-quality speech vs distorted speech (noisy, clipped, ...)
22
Summary
• An overview of speaker recognition technologies with an emphasis
on dealing with robustness issues.
• Three categories : Environment-related issues
Speaker-related issues
Application-oriented issues
• Some directions have been touched by researchers while others may
be future focuses.
Thank you
APSIPA ASC 2014, Dec 9-12, 2014, Cambodia
23
References• M. Akbacak, J. H. Hansen (Akbacak 2007), “Language normalization for bilingual speaker recognition systems,” Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 4: IV-257-IV-260.• H. Beigi (Beigi 2009), “Effects of time lapse on speaker recognition results,” Proc. of 16th International Conference on
Digital Signal Processing, pp. 1-6, 2009.• F.-H. Bie, D. Wang, T. F. Zheng, J. Tejedor, R. Chen (Bie 2013), “Emotional adaptive training for speaker verification,” Signal
and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific. IEEE, 2013: 1-4.• S. F. Boll (Boll 1979), “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics,
Speech and Signal Processing, 1979, 27:113-120.• S. Cumani, O. Glembek, N. Brummer, E. de Villiers, P. Laface (Cumani 2012), “Gender independent discriminative speaker
recognition in i-vector space,” ICASSP, 2012.• N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (Dehak 2011), “Front-end factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.• S. Furui (Furui 1981), “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust. Speech Signal
Processing, 1981. 29(2):254-272.• M. J. F. Gales and S. J. Young (Gales 1996), “Robust continuous speech recognition using parallel model combination,”
IEEE Transactions on Speech and Audio Processing, 1996, 4(5): 352-359.• A. O. Hatch, S. S. Kajarekar, and A. Stolcke (Hatch 2006), “Within-class covariance normalization for SVM-based speaker
recognition,” in INTERSPEECH’ 06, 2006.
24
References• H. Hermansky and N. Morgan (Hermansky 1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio
Processing, 1994. 2(4): 578-589• S. Ioffe (Ioffe 2006), “Probabilistic linear discriminant analysis,” in ECCV2006, 2006, pp. 531–542.• F. Kelly and N. Harte (Kelly 2011), “Effects of long-term ageing on speaker verification,” Biometrics and ID Management,
Volume 6583 of Lecture Notes in Computer Science, pp. 113-124, Springer Berlin/Heidelberg, 2011.• P. Kenny, P. Dumouchel (Kenny 2004), “Experiments in Speaker Verification using Factor Analysis Likelihood Ratios,” in
Proceedings of Odyssey04 - Speaker and Language Recognition Workshop, Toledo, Spain, 2004.• P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (Kenny 2007), “Joint factor analysis versus eigenchannels in speaker
recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.• A. Kocsor, L. Toth, A. Kuba, K. Kovacs, M. Jelasity, T. Gyimothy, J. Csirik (Kocsor 2000), “A comparative study of several feature
transformation and learning methods for phoneme classification,” International Journal of Speech Technology, 2000. 3(3): 263-276.
• L. Lamel and J. Gauvin (Lamel 2000), “Speaker verification over the telephone,” Speech Communication, Volume 2000, Issue 31, pp. 141-154, 2000.
• R. G. Lomax and D. L. Hahs-Vaughn (Lomax 2007), “Statistical concepts: a second course,” Lawrence Erlbaum Associates, 2007.• X. Lu and J. Dang (Lu 2007), “Physiological feature extraction for text independent speaker identification using non-uniform
subband processing,” Proc. of ICASSP 2007, pp. 461-464, 2007• L. Lu, Y. Dong, X. Zhao, J. Liu, H. Wang (Lu 2009), “The effect of language factors for robust speaker recognition,” Acoustics,
Speech and Signal Processing, 2009. ICASSP 2009.
25
References• B. Ma, and H.-L. Meng (Ma 2004), “English-Chinese bilingual text-independent speaker verification,” Acoustics, Speech,
and Signal Processing, 2004. Proceedings (ICASSP'04). IEEE International Conference on. Vol. 5, 2004.• A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran and J. Fortuna (Malegaonkar 2008), “On the enhancement of speaker
identification accuracy using weighted bilateral scoring,” IEEE International Carnahan Conference on Security Technology (ICCST): 254-258, 2008.
• F. Martinez, D. Tapias, J. Alvarez (Martinez 1998), “Towards speech rate independence in large vocabulary continuous speech recognition,” Acoustics, Speech and Signal Processing, 1998.
• M. McLaren and D. A. van Leeuwen (McLaren 2012), “Gender-independent speaker recognition using source normalization,” in Proc. ICASSP, 2012, pp.4373-4376.
• B. G. Nagaraja, H. S. Jayanna (Nagaraja 2013), “Combination of Features for Multilingual Speaker Identification with the Constraint of Limited Data,” International Journal of Computer Applications, 2013, Vol.70 (6), pp.1-6.
• NIST Speaker Recognition Evaluation Plan (NIST), Online Available http://www.nist.gov/speech/tests/sre/.• E. S. Parris and M. J. Carey (Parris 1998), “Multilateral techniques for speaker recognition,” International Conference on
Spoken Language Processing (ICSLP), 1998.• D. A. Reynolds (Reynolds 2003), “Channel robust speaker verification via feature mapping,” ICASSP, 2003, (2): 53-56.• G. Saon, M. Padmanabhan, R. Gopinath, S. Chen (Saon 2000), “Maximum likelihood discriminant feature spaces,”
Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2000. 2: 1129-1132. • I. Shahin (Shahin 2009), “Speaker identification in emotional environments,” Iranian Journal of Electrical and Computer
Engineering, vol. 8, no.1, pp. 41–46, 2009.
26
References• A. Solomonoff, C. Quillen, and W. M. Campbell (Solomonoff 2004), “Channel compensation for SVM speaker
recognition,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2004, pp. 57–62.• R. Teunen, B. Shahshahani, and L. Heck (Teunen 2000), “A model-based transformational approach to robust speaker
recognition,” in Proc. ICSLP’00, 2000, pp. 495–498.• R. G. Tull and J. C. Rutledge (Tull 1996), “‘Cold Speech’ for Automatic Speaker Recognition,” Acoustical Society of America
131st Meeting Lay Language Papers, May, 1996.• R. Vogt, B. Baker, and S. Sridharan (Vogt 2008), “Factor analysis subspace estimation for speaker verification with short
utterances,” in Interspeech, Brisbane, 2008.• L.-L. Wang, X.-J. Wu, T. F. Zheng and C.-H. Zhang (Wang 2012), “An Investigation into Better Frequency Warping for Time-
Varying Speaker Recognition,” APSIPA ASC, 2012.• T. Wu, Y.-C. Yang, and Z.-H. Wu (Wu 2005), “Improving speaker recognition by training on emotion-added models,” in
Proc. Affective Computing and Intelligent Interaction, 2005, pp. 382–389.• H. Yasuda and M. Kudo (Yasuda 2012), “Speech rate change detection in martingale framework,” in Proc. ISDA, 2012,
pp.859-864.• C.-H. Zhang, X.-J. Wu, T. F. Zheng and L.-L. Wang (Zhang 2012), “A K-phoneme-class based multi-model method for short
utterance speaker recognition,” The 4th Asia-Pacific Signal and Information Processing Association, Annual Summit and Conference, APSIPA ASC, 2012.
• C.-H. Zhang and T. F. Zheng (Zhang 2013), “A fishervoice based feature fusion method for short utterance speaker recognition,” IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP, 2013.
• C.-H. Zhang (Zhang 2014), “Research on Short Utterance Speaker Recognition,” PhD thesis, Tsinghua University, April 2014.