an overview of robustness related issues in speaker recognition a plenary overview talk at apsipa...

An overview of Robustness Related Issues in speaker recognition

a plenary overview talk at APSIPA ASC 2014

Thomas Fang ZhengCSLT, RIIT, Tsinghua University

[email protected]

APSIPA ASC 2014, Dec 9-12, 2014, Siem Reap, Cambodia

mailto:[email protected]

1

Outline

• Introduction

• Environmental-Related Issues

• Speaker-Related Issues

• Application-Oriented Issues

• Summary

• Reference

2

Introduction

• Automatic speaker recognition (id & veri)Active research areas (cross-channel, noise, …)

Wide applications (telephone banking, forensics, …)

• A lot of challenges in practical applications

• Three categories of robustness issuesEnvironment-related issues

Speaker-related issues

Application-oriented issues

3

Three Issues

4

Environment-Related Issues – Noise robustness

• Factors:Recording / Environmental noises

• Two research directions:Feature level:

Spectral Subtraction (Boll 1979)/ RASTA filtering (Hermansky 1994)

PCA (Kocsor 2000) /LDA (Lomax 2007) /HLDA (Saon 2000) in feature domain

Model level:Model compensation algorithms (Gales 1996)

5

Environment-Related Issues – Channel mismatch

• Factors:Various types of microphones /transmission channels

• Three research directions:Feature transformation

CMS /CMN (Furui 1981) ; Feature mapping (Reynolds 2003)

Model compensation SMS (Teunen 2000) (Speaker Model Synthesis); subspace projection

Score normalizationZ-Norm, H-Norm, T-Norm, ...

6


• State-of-the-art approachesJFA (Joint Factor Analysis) (Kenny 2007): a more comprehensive statistical

approach, which defines both the speaker- and channel- variations

as two independent random variables.

i-vector (Dehak 2011): a low-rank total variability is defined to

represent both speaker- and channel-variations at the same time.

7

• Inter-channel compensation methodsi-vector leads to less discrimination among speakers due to channel

variations. So many inter-channel compensation methods were proposed to

extract accentuate speaker information.

NAP (Nuisance Attribute Projection) (Solomonoff 2004): to find the optimized projection.

WCCN (Within Class Covariance Normalization) (Hatch 2006): Linear transform.

LDA (Linear Discriminant Analysis) (Dehak 2011)/PLDA (Probabilistic LDA) (Loffe 2006): PLDA is a

generative model and has achieved great success.


8

Three Issues

9

Speaker-Related Issues

• Gender

• Physical conditions (cold or laryngitis)

• Speaking style (emotion /speaking rate /volume /idiom)

• Cross-Lingual (language mismatch)

• Ageing (voice changes with time/age)

10

Speaker-Related Issues – Gender

• Better scenario: training with gender dependent (GD) features and recognizing with

known gender information. In applications, gender info is often not available.

• Approaches: To design a gender independent system, and then

Pairwise discriminative training based on i-vector (Cumani 2012)

Source-normalization for variation to separate genders as a pre-processing step based

on a PLDA classifier (McLaren 2012)

Male and female are physiologically different, their speech should be difficultly

precossed and analyzed: FFT-size, frame-shift (resolution), UBM, ..., the authors’

preliminary results show significant improvement when doing this way.

11

Speaker-Related Issues – Physical conditions

• Speech is a behavioral signal.

• Variability of Speaker’s physical conditionsCold /nasal congestion /laryngitis, etc.

• “cold-affected” speech in speaker recognition (Tull 1996)

• This direction is still rare, and speech databases are difficult

to collect and organize.

• But research on it has practical importance.

12

Speaker-Related Issues – Speaking style (Emotion)

• Emotion: an intrinsic nature of human beings.

• Categories:Analysis of various emotion-related acoustic factors

Prosody /Voice quality /pitch /duration /sound intensity

Emotion-compensation methodsemotion-added model training method (Wu 2005)

supra-segmental HMM (Shahin 2009)

emotion-dependent CMLLR transformations (Bie 2013)

13

Speaker-Related Issues – Speaking style (Rate)

• Speaking rate: another high level speaker-related variable and has a big impact

on speaker verification performance.

• Rate mismatch between training and test utterances

• Speech recognitionA probabilistic method to estimate speaking rate (Yasuda 2012)

A speech rate classifier (SRC) (Martinez 1998)

• Speaker recognitionNon-linear time alignment or DTW (Dynamic Time Warping), effective or not?

14

Speaker-Related Issues – Speaking style (Idiom)

• Idiom: a person’s personal style of word usage and a high-level

inter-speaker characteristic. It is actually a kind of discriminate

information rather than a robustness issue, but it helps to improve

the recognition performance.

• Human brain: self-learning with idioms

• Important threads:Idiosyncratic word-usage: high-level feature

Idiosyncratic pronunciation feature: low-level feature

15

Speaker-Related Issues – Speaking style (X-lingual)

• Language mismatch results in performance degradation.

• Previous work:Training a pooled model from multi-lingual corpora (Ma 2004)

Language normalization (Akbacak 2007)

Language factor compensation (Lu 2009)

Feature combination (Nagaraja 2013)

16

Speaker-Related Issues – Speaking style (Ageing)

• Whether voice changes significantly with time?

• Performance degradation has been observed in the

presence of time intervals.

• From the point of view of patter recognition:Enrollment data (training model) and test utterances for

verification are separated by some period of time.

17

• Model domain: Data augmentation (Beigi 2009): speaker re-enrollment

MAP/MLLR-adaptation (Lamel 2000): model adaptation

• Score domain:A classifier with an ageing-dependent decision boundary (Kelly 2011)

• Feature domain:F-ratio measure (Lu 2007)

Frequency warping and filter output weighting to emphasize speaker-sensitive and time-

insensitive sub-bands (Wang 2012)

Speaker-Related Issues – Speaking style (Ageing)

18

Three Issues

19

APP-Oriented Issues – Main applications

• User Authenticationcommercial transactions /control access /online shopping

• Public Security and JudicatureParolees monitoring /In-prison call monitoring /Forensics

• Speaker Adaptation in Speech RecognitionSpeaker-dependent speech recognizer

• Multi-Speaker EnvironmentsSpeaker detection /tracking /segmentation /diarization

20

APP-Oriented Issues – SUSR

• Short utterance speaker recognition (SUSR)Unsatisfactory performance on GMM-UBM (NIST), JFA (Kenny 2004) and i-vector (Vogt 2008).

• Challenges (Zhang 2014)

Discriminative information inadequate and confusable

• Research directionsTo select more discriminative data: Fisher-voice based feature fusion method combined

with PCA and LDA (Zhang 2013).

To train more accurate model with high-level information: JFA and i-vector / phoneme

specific multi-model method (Zhang 2012).

Better algorithms for scoring: ULS (Parris 1998) / WBLS (Malegaonkar 2008).

21

APP-Oriented Issues – Many others

• Coding mismatchG.711 /G.729 /WeChat-specific format /...

• Integration of speech recognition and speaker recognitionSpeech recognition: more speaker/dialect-independent

Speaker recognition: more speaker-dependent

• Voice quality control:VAD and higher-discriminative feature/segment retrieval

High-quality speech vs distorted speech (noisy, clipped, ...)

22

Summary

• An overview of speaker recognition technologies with an emphasis

on dealing with robustness issues.

• Three categories : Environment-related issues

Speaker-related issues

Application-oriented issues

• Some directions have been touched by researchers while others may

be future focuses.

Thank you

APSIPA ASC 2014, Dec 9-12, 2014, Cambodia

TINA

23

References• M. Akbacak, J. H. Hansen (Akbacak 2007), “Language normalization for bilingual speaker recognition systems,” Acoustics,

Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 4: IV-257-IV-260.• H. Beigi (Beigi 2009), “Effects of time lapse on speaker recognition results,” Proc. of 16th International Conference on

Digital Signal Processing, pp. 1-6, 2009.• F.-H. Bie, D. Wang, T. F. Zheng, J. Tejedor, R. Chen (Bie 2013), “Emotional adaptive training for speaker verification,” Signal

and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific. IEEE, 2013: 1-4.• S. F. Boll (Boll 1979), “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics,

Speech and Signal Processing, 1979, 27:113-120.• S. Cumani, O. Glembek, N. Brummer, E. de Villiers, P. Laface (Cumani 2012), “Gender independent discriminative speaker

recognition in i-vector space,” ICASSP, 2012.• N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet (Dehak 2011), “Front-end factor analysis for speaker

verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.• S. Furui (Furui 1981), “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust. Speech Signal

Processing, 1981. 29(2):254-272.• M. J. F. Gales and S. J. Young (Gales 1996), “Robust continuous speech recognition using parallel model combination,”

IEEE Transactions on Speech and Audio Processing, 1996, 4(5): 352-359.• A. O. Hatch, S. S. Kajarekar, and A. Stolcke (Hatch 2006), “Within-class covariance normalization for SVM-based speaker

recognition,” in INTERSPEECH’ 06, 2006.

24

References• H. Hermansky and N. Morgan (Hermansky 1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio

Processing, 1994. 2(4): 578-589• S. Ioffe (Ioffe 2006), “Probabilistic linear discriminant analysis,” in ECCV2006, 2006, pp. 531–542.• F. Kelly and N. Harte (Kelly 2011), “Effects of long-term ageing on speaker verification,” Biometrics and ID Management,

Volume 6583 of Lecture Notes in Computer Science, pp. 113-124, Springer Berlin/Heidelberg, 2011.• P. Kenny, P. Dumouchel (Kenny 2004), “Experiments in Speaker Verification using Factor Analysis Likelihood Ratios,” in

Proceedings of Odyssey04 - Speaker and Language Recognition Workshop, Toledo, Spain, 2004.• P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel (Kenny 2007), “Joint factor analysis versus eigenchannels in speaker

recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1435–1447, 2007.• A. Kocsor, L. Toth, A. Kuba, K. Kovacs, M. Jelasity, T. Gyimothy, J. Csirik (Kocsor 2000), “A comparative study of several feature

transformation and learning methods for phoneme classification,” International Journal of Speech Technology, 2000. 3(3): 263-276.

• L. Lamel and J. Gauvin (Lamel 2000), “Speaker verification over the telephone,” Speech Communication, Volume 2000, Issue 31, pp. 141-154, 2000.

• R. G. Lomax and D. L. Hahs-Vaughn (Lomax 2007), “Statistical concepts: a second course,” Lawrence Erlbaum Associates, 2007.• X. Lu and J. Dang (Lu 2007), “Physiological feature extraction for text independent speaker identification using non-uniform

subband processing,” Proc. of ICASSP 2007, pp. 461-464, 2007• L. Lu, Y. Dong, X. Zhao, J. Liu, H. Wang (Lu 2009), “The effect of language factors for robust speaker recognition,” Acoustics,

Speech and Signal Processing, 2009. ICASSP 2009.

25

References• B. Ma, and H.-L. Meng (Ma 2004), “English-Chinese bilingual text-independent speaker verification,” Acoustics, Speech,

and Signal Processing, 2004. Proceedings (ICASSP'04). IEEE International Conference on. Vol. 5, 2004.• A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran and J. Fortuna (Malegaonkar 2008), “On the enhancement of speaker

identification accuracy using weighted bilateral scoring,” IEEE International Carnahan Conference on Security Technology (ICCST): 254-258, 2008.

• F. Martinez, D. Tapias, J. Alvarez (Martinez 1998), “Towards speech rate independence in large vocabulary continuous speech recognition,” Acoustics, Speech and Signal Processing, 1998.

• M. McLaren and D. A. van Leeuwen (McLaren 2012), “Gender-independent speaker recognition using source normalization,” in Proc. ICASSP, 2012, pp.4373-4376.

• B. G. Nagaraja, H. S. Jayanna (Nagaraja 2013), “Combination of Features for Multilingual Speaker Identification with the Constraint of Limited Data,” International Journal of Computer Applications, 2013, Vol.70 (6), pp.1-6.

• NIST Speaker Recognition Evaluation Plan (NIST), Online Available http://www.nist.gov/speech/tests/sre/.• E. S. Parris and M. J. Carey (Parris 1998), “Multilateral techniques for speaker recognition,” International Conference on

Spoken Language Processing (ICSLP), 1998.• D. A. Reynolds (Reynolds 2003), “Channel robust speaker verification via feature mapping,” ICASSP, 2003, (2): 53-56.• G. Saon, M. Padmanabhan, R. Gopinath, S. Chen (Saon 2000), “Maximum likelihood discriminant feature spaces,”

Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2000. 2: 1129-1132. • I. Shahin (Shahin 2009), “Speaker identification in emotional environments,” Iranian Journal of Electrical and Computer

Engineering, vol. 8, no.1, pp. 41–46, 2009.

26

References• A. Solomonoff, C. Quillen, and W. M. Campbell (Solomonoff 2004), “Channel compensation for SVM speaker

recognition,” in Proc. Odyssey Speaker and Language Recognition Workshop, 2004, pp. 57–62.• R. Teunen, B. Shahshahani, and L. Heck (Teunen 2000), “A model-based transformational approach to robust speaker

recognition,” in Proc. ICSLP’00, 2000, pp. 495–498.• R. G. Tull and J. C. Rutledge (Tull 1996), “‘Cold Speech’ for Automatic Speaker Recognition,” Acoustical Society of America

131st Meeting Lay Language Papers, May, 1996.• R. Vogt, B. Baker, and S. Sridharan (Vogt 2008), “Factor analysis subspace estimation for speaker verification with short

utterances,” in Interspeech, Brisbane, 2008.• L.-L. Wang, X.-J. Wu, T. F. Zheng and C.-H. Zhang (Wang 2012), “An Investigation into Better Frequency Warping for Time-

Varying Speaker Recognition,” APSIPA ASC, 2012.• T. Wu, Y.-C. Yang, and Z.-H. Wu (Wu 2005), “Improving speaker recognition by training on emotion-added models,” in

Proc. Affective Computing and Intelligent Interaction, 2005, pp. 382–389.• H. Yasuda and M. Kudo (Yasuda 2012), “Speech rate change detection in martingale framework,” in Proc. ISDA, 2012,

pp.859-864.• C.-H. Zhang, X.-J. Wu, T. F. Zheng and L.-L. Wang (Zhang 2012), “A K-phoneme-class based multi-model method for short

utterance speaker recognition,” The 4th Asia-Pacific Signal and Information Processing Association, Annual Summit and Conference, APSIPA ASC, 2012.

• C.-H. Zhang and T. F. Zheng (Zhang 2013), “A fishervoice based feature fusion method for short utterance speaker recognition,” IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP, 2013.

• C.-H. Zhang (Zhang 2014), “Research on Short Utterance Speaker Recognition,” PhD thesis, Tsinghua University, April 2014.

an overview of robustness related issues in speaker recognition a plenary overview talk at apsipa...

Documents

speaker information

cambodia slide

channel variations

channel variations

gender independent system

known gender information

feature level

feature domain model