building a robust speaker recognition system oldřich plchot, ondřej glembek, pavel matějka...

24
Building a Robust Speaker Recognition System Oldřich Plchot, Ondřej Glembek, Pavel Matějka December 9 th 2012

Upload: ireland-halfpenny

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1

Slide 2 Building a Robust Speaker Recognition System Oldich Plchot, Ondej Glembek, Pavel Matjka December 9 th 2012 Slide 3 The PRISM Team SRI International Harry Bratt, Lukas Burget, Luciana Ferrer, Martin Graciarena, Aaron Lawson, Yun Lei, Nicolas Scheffer Sachin Kajarekar, Elizabeth Shriberg, Andreas Stolcke Brno University of Technology Jan H. Cernocky, Ondrej Glembek, Pavel Matejka, Oldrich Plchot Slide 4 PRISM Robustness How did we achieve these results? What are the outstanding research issues? BEST Phase I PI conference, Nov. 29th, 2011 ~ ~ ~ ~ Error rates lowered 3 Slide 5 Robustness A need for effectiveness on non-ideal conditions Moving beyond biometric evaluation on clean, controlled acquisition environments Extract robust and discriminative biometric features, invariant to such variability types A need for predictability A system claiming 99% accuracy should not give 80% on unseen data Unless otherwise warned by the system BEST Phase I PI conference, Nov. 29th, 2011 4 Slide 6 A comprehensive approach Multi-stream High order and Low order features Advanced speaker modeling and system combination Prediction of difficult scenarios QM vector Robustness vs. Unknown Carefully test on held-out data, beware of overtraining BEST Phase I PI conference, Nov. 29th, 2011 5 Slide 7 A comprehensive approach Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Multiple HOFs: new complimentary information Multiple LOFs: ditto + redundancy for increased robustness Advanced speaker modeling and system combination: Unified modeling framework i-vector / probabilistic discriminant analysis Robust variation-compensation scheme for multiple features and variability types i-vector / PLDA framework adapted to all high- and low- level features Discriminative training for more compact thus robust systems BEST Phase I PI conference, Nov. 29th, 2011 6 Slide 8 THE MAGIC? - iVectors iVector extractor is model similar to JFA with single subspace T easier to train no need for speaker labels the subspace can be trained on large amount of unlabeled recordings We assume standard normal prior factors i. iVector point estimate of i can now be extracted for every recording as its low-dimensional, fixed-length representation (typically 200 dimensions). However, iVector contains information about both speaker and channel. Hopefully this can by separated by the following classifier. Dehak, N., et al., Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification In Proc Interspeech 2009, Brighton, UK, September 2009 Slide 9 Illustration Low dimensional vector can represent complex patterns in multi- dimensional space i2i3i2i3 i1i1 121212121212 = t 13 t 23 t 13 t 23 t 13 t 23 m1m2m1m2m1m2m1m2m1m2m1m2 + t 11 t 12 t 21 t 22 t 11 t 12 t 21 t 22 t 11 t 12 t 21 t 22 Slide 10 Probabilistic Linear Discriminant Analysis (PLDA) Let every speech recording be represented by iVector. What would be now the appropriate probabilistic model for verification? iVector are assumed to be normal distributed iVector still contains channel information our model should consider both speaker and channel variability, just like in JFA. Natural choice is simplified JFA model with only single Gaussian. Such model is known as PLDA and is described by familiar equation: Slide 11 For our low-dimensional iVectors, we usually choose U to be full rank matrix no need to consider residual We can rewrite definition of PLDA as or equivalently as Why PLDA ? familiar LDA assumptions ! Slide 12 PLDA based verification Lets again consider verification score given by log-likelihood ratio for same and different speaker hypothesis, now in the context of modeling iVectors using PLDA: before: intractable, with iVectors: feasible. All the integral are now convolutions of Gaussians and can be solved analytically, giving, after some manipulation: FAST ! Slide 13 Performance compared to Eigenchannels and JFA NIST SRE 2010, tel-tel (cond. 5) Baseline (relevane MAP) Eigenchannel adapt. JFA iVector+PLDA iVector+PLDA system: Implementation simpler than for JFA Allows for extremely fast verification Provides significant improvements especially in important low False Alarm region Slide 14 iVector+PLDA enhancements NIST SRE 2010, tel-tel (cond. 5) iVector+PLDA iVector+PLDA fullcov UBM LDA150+Length normalization red + Mean normalization Ideas behind the enhancements: Make it easier for PLDA by preprocessing the data by LDA Make the heavy tail distributed iVectors more Gaussian Help a little bit more with channel compensation by condition-based mean normalization Slide 15 Diverse systems unified New technologies for prosody modeling, e.g. subspace multinomial modeling 14 BEST Phase I Final review, Nov. 3rd, 2011 %FA @ 10% Miss All features are now modeled using the i-vector paradigm, even for combination Slide 16 BEST evaluation submissions Complex multi-feature / combination of low- and high- level systems % False Alarms @10% Miss for our PRISM MFCC system: Look at another operating point? (if that low for the evaluation) 15 BEST Phase I Final review, Nov. 3rd, 2011 %FA @ 10% Miss PRIMARY Early iVector fusion, optimal Slide 17 A comprehensive approach Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling framework Prediction of difficult scenarios: Universal audio characterization for system combination Detect the difficulty of the problem, eg: enroll on noise, test on telephone React appropriately, eg: calibrate scores for sound decisions BEST Phase I PI conference, Nov. 29th, 2011 16 Slide 18 Predicting challenging scenarios Unified acoustic characterization: A novel approach to extract any metadata in a unified way Designed with the BEST program goal in mind: ability to handle unseen data, or compounded variability types Avoid the unnecessary burden to develop a new system for each new type of metadata IDentification system, where the training data is divided into conditions Investigating how to integrate intrinsic conditions: language and vocal effort BEST Phase I Final review, Nov. 3rd, 2011 0.001 0.4 0.001 0.3 17 Microphone Noise 20db Noise 15db Reverb 0.3 Reverb 0.5 Reverb 0.7 TelNoise 8db Slide 19 Robust calibration / fusion Condition prediction features as new higher order information for calibration Calibration: scale and shift scores for sound decision making on all operating points Confidence under matched vs. mismatch conditions will differ Discriminative training of the bilinear form Model is giving a bias for each condition type Further research Assess generalization Affect system fusion weights not just calibration Early inclusion of the information BEST Phase I Final review, Nov. 3rd, 2011 18 Slide 20 Fusion with QM BEST Phase I PI conference, Nov. 29th, 2011 -Offset -Linear combination weigths -Score from system k -Vectors of metadata -Bilinear combination matrix Slide 21 A comprehensive approach Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling framework Prediction of difficult scenarios: Unified condition prediction for system combination Robustness vs. unknown: The PRISM data set Expose systems to a diverse enough variability types of interest Aim for generalization on non-ideal or unseen data scenarios Use advanced strategies to compensate for these degradation BEST Phase I PI conference, Nov. 29th, 2011 20 Slide 22 The PRISM data set A multi-variability, large scale, speaker recognition evaluation set Unprecedented design effort across many data sets Simulation of extrinsic variability types: reverb & noise Incorporation of intrinsic and cross-language variability 1000 speakers, 30K audio files and more than 70M trials Open design: Recipe published at SRE11 analysis workshop [Ferrer11] Extrinsic data simulation Degradation of a clean interview data set from SRE08 and 10 (close mics) A variety of degradation aiming at generalization: Diversity of SNRs / reverbs to cover unseen data BEST Phase I Final review, Nov. 3rd, 2011 21 Reverb data set Uses RIR + Fconv Choose 3 RT30 values: 0.3, 0.5, 0.7 15 different room configurations 9 for training, 3 enrollment, 3 test Reverb data set Uses RIR + Fconv Choose 3 RT30 values: 0.3, 0.5, 0.7 15 different room configurations 9 for training, 3 enrollment, 3 test Noisy data set Noises from freesound.org, mixied using FaNT (Aurora) Real noise sample: cocktail party type, office noises Different noises for training and evaluation Noisy data set Noises from freesound.org, mixied using FaNT (Aurora) Real noise sample: cocktail party type, office noises Different noises for training and evaluation Slide 23 A comprehensive approach Multi-stream High order and Low order features: Prosody, MLLR, constraints, and MFCC, PLP, Advanced speaker modeling and system combination: Unified modeling framework Prediction of difficult scenarios: Unified condition prediction for system combination Robustness vs. unknown: The PRISM data set Executing this plan BEST Evaluation: An order of magnitude bigger than other known evaluations Developed a very fast speaker recognition system Leverage a SRIs rapid application development framework for efficient idea assessment and system delivery: SRI Idento system A diversely skilled team BEST Phase I PI conference, Nov. 29th, 2011 22 Slide 24 Research opportunities Multi-feature systems Use novel low-level features for noise robustness Noise / Reverb robust pitch extraction algorithms Deeper understanding of combination: Aiming for simpler systems Information fusion at earlier stage than score level New speech feature design? Acoustic characterization Deep integration of condition prediction in the pipeline Affecting fusion weights during system combination Integrate language and intrinsic variations Assessing improvements on unseen data, compounded variations Hard extrinsic variations brings up new domains of expertise borrowed from speech recognition and others (noise robust modeling, speech enhancement: De-reverberation, de-noising, binary masks, ) 23 BEST Phase I PI conference, Nov. 29th, 2011 Slide 25 Research opportunities Relaxing constraints even more Compounded variations: Reverb + noise + language switch Explore new types of variations New kinds of intrinsic variations: vocal effort (furtive, oration), Aging, Sickness Naturally occurring reverberant and noisy speech Other parametric relaxations Unconstrained duration for speaker enrollment and testing (as low as a second?) Robustness to multi-speaker audio enrollment and testing: another kind of variability: VERY important for interview data processing BEST Phase I PI conference, Nov. 29th, 2011 24 Slide 26 Slide 27 Expanding boundaries of speaker recognition More similar trials (close relatives, same dialect area) Change verification to large scale speaker search / tracking Large multi-speaker corpora, enroll as many speaker as possible Evaluation: in another large scale corpus, find the enrolled speakers (HVI) How much can you learn from a speaker? Enrolling a familiar voice. Plentiful of enrollment data Limited test data Introduce new high level features, towards social analytics BEST Phase I PI conference, Nov. 29th, 2011 26 Slide 28 Research opportunities Understanding combination: Leaner systems Information fusion at earlier stage than score level New feature creation / Early stage fusion methods New opportunities Need: Enabling technologies are understudied Benefits: SRI + BUT gives 2x improvement A niche for robustness BUT / SRI worked for two years for a common setup / dataset Still BUT has a better system on telephone Same lists, data, technology Except VAD BEST Phase I PI conference, Nov. 29th, 2011 27 Slide 29 HLF Systems Explored on BEST BEST Phase I PI conference, Nov. 29th, 2011 Prosodic systems Syllable-based features Contour modeling Phonetic systems MLLR from speech recognition Constraints, analysis based on a linguistically motivated region Constraints MLLR Acoustic characterization 28