speaker recognition a retrospectivevgg/data/voxceleb/data...speaker recognition: i-vectors 27 • it...
TRANSCRIPT
-
© 2019 SRI International
Speaker RecognitionA Retrospective
VoxSRC Worshop 2019September 2019 | Graz, Austria
Mitchell McLaren1 , Doug Reynolds2
1Speech Technology and Research Laboratory, SRI International, California, USA2MITLL, USA
-
© 2019 SRI International
Outline
• Speaker Recognition Benchmarking
• The Evolution of Speaker Recognition Technology
• Future Challenges in Speaker Recognition
2
-
© 2019 SRI International
Speaker Recognition Benchmarking
-
© 2019 SRI International
Speaker Recognition Benchmarking
• Why do we need evaluations?− Motivate research into current or evolving problems− Common benchmark available to researchers in order to gauge
improvements in technology over time• Proprietary data benchmarks offer limited insight to the research community
• What is needed to run an evaluation?− Funding− Data− Motivation/Passion− People to pull the strings, and keep the cogs turning− A broad interest from the community− Meaningful metrics and outcomes
4
-
© 2019 SRI International
This presentation is concerned with text-independent (Free speech) speaker verification
Speaker Recognition Tasks
5
?
?
?
?
Whose voice is this?Whose voice is this??
?
?
?
Whose voice is this?Whose voice is this?
Identification
?
Is this Bob’s voice?Is this Bob’s voice?
?
Is this Bob’s voice?Is this Bob’s voice?
Verification/Authentication
Segmentation and Clustering (Diarization)
-
© 2019 SRI International
This presentation is concerned with text-independent (Free speech) speaker verification
Speaker Recognition Tasks
?
Is this Bob’s voice?Is this Bob’s voice?
?
Is this Bob’s voice?Is this Bob’s voice?
Verification/Authentication
-
© 2019 SRI International
Benchmarking: Metrics
• Metrics are important− Teams may spend months driving down a metric for an eval
• What makes a good metric?− It’s meaningful for the intended application− It reflects problems associated with mis-calibration/domain shift
7
-
© 2019 SRI International
Benchmarking: The Detection Error Tradeoff (DET) Curve
8PROBABILITY OF FALSE ACCEPT (in %)
PRO
BA
BIL
ITY
OF
FALS
E R
EJEC
T (in
%)
Equal Error Rate (EER) = 1 %
Access Control:False acceptance is very costly
Users may tolerate rejections for security
Purchase Fraud:False rejections alienate customers
Any fraud rejection is beneficial
Equal Error Rate (EER) is often quoted as a summary performance measure
High Convenience
High Security
Balance
Application operating point depends on relative costs of the two errors
-
© 2019 SRI International
Benchmarking: Common Metrics
• Equal Error Rate (EER)− A theoretical operating point demonstrating discriminative power
• Decision Cost Function (DCF)− Weighted sum of miss and false alarm errors each with a cost− Tailored to specific operating point
• Cost of Log Likelihood Ratio (Cllr)− Reflects discrimination and calibration performance across all
operating points
• R-Precision− Average position of true speaker ID in a LLR-ranked list per test file− Indicative of real-use triage system for identifying speaker of interest
in the top of the list 9
-
© 2019 SRI International
Benchmarking: NIST Evaluations
10
• Annual NIST evaluations of speaker verification technology (since 1995)
• Aim: Provide a common paradigm for comparing technologies
• Focus: Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology Consumers / Funders
Application domain / parameters
Technology Developers
-
© 2019 SRI International
Publicly available data
Benchmarking: Community-Driven• Robust Automatic Transcription of Speech (RATS) [2011-2016]
− Concerned with heavily degraded audio from radio communications, but constrained to three primes
• Speakers in the Wild (SITW) [2016]− The first large-scale eval dealing with open-source data and
multiple speakers with very heterogenous conditions− Now a standard benchmarking dataset in literature
• VOiCES [2019]− Focused on a distant speech collect with naturally degraded
conditions (noise, babble, etc.)
• Fearless Steps Challenge [2019]− Dealing with data from the Apollo missions
• VoxSRC [2019]− Large-scale training data using automated harvesting− VoxCeleb 1 & 2 are now common datasets for system training11
-
© 2019 SRI International
Benchmarking: Items of concern
• Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out
data/conditions (multiple DCF points, Cllr, etc.)
• Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees
• Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions
• Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.
• Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y
12
-
© 2019 SRI International
Benchmarking: Items of concern
• Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out
data/conditions (multiple DCF points, Cllr, etc.)
• Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees
• Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions
• Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.
• Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y
13
-
© 2019 SRI International
The Evolution of Speaker Recognition Technology
-
© 2019 SRI International
Speaker Recognition Evolution
15
• Speech technology research accelerated with increase in− Common corpora and evaluations− Compute and storage capacity− Increasing application demand
Dynamic Time-Warping Vector
Quantization Hidden Markov Models Gaussian Mixture Models
Template matching
Aural and spectrogram
matching
Support Vector MachinesGMM Supervectors
1930 Present
JFA, I-vectors
1960 1980 1990 2000
Evaluation on Application
Corpora
Large corpora, realistic,
unconstrained speech
Small corpora, clean, controlled speech
Common Corpora and Evaluations
…
2010
DNNs
-
© 2019 SRI International
Speaker Recognition: Framework
16
Training Algorithm
System parameters, model configuration, etc.
Annotated data
Class N
Class 1
Class 2Feature
extraction
Annotated and unannotated data
System Models
Speaker ASpeaker Model
Training
Feature extraction Speaker Model
System Training
Speaker Enrollment
Speaker Testing Speaker Unknown Model
ScoringFeature
extraction Score
-
© 2019 SRI International
Speaker Recognition: Features
17
• Humans use several levels of perceptual cues for speaker recognition
• There are no exclusive speaker identity cues• Low-level acoustic cues most applicable for automatic systems
Semantics, diction,pronunciations,idiosyncrasies
Socio-economicstatus, education,place of birth
Prosodics, rhythm,speed intonation,volume modulation
Personality type,parental influence
Acoustic aspect ofspeech, nasal,deep, breathy,rough
Anatomical structureof vocal apparatus
High-level cues (learned traits)
Low-level cues (physical traits)
Easy to automatically extract
Difficult to automatically extract
Hierarchy of Perceptual Cues
-
© 2019 SRI International
Speaker Recognition: Spectral Features
• Speech is a continuous evolution of the vocal tract − Need to extract time series of spectra− Use a sliding window: 25 ms window, 10 ms shift
• Produces time-frequency evolution of the spectrum
Freq
uenc
y (H
z)
Time (sec)
Fourier Transform Magnitude
-
© 2019 SRI International
Speaker Recognition: Spectral Features
19
Fourier Transform Magnitude Log()
Cosine transform
-
© 2019 SRI International
Speaker Recognition: Variability
20
• Variability refers to changes in effects between training and successive detection attempts
• Variability compensation is applied at several levels (signal, features, classifiers)− Many are learned from prior similar speech data
The biggest challenge to practical use of speech based recognition systems is data variability
Extrinsic variabilityMicrophonesAcoustic environmentTransmission channel
Intrinsic variability HealthStressRole
-
© 2019 SRI International
GMM DNNDNN/GMM HybridGMM Supervectors
Speaker Recognition: Modelling Approaches
21
• The evolution of speaker modelling over the last two decades:
2000’s 2003 2007 2014 2015 2017
• Every 3-4 years, a major breakthrough is made and the whole community gets behind it and drives it forward
GMM-UBM GMM-
SVM
JFAI-vectors
DNN i-vectors
Bottlenecki-vectors
X-vectors
-
© 2019 SRI International
Speaker Recognition: GMM-UBM
22
• A GMM is a weighted sum of Gaussian distributions
( | ( )s i ii
p x p b xlM
=1
) = å! !
11/2/2
1 1( ) exp ( ) ' ( )
2(2 )i i i iD
i
b x x xµ µp
-æ ö= - - S -ç ÷è øS
! ! ! ! !
),, iiis p S( = µl!
matrix covariance mixturermean vecto mixture
)proabilityprior (Gaussian weight mixture
i =S==
i
ipµ!
-
© 2019 SRI International
Speaker Recognition: GMM-UBM
23
(2) Train UBM with speech from many speakers using EM
(3) Adapt target model from UBM
(4) Compute likelihood ratio of test data
(1) Extract feature vector sequence from speech signal
UBM
TargetModel
arg
( )log ( | ) log ( | )t et ubm
LLR Xp X p Xl l
=-
1
log ( | ) log ( )N
i i nn i
p X p b xlM
= =1
æ ö= ç ÷è ø
å å !
-
© 2019 SRI International
Speaker Recognition: GMM Supervectors
24
• For each utterance we can train a GMM by adapting the means of a UBM
• We can then stack the mean vectors for each mixture together to represent the utterance as a super-vector
, , )ubm ubmX i iXipl µ = ( S! μ1
…
μ2
μM
For 60 dim feat vecand 2048 mixtures, super-vector dim is 122,880
-
© 2019 SRI International
Speaker Recognition: GMM-SVM
25
• GMM supervectors were fed to Support Vector Machines (SVM) for speaker recognition
• A hyperplane was found in a high dimensional space that separated the speakers’ training vector(s) from background vectors
• Variability compensation via Nuisance Attribute Projection (NAP) or Within-Class Covariance Normalization (WCCN) was shown to be effective with GMM supervectors
researchgate.net
-
© 2019 SRI International
Speaker Recognition: Joint Factor Analysis (JFA)
26
• We know there are other sources of variability than just the observation noise (number of frames in X)− For example, linear channels are an additive source− We can also treat speaker information
as an additive source relative to the UBM
• Due to the large dimension in super-vector space, we cannot work with full-rank covariance matrices, so we use a lower dimension approximation
X ubm X X X= + + +m µ s c η ( , )nN 0 Σ( , )cN 0 Σ
( , )sN 0 Σ
X ubm y x z= + + +m µ V U D
, , ~ ( , )x y z N 0 I 's =Σ VV 'c =Σ UU 'n =Σ DD
-
© 2019 SRI International
Speaker Recognition: i-vectors
27
• It was shown that the session factors of JFA contained speaker information
• This led to the advent of total variability modeling
• T is a rectangular matrix with rank
-
© 2019 SRI International
Speaker Recognition: Backend Scoring
28
• Pre-processing of i-vectors into a suitable space for speaker comparisons was essential:
• Cosine scores were initially used (simple and fast)• We borrowed Probabilistic Linear Discriminant Analysis
(PLDA) from the field of face recognition− Still the most widely used backend scoring process
• Calibration transforms raw scores into calibrated log likelihood ratios (LLRs)
LDA Mean normLength norm
PLDACalibrationTest / Enroll i-vectors
TestScore
Cosine
-
© 2019 SRI International
Speaker Recognition: DNN i-vectors
29
• Replace the UBM with a DNN that predicts phone posteriors− Leverage the speaker-specific phone pronunciation
• A major advantage (~50%) in English telephone speech
Feature Extraction
Feature Extraction
i-vector Extraction
Super-Vector Extraction Scoring
i-vector model
Match score
Speech Recognition Trained DNN
“t”“uh”“k”
State posteriors
-
© 2019 SRI International
Speaker Recognition: Tandem Bottleneck Features
30
• A simple and more robust hybrid system
Feature Extraction
Stack Features
i-vector Extraction
Super-Vector Extraction Scoring
i-vector model
Match score
Tandem
BNF
“t”“uh”“k”
Speech Recognition Trained DNN
-
© 2019 SRI International
Speaker Recognition: X-Vectors
31
• In 2017, i-vectors were completely replaced with a DNN internal vector extraction (speaker embeddings/x-vector)
• Time-Delay NN: Incremental context in first few layers• Key ingredient: Statistics pooling layer to summarize
frame-level activations into segment-level• Trained to discriminate speakers directly at the output
layer using large training speaker set
Segment-levelFrame-level
Features
Embeddings DNN
EMBE
DDIN
G #
2EM
BEDD
ING
#1
STAT
S LA
YER
OUTP
UT L
AYER
Scoring Match score
-
© 2019 SRI International
Future Challenges in Speaker Recognition
-
© 2019 SRI International
Future Challenges: Conditions
• Calibration− Ensuring a system meets expected error rates after deployment
• Distant Speech− Low SNR audio & reverberation
• The Cocktail Party Problem− Heavy speaker babble, source separation
• Multi-speaker Audio− Speaker detection among several dominant speakers− Diarization is a means to an end
33
-
© 2019 SRI International
Future Challenges: Data
In the era of DNNs, data is gold:
• How to get annotations cheaply? (VoxCeleb)
• How to exploit unannotated data?
• How to automatically detect data shift?
• How to transfer knowledge and information between data domains?
• How to combine multiple modalities (face+voice+text)?
• How to use adversarial learning to generate synthetic data or learn about the speech manifold?
34
-
© 2019 SRI International
Future Challenges: Modelling
• End-to-End Speaker Recognition− The goal of many research groups− Difficulty in generalizing to unseen conditions− Difficulty in producing calibrated likelihood ratios
• Dynamic modelling− Accounting for the conditions of the ‘current’ comparison
• Intelligent systems− Knowing when it’s out of the comfort/known zone− Explaining to a user the internal decisions of the system
35
-
© 2019 SRI International
Headquarters333 Ravenswood AvenueMenlo Park, CA 94025+1.650.859.2000
Princeton, NJ201 Washington RoadPrinceton, NJ 08540+1.609.734.2553
Additional U.S. and international locations
www.sri.com
Thank You
36