speaker recognition a retrospectivevgg/data/voxceleb/data...speaker recognition: i-vectors 27 • it...

© 2019 SRI International

Speaker RecognitionA Retrospective

VoxSRC Worshop 2019September 2019 | Graz, Austria

Mitchell McLaren1 , Doug Reynolds2

1Speech Technology and Research Laboratory, SRI International, California, USA2MITLL, USA


Outline

• Speaker Recognition Benchmarking

• The Evolution of Speaker Recognition Technology

• Future Challenges in Speaker Recognition

2


Speaker Recognition Benchmarking


Speaker Recognition Benchmarking

• Why do we need evaluations?− Motivate research into current or evolving problems− Common benchmark available to researchers in order to gauge

improvements in technology over time• Proprietary data benchmarks offer limited insight to the research community

• What is needed to run an evaluation?− Funding− Data− Motivation/Passion− People to pull the strings, and keep the cogs turning− A broad interest from the community− Meaningful metrics and outcomes

4


This presentation is concerned with text-independent (Free speech) speaker verification

Speaker Recognition Tasks

5

?

?

?

?

Whose voice is this?Whose voice is this??

?

?

?

Whose voice is this?Whose voice is this?

Identification

?

Is this Bob’s voice?Is this Bob’s voice?

?


Verification/Authentication

Segmentation and Clustering (Diarization)


This presentation is concerned with text-independent (Free speech) speaker verification

Speaker Recognition Tasks

?


?


Verification/Authentication


Benchmarking: Metrics

• Metrics are important− Teams may spend months driving down a metric for an eval

• What makes a good metric?− It’s meaningful for the intended application− It reflects problems associated with mis-calibration/domain shift

7


Benchmarking: The Detection Error Tradeoff (DET) Curve

8PROBABILITY OF FALSE ACCEPT (in %)

PRO

BA

BIL

ITY

OF

FALS

E R

EJEC

T (in

%)

Equal Error Rate (EER) = 1 %

Access Control:False acceptance is very costly

Users may tolerate rejections for security

Purchase Fraud:False rejections alienate customers

Any fraud rejection is beneficial

Equal Error Rate (EER) is often quoted as a summary performance measure

High Convenience

High Security

Balance

Application operating point depends on relative costs of the two errors


Benchmarking: Common Metrics

• Equal Error Rate (EER)− A theoretical operating point demonstrating discriminative power

• Decision Cost Function (DCF)− Weighted sum of miss and false alarm errors each with a cost− Tailored to specific operating point

• Cost of Log Likelihood Ratio (Cllr)− Reflects discrimination and calibration performance across all

operating points

• R-Precision− Average position of true speaker ID in a LLR-ranked list per test file− Indicative of real-use triage system for identifying speaker of interest

in the top of the list 9


Benchmarking: NIST Evaluations

10

• Annual NIST evaluations of speaker verification technology (since 1995)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider Comparison of technologies on common task

Evaluate

Improve

Technology Consumers / Funders

Application domain / parameters

Technology Developers


Publicly available data

Benchmarking: Community-Driven• Robust Automatic Transcription of Speech (RATS) [2011-2016]

− Concerned with heavily degraded audio from radio communications, but constrained to three primes

• Speakers in the Wild (SITW) [2016]− The first large-scale eval dealing with open-source data and

multiple speakers with very heterogenous conditions− Now a standard benchmarking dataset in literature

• VOiCES [2019]− Focused on a distant speech collect with naturally degraded

conditions (noise, babble, etc.)

• Fearless Steps Challenge [2019]− Dealing with data from the Apollo missions

• VoxSRC [2019]− Large-scale training data using automated harvesting− VoxCeleb 1 & 2 are now common datasets for system training11


Benchmarking: Items of concern

• Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out

data/conditions (multiple DCF points, Cllr, etc.)

• Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees

• Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions

• Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.

• Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y

12


Benchmarking: Items of concern

• Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out

data/conditions (multiple DCF points, Cllr, etc.)

• Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees

• Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions

• Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.

• Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y

13


The Evolution of Speaker Recognition Technology


Speaker Recognition Evolution

15

• Speech technology research accelerated with increase in− Common corpora and evaluations− Compute and storage capacity− Increasing application demand

Dynamic Time-Warping Vector

Quantization Hidden Markov Models Gaussian Mixture Models

Template matching

Aural and spectrogram

matching

Support Vector MachinesGMM Supervectors

1930 Present

JFA, I-vectors

1960 1980 1990 2000

Evaluation on Application

Corpora

Large corpora, realistic,

unconstrained speech

Small corpora, clean, controlled speech

Common Corpora and Evaluations

…

2010

DNNs


Speaker Recognition: Framework

16

Training Algorithm

System parameters, model configuration, etc.

Annotated data

Class N

Class 1

Class 2Feature

extraction

Annotated and unannotated data

System Models

Speaker ASpeaker Model

Training

Feature extraction Speaker Model

System Training

Speaker Enrollment

Speaker Testing Speaker Unknown Model

ScoringFeature

extraction Score


Speaker Recognition: Features

17

• Humans use several levels of perceptual cues for speaker recognition

• There are no exclusive speaker identity cues• Low-level acoustic cues most applicable for automatic systems

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

High-level cues (learned traits)

Low-level cues (physical traits)

Easy to automatically extract

Difficult to automatically extract

Hierarchy of Perceptual Cues


Speaker Recognition: Spectral Features

• Speech is a continuous evolution of the vocal tract − Need to extract time series of spectra− Use a sliding window: 25 ms window, 10 ms shift

• Produces time-frequency evolution of the spectrum

Freq

uenc

y (H

z)

Time (sec)

Fourier Transform Magnitude


Speaker Recognition: Spectral Features

19

Fourier Transform Magnitude Log()

Cosine transform


Speaker Recognition: Variability

20

• Variability refers to changes in effects between training and successive detection attempts

• Variability compensation is applied at several levels (signal, features, classifiers)− Many are learned from prior similar speech data

The biggest challenge to practical use of speech based recognition systems is data variability

Extrinsic variabilityMicrophonesAcoustic environmentTransmission channel

Intrinsic variability HealthStressRole


GMM DNNDNN/GMM HybridGMM Supervectors

Speaker Recognition: Modelling Approaches

21

• The evolution of speaker modelling over the last two decades:

2000’s 2003 2007 2014 2015 2017

• Every 3-4 years, a major breakthrough is made and the whole community gets behind it and drives it forward

GMM-UBM GMM-

SVM

JFAI-vectors

DNN i-vectors

Bottlenecki-vectors

X-vectors


Speaker Recognition: GMM-UBM

22

• A GMM is a weighted sum of Gaussian distributions

( | ( )s i ii

p x p b xlM

=1

) = å! !

11/2/2

1 1( ) exp ( ) ' ( )

2(2 )i i i iD

i

b x x xµ µp

-æ ö= - - S -ç ÷è øS

! ! ! ! !

),, iiis p S( = µl!

matrix covariance mixturermean vecto mixture

)proabilityprior (Gaussian weight mixture

i =S==

i

ipµ!


Speaker Recognition: GMM-UBM

23

(2) Train UBM with speech from many speakers using EM

(3) Adapt target model from UBM

(4) Compute likelihood ratio of test data

(1) Extract feature vector sequence from speech signal

UBM

TargetModel

arg

( )log ( | ) log ( | )t et ubm

LLR Xp X p Xl l

=-

1

log ( | ) log ( )N

i i nn i

p X p b xlM

= =1

æ ö= ç ÷è ø

å å !


Speaker Recognition: GMM Supervectors

24

• For each utterance we can train a GMM by adapting the means of a UBM

• We can then stack the mean vectors for each mixture together to represent the utterance as a super-vector

, , )ubm ubmX i iXipl µ = ( S! μ1

…

μ2

μM

For 60 dim feat vecand 2048 mixtures, super-vector dim is 122,880


Speaker Recognition: GMM-SVM

25

• GMM supervectors were fed to Support Vector Machines (SVM) for speaker recognition

• A hyperplane was found in a high dimensional space that separated the speakers’ training vector(s) from background vectors

• Variability compensation via Nuisance Attribute Projection (NAP) or Within-Class Covariance Normalization (WCCN) was shown to be effective with GMM supervectors

researchgate.net


Speaker Recognition: Joint Factor Analysis (JFA)

26

• We know there are other sources of variability than just the observation noise (number of frames in X)− For example, linear channels are an additive source− We can also treat speaker information

as an additive source relative to the UBM

• Due to the large dimension in super-vector space, we cannot work with full-rank covariance matrices, so we use a lower dimension approximation

X ubm X X X= + + +m µ s c η ( , )nN 0 Σ( , )cN 0 Σ

( , )sN 0 Σ

X ubm y x z= + + +m µ V U D

, , ~ ( , )x y z N 0 I 's =Σ VV 'c =Σ UU 'n =Σ DD


Speaker Recognition: i-vectors

27

• It was shown that the session factors of JFA contained speaker information

• This led to the advent of total variability modeling

• T is a rectangular matrix with rank


Speaker Recognition: Backend Scoring

28

• Pre-processing of i-vectors into a suitable space for speaker comparisons was essential:

• Cosine scores were initially used (simple and fast)• We borrowed Probabilistic Linear Discriminant Analysis

(PLDA) from the field of face recognition− Still the most widely used backend scoring process

• Calibration transforms raw scores into calibrated log likelihood ratios (LLRs)

LDA Mean normLength norm

PLDACalibrationTest / Enroll i-vectors

TestScore

Cosine


Speaker Recognition: DNN i-vectors

29

• Replace the UBM with a DNN that predicts phone posteriors− Leverage the speaker-specific phone pronunciation

• A major advantage (~50%) in English telephone speech

Feature Extraction

Feature Extraction

i-vector Extraction

Super-Vector Extraction Scoring

i-vector model

Match score

Speech Recognition Trained DNN

“t”“uh”“k”

State posteriors


Speaker Recognition: Tandem Bottleneck Features

30

• A simple and more robust hybrid system

Feature Extraction

Stack Features

i-vector Extraction

Super-Vector Extraction Scoring

i-vector model

Match score

Tandem

BNF

“t”“uh”“k”

Speech Recognition Trained DNN


Speaker Recognition: X-Vectors

31

• In 2017, i-vectors were completely replaced with a DNN internal vector extraction (speaker embeddings/x-vector)

• Time-Delay NN: Incremental context in first few layers• Key ingredient: Statistics pooling layer to summarize

frame-level activations into segment-level• Trained to discriminate speakers directly at the output

layer using large training speaker set

Segment-levelFrame-level

Features

Embeddings DNN

EMBE

DDIN

G #

2EM

BEDD

ING

#1

STAT

S LA

YER

OUTP

UT L

AYER

Scoring Match score


Future Challenges in Speaker Recognition


Future Challenges: Conditions

• Calibration− Ensuring a system meets expected error rates after deployment

• Distant Speech− Low SNR audio & reverberation

• The Cocktail Party Problem− Heavy speaker babble, source separation

• Multi-speaker Audio− Speaker detection among several dominant speakers− Diarization is a means to an end

33


Future Challenges: Data

In the era of DNNs, data is gold:

• How to get annotations cheaply? (VoxCeleb)

• How to exploit unannotated data?

• How to automatically detect data shift?

• How to transfer knowledge and information between data domains?

• How to combine multiple modalities (face+voice+text)?

• How to use adversarial learning to generate synthetic data or learn about the speech manifold?

34


Future Challenges: Modelling

• End-to-End Speaker Recognition− The goal of many research groups− Difficulty in generalizing to unseen conditions− Difficulty in producing calibrated likelihood ratios

• Dynamic modelling− Accounting for the conditions of the ‘current’ comparison

• Intelligent systems− Knowing when it’s out of the comfort/known zone− Explaining to a user the internal decisions of the system

35


Headquarters333 Ravenswood AvenueMenlo Park, CA 94025+1.650.859.2000

Princeton, NJ201 Washington RoadPrinceton, NJ 08540+1.609.734.2553

Additional U.S. and international locations

www.sri.com

Thank You

36

speaker recognition a retrospectivevgg/data/voxceleb/data...speaker recognition: i-vectors 27 • it...

Documents