speaker recognition a retrospectivevgg/data/voxceleb/data...speaker recognition: i-vectors 27 • it...

36
© 2019 SRI International Speaker Recognition A Retrospective VoxSRC Worshop 2019 September 2019 | Graz, Austria Mitchell McLaren 1 , Doug Reynolds 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 MITLL, USA

Upload: others

Post on 11-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • © 2019 SRI International

    Speaker RecognitionA Retrospective

    VoxSRC Worshop 2019September 2019 | Graz, Austria

    Mitchell McLaren1 , Doug Reynolds2

    1Speech Technology and Research Laboratory, SRI International, California, USA2MITLL, USA

  • © 2019 SRI International

    Outline

    • Speaker Recognition Benchmarking

    • The Evolution of Speaker Recognition Technology

    • Future Challenges in Speaker Recognition

    2

  • © 2019 SRI International

    Speaker Recognition Benchmarking

  • © 2019 SRI International

    Speaker Recognition Benchmarking

    • Why do we need evaluations?− Motivate research into current or evolving problems− Common benchmark available to researchers in order to gauge

    improvements in technology over time• Proprietary data benchmarks offer limited insight to the research community

    • What is needed to run an evaluation?− Funding− Data− Motivation/Passion− People to pull the strings, and keep the cogs turning− A broad interest from the community− Meaningful metrics and outcomes

    4

  • © 2019 SRI International

    This presentation is concerned with text-independent (Free speech) speaker verification

    Speaker Recognition Tasks

    5

    ?

    ?

    ?

    ?

    Whose voice is this?Whose voice is this??

    ?

    ?

    ?

    Whose voice is this?Whose voice is this?

    Identification

    ?

    Is this Bob’s voice?Is this Bob’s voice?

    ?

    Is this Bob’s voice?Is this Bob’s voice?

    Verification/Authentication

    Segmentation and Clustering (Diarization)

  • © 2019 SRI International

    This presentation is concerned with text-independent (Free speech) speaker verification

    Speaker Recognition Tasks

    ?

    Is this Bob’s voice?Is this Bob’s voice?

    ?

    Is this Bob’s voice?Is this Bob’s voice?

    Verification/Authentication

  • © 2019 SRI International

    Benchmarking: Metrics

    • Metrics are important− Teams may spend months driving down a metric for an eval

    • What makes a good metric?− It’s meaningful for the intended application− It reflects problems associated with mis-calibration/domain shift

    7

  • © 2019 SRI International

    Benchmarking: The Detection Error Tradeoff (DET) Curve

    8PROBABILITY OF FALSE ACCEPT (in %)

    PRO

    BA

    BIL

    ITY

    OF

    FALS

    E R

    EJEC

    T (in

    %)

    Equal Error Rate (EER) = 1 %

    Access Control:False acceptance is very costly

    Users may tolerate rejections for security

    Purchase Fraud:False rejections alienate customers

    Any fraud rejection is beneficial

    Equal Error Rate (EER) is often quoted as a summary performance measure

    High Convenience

    High Security

    Balance

    Application operating point depends on relative costs of the two errors

  • © 2019 SRI International

    Benchmarking: Common Metrics

    • Equal Error Rate (EER)− A theoretical operating point demonstrating discriminative power

    • Decision Cost Function (DCF)− Weighted sum of miss and false alarm errors each with a cost− Tailored to specific operating point

    • Cost of Log Likelihood Ratio (Cllr)− Reflects discrimination and calibration performance across all

    operating points

    • R-Precision− Average position of true speaker ID in a LLR-ranked list per test file− Indicative of real-use triage system for identifying speaker of interest

    in the top of the list 9

  • © 2019 SRI International

    Benchmarking: NIST Evaluations

    10

    • Annual NIST evaluations of speaker verification technology (since 1995)

    • Aim: Provide a common paradigm for comparing technologies

    • Focus: Conversational telephone speech (text-independent)

    Evaluation Coordinator

    Linguistic Data Consortium

    Data Provider Comparison of technologies on common task

    Evaluate

    Improve

    Technology Consumers / Funders

    Application domain / parameters

    Technology Developers

  • © 2019 SRI International

    Publicly available data

    Benchmarking: Community-Driven• Robust Automatic Transcription of Speech (RATS) [2011-2016]

    − Concerned with heavily degraded audio from radio communications, but constrained to three primes

    • Speakers in the Wild (SITW) [2016]− The first large-scale eval dealing with open-source data and

    multiple speakers with very heterogenous conditions− Now a standard benchmarking dataset in literature

    • VOiCES [2019]− Focused on a distant speech collect with naturally degraded

    conditions (noise, babble, etc.)

    • Fearless Steps Challenge [2019]− Dealing with data from the Apollo missions

    • VoxSRC [2019]− Large-scale training data using automated harvesting− VoxCeleb 1 & 2 are now common datasets for system training11

  • © 2019 SRI International

    Benchmarking: Items of concern

    • Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out

    data/conditions (multiple DCF points, Cllr, etc.)

    • Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees

    • Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions

    • Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.

    • Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y

    12

  • © 2019 SRI International

    Benchmarking: Items of concern

    • Metrics: − Equal error rate is useful in limited applications. − Expertise is required to calibrate a system for held out

    data/conditions (multiple DCF points, Cllr, etc.)

    • Data Licenses: Publicly-available data enables research by newcomers without the initial overhead of license fees

    • Development Data: Teams should be aware of intended conditions of the evaluation and given a small amount of data reflecting the conditions

    • Leaderboards: Allowing 1000 submissions enables fine tuning of a system, but inspires continuous development during an eval.

    • Statistical Significance: Few speakers or limited trials must be considered when making claims that system X is better than Y

    13

  • © 2019 SRI International

    The Evolution of Speaker Recognition Technology

  • © 2019 SRI International

    Speaker Recognition Evolution

    15

    • Speech technology research accelerated with increase in− Common corpora and evaluations− Compute and storage capacity− Increasing application demand

    Dynamic Time-Warping Vector

    Quantization Hidden Markov Models Gaussian Mixture Models

    Template matching

    Aural and spectrogram

    matching

    Support Vector MachinesGMM Supervectors

    1930 Present

    JFA, I-vectors

    1960 1980 1990 2000

    Evaluation on Application

    Corpora

    Large corpora, realistic,

    unconstrained speech

    Small corpora, clean, controlled speech

    Common Corpora and Evaluations

    2010

    DNNs

  • © 2019 SRI International

    Speaker Recognition: Framework

    16

    Training Algorithm

    System parameters, model configuration, etc.

    Annotated data

    Class N

    Class 1

    Class 2Feature

    extraction

    Annotated and unannotated data

    System Models

    Speaker ASpeaker Model

    Training

    Feature extraction Speaker Model

    System Training

    Speaker Enrollment

    Speaker Testing Speaker Unknown Model

    ScoringFeature

    extraction Score

  • © 2019 SRI International

    Speaker Recognition: Features

    17

    • Humans use several levels of perceptual cues for speaker recognition

    • There are no exclusive speaker identity cues• Low-level acoustic cues most applicable for automatic systems

    Semantics, diction,pronunciations,idiosyncrasies

    Socio-economicstatus, education,place of birth

    Prosodics, rhythm,speed intonation,volume modulation

    Personality type,parental influence

    Acoustic aspect ofspeech, nasal,deep, breathy,rough

    Anatomical structureof vocal apparatus

    High-level cues (learned traits)

    Low-level cues (physical traits)

    Easy to automatically extract

    Difficult to automatically extract

    Hierarchy of Perceptual Cues

  • © 2019 SRI International

    Speaker Recognition: Spectral Features

    • Speech is a continuous evolution of the vocal tract − Need to extract time series of spectra− Use a sliding window: 25 ms window, 10 ms shift

    • Produces time-frequency evolution of the spectrum

    Freq

    uenc

    y (H

    z)

    Time (sec)

    Fourier Transform Magnitude

  • © 2019 SRI International

    Speaker Recognition: Spectral Features

    19

    Fourier Transform Magnitude Log()

    Cosine transform

  • © 2019 SRI International

    Speaker Recognition: Variability

    20

    • Variability refers to changes in effects between training and successive detection attempts

    • Variability compensation is applied at several levels (signal, features, classifiers)− Many are learned from prior similar speech data

    The biggest challenge to practical use of speech based recognition systems is data variability

    Extrinsic variabilityMicrophonesAcoustic environmentTransmission channel

    Intrinsic variability HealthStressRole

  • © 2019 SRI International

    GMM DNNDNN/GMM HybridGMM Supervectors

    Speaker Recognition: Modelling Approaches

    21

    • The evolution of speaker modelling over the last two decades:

    2000’s 2003 2007 2014 2015 2017

    • Every 3-4 years, a major breakthrough is made and the whole community gets behind it and drives it forward

    GMM-UBM GMM-

    SVM

    JFAI-vectors

    DNN i-vectors

    Bottlenecki-vectors

    X-vectors

  • © 2019 SRI International

    Speaker Recognition: GMM-UBM

    22

    • A GMM is a weighted sum of Gaussian distributions

    ( | ( )s i ii

    p x p b xlM

    =1

    ) = å! !

    11/2/2

    1 1( ) exp ( ) ' ( )

    2(2 )i i i iD

    i

    b x x xµ µp

    -æ ö= - - S -ç ÷è øS

    ! ! ! ! !

    ),, iiis p S( = µl!

    matrix covariance mixturermean vecto mixture

    )proabilityprior (Gaussian weight mixture

    i =S==

    i

    ipµ!

  • © 2019 SRI International

    Speaker Recognition: GMM-UBM

    23

    (2) Train UBM with speech from many speakers using EM

    (3) Adapt target model from UBM

    (4) Compute likelihood ratio of test data

    (1) Extract feature vector sequence from speech signal

    UBM

    TargetModel

    arg

    ( )log ( | ) log ( | )t et ubm

    LLR Xp X p Xl l

    =-

    1

    log ( | ) log ( )N

    i i nn i

    p X p b xlM

    = =1

    æ ö= ç ÷è ø

    å å !

  • © 2019 SRI International

    Speaker Recognition: GMM Supervectors

    24

    • For each utterance we can train a GMM by adapting the means of a UBM

    • We can then stack the mean vectors for each mixture together to represent the utterance as a super-vector

    , , )ubm ubmX i iXipl µ = ( S! μ1

    μ2

    μM

    For 60 dim feat vecand 2048 mixtures, super-vector dim is 122,880

  • © 2019 SRI International

    Speaker Recognition: GMM-SVM

    25

    • GMM supervectors were fed to Support Vector Machines (SVM) for speaker recognition

    • A hyperplane was found in a high dimensional space that separated the speakers’ training vector(s) from background vectors

    • Variability compensation via Nuisance Attribute Projection (NAP) or Within-Class Covariance Normalization (WCCN) was shown to be effective with GMM supervectors

    researchgate.net

  • © 2019 SRI International

    Speaker Recognition: Joint Factor Analysis (JFA)

    26

    • We know there are other sources of variability than just the observation noise (number of frames in X)− For example, linear channels are an additive source− We can also treat speaker information

    as an additive source relative to the UBM

    • Due to the large dimension in super-vector space, we cannot work with full-rank covariance matrices, so we use a lower dimension approximation

    X ubm X X X= + + +m µ s c η ( , )nN 0 Σ( , )cN 0 Σ

    ( , )sN 0 Σ

    X ubm y x z= + + +m µ V U D

    , , ~ ( , )x y z N 0 I 's =Σ VV 'c =Σ UU 'n =Σ DD

  • © 2019 SRI International

    Speaker Recognition: i-vectors

    27

    • It was shown that the session factors of JFA contained speaker information

    • This led to the advent of total variability modeling

    • T is a rectangular matrix with rank

  • © 2019 SRI International

    Speaker Recognition: Backend Scoring

    28

    • Pre-processing of i-vectors into a suitable space for speaker comparisons was essential:

    • Cosine scores were initially used (simple and fast)• We borrowed Probabilistic Linear Discriminant Analysis

    (PLDA) from the field of face recognition− Still the most widely used backend scoring process

    • Calibration transforms raw scores into calibrated log likelihood ratios (LLRs)

    LDA Mean normLength norm

    PLDACalibrationTest / Enroll i-vectors

    TestScore

    Cosine

  • © 2019 SRI International

    Speaker Recognition: DNN i-vectors

    29

    • Replace the UBM with a DNN that predicts phone posteriors− Leverage the speaker-specific phone pronunciation

    • A major advantage (~50%) in English telephone speech

    Feature Extraction

    Feature Extraction

    i-vector Extraction

    Super-Vector Extraction Scoring

    i-vector model

    Match score

    Speech Recognition Trained DNN

    “t”“uh”“k”

    State posteriors

  • © 2019 SRI International

    Speaker Recognition: Tandem Bottleneck Features

    30

    • A simple and more robust hybrid system

    Feature Extraction

    Stack Features

    i-vector Extraction

    Super-Vector Extraction Scoring

    i-vector model

    Match score

    Tandem

    BNF

    “t”“uh”“k”

    Speech Recognition Trained DNN

  • © 2019 SRI International

    Speaker Recognition: X-Vectors

    31

    • In 2017, i-vectors were completely replaced with a DNN internal vector extraction (speaker embeddings/x-vector)

    • Time-Delay NN: Incremental context in first few layers• Key ingredient: Statistics pooling layer to summarize

    frame-level activations into segment-level• Trained to discriminate speakers directly at the output

    layer using large training speaker set

    Segment-levelFrame-level

    Features

    Embeddings DNN

    EMBE

    DDIN

    G #

    2EM

    BEDD

    ING

    #1

    STAT

    S LA

    YER

    OUTP

    UT L

    AYER

    Scoring Match score

  • © 2019 SRI International

    Future Challenges in Speaker Recognition

  • © 2019 SRI International

    Future Challenges: Conditions

    • Calibration− Ensuring a system meets expected error rates after deployment

    • Distant Speech− Low SNR audio & reverberation

    • The Cocktail Party Problem− Heavy speaker babble, source separation

    • Multi-speaker Audio− Speaker detection among several dominant speakers− Diarization is a means to an end

    33

  • © 2019 SRI International

    Future Challenges: Data

    In the era of DNNs, data is gold:

    • How to get annotations cheaply? (VoxCeleb)

    • How to exploit unannotated data?

    • How to automatically detect data shift?

    • How to transfer knowledge and information between data domains?

    • How to combine multiple modalities (face+voice+text)?

    • How to use adversarial learning to generate synthetic data or learn about the speech manifold?

    34

  • © 2019 SRI International

    Future Challenges: Modelling

    • End-to-End Speaker Recognition− The goal of many research groups− Difficulty in generalizing to unseen conditions− Difficulty in producing calibrated likelihood ratios

    • Dynamic modelling− Accounting for the conditions of the ‘current’ comparison

    • Intelligent systems− Knowing when it’s out of the comfort/known zone− Explaining to a user the internal decisions of the system

    35

  • © 2019 SRI International

    Headquarters333 Ravenswood AvenueMenlo Park, CA 94025+1.650.859.2000

    Princeton, NJ201 Washington RoadPrinceton, NJ 08540+1.609.734.2553

    Additional U.S. and international locations

    www.sri.com

    Thank You

    36