speaker recognition system · 2020-02-24 · speaker recognition system adaptation to unseen and...

Post on 07-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Speaker recognition system adaptation to unseen and

mismatched recording devices in the NFI-FRIDA database

1Oxford Wave Research Ltd, UK 2 Speech and Audio Research,

Netherlands Forensic Institute, The

Hague, Netherlands

Finnian Kelly1, Anil Alexander1,

Oscar Forth1, and David van

der Vloed2

Motivation

Current automatic speaker recognition systems are trained on large

quantities of diverse speaker recordings:

• performance is good for forensic casework material involving

typical microphone or telephone recordings

• for unseen recording types, such as those involving a new

covert surveillance recorder, or a new transmission condition,

for example, performance may be impacted negatively

Motivation

Current automatic speaker recognition systems are trained on large

quantities of diverse speaker recordings:

• performance is good for forensic casework material involving

typical microphone or telephone recordings

• for unseen recording types, such as those involving a new

covert surveillance recorder, or a new transmission condition,

for example, performance may be impacted negatively

How can we adapt a well-trained automatic system to the unseen and mismatched conditions of a new case?

Levels of mismatch

Matched seen conditions

Mismatched seen conditions

Matched unseen conditions

Mismatched unseen conditions

Increasing

difficulty

The challenge of mismatch

Matched, seen

conditions

Mismatched, unseen

conditions

H0 H1 H0 H1

H0: same speaker scores

H1: different speaker scores

Existing solutions in VOCALISE

• Train the system from scratch with relevant data

• Data hungry: 1000s of speakers required

• Re-train the system LDA/PLDA stages with relevant data

• Data hungry: 100s of speakers required

• Apply score normalisation

• Will help, but is limited

Existing solutions in VOCALISE

• Train the system from scratch with relevant data

• Data hungry: 1000s of speakers required

• Re-train the system LDA/PLDA stages with relevant data

• Data hungry: 100s of speakers required

• Apply score normalisation

• Will help, but is limited

Here we introduce a new method of adapting a well-trained

system to unseen conditions on the fly using small* quantities

of data => forensically realistic

*10s of speakers

VOCALISE i-vector framework

UBM

feature

extraction

speech i-vector

i-vector

extraction

High-dimensional,

universal speaker space

Low-dimensional,

speaker-specific

space

Comparing i-vectors

i-vector A

i-vector B LDA / PLDA

Comparison score for

i-vectors A and B

Post-processing i-vectors

We could compare ‘raw’ i-vectors directly, but

it is beneficial to first post-process i-vectors to

increase their discriminatory power

LDA (linear discriminant analysis) is an

important post-processing step that:

1. Increases inter-speaker separability

2. Reduces dimensionality

Linear Discriminant Analysis (LDA)

• LDA projects i-vectors into a new space in which: • within-speaker variability is minimised • between-speaker separation is maximised

• Requires a set of training i-vectors and their speaker labels

LDA training

• The LDA transformation is generally learned using the same training data as the other models in the i-vector framework (UBM and TV).

• Can we leverage LDA for adapting a system to new conditions?

Condition adaptation via LDA

System

development

i-vectors

N ≈ 50,000

Adaptation

i-vectors

N ≈ 100

adapted LDA

transformation

well-trained LDA transformation

Probabilistic LDA (PLDA) • PLDA compares two post-LDA i-vectors and returns a comparison

score

• The score is calculated based on the most discriminative parts of an

i-vector:

• Achieved by learning a subspace that describes the dominant

directions of change in the i-vectors of different speakers

• PLDA therefore requires a set of post-LDA training i-vectors and

their speaker labels

We supplement our LDA condition adaptation by re-training

PLDA with all adapted i-vectors

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B Raw comparison

score

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B

Reference

i-vectors

Raw comparison

score

Reference scores A

Reference scores B

Normalised

comparison

score

Reference normalisation

Reference (or score) normalisation, is an established technique for

adapting the output of a system to new conditions

i-vector A

i-vector B

Reference

i-vectors

Raw comparison

score

Reference scores A

Reference scores B

Normalised

comparison

score

Can only shift scores up or down; less powerful than LDA/PLDA adaptation…

Mismatched condtion experiments

System:

iVOCALISE 2017B

• TEL only and TEL-MIC session

• Condition adaptation

• Reference normalisation

Alexander, A., Forth, O., Atreya, A. A. and Kelly, F. (2016). VOCALISE: A forensic automatic speake recognition system supporting spectral, phonetic, and user-provided featurs. Speaker Odyssey 2016, Bilbao, Spain.

outline

NFI-FRIDA recap

Experiments with NFI-FRIDA

• 40 test speakers

• 3 recordings per speaker from each of the following devices

• d1: Headset microphone

• d2: Close microphone A

• d3: Close microphone B

• d4: Far microphone

• d5: Telephone intercept

• 1 additional recording per speaker from device d1

• Cross-device (mismatched) performance, relative to d1, was evaluated

for all devices d1—d5

• #H0 (same speaker) comparisons = 120

• #H1 (different speaker) comparisons = 4680

Condition adaptation & reference normalisation experiments • 15 training speakers (no overlap with the 40 test speakers)

• 2 recordings per speaker from each of the following devices

• d1: Headset microphone

• d2: Close microphone A

• d3: Close microphone B

• d4: Far microphone

• d5: Telephone intercept

• For condition adaptation, 2 recordings from each of the devices

under comparison were used => 2 recordings x 2 devices x 15

speakers = 60 recordings*

• For reference normalisation, 2 recordings from the other (not d1)

device were used => 2 recordings x 1 devices x 15 speakers = 30

recordings

* With the exception of d1-d1, where only 30 recordings were used

Cross-condition performance (EER%): Telephone-only session data

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

EER%

15 speakers for adaptation/normalisation

EER%

Cross-condition performance (EER%): Telephone+Microphone session data

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

15 speakers for adaptation/normalisation

EER%

Revisiting matched comparison EER%

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

15 speakers for adaptation/normalisation

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Cross-condition performance (Cllr-min): Telephone+Microphone session data

• Cllr-min, like the EER, measures the ability of the system to discriminate

between speakers

• Unlike the EER, it considers the discriminatory power of the system across

all possible score thresholds

• Cllr-min, or minimum log-likelihood ratio cost, is the optimal Cllr value

achievable by a system

• Like the EER, lower is better:

• Cllr-min = 0 for a perfect system

• Cllr-min = 1 for a useless system

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Cross-condition performance (Cllr-min): Telephone+Microphone session data

15 speakers for adaptation/normalisation

Cllr-

min

d1: Headset mic

d2: Close mic A

d3: Close mic B

d4: Far mic

d5: Tel intercept

Revisiting matched comparison Cllr-min

15 speakers for adaptation/normalisation

Varying the number of adaptation & normalisation speakers • The original set of 15 training speakers was increased to 38 speakers

(again, no overlap with the 40 test speakers)

• 2 recordings from each device d1—d5 were used

• Condition adaptation and reference normalisation proceeded as

before, increasing the number of training speakers in increments of 5,

from 5 to 38.

• Results presented for d1—d4 and d1—d5 only

d1-d4: Close mic - Far mic Telephone+Microphone session data

Cllr-

min

#adapt/reference speakers

d1-d5: Close mic - Telephone intercept Telephone+Microphone session data

Cllr-

min

#adapt/reference speakers

d1-d5: Close mic - Telephone intercept Condition adaptation variance with 5 speakers

#adapt/reference speakers

Cllr-

min

#adapt/reference speakers

Cllr-

min

d1-d5: Close mic - Telephone intercept Reference normalisation variance with 5 speakers

d1-d5: Close mic - Telephone intercept Condition adaptation variance with 20 speakers

#adapt/reference speakers

Cllr-

min

#adapt/reference speakers

Cllr-

min

d1-d5: Close mic - Telephone intercept Reference normalisation variance with 20 speakers

Conclusions

The baseline performance of a well-trained automatic system to unseen and mismatched conditions is good:

<4 % EER

Condition adaptation can provide consistent and stable performance improvement with a very small number of speakers (≈30) => applicable to forensic casework

Condition adaptation has the scope to exploit additional speakers and recordings if they are available

Here we have used condition adaptation and reference normalisation in isolation; they can be used in combination

Thanks!

top related