automatic speaker recognition and de ...costic1206.uvigo.es/sites/default/files/trainingschool...l....

AUTOMATIC

SPEAKER RECOGNITION AND

DE-IDENTIFICATION

JEAN-FRANÇOIS BONASTRE

IC1206 TRAINING SCHOOL

LAS PALMAS, FEBRUARY 13-16 2017

IC1206 Training School – Las Palmas, February 13-16 2017

INTRODUCTION

De-identification:

• To destroy or shadow the information about the « identity » of a person

• « »? Identity or all personal/private elements?

Modality in-interest:

• VOICE (more than speech or « speaker »)

Two potentially targeted receptors:

• Human brain thanks to human hearing system

• Perceived « identity »

• Analytics thanks to computers

• Information retrieval, QA…

2

INTRODUCTION

MENU OF THE TALK

Focused on « Voice » and « Analytics », so on « identity »

in the meaning of automatic speaker recognition (ASpR)

Entrada:

Talk about what (ASpR) systems are really doing currently

Plato principal:

A deeper look on the phonetic information used by ASpR

systems

Postre:

Brainstorming on « identity » and voice

3


ENTRADA

UNDERSTAND ASpR

Amazing performance progresses among the last decade(s)

• Not really in terms of error rates but completely true in terms of operational context and scalability

• From monolingual prompt text in clean environment and with an unique microphone

• To text free life conversations gathered from multiple real world situations thanks to hundreds of cellphones

• (And new progresses are expected from DNNs…)

But still the same architecture and information processing

• ~UBM plus statistical analysis, Ivector systems

4


ENTRADA

UNDERSTAND ASpR

5

And what is UBM/GMM? A reminder?

Corpus-based training

Generic model trained on large collection

High adaptation abilities

Allow to learn interesting variability vs non-interesting ones using a task-driven principle

ENTRADA

UNDERSTAND ASpR

6

And what is a UBM/GMM system? A reminder?

Super Vecteurs

Acoustic space ~50 UBM 500 à 4000 comp.

= dimension ~100.000

UBM-GMM

(~1998)

MIT-LL, Reynolds,

D. A. et al., 2000,

Digital signal

processing

The « shifts » are defining speaker X model!

ENTRADA

UNDERSTAND ASpR

7

And what is a iVector system? A reminder?

s

c m

Speaker

variability

z

Modelisation noise

Session variability

Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data., IEEE

Transactions on, Speech and Audio Processing

Factor Analysis

With training data,

learn

• Good var. subspace

• Bad var. subspace

Probabilistic approach

ENTRADA

UNDERSTAND ASpR

It is Total Variability space!

• Same than factor analysis but one T space / Matrix

• Concentrate the variability in a “small” space

• Similar to PCA but follows probabilistic estimation

In the “small” space (~400)

• A recording is represented by one vector denoted

“iVector”

• Discriminant technics are applied like PLDA (or LDA)

• (Plus “normalization”/conditioning)

8

So! What is a iVector system?

ENTRADA

UNDERSTAND ASpR

9

las palmas, c’est le Fun !

Free text handwriting

writer recognition

Handwriting recognition metaphor


ENTRADA

UNDERSTAND ASpR

You could use UBM-GMM iVector approach

• No deep knowledge on handwriting

• 99% corpus based approach

• Train diversity and variabilities using examples of data

The ASpR progresses are easy to visualize

10






c’est top ici, à las palmas! c’est top ici, à las palmas!

las palmas, it io so nice ! las palmas, it io so nice !

...

Write My name is your firstname

In the box, using the blue pen

My name is Jean-Francois

Early 1990’s 2017


ENTRADA

UNDERSTAND ASpR

To observe a good EER/DCF/CLR doesn’t mean that the

system is doing writer recognition

To add factors doesn’t always mean the task is more difficult!

• Add a paper color (~ a phone) if only a small number of

users is using consistently it

• Add a pen color and or a language if only…

• Add a context/content (Las Palmas!) if only…

11


Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg

Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg

Xxxyyyzzzaaa fffggtt sfffg







ENTRADA

UNDERSTAND ASpR

With ASpR (but non only), in order to de-identify it is

mandatory to

• Know what is the task done by the systems

• Know deeply which information is used by the system

And to know that the systems design could have an

influence on de-identification

• Something true today could not be true tomorrow

• This is particularly important if we don’t know

(well enough) what is the speaker specific

information

12


ENTRADA

UNDERSTAND ASpR

13


7396 8049

NCFB_A -1.94 4.84 0.46 5.47

F. Accept. F. Rejection

0.88 %

49.72 %

96.55 %

27.45 %

27.45 %

27.45 %

1 - baseline

2 - !=

3 - =

Bonastre, J. F et al. (2007). Artificial impostor voice transformation effects on

false acceptance rates. INTERSPEECH

See also Federico Alegre et al. 2012

ENTRADA

UNDERSTAND ASpR

Important to define the receiver…

Important to define the task

For latter, the discussion, do not forget…

• The differences between handwriting writer

recognition, signature recognition and Xprints)

• The fact that we didn’t speak stricto senso about

biometrics until now…

14


Comments…

PLATO PRINCIPAL

Recipe:

• Start with a baseline system (iVector/ALIZE)

• Find a good database… With enough intraspeaker

variability

• Build a database (FABIOLE) with some cons but

designed for the job

• Develop a protocol

• Do the experiments and try to understand the results

A deeper look on the phonetic information

used by ASpR systems

One PhD (Moez Ajili) + a previous PhD (Juliette Kahn)

15

PLATO PRINCIPAL

FABIOLE

Recordings from various radio and tv shows (similar than

Ester, Repere, Etape). Fabiole has two sets of (male only)

speakers

• T: 30 Speakers with 100 recordings (30s minimum)

taken in different shows (different days and/or channel)

• I: 100 Speakers with 1 recording (30s minimum)

Here, we used only T set, keeping I for other experiments

• For each T speaker, 4950 same-speaker pairs and 290K

different-speakers pairs (* by 30 for the total of pairs)

16

PLATO PRINCIPAL

BASELINE SYSTEM

LIA SpkDet system, using ALIZE/SpkDet open-source toolkit

• 19 LFCC, first derivatives and 11 second order derivatives

• Bandwidth restricted to 300-3400 (first part)

• UBM with 512 components.

• UBM and T matrix trained on ESTER 1&2, REPERE and ETAPE databases, 7 690 sessions from 2906 speakers

• Inter-session matrix W estimated on a subset (>=2 sessions) using 3410 sessions from 617 speakers

• I-Vectors dimension is 400

• PLDA scoring model

17

PLATO PRINCIPAL

PROTOCOL FOR A PHONETIC VIEW

Automatic phonetic alignment using LIA tools

• Automatic transcription using Speeral, LIA automatic

transcription system (WER ~29% on REPERE)

• Plus verification with orthographic transcriptions

Main principles

• Withdraw the in-interest information to see the

performance degradation

• Compare withdrawing of in-interest information with

random pruning of the same number of acoustic

frames

• Express the results as a relative loss/win in %

18

PLATO PRINCIPAL


Measure the performance using CLLR

N. Brümmer, J. du Preez, Application-independent evaluation of speaker detection, Computer Speech &

Language 20 (2006) 230–275.

CLLR: Two loss functions -> gives a loss of information

19

PLATO PRINCIPAL


But divide the CLLR into tar (match pairs) and non

(unmatched pairs) components

Hypothesis

• Tar part is mainly linked to intraspeaker variability

• Non part is mainly linked to interspeaker variability

Measure the performance using CLLR

20

PLATO PRINCIPAL


Define Relative CLLR

We have now Cllr with Tar and Non version

We have the phonemic pruning system which for each pair

• Suppress the frames tied to a specific phoneme

• Suppress randomly the same amount of frames (*10)

We need the relative Cllr…

21

PLATO PRINCIPAL


We work mainly on phoneme classes to avoid lake of data

problems

We use, classically, the following classes:

• Oral vowels (OV)

• Nasal vowels (NV)

• Nasal consonants (NC)

• Plosive (P)

• Fricatives (F)

• Liquids (L)

Phoneme classification

22

PLATO PRINCIPAL

RESULTS

Similar results than literature: nasals and vowels are particularly speaker specific (+++ nasal vowels)

In contradiction with L. F. Gallardo, M. Wagner, S. Moller, I-vector speaker verification based on phonetic information under transmission channel effects., in: INTERSPEECH, pp. 696–

700 low importance of fricatives

• But bandwidth explanation…

23

Global, by phonetic category

PLATO PRINCIPAL

RESULTS

Even with a controlled protocol, large differences between

speakers are observed (mainly on Tar part)

24

By speaker and Tar, Non

PLATO PRINCIPAL

RESULTS

Same general tendency but with a large variability depending

on the speaker!!!

• speaker 2 has a loss of 175% without oral vowels

• speaker 28 has a win of about 40% in the same situation

25

By class and speaker using relative Cllr

PLATO PRINCIPAL

RESULTS

Reinforce the importance of some phonetic classes (like

oral vowels) in terms of speaker specific information

(“identity” information)

Good! Simple!

26

Same for different-speakers pairs only (Non)

PLATO PRINCIPAL

RESULTS

Ohhhh!!!!

Oral vowels are now negative, in average…

Per speaker variability is huge

Intra speaker variability seems very important!

27

Repeat with same-speaker pairs only (Tar)

PLATO PRINCIPAL

RESULTS

Statistical relevance? Checked with ANOVA

• Differences are significant for both non-target and target trials

• Phonemic category explains of about 60% of the variance of Cllr non and 10.2% for tar

• Large effect for Non, medium for Tar (using Eta-square)

A large Nasal effectiveness for speaker comparison (contribution of nasal/paranasal cavities)

Oral vowels

• Bring the largest part in terms of speaker discrimination

• And, in the same time, show a large intra-speaker variability which conveys a large part of the loss

28

Phonetic view and Tar/Non

PLATO PRINCIPAL

BONUS…

• Large contribution of F1-F3 to phoneme discrimination

• ~No contribution of F1-F3 to speaker discrimination

• F4 has a significant contribution to speaker discrimination

• F4 is mainly linked to nasality as shown in:

29

Undisclosed results on formants….

% of variability explained y the different factors (Eta-square)

Y. Lavner, I. Gath, J. Rosenhouse, The effects of acoustic modifications on the

identification of familiar voices speaking isolated vowels, Speech Communication 30

(2000) 9–26.

POSTRE

Brainstorming??

• On « identity » and voice

• Your time… to work…

30


BRAINSTORMING

POSTRE

WHAT WE FOUND?

Automatic speaker recognition systems use globally the

information available

• Not only “speaker specific” information

Design a database in order to study intraspeaker variability

• Shows a large “speaker effect”

• Shows that intraspeaker variability is responsible of

~2/3 of system’s losses

• Shows that the information is not uniformly distributed

among the phonological units

• (and killed some ideas about formants)

Event if the database is still very limited

31

POSTRE

LESSONS AND QUESTIONS

1- To de-identify we need to know the target, human/ASpR

2- It is not possible to de-identify without knowing the used information

3- Are we talking about biometrics?

• No for gait, speech….

4- Is it so important for de-identification?

• No… If the paper color is used, de-identify by this way could work

5- As a consequence, some de-identification approaches have a limited life expectancy (due to technology changes)

32


POSTRE

LESSONS AND QUESTIONS

6- Is “identity” (in biometrics meaning) is the only element

to withdraw for de-identification?

• (re)Open the general question of privacy

• Voice is conveying a lot of information

• Gender, Age, Mother language, Accent,

Education, Stress, “Emotion”, Health, Opinions…

• And… what you did yesterday evening…

• Speech also conveys information (“Las palmas”)

• Huge interest in these “paralinguistic information”

• Special sessions, challenges, big players views

7- “Identity” is certainly the easiest aspect to deal with…

33

CREDITS, THANKS AND

REFERENCES

The technical part of this presentation comes from

• With a large contribution of Solange Rossato (LIG) and Juliette Kahn (LNE)

A part of the presented results is here :

Moez Ajili, Jean-françois Bonastre, Waad Ben Kheder, Solange Rossato, Juliette Kahn (2017). Phonological content impact on wrongful convictions in forensic voice comparison context. ICASSP 2017

Moez Ajili, Jean-françois Bonastre, Waad Ben Kheder, Solange Rossato, Juliette Kahn (2016). 2016 IEEE Workshop on Spoken Language Technology (SLT), 13–16 December. IEEE, San-Diego, USA

Moez, A., Bonastre, J. F., Rossato, S., Kahn, J. (2016, March). Inter-speaker variability in forensic voice comparison: a preliminary evaluation, ICASSP 2016, Shanghai, China.

Ajili, Moez and Bonastre, Jean-Francois and Kahn, Juliette and Rossato, Solange and Bernard, Guillaume (2016). Fabiole, a speech database for forensic speaker comparison, Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC (pp 23-28)

34

Moez Ajili PhD thesis work

OTHER REFERENCES

J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre, D. Matrouf, Forensic speaker recognition, Signal Processing, 2009

J. Kahn, N. Audibert, J.-F. Bonastre, S. Rossato, Inter and intraspeaker variability in french: an analysis of oral vowels and its implication for automatic speaker verification, in: International Congress of Phonetic

Sciences (ICPhS), pp. 1002–1005.

K. Amino, T. Osanai, T. Kamada, H. Makinae, T. Arai, Effects of the phonological contents and transmission channels on forensic speaker recognition, in: Forensic Speaker Recognition, Springer, 2012, pp. 275–308.

J. P. Eatock, J. S. Mason, A quantitative assessment of the relative speaker discriminating properties of phonemes, ICASSP-94

U. Hofker, Auros-automatic recognition of speakers by computers: phoneme ordering for speaker recognition, in: Proc. 9th International Congress on’Acoustics

K. Amino, T. Sugawara, T. Arai, Idiosyncrasy of nasal sounds in human speaker identification and their acoustic properties, Acoustical science and technology 27 (2006) 233–235.

J. H. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Processing Magazine 32 (2015) 74–99.

S. S. Kajarekar, H. Bratt, E. Shriberg, R. De Leon, A study of intentional

voice modifications for evading automatic speaker recognition, in: 2006 IEEE Odyssey-The Speaker and Language RecognitionWorkshop, IEEE, pp. 1–6.

C. Schindler, C. Draxler, The influence of bandwidth limitation on the speaker discriminating potential of nasals and fricatives, International Association for Forensic Phonetics and Acoustics (IAFPA) (2013).

35

automatic speaker recognition and de ...costic1206.uvigo.es/sites/default/files/trainingschool...l....

Documents