automatic speaker recognition and de ...costic1206.uvigo.es/sites/default/files/trainingschool...l....
TRANSCRIPT
AUTOMATIC
SPEAKER RECOGNITION AND
DE-IDENTIFICATION
JEAN-FRANÇOIS BONASTRE
IC1206 TRAINING SCHOOL
LAS PALMAS, FEBRUARY 13-16 2017
IC1206 Training School – Las Palmas, February 13-16 2017
INTRODUCTION
De-identification:
• To destroy or shadow the information about the « identity » of a person
• « »? Identity or all personal/private elements?
Modality in-interest:
• VOICE (more than speech or « speaker »)
Two potentially targeted receptors:
• Human brain thanks to human hearing system
• Perceived « identity »
• Analytics thanks to computers
• Information retrieval, QA…
2
INTRODUCTION
MENU OF THE TALK
Focused on « Voice » and « Analytics », so on « identity »
in the meaning of automatic speaker recognition (ASpR)
Entrada:
Talk about what (ASpR) systems are really doing currently
Plato principal:
A deeper look on the phonetic information used by ASpR
systems
Postre:
Brainstorming on « identity » and voice
3
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
Amazing performance progresses among the last decade(s)
• Not really in terms of error rates but completely true in terms of operational context and scalability
• From monolingual prompt text in clean environment and with an unique microphone
• To text free life conversations gathered from multiple real world situations thanks to hundreds of cellphones
• (And new progresses are expected from DNNs…)
But still the same architecture and information processing
• ~UBM plus statistical analysis, Ivector systems
4
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
5
And what is UBM/GMM? A reminder?
Corpus-based training
Generic model trained on large collection
High adaptation abilities
Allow to learn interesting variability vs non-interesting ones using a task-driven principle
ENTRADA
UNDERSTAND ASpR
6
And what is a UBM/GMM system? A reminder?
Super Vecteurs
Acoustic space ~50 UBM 500 à 4000 comp.
= dimension ~100.000
UBM-GMM
(~1998)
MIT-LL, Reynolds,
D. A. et al., 2000,
Digital signal
processing
The « shifts » are defining speaker X model!
ENTRADA
UNDERSTAND ASpR
7
And what is a iVector system? A reminder?
s
c m
Speaker
variability
z
Modelisation noise
Session variability
Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data., IEEE
Transactions on, Speech and Audio Processing
Factor Analysis
With training data,
learn
• Good var. subspace
• Bad var. subspace
Probabilistic approach
ENTRADA
UNDERSTAND ASpR
It is Total Variability space!
• Same than factor analysis but one T space / Matrix
• Concentrate the variability in a “small” space
• Similar to PCA but follows probabilistic estimation
In the “small” space (~400)
• A recording is represented by one vector denoted
“iVector”
• Discriminant technics are applied like PLDA (or LDA)
• (Plus “normalization”/conditioning)
8
So! What is a iVector system?
ENTRADA
UNDERSTAND ASpR
9
las palmas, c’est le Fun !
Free text handwriting
writer recognition
Handwriting recognition metaphor
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
You could use UBM-GMM iVector approach
• No deep knowledge on handwriting
• 99% corpus based approach
• Train diversity and variabilities using examples of data
The ASpR progresses are easy to visualize
10
Handwriting recognition metaphor
las palmas, c’est le Fun !
las palmas, c’est le Fun !
las palmas, c’est le Fun !
las palmas, c’est le Fun !
c’est top ici, à las palmas! c’est top ici, à las palmas!
las palmas, it io so nice ! las palmas, it io so nice !
...
Write My name is your firstname
In the box, using the blue pen
My name is Jean-Francois
Early 1990’s 2017
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
To observe a good EER/DCF/CLR doesn’t mean that the
system is doing writer recognition
To add factors doesn’t always mean the task is more difficult!
• Add a paper color (~ a phone) if only a small number of
users is using consistently it
• Add a pen color and or a language if only…
• Add a context/content (Las Palmas!) if only…
11
Handwriting recognition metaphor
Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg Xxxyyyzzzaaa fffggtt sfffg
Xxxyyyzzzaaa fffggtt sfffg
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
With ASpR (but non only), in order to de-identify it is
mandatory to
• Know what is the task done by the systems
• Know deeply which information is used by the system
And to know that the systems design could have an
influence on de-identification
• Something true today could not be true tomorrow
• This is particularly important if we don’t know
(well enough) what is the speaker specific
information
12
IC1206 Training School – Las Palmas, February 13-16 2017
ENTRADA
UNDERSTAND ASpR
13
IC1206 Training School – Las Palmas, February 13-16 2017
7396 8049
NCFB_A -1.94 4.84 0.46 5.47
F. Accept. F. Rejection
0.88 %
49.72 %
96.55 %
27.45 %
27.45 %
27.45 %
1 - baseline
2 - !=
3 - =
Bonastre, J. F et al. (2007). Artificial impostor voice transformation effects on
false acceptance rates. INTERSPEECH
See also Federico Alegre et al. 2012
ENTRADA
UNDERSTAND ASpR
Important to define the receiver…
Important to define the task
For latter, the discussion, do not forget…
• The differences between handwriting writer
recognition, signature recognition and Xprints)
• The fact that we didn’t speak stricto senso about
biometrics until now…
14
IC1206 Training School – Las Palmas, February 13-16 2017
Comments…
PLATO PRINCIPAL
Recipe:
• Start with a baseline system (iVector/ALIZE)
• Find a good database… With enough intraspeaker
variability
• Build a database (FABIOLE) with some cons but
designed for the job
• Develop a protocol
• Do the experiments and try to understand the results
A deeper look on the phonetic information
used by ASpR systems
One PhD (Moez Ajili) + a previous PhD (Juliette Kahn)
15
PLATO PRINCIPAL
FABIOLE
Recordings from various radio and tv shows (similar than
Ester, Repere, Etape). Fabiole has two sets of (male only)
speakers
• T: 30 Speakers with 100 recordings (30s minimum)
taken in different shows (different days and/or channel)
• I: 100 Speakers with 1 recording (30s minimum)
Here, we used only T set, keeping I for other experiments
• For each T speaker, 4950 same-speaker pairs and 290K
different-speakers pairs (* by 30 for the total of pairs)
16
PLATO PRINCIPAL
BASELINE SYSTEM
LIA SpkDet system, using ALIZE/SpkDet open-source toolkit
• 19 LFCC, first derivatives and 11 second order derivatives
• Bandwidth restricted to 300-3400 (first part)
• UBM with 512 components.
• UBM and T matrix trained on ESTER 1&2, REPERE and ETAPE databases, 7 690 sessions from 2906 speakers
• Inter-session matrix W estimated on a subset (>=2 sessions) using 3410 sessions from 617 speakers
• I-Vectors dimension is 400
• PLDA scoring model
17
PLATO PRINCIPAL
PROTOCOL FOR A PHONETIC VIEW
Automatic phonetic alignment using LIA tools
• Automatic transcription using Speeral, LIA automatic
transcription system (WER ~29% on REPERE)
• Plus verification with orthographic transcriptions
Main principles
• Withdraw the in-interest information to see the
performance degradation
• Compare withdrawing of in-interest information with
random pruning of the same number of acoustic
frames
• Express the results as a relative loss/win in %
18
PLATO PRINCIPAL
PROTOCOL FOR A PHONETIC VIEW
Measure the performance using CLLR
N. Brümmer, J. du Preez, Application-independent evaluation of speaker detection, Computer Speech &
Language 20 (2006) 230–275.
CLLR: Two loss functions -> gives a loss of information
19
PLATO PRINCIPAL
PROTOCOL FOR A PHONETIC VIEW
But divide the CLLR into tar (match pairs) and non
(unmatched pairs) components
Hypothesis
• Tar part is mainly linked to intraspeaker variability
• Non part is mainly linked to interspeaker variability
Measure the performance using CLLR
20
PLATO PRINCIPAL
PROTOCOL FOR A PHONETIC VIEW
Define Relative CLLR
We have now Cllr with Tar and Non version
We have the phonemic pruning system which for each pair
• Suppress the frames tied to a specific phoneme
• Suppress randomly the same amount of frames (*10)
We need the relative Cllr…
21
PLATO PRINCIPAL
PROTOCOL FOR A PHONETIC VIEW
We work mainly on phoneme classes to avoid lake of data
problems
We use, classically, the following classes:
• Oral vowels (OV)
• Nasal vowels (NV)
• Nasal consonants (NC)
• Plosive (P)
• Fricatives (F)
• Liquids (L)
Phoneme classification
22
PLATO PRINCIPAL
RESULTS
Similar results than literature: nasals and vowels are particularly speaker specific (+++ nasal vowels)
In contradiction with L. F. Gallardo, M. Wagner, S. Moller, I-vector speaker verification based on phonetic information under transmission channel effects., in: INTERSPEECH, pp. 696–
700 low importance of fricatives
• But bandwidth explanation…
23
Global, by phonetic category
PLATO PRINCIPAL
RESULTS
Even with a controlled protocol, large differences between
speakers are observed (mainly on Tar part)
24
By speaker and Tar, Non
PLATO PRINCIPAL
RESULTS
Same general tendency but with a large variability depending
on the speaker!!!
• speaker 2 has a loss of 175% without oral vowels
• speaker 28 has a win of about 40% in the same situation
25
By class and speaker using relative Cllr
PLATO PRINCIPAL
RESULTS
Reinforce the importance of some phonetic classes (like
oral vowels) in terms of speaker specific information
(“identity” information)
Good! Simple!
26
Same for different-speakers pairs only (Non)
PLATO PRINCIPAL
RESULTS
Ohhhh!!!!
Oral vowels are now negative, in average…
Per speaker variability is huge
Intra speaker variability seems very important!
27
Repeat with same-speaker pairs only (Tar)
PLATO PRINCIPAL
RESULTS
Statistical relevance? Checked with ANOVA
• Differences are significant for both non-target and target trials
• Phonemic category explains of about 60% of the variance of Cllr non and 10.2% for tar
• Large effect for Non, medium for Tar (using Eta-square)
A large Nasal effectiveness for speaker comparison (contribution of nasal/paranasal cavities)
Oral vowels
• Bring the largest part in terms of speaker discrimination
• And, in the same time, show a large intra-speaker variability which conveys a large part of the loss
28
Phonetic view and Tar/Non
PLATO PRINCIPAL
BONUS…
• Large contribution of F1-F3 to phoneme discrimination
• ~No contribution of F1-F3 to speaker discrimination
• F4 has a significant contribution to speaker discrimination
• F4 is mainly linked to nasality as shown in:
29
Undisclosed results on formants….
% of variability explained y the different factors (Eta-square)
Y. Lavner, I. Gath, J. Rosenhouse, The effects of acoustic modifications on the
identification of familiar voices speaking isolated vowels, Speech Communication 30
(2000) 9–26.
POSTRE
Brainstorming??
• On « identity » and voice
• Your time… to work…
30
IC1206 Training School – Las Palmas, February 13-16 2017
BRAINSTORMING
POSTRE
WHAT WE FOUND?
Automatic speaker recognition systems use globally the
information available
• Not only “speaker specific” information
Design a database in order to study intraspeaker variability
• Shows a large “speaker effect”
• Shows that intraspeaker variability is responsible of
~2/3 of system’s losses
• Shows that the information is not uniformly distributed
among the phonological units
• (and killed some ideas about formants)
Event if the database is still very limited
31
POSTRE
LESSONS AND QUESTIONS
1- To de-identify we need to know the target, human/ASpR
2- It is not possible to de-identify without knowing the used information
3- Are we talking about biometrics?
• No for gait, speech….
4- Is it so important for de-identification?
• No… If the paper color is used, de-identify by this way could work
5- As a consequence, some de-identification approaches have a limited life expectancy (due to technology changes)
32
las palmas, c’est le Fun !
POSTRE
LESSONS AND QUESTIONS
6- Is “identity” (in biometrics meaning) is the only element
to withdraw for de-identification?
• (re)Open the general question of privacy
• Voice is conveying a lot of information
• Gender, Age, Mother language, Accent,
Education, Stress, “Emotion”, Health, Opinions…
• And… what you did yesterday evening…
• Speech also conveys information (“Las palmas”)
• Huge interest in these “paralinguistic information”
• Special sessions, challenges, big players views
7- “Identity” is certainly the easiest aspect to deal with…
33
CREDITS, THANKS AND
REFERENCES
The technical part of this presentation comes from
• With a large contribution of Solange Rossato (LIG) and Juliette Kahn (LNE)
A part of the presented results is here :
Moez Ajili, Jean-françois Bonastre, Waad Ben Kheder, Solange Rossato, Juliette Kahn (2017). Phonological content impact on wrongful convictions in forensic voice comparison context. ICASSP 2017
Moez Ajili, Jean-françois Bonastre, Waad Ben Kheder, Solange Rossato, Juliette Kahn (2016). 2016 IEEE Workshop on Spoken Language Technology (SLT), 13–16 December. IEEE, San-Diego, USA
Moez, A., Bonastre, J. F., Rossato, S., Kahn, J. (2016, March). Inter-speaker variability in forensic voice comparison: a preliminary evaluation, ICASSP 2016, Shanghai, China.
Ajili, Moez and Bonastre, Jean-Francois and Kahn, Juliette and Rossato, Solange and Bernard, Guillaume (2016). Fabiole, a speech database for forensic speaker comparison, Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC (pp 23-28)
34
Moez Ajili PhD thesis work
OTHER REFERENCES
J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J.-F. Bonastre, D. Matrouf, Forensic speaker recognition, Signal Processing, 2009
J. Kahn, N. Audibert, J.-F. Bonastre, S. Rossato, Inter and intraspeaker variability in french: an analysis of oral vowels and its implication for automatic speaker verification, in: International Congress of Phonetic
Sciences (ICPhS), pp. 1002–1005.
K. Amino, T. Osanai, T. Kamada, H. Makinae, T. Arai, Effects of the phonological contents and transmission channels on forensic speaker recognition, in: Forensic Speaker Recognition, Springer, 2012, pp. 275–308.
J. P. Eatock, J. S. Mason, A quantitative assessment of the relative speaker discriminating properties of phonemes, ICASSP-94
U. Hofker, Auros-automatic recognition of speakers by computers: phoneme ordering for speaker recognition, in: Proc. 9th International Congress on’Acoustics
K. Amino, T. Sugawara, T. Arai, Idiosyncrasy of nasal sounds in human speaker identification and their acoustic properties, Acoustical science and technology 27 (2006) 233–235.
J. H. Hansen, T. Hasan, Speaker recognition by machines and humans: a tutorial review, IEEE Signal Processing Magazine 32 (2015) 74–99.
S. S. Kajarekar, H. Bratt, E. Shriberg, R. De Leon, A study of intentional
voice modifications for evading automatic speaker recognition, in: 2006 IEEE Odyssey-The Speaker and Language RecognitionWorkshop, IEEE, pp. 1–6.
C. Schindler, C. Draxler, The influence of bandwidth limitation on the speaker discriminating potential of nasals and fricatives, International Association for Forensic Phonetics and Acoustics (IAFPA) (2013).
35