speech research activities @ university of eastern...
TRANSCRIPT
-
Speech Research Activities @ University of Eastern Finland
Management Committee Meeting of COST Action IC1206, Zagreb, July 8-9, 2013
Dr. Tomi H. Kinnunen
School of Computing, UEF
mailto:[email protected]
-
University of Eastern Finland
• Three campuses: Joensuu, Kuopio, Savonlinna • Four faculties: Science and forestry, health science,
philosophical, business studies.
• 15000 students, 2800 staff members
School of Computing
• Annually 100 MSc + 10 PhD degrees • International master’s program (IMPIT) • Multimedia computing one of the 3 focus areas
-
Speech & Image Processing Unit (SIPU)
Prof. Pasi Fränti (team leader)
Dr. Tomi Kinnunen (leader of speech processing topics)
Dr. Ville Hautamäki (postdoc)
Dr. Padmanabhan Rajan (postdoc)
Dr. Rahim Saeidi (postdoc @ Radboud Univ, the Netherlands)
Dr. Cemal Hanilci (collaborator @ Uludag Univ, Turkey)
+ Several PhD students working on speech technology, data clustering and location-based systems (see the webpage for details)
Core team
http://www.uef.fi/en/sipu
http://www.uef.fi/en/sipu/peoplehttp://www.uef.fi/en/sipu
-
Research topics
Speech processing
Clustering methods
Location-based application
Clustering algorithms
Clustering validity
Graph clustering
Gaussian mixture models
Speaker recognition
Feature extraction
Voice activity detection
Voice conversion
Mobile data collection
Route reduction and compression
Photo collections and social networks
Location-aware services & search engine
Lossless compression and data reduction
Image denoising
Ultrasonic, medical and HDR imaging
Image processing & compression
-
Speaker recognition activities • Odyssey 2014: The Speaker and Language Recognition Workhop in Joensuu
• Theses: Four PhD theses on speaker recognition: Tomi Kinnunen (2005), Ville Hautamäki
(2008), Rahim Saeidi (2011), Evgeny Karpov (2011)
• Funding: National TEKES-funded PUMS project (2003-2007) of Fränti, 3-year postdoc
projects of Kinnunen (speaker recognition, 2010-2012) and Hautamäki (dialect and accent
recognition, 2012-2014, ongoing), and 4-year Academy project of Kinnunen, ”Reliable
Speaker Recognition and Modification” (2012-2015, ongoing)
• Main collaborators: I2R and NTU (Singapore), Aalto university (Finland), Georgia Tech
(USA), Aalborg University (Denmark), Lund University (Sweden), Uludag University (Turkey),
CRIM (Canada), Speech Technology Center (Russia)
• NIST SRE: Participation in 2006, 2008, 2010, 2012
• Selected publications:
• V. Hautamäki, T. Kinnunen, F. Sedlak, K.A. Lee, B. Ma, H. Li, “Sparse Classifier Fusion for Speaker
Verification”, IEEE Trans. Audio, Speech and Language Processing, 21(8): 1622—1681, August 2013.
• T. Kinnunen, R. Saeidi, F. Sedlak, K.A. Lee, J. Sandberg, M. Hansson-Sandsten, H. Li, “Low-Variance
Multitaper MFCC Features: a Case Study in Robust Speaker Verification”, IEEE Trans. Audio, Speech and
Language Processing, 20 (7), 1990-2001, 2012.
• C. Hanilci, T. Kinnunen, F. Ertas, R. Saeidi, J. Pohjalainen, P. Alku, “Regularized All-Pole Models for Speaker
Verification Under Noisy Environments”, IEEE Signal Processing Letters 19(3), 163-166, March 2012
• T. Kinnunen and H. Li, ”An overview of text-independent speaker recognition: from features to supervectors”,
Speech Communication, 52(1): 12--40, January 2010.
• T. Kinnunen, E. Karpov, P. Fränti, “Real-Time Speaker Identification and Verification”, IEEE Transactions on
Audio, Speech and Language Processing, 14(1): 277--288, Jan 2006.
-
I4U submission to NIST SRE 2012
1. ValidSoft Ltd, UK
2. Swansea University, UK
3. University of Avignon, France
4. Radboud University Nijmegen, the Netherlands
5. University of Texas at Dallas, USA
6. University of Eastern Finland, Finland
7. Institute for Infocomm Research, Singapore
8. IDIAP Research Institute, Switzerland
[ R. Saeidi et al. “I4U submission to NIST SRE 2012: a large-scale collaborative effort
for noise-robust speaker verification”, Interspeech 2013 (to appear) ]
-
International summer schools and seminars
19th International Summer School in Novel
Computing
June 18-21, 2012, UEF, Joensuu Campus
Recent Advances in Probabilistic Modeling for
Pattern Recognition
Dr. Patrick Kenny, Lead researcher, CRIM,
Montreal, Canada.
Attendance: 43 participants
Social Computing
Dr. Rosta Farzan, Postdoc researcher, Human
Computer Interaction Institute, Carnegie Mellon
University, USA.
Attandance: 22 participants
Winter workshop on Data Mining and Pattern Recognition
March 4-6, 2013, UEF, Mekrijärvi research station, Ilomantsi
Scientific Presentations Skills
June 10-12, 2013, UEF, Joensuu Campus
Dr. Jean-Luc Lebrun
Scientific Writing Skills
August 6-8, 2012, UEF, Joensuu Campus
Dr. Jean-Luc Lebrun
Attendance: 96 participants
16th International Summer School in Novel Computing
August 10 - August 14, 2009, Joensuu
Speaker and Language Recognition
Dr. Douglas A. Reynolds, Lincoln Laboratory, MIT
Attendance: 34 participants
Platforms for Stories-based Learning in Future Schools
Prof. Paul De Bra, Eindhoven University of Technology, The Netherlands
Attendance: 18 participants
http://cs.joensuu.fi/ecse/2012/http://cs.joensuu.fi/ecse/2012/http://www.crim.ca/en/r-d/bottin_recherche/?fiche=/fr/r-d/reconnaissance_parole/equipe/fiche07.htmlhttp://rosta-farzan.net/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/SciPre2013/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://www.ll.mit.edu/mission/communications/ist/biographies/reynolds-bio.htmlhttp://wwwis.win.tue.nl/~debra/
-
Activities relevant to
COST action IC1206
1) Speaker recognition
2) Voice conversion
3) Spoofing and anti-spoofing for speaker
recognition
-
Spoofing speaker recognizers
Human imitators (Lau et. al., 2005; Farrus et. al. 2008)
Playback attacks (Lindberg & Blomberg, Eurospeech
1999 ; Villalba & Lleida, FALA 2010)
Speaker-adapted speech synthesis (Pellom &
Hansen, ICASSP 1999; Satoh et. al., Eurospeech 2001;
DeLeon et. al., Speaker Odyssey 2010)
Voice conversion (Jin et. al., ICASSP 2008 ; Bonastre
et. al., Interspeech 2007 + many more)
SPECIAL SESSION: ”Spoofing and countermeasures for automatic
speaker verification”
Organizers: Nick Evans (EURECOM), Tomi Kinnunen (UEF), Junichi
Yamagishi (Univ. Edinburgh), Sebastien Marcel (IDIAP)
-
Voice Conversion Feature extraction Spectrum extraction using SPTK toolkit, 30 mel-cepstral coefficients (MCEP)
F0 extraction using the RAPT algorithm
Mapping function: -Joint density GMM (Kain & Macon, ICASSP 1998)
Frame alignment - VQ codebook mapping (Sundermann et al, Interspeech 2004)
-
Speaker modeling:
selected approaches
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal
background model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant
sequence support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal background
model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant sequence
support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
Gaussian mixture model (GMM)
Density
Prior probability
Mean vector
Cov. matrix
multivariate Gaussian density
-
Maximum a Posteriori (MAP)
Adaptation of GMMs
Adapted mean vector
Mean of training data
Prior mean from univ. background model (UBM)
Adaptation coefficient (r = relevance factor, usually r = 16)
Mean of training data assigned to kth Gaussian
Soft count of vectors assigned to kth Gaussian
Posterior probability of the kth Gaussian for one feature vector
-
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal
background model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant sequence
support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
Vector Quantization (VQ) One of the “classical” speaker recognition methods,
similar performance with GMM with reduced computation
Speaker model = codebook C = {c1, c2, … , cK}, where ck are the centroid vectors. Usually these are obtained
using K-means
kkkkk uxc )1(
kn S
n
k
kS x
xx1
rS
S
k
k
k
mS mkk cxcxx ;where
r = relevance factor (fixed constant)
Centroid of speaker’s training vectors that are mapped to UBM vector uk
UBM centroid
-
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal background
model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant
sequence support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
Sequence-kernel SVMs
-
Generalized Linear Discriminant
Sequence SVM (GLDS-SVM) [Campbell et al, 2006]
Use monomials (up to certain degree) to expand feature vectors and use the average to represent utterances:
→
→
Note: dimensionality = (d + M)! / (d! x M!)
Use linear kernel SVM with these supervectors
Example: expansion from 2-dimensional input space
using 2nd order expansion
-
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal background
model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant sequence
support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
GMM Supervectors [Campbell et al, 2006]
MAP
adaptation
Feature
extraction
Universal background model
Speech utterance
μ1 μ2
μK
... μ =
GMM mean supervector of dimensionality K × d, where K = number of Gaussians, d = number of acoustic features
-
Approach Reference
Fra
me-
based 1. Gaussian mixture model with universal
background model (GMM-UBM)
Reynolds et al, 2000
2. Vector quantizer with universal background
model (VQ-UBM)
Hautamäki et al, 2008
Uttera
nce-
based
3. Generalized linear discriminant sequence
support vector machine
(GLDS-SVM)
Campbell et al, 2006a
4. GMM supervectors with support vector
machine (GMM-SVM)
Campbell et al, 2006b
5. GMM with joint factor analysis
(GMM-JFA)
Kenny et al, 2005, 2006,
2008,
-
Joint Factor Analysis (JFA)
Speaker- and channel-dependent supervector
Speaker supervector
Channel supervector
Assume that each supervector is composed as a sum of two
statistically independent components:
UBM supervector
Eigenvoice matrix
Speaker factors
Residual Term Dz
Eigenchannel matrix
Channel factors
[Kenny et al, 2005, 2007, 2008, http://www.crim.ca/perso/patrick.kenny/]
V, D and U are the model hyperparameters trained beforehand on large
datasets ; x, y and z need to be estimated for a given training sample
JFA cookbook from Brno Univ. Tech (BUT),
http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo
http://www.crim.ca/perso/patrick.kenny/http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo
-
Spoofing results
Classifier Intersession
compensation
Baseline
accuracy
(EER, %)
Accuracy on
spoofed samples
(EER, %)
GMM-UBM - 7.63 24.99
VQ-UBM - 7.56 22.62
GLDS-SVM - 7.16 25.17
GMM-SVM Nuis. attrib.
projection (NAP)
3.74 12.58
GMM-JFA Joint factor
analysis (JFA)
3.24 7.61
T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, H. Li, “Vulnerability of Speaker Verification Systems
Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech”, Proc. ICASSP 2012
-
Calibration breaks down even
for the advanced classifiers
False acceptance rates (threshold = EER point on baseline)
GMM-SVM GMM-JFA
Baseline (no spoofing) 3.74 3.24
Voice conversion spoofing (JD-GMM) 41.54 17.33
-
Spoofing i-vector systems Z. Wu, T. Kinnunen, E.S. Chng, H. Li, E. Ambikairajah, ”A study on spoofing attack in state-
of-the-art speaker verification: the telephone speech case”, APSIPA 2012, Hollywood, US,
December 2012
False acceptance rates (threshold = EER point on baseline)
GMM-JFA i-vector PLDA
Baseline (no spoofing) 3.24 2.99
JD-GMM conversion 17.36 19.29
Unit selection conversion 32.54 41.25
-
Study with a human impersonator
False acceptance rates (threshold = EER point on baseline)
GMM-UBM i-vector PLDA
Baseline (no spoofing) 11.11 9.03
Mimicry attack 9.68 11.61
R. Gonzales Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, Anne-Maria
Laukkanen, ”I-vectors meet imitators: on vulnerability of speaker verification systems against
voice mimicry”, Interspeech 2013 (to appear)
-
Acknowledgements
COST Action IC 1206 and members
Academy of Finland for partial funding
Zhizheng Wu, Eng Siong Chng, Haizhou Li (I2R and NTU,
Singapore) for the joint studies on spoofing
Nick Evans, Junichi Yamagishi, Sebastien Marcel for the
joint organization of Interspeech 2013 special session
Other colleagues at SIPU who contributed to the studies
and material presented herein