speech research activities @ university of eastern...

Speech Research Activities @ University of Eastern Finland

Management Committee Meeting of COST Action IC1206, Zagreb, July 8-9, 2013

Dr. Tomi H. Kinnunen

School of Computing, UEF

[email protected]

mailto:[email protected]

University of Eastern Finland

• Three campuses: Joensuu, Kuopio, Savonlinna • Four faculties: Science and forestry, health science,

philosophical, business studies.

• 15000 students, 2800 staff members

School of Computing

• Annually 100 MSc + 10 PhD degrees • International master’s program (IMPIT) • Multimedia computing one of the 3 focus areas

Speech & Image Processing Unit (SIPU)

Prof. Pasi Fränti (team leader)

Dr. Tomi Kinnunen (leader of speech processing topics)

Dr. Ville Hautamäki (postdoc)

Dr. Padmanabhan Rajan (postdoc)

Dr. Rahim Saeidi (postdoc @ Radboud Univ, the Netherlands)

Dr. Cemal Hanilci (collaborator @ Uludag Univ, Turkey)

+ Several PhD students working on speech technology, data clustering and location-based systems (see the webpage for details)

Core team

http://www.uef.fi/en/sipu

http://www.uef.fi/en/sipu/peoplehttp://www.uef.fi/en/sipu

Research topics

Speech processing

Clustering methods

Location-based application

Clustering algorithms

Clustering validity

Graph clustering

Gaussian mixture models

Speaker recognition

Feature extraction

Voice activity detection

Voice conversion

Mobile data collection

Route reduction and compression

Photo collections and social networks

Location-aware services & search engine

Lossless compression and data reduction

Image denoising

Ultrasonic, medical and HDR imaging

Image processing & compression

Speaker recognition activities • Odyssey 2014: The Speaker and Language Recognition Workhop in Joensuu

• Theses: Four PhD theses on speaker recognition: Tomi Kinnunen (2005), Ville Hautamäki

(2008), Rahim Saeidi (2011), Evgeny Karpov (2011)

• Funding: National TEKES-funded PUMS project (2003-2007) of Fränti, 3-year postdoc

projects of Kinnunen (speaker recognition, 2010-2012) and Hautamäki (dialect and accent

recognition, 2012-2014, ongoing), and 4-year Academy project of Kinnunen, ”Reliable

Speaker Recognition and Modification” (2012-2015, ongoing)

• Main collaborators: I2R and NTU (Singapore), Aalto university (Finland), Georgia Tech

(USA), Aalborg University (Denmark), Lund University (Sweden), Uludag University (Turkey),

CRIM (Canada), Speech Technology Center (Russia)

• NIST SRE: Participation in 2006, 2008, 2010, 2012

• Selected publications:

• V. Hautamäki, T. Kinnunen, F. Sedlak, K.A. Lee, B. Ma, H. Li, “Sparse Classifier Fusion for Speaker

Verification”, IEEE Trans. Audio, Speech and Language Processing, 21(8): 1622—1681, August 2013.

• T. Kinnunen, R. Saeidi, F. Sedlak, K.A. Lee, J. Sandberg, M. Hansson-Sandsten, H. Li, “Low-Variance

Multitaper MFCC Features: a Case Study in Robust Speaker Verification”, IEEE Trans. Audio, Speech and

Language Processing, 20 (7), 1990-2001, 2012.

• C. Hanilci, T. Kinnunen, F. Ertas, R. Saeidi, J. Pohjalainen, P. Alku, “Regularized All-Pole Models for Speaker

Verification Under Noisy Environments”, IEEE Signal Processing Letters 19(3), 163-166, March 2012

• T. Kinnunen and H. Li, ”An overview of text-independent speaker recognition: from features to supervectors”,

Speech Communication, 52(1): 12--40, January 2010.

• T. Kinnunen, E. Karpov, P. Fränti, “Real-Time Speaker Identification and Verification”, IEEE Transactions on

Audio, Speech and Language Processing, 14(1): 277--288, Jan 2006.

I4U submission to NIST SRE 2012

1. ValidSoft Ltd, UK

2. Swansea University, UK

3. University of Avignon, France

4. Radboud University Nijmegen, the Netherlands

5. University of Texas at Dallas, USA

6. University of Eastern Finland, Finland

7. Institute for Infocomm Research, Singapore

8. IDIAP Research Institute, Switzerland

[ R. Saeidi et al. “I4U submission to NIST SRE 2012: a large-scale collaborative effort

for noise-robust speaker verification”, Interspeech 2013 (to appear) ]

International summer schools and seminars

19th International Summer School in Novel

Computing

June 18-21, 2012, UEF, Joensuu Campus

Recent Advances in Probabilistic Modeling for

Pattern Recognition

Dr. Patrick Kenny, Lead researcher, CRIM,

Montreal, Canada.

Attendance: 43 participants

Social Computing

Dr. Rosta Farzan, Postdoc researcher, Human

Computer Interaction Institute, Carnegie Mellon

University, USA.

Attandance: 22 participants

Winter workshop on Data Mining and Pattern Recognition

March 4-6, 2013, UEF, Mekrijärvi research station, Ilomantsi

Scientific Presentations Skills

June 10-12, 2013, UEF, Joensuu Campus

Dr. Jean-Luc Lebrun

Scientific Writing Skills

August 6-8, 2012, UEF, Joensuu Campus

Dr. Jean-Luc Lebrun


16th International Summer School in Novel Computing

August 10 - August 14, 2009, Joensuu

Speaker and Language Recognition

Dr. Douglas A. Reynolds, Lincoln Laboratory, MIT


Platforms for Stories-based Learning in Future Schools

Prof. Paul De Bra, Eindhoven University of Technology, The Netherlands


http://cs.joensuu.fi/ecse/2012/http://cs.joensuu.fi/ecse/2012/http://www.crim.ca/en/r-d/bottin_recherche/?fiche=/fr/r-d/reconnaissance_parole/equipe/fiche07.htmlhttp://rosta-farzan.net/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/dmpr2013/http://cs.joensuu.fi/ecse/SciPre2013/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://cs.joensuu.fi/ecse/SciWri2012/http://www.scientific-writing.com/http://www.scientific-writing.com/http://www.scientific-writing.com/http://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://cs.joensuu.fi/ecse/16issnc/index.htmlhttp://www.ll.mit.edu/mission/communications/ist/biographies/reynolds-bio.htmlhttp://wwwis.win.tue.nl/~debra/

Activities relevant to

COST action IC1206

1) Speaker recognition

2) Voice conversion

3) Spoofing and anti-spoofing for speaker

recognition

Spoofing speaker recognizers

Human imitators (Lau et. al., 2005; Farrus et. al. 2008)

Playback attacks (Lindberg & Blomberg, Eurospeech

1999 ; Villalba & Lleida, FALA 2010)

Speaker-adapted speech synthesis (Pellom &

Hansen, ICASSP 1999; Satoh et. al., Eurospeech 2001;

DeLeon et. al., Speaker Odyssey 2010)

Voice conversion (Jin et. al., ICASSP 2008 ; Bonastre

et. al., Interspeech 2007 + many more)

SPECIAL SESSION: ”Spoofing and countermeasures for automatic

speaker verification”

Organizers: Nick Evans (EURECOM), Tomi Kinnunen (UEF), Junichi

Yamagishi (Univ. Edinburgh), Sebastien Marcel (IDIAP)

Voice Conversion Feature extraction Spectrum extraction using SPTK toolkit, 30 mel-cepstral coefficients (MCEP)

F0 extraction using the RAPT algorithm

Mapping function: -Joint density GMM (Kain & Macon, ICASSP 1998)

Frame alignment - VQ codebook mapping (Sundermann et al, Interspeech 2004)

Speaker modeling:

selected approaches

Approach Reference

Fra

me-

based 1. Gaussian mixture model with universal

background model (GMM-UBM)

Reynolds et al, 2000

2. Vector quantizer with universal

background model (VQ-UBM)

Hautamäki et al, 2008

Uttera

nce-

based

3. Generalized linear discriminant

sequence support vector machine

(GLDS-SVM)

Campbell et al, 2006a

4. GMM supervectors with support vector

machine (GMM-SVM)

Campbell et al, 2006b

5. GMM with joint factor analysis

(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

Approach Reference

Fra

me-




2. Vector quantizer with universal background

model (VQ-UBM)


Uttera

nce-

based

3. Generalized linear discriminant sequence

support vector machine

(GLDS-SVM)



machine (GMM-SVM)



(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

Gaussian mixture model (GMM)

Density

Prior probability

Mean vector

Cov. matrix

multivariate Gaussian density

Maximum a Posteriori (MAP)

Adaptation of GMMs

Adapted mean vector

Mean of training data

Prior mean from univ. background model (UBM)

Adaptation coefficient (r = relevance factor, usually r = 16)

Mean of training data assigned to kth Gaussian

Soft count of vectors assigned to kth Gaussian

Posterior probability of the kth Gaussian for one feature vector

Approach Reference

Fra

me-




2. Vector quantizer with universal

background model (VQ-UBM)


Uttera

nce-

based



(GLDS-SVM)



machine (GMM-SVM)



(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

Vector Quantization (VQ) One of the “classical” speaker recognition methods,

similar performance with GMM with reduced computation

Speaker model = codebook C = {c1, c2, … , cK}, where ck are the centroid vectors. Usually these are obtained

using K-means

kkkkk uxc )1(

kn S

n

k

kS x

xx1

rS

S

k

k

k

mS mkk cxcxx ;where

r = relevance factor (fixed constant)

Centroid of speaker’s training vectors that are mapped to UBM vector uk

UBM centroid

Approach Reference

Fra

me-





model (VQ-UBM)


Uttera

nce-

based

3. Generalized linear discriminant

sequence support vector machine

(GLDS-SVM)



machine (GMM-SVM)



(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

Sequence-kernel SVMs

Generalized Linear Discriminant

Sequence SVM (GLDS-SVM) [Campbell et al, 2006]

Use monomials (up to certain degree) to expand feature vectors and use the average to represent utterances:

→

→

Note: dimensionality = (d + M)! / (d! x M!)

Use linear kernel SVM with these supervectors

Example: expansion from 2-dimensional input space

using 2nd order expansion

Approach Reference

Fra

me-





model (VQ-UBM)


Uttera

nce-

based



(GLDS-SVM)



machine (GMM-SVM)



(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

GMM Supervectors [Campbell et al, 2006]

MAP

adaptation

Feature

extraction

Universal background model

Speech utterance

μ1 μ2

μK

... μ =

GMM mean supervector of dimensionality K × d, where K = number of Gaussians, d = number of acoustic features

Approach Reference

Fra

me-





model (VQ-UBM)


Uttera

nce-

based



(GLDS-SVM)



machine (GMM-SVM)



(GMM-JFA)

Kenny et al, 2005, 2006,

2008,

Joint Factor Analysis (JFA)

Speaker- and channel-dependent supervector

Speaker supervector

Channel supervector

Assume that each supervector is composed as a sum of two

statistically independent components:

UBM supervector

Eigenvoice matrix

Speaker factors

Residual Term Dz

Eigenchannel matrix

Channel factors

[Kenny et al, 2005, 2007, 2008, http://www.crim.ca/perso/patrick.kenny/]

V, D and U are the model hyperparameters trained beforehand on large

datasets ; x, y and z need to be estimated for a given training sample

JFA cookbook from Brno Univ. Tech (BUT),

http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo

http://www.crim.ca/perso/patrick.kenny/http://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demohttp://speech.fit.vutbr.cz/software/joint-factor-analysis-matlab-demo

Spoofing results

Classifier Intersession

compensation

Baseline

accuracy

(EER, %)

Accuracy on

spoofed samples

(EER, %)

GMM-UBM - 7.63 24.99

VQ-UBM - 7.56 22.62

GLDS-SVM - 7.16 25.17

GMM-SVM Nuis. attrib.

projection (NAP)

3.74 12.58

GMM-JFA Joint factor

analysis (JFA)

3.24 7.61

T. Kinnunen, Z.-Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, H. Li, “Vulnerability of Speaker Verification Systems

Against Voice Conversion Spoofing Attacks: the Case of Telephone Speech”, Proc. ICASSP 2012

Calibration breaks down even

for the advanced classifiers

False acceptance rates (threshold = EER point on baseline)

GMM-SVM GMM-JFA

Baseline (no spoofing) 3.74 3.24

Voice conversion spoofing (JD-GMM) 41.54 17.33

Spoofing i-vector systems Z. Wu, T. Kinnunen, E.S. Chng, H. Li, E. Ambikairajah, ”A study on spoofing attack in state-

of-the-art speaker verification: the telephone speech case”, APSIPA 2012, Hollywood, US,

December 2012


GMM-JFA i-vector PLDA


JD-GMM conversion 17.36 19.29

Unit selection conversion 32.54 41.25

Study with a human impersonator


GMM-UBM i-vector PLDA


Mimicry attack 9.68 11.61

R. Gonzales Hautamäki, Tomi Kinnunen, Ville Hautamäki, Timo Leino, Anne-Maria

Laukkanen, ”I-vectors meet imitators: on vulnerability of speaker verification systems against

voice mimicry”, Interspeech 2013 (to appear)

Acknowledgements

COST Action IC 1206 and members

Academy of Finland for partial funding

Zhizheng Wu, Eng Siong Chng, Haizhou Li (I2R and NTU,

Singapore) for the joint studies on spoofing

Nick Evans, Junichi Yamagishi, Sebastien Marcel for the

joint organization of Interspeech 2013 special session

Other colleagues at SIPU who contributed to the studies

and material presented herein

speech research activities @ university of eastern...

Documents