speaker recognition

Post on 12-Jan-2016

38 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

University of Joensuu, Department of Computer Science. PUMS 2003-2004 –seminaari 14.10.2004 Turku. Speaker Recognition. Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen. Research Group. PUMS project. Juhani Saastamoinen Project manager. - PowerPoint PPT Presentation

TRANSCRIPT

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Speaker Recognition

University of Joensuu,Department of Computer Science

PUMS 2003-2004 –seminaari 14.10.2004 Turku

Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki,Tomi Kinnunen, Ismo Kärkkäinen

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Research Group

Pasi FräntiProfessor

Juhani SaastamoinenProject manager

Evgeny KarpovProject researcher

Ville HautamäkiProject researcher

Tomi KinnunenResearcher

Ismo Kärkkäinen Clustering algorithms

PUMS project

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

PUMS & JoY

• Speaker Recognition• PUMS season 2003-2004:

– Identification, no verification– Port it in mobile phone– Feature fusion– Real-time

• http://cs.joensuu.fi/pages/pums

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Application Scenarios

Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification

Speaker RecognitionSpeaker Recognition

Whose voice is this?Is this Bob’s voice?

(Claim)+

Verification

Imposter!

?Identification

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Identification System

Recognition: min. MSE within DB

over input speech

SignalProcessing

SpeakerModellingFeature

VectorsSpeechAudio

AddtrainedspeakerprofilesUse all profiles

in recognition

Decision

Speaker ProfileDatabase

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

sprofiler

Results 2003-2004

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfiler

Winsprofiler

Epocsprofiler

console UI

Windows

Series60

TCL/TK (HY)

console UI

common speaker recognition app. interface

DB

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Planned Results

sprofiler

Fusion

Speechfeatures (HY)

ProfMatch

srlibReal-time

SpeakerProfilerWinsprofiler

Epocsprofiler

DB

ApplicationsAccess control

TeleconferenceLarge scale database

Mobile phone login?

Results 2003-2004

common speaker recognition app. interface

Segmentation

VAD

common speaker recognition app. interface

Verification

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

System in Mobile Phone

Port to Symbian OS with Series 60 UI platform

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Symbian Phones

• Series 60 phone features:– 16 MB ROM– 8 MB RAM

– 176 x 208 display

– 32-bit ARM-processor

– No floating-point unit!!!

Series 80

Series 60UIQ

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

FFTGEN

• Multiplication results must fit in 32 bits: truncate multiplication inputs

• FFTGEN: Truncate to 16/16 bits (“16/16 FFT”)

32-bit multiplication result

FFT layer input FFT Twiddle FactorX

X16-bit integer 16-bit integer

FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer

16 used bits 16 crop-off bits

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Proposed Information Preserving “22/10 FFT”

• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information

– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024

– Truncate multiplication inputs to 22/10 bits (signal/op)

22 used bits 10 crop-off bits

32-bit multiplication result

X32-bit integer, 22 bits used 16-bit integer, 10 bits used

32-bit integer

FFT layer input FFT Twiddle FactorX

FFT layer output (part of it)Crop-off for next layer: 10 bits

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Scale of Error in Proposed FFT

16/16 22/10

Log10 of relative error in FFT elements

FFTGEN 22/10 FFT

average -0.775 -2.118

standard deviation 0.797 0.590

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Mobile Phone Results

TIMIT, 100 speakers recog. rate (%) std. dev. (%)

FLOAT 100.0 N/A

FFTGEN 9.7 1.6

FIXED 95.8 1.2

MIXED 100.0 N/A

MIXED2 98.0 0.6

implementation, signal recog. rate (%) std. dev. (%)

FLOAT, Symbian audio 83.2 4.38

FLOAT, PC audio 100.0 N/A

FIXED, Symbian audio 76.0 2.83

FIXED, PC audio 100.0 N/A

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Improving Accuracy by Information Fusion

Time (s)0 0.483107

-0.1211

0.1058

0

Feature set 1

... ... Feature set 2

Feature set 3

Classifier 1

Classifier 2

Classifier 3

score 1

score 2

score 3

Decision

feature vector

Score combiner

(e.g. 5 MFCCs)

(e.g. F0 + -F0)

(e.g. formants F1,F2,F3)

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Information Fusion Results

Decision-level fusion

Score-level fusion

Feature-level fusion

BASELINE:

Best individual

Feature set combination

14.615.816.8MFCC + MFCC

15.2

52.0

16.8

14.7

12.621.216.0All feature sets

29.919.4FMT + FMT

18.217.1ARCSIN + ARCSIN

19.816.0LPCC + LPCC

Fusion succesfull

Fusion sucks

N/A

N/A

N/A

N/A

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Speech input stream

Silence detection

Feature extraction

Pre-quantization

Speaker database

Speaker 1 model

Speaker N model

List of candidate speakers

Active speakers Pruned speakers

Frame blocking

Decision ?END

...

Fill buffer with new data

All frames

Non-silent frames

Feature vectors

Redused set of vectors

Matching

v

v

v

v

v

v

v

Database pruningv

v

YesNo

Vantage-point tree (VPT) indexing of the code vectors

1. Averaging

2. Random sampling

3. Decimation

4. Clustering (LBG) 1. Static pruning

2. Hierarchical pruning

3. Adaptive pruning

4. Confidence-based pruning

Reducing # vectors

Speed up NN search

Reduce # speakers

Real-Time Speaker Identification

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Baseline System (TIMIT)

(Average length of test utterance = 8.9 s)

Real-time requirement satisfied

4 x realtime

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Pre-Quantization (TIMIT)(Codebook size = 64)

• Averaging performs worst, clustering best

• About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy

9 x realtime

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: Pruning Variants (TIMIT)(Codebook size = 64)

11 x realtime

• Recommended method : adaptive pruning (AP)

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results: PQ, Pruning and PQP (TIMIT)(Codebook size = 64)

33 x realtime

• Recommended method : Combination of pre-quantization and pruning (PQP)

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results : VQ vs. GMM (TIMIT)

13:1 speed-up without degradation

9:1 to 10:1 speed-up without degradation

VQ GMM

Best time : 0.27 s = 33 x realtime

@ error rate 0.32 %

Smallest error : 0.00 %

@ 0.31 s = 28 x realtime

Best time : 0.18 s = 49 x realtime

@ error rate 0.16 %

Smallest error : 0.16 %

@ 0.18 s = 49 x realtime

(Average length of test utterance = 8.9 s)

University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi

Results : VQ vs. GMM (NIST-1999)

VQ GMM13:1 to 16:1 speedup with minor degradation

23:1 to 34:1 speedup with minor degradation

Best time : 0.48 s = 63 x realtime

@ error rate 19.22 %

Smallest error : 17.34 %

@ 11.4 s = 3 x realtime

Best time : 0.82 s = 37 x realtime

@ error rate 19.36 %

Smallest error: 16.90 %

@ 37.9 s = 0.8 x realtime

(Average length of test utterance = 30.4 s)

top related