speaker recognition
DESCRIPTION
University of Joensuu, Department of Computer Science. PUMS 2003-2004 –seminaari 14.10.2004 Turku. Speaker Recognition. Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen. Research Group. PUMS project. Juhani Saastamoinen Project manager. - PowerPoint PPT PresentationTRANSCRIPT
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Speaker Recognition
University of Joensuu,Department of Computer Science
PUMS 2003-2004 –seminaari 14.10.2004 Turku
Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki,Tomi Kinnunen, Ismo Kärkkäinen
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Research Group
Pasi FräntiProfessor
Juhani SaastamoinenProject manager
Evgeny KarpovProject researcher
Ville HautamäkiProject researcher
Tomi KinnunenResearcher
Ismo Kärkkäinen Clustering algorithms
PUMS project
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
PUMS & JoY
• Speaker Recognition• PUMS season 2003-2004:
– Identification, no verification– Port it in mobile phone– Feature fusion– Real-time
• http://cs.joensuu.fi/pages/pums
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Application Scenarios
Speaker VerificationSpeaker Verification Speaker IdentificationSpeaker Identification
Speaker RecognitionSpeaker Recognition
Whose voice is this?Is this Bob’s voice?
(Claim)+
Verification
Imposter!
?Identification
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Identification System
Recognition: min. MSE within DB
over input speech
SignalProcessing
SpeakerModellingFeature
VectorsSpeechAudio
AddtrainedspeakerprofilesUse all profiles
in recognition
Decision
Speaker ProfileDatabase
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
sprofiler
Results 2003-2004
Fusion
Speechfeatures (HY)
ProfMatch
srlibReal-time
SpeakerProfiler
Winsprofiler
Epocsprofiler
console UI
Windows
Series60
TCL/TK (HY)
console UI
common speaker recognition app. interface
DB
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Planned Results
sprofiler
Fusion
Speechfeatures (HY)
ProfMatch
srlibReal-time
SpeakerProfilerWinsprofiler
Epocsprofiler
DB
ApplicationsAccess control
TeleconferenceLarge scale database
Mobile phone login?
Results 2003-2004
common speaker recognition app. interface
Segmentation
VAD
common speaker recognition app. interface
Verification
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
System in Mobile Phone
Port to Symbian OS with Series 60 UI platform
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Symbian Phones
• Series 60 phone features:– 16 MB ROM– 8 MB RAM
– 176 x 208 display
– 32-bit ARM-processor
– No floating-point unit!!!
Series 80
Series 60UIQ
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
FFTGEN
• Multiplication results must fit in 32 bits: truncate multiplication inputs
• FFTGEN: Truncate to 16/16 bits (“16/16 FFT”)
32-bit multiplication result
FFT layer input FFT Twiddle FactorX
X16-bit integer 16-bit integer
FFT layer output (part of it)Crop-off for next layer: 16 bits!16-bit integer
16 used bits 16 crop-off bits
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Proposed Information Preserving “22/10 FFT”
• Approximate DFT operator F with G• Increase ||F-G||, preserve more signal information
– minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024
– Truncate multiplication inputs to 22/10 bits (signal/op)
22 used bits 10 crop-off bits
32-bit multiplication result
X32-bit integer, 22 bits used 16-bit integer, 10 bits used
32-bit integer
FFT layer input FFT Twiddle FactorX
FFT layer output (part of it)Crop-off for next layer: 10 bits
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Scale of Error in Proposed FFT
16/16 22/10
Log10 of relative error in FFT elements
FFTGEN 22/10 FFT
average -0.775 -2.118
standard deviation 0.797 0.590
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Mobile Phone Results
TIMIT, 100 speakers recog. rate (%) std. dev. (%)
FLOAT 100.0 N/A
FFTGEN 9.7 1.6
FIXED 95.8 1.2
MIXED 100.0 N/A
MIXED2 98.0 0.6
implementation, signal recog. rate (%) std. dev. (%)
FLOAT, Symbian audio 83.2 4.38
FLOAT, PC audio 100.0 N/A
FIXED, Symbian audio 76.0 2.83
FIXED, PC audio 100.0 N/A
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Improving Accuracy by Information Fusion
Time (s)0 0.483107
-0.1211
0.1058
0
Feature set 1
... ... Feature set 2
Feature set 3
Classifier 1
Classifier 2
Classifier 3
score 1
score 2
score 3
Decision
feature vector
Score combiner
(e.g. 5 MFCCs)
(e.g. F0 + -F0)
(e.g. formants F1,F2,F3)
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Information Fusion Results
Decision-level fusion
Score-level fusion
Feature-level fusion
BASELINE:
Best individual
Feature set combination
14.615.816.8MFCC + MFCC
15.2
52.0
16.8
14.7
12.621.216.0All feature sets
29.919.4FMT + FMT
18.217.1ARCSIN + ARCSIN
19.816.0LPCC + LPCC
Fusion succesfull
Fusion sucks
N/A
N/A
N/A
N/A
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Speech input stream
Silence detection
Feature extraction
Pre-quantization
Speaker database
Speaker 1 model
Speaker N model
List of candidate speakers
Active speakers Pruned speakers
Frame blocking
Decision ?END
...
Fill buffer with new data
All frames
Non-silent frames
Feature vectors
Redused set of vectors
Matching
v
v
v
v
v
v
v
Database pruningv
v
YesNo
Vantage-point tree (VPT) indexing of the code vectors
1. Averaging
2. Random sampling
3. Decimation
4. Clustering (LBG) 1. Static pruning
2. Hierarchical pruning
3. Adaptive pruning
4. Confidence-based pruning
Reducing # vectors
Speed up NN search
Reduce # speakers
Real-Time Speaker Identification
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results: Baseline System (TIMIT)
(Average length of test utterance = 8.9 s)
Real-time requirement satisfied
4 x realtime
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results: Pre-Quantization (TIMIT)(Codebook size = 64)
• Averaging performs worst, clustering best
• About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy
9 x realtime
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results: Pruning Variants (TIMIT)(Codebook size = 64)
11 x realtime
• Recommended method : adaptive pruning (AP)
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results: PQ, Pruning and PQP (TIMIT)(Codebook size = 64)
33 x realtime
• Recommended method : Combination of pre-quantization and pruning (PQP)
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results : VQ vs. GMM (TIMIT)
13:1 speed-up without degradation
9:1 to 10:1 speed-up without degradation
VQ GMM
Best time : 0.27 s = 33 x realtime
@ error rate 0.32 %
Smallest error : 0.00 %
@ 0.31 s = 28 x realtime
Best time : 0.18 s = 49 x realtime
@ error rate 0.16 %
Smallest error : 0.16 %
@ 0.18 s = 49 x realtime
(Average length of test utterance = 8.9 s)
University of JoensuuDept. of Computer ScienceP.O. Box 111FIN- 80101 JoensuuTel. +358 13 251 7959fax +358 13 251 7955www.cs.joensuu.fi
Results : VQ vs. GMM (NIST-1999)
VQ GMM13:1 to 16:1 speedup with minor degradation
23:1 to 34:1 speedup with minor degradation
Best time : 0.48 s = 63 x realtime
@ error rate 19.22 %
Smallest error : 17.34 %
@ 11.4 s = 3 x realtime
Best time : 0.82 s = 37 x realtime
@ error rate 19.36 %
Smallest error: 16.90 %
@ 37.9 s = 0.8 x realtime
(Average length of test utterance = 30.4 s)