robust voice activity detection for interview speech in nist speaker recognition evaluation man-wai...
TRANSCRIPT
Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation
Man-Wai MAK and Hon-Bill YUThe Hong Kong Polytechnic University
[email protected]://www.eie.polyu.edu.hk/~mwmak/
2
Outline
Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features
Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010
3
Speaker Verification Process
To verify the identify of a claimant based on his/her own voices
Is this Mary’s voice?
I am Mary
4
A 2-class Hypothesis problem:
H0: MFCC sequence X(c) comes from to the true speaker
H1: MFCC sequence X(c) comes from an impostor Verification score is a likelihood ratio:
)|(log)|(log)1|(
)0|(logScore ubm)()()()(
)(
)(
cscc
c
XpXpHXp
HXp
Featureextraction
BackgroundModel
Decision+−
accept Score
reject Score
Score
SpeakerModel )(s
ubm)(
Speaker Verification Process
)(cX
)(cX
)|(log )()( scXp
)|(log )ubm()( cXp
5
Voice Activity Detection in Speaker Verification
Speech Speech segments
DCTLog|X(ω)|MFCC
VADFeature
Extraction Acoustic Features(MFCC)
6
Effect of VAD on Acoustic Features
Speech
Feacture vector: MFCC
dim1
dim
2
Feacture vector: MFCC
dim1d
im2
Non-speech region
7
Outline
Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features
Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010
8
Interview-Speech in NIST SRE
Interview Room
Interviewer
Interviewee
Desk
Source: NIST SRE 2008 Workshop
9
Far-field and desktop microphones were used for collecting interview speech
Some interview-speech files are very noisy, causing difficulty in differentiating speech segments from non-speech segments
non-speech speech
Time
Fre
qu
en
cyA
mp
litu
de
A typical interview-speech file in NIST SRE 2008
Interview-Speech in NIST SRE
1010
Fre
qu
en
cyA
mp
litu
de
Am
plit
ud
eS
eg
me
nta
tion
S: speechh#: non-speech S: speech
Whole file
Time
Interview-Speech in NIST SRE
Some files have very low SNR
11
Some files contain spiky signals, causing wrong VAD decision threshold
Time
Am
plit
ud
e
Spiky signal
Interview-Speech in NIST SRE
12
Some files contain low-energy speech signal superimposed on periodic background noise.
Time
Fre
qu
en
cyA
mp
litu
de
Se
gm
en
tatio
n
Non-speech detected as speech
Interview-Speech in NIST SRE
13
Outline
Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features
Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010
14
Use speech enhancement as a pre-processing step
VAD for NIST Speaker Recognition Evaluation
Noisy Speech Denoised Speech
Speech Segment Info
Spectral-Subtraction VAD (SVAD)
Feature Extraction
ScoringMFCC Accept/Reject
Decision Making
SpeakerModel
ImpostorModel
DecisionThreshold
S S S S S S
15
Use speech enhancement as a pre-processing step
VAD for NIST Speaker Recognition Evaluation
Signal Frequency Spectrum
Clean speech x(n,m) X(ω,m)Noisy speech y(n,m) Y(ω,m)Background speech
b(n,m) B(ω,m)
This values were set such that we remove as much noise as possible.
16
VAD for NIST Speaker Recognition Evaluation
Without denoising
With denoising
Time
Am
plit
ude
Time
Am
plit
ude
18
VAD for NIST Speaker Recognition Evaluation
With denoising
SS-V
ADVA
D in
ETS
I-AM
R sp
eech
cod
er
S: speechh#: non-speech
19
VAD for NIST Speaker Recognition Evaluation
Speech-segment-length to speech-file-length ratio of 3 VADs
6249 Speech Files (NIST’05-08)
Speech / Non-speech
Speech / Non-speech
Speech / Non-speech
total duration: 10 secs .
total speech segment: 3 secs.
speech-segment-length to speech-file-length ratio = 3/10
20
VAD for NIST Speaker Recognition Evaluation
Speech-segment-length to speech-file-length ratio of 3 VADs
High frequency of occurrence, suggesting many non-speech segments being mistakenly detected as speech segments
Ordinary Energy-based VAD
Spectral-Subtraction VAD
VAD in ETSI AMR Coder
21
Outline
Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features
Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010
22
Experiments on NIST SRE 2008
Dataset NIST’05 & NIST’06 (development) NIST’08 (performance evaluations)
Common Condition Train/Test Condition No. of Targets No. of Trials
1 All interview speech 622 144052 Interview speech, same
microphone type for training and test
125 731
3 Interview speech, different microphone types for
training and test
622 13674
4 Interview speech for training, telephone speech for test
622 5048
Common Condition Train/Test Condition No. of Targets No. of Trials
1 All interview speech 622 144052 Interview speech, same
microphone type for training and test
125 731
3 Interview speech, different microphone types for
training and test
622 13674
4 Interview speech for training, telephone speech for test
622 5048
Speaker Modeling: GMM-SVM Score Normalization: T-norm
23
Results on NIST 2008 SRE
ETSI-AMR: VAD in AMR coder
Baseline: energy-based VAD without SS (γ =0.99) SS-VAD: spectral subtraction VAD
3.57 > 1.12 (69%)
25
Preliminary Results on NIST 2010
EER (%) Normalized minDCF
Energy-based VAD 11.72 0.99
SS-VAD 4.45 0.58
SMB 5.83 0.75
SS-SMB 4.62 0.60
NIST ASR Transcripts 8.58 0.85
ETSI-AMR 8.05 0.85
Common Condition 2: All trials involving interview speech from different microphones
SMB: Statistical-Model Based VADSohn, et al. “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, 1999.
26
Conclusions
Noise reduction is of primary importance for VAD under extremely low SNR It is important to remove the sinusoidal background found in NIST SRE sound files as this kind
of background signal could lead to many false detection in energy-based VAD. Using noise reduction as a pre-preprocessing step leads to a VAD outperforms the VAD in
ETSI-AMR (Option 2).
27
VAD for NIST Speaker Recognition Evaluation
Threshold Determination and VAD Decision Logicspike
frame
amplitude
apL
ap1
L 500 preset non-speech frames
μb
Sample-based
Frame-based
AmplitudeRanking
29
Experiments on NIST SRE 2008
bkguttUBM
(NIST’05 & 06)
spkutt(NIST’08)
GMM-supervectors of target speakers
GMM-supervectors of 300 impostors
300 background speakers(NIST’06)
spk
GMM-SVM
Training phase
NAP NAP