robust voice activity detection for interview speech in nist speaker recognition evaluation man-wai...

Robust Voice Activity Detection for Interview Speech in NIST Speaker Recognition Evaluation

Man-Wai MAK and Hon-Bill YUThe Hong Kong Polytechnic University

[email protected]://www.eie.polyu.edu.hk/~mwmak/

2

Outline

Speaker Verification Speaker Verification Process Voice Activity Detection (VAD) in Speaker Verification Effect of VAD on Acoustic Features

Characteristics of Interview-Speech in NIST Speaker Recognition Evaluation VAD for NIST Speaker Recognition Evaluation Experiments on NIST SRE 2008 Preliminary Results on NIST SRE 2010

3

Speaker Verification Process

To verify the identify of a claimant based on his/her own voices

Is this Mary’s voice?

I am Mary

4

A 2-class Hypothesis problem:

H0: MFCC sequence X(c) comes from to the true speaker

H1: MFCC sequence X(c) comes from an impostor Verification score is a likelihood ratio:

)|(log)|(log)1|(

)0|(logScore ubm)()()()(

)(

)(

cscc

c

XpXpHXp

HXp

Featureextraction

BackgroundModel

Decision+−

accept Score

reject Score

Score

SpeakerModel )(s

ubm)(

Speaker Verification Process

)(cX

)(cX

)|(log )()( scXp

)|(log )ubm()( cXp

5

Voice Activity Detection in Speaker Verification

Speech Speech segments

DCTLog|X(ω)|MFCC

VADFeature

Extraction Acoustic Features(MFCC)

6

Effect of VAD on Acoustic Features

Speech

Feacture vector: MFCC

dim1

dim

2

Feacture vector: MFCC

dim1d

im2

Non-speech region

7

Outline



8

Interview-Speech in NIST SRE

Interview Room

Interviewer

Interviewee

Desk

Source: NIST SRE 2008 Workshop

9

Far-field and desktop microphones were used for collecting interview speech

Some interview-speech files are very noisy, causing difficulty in differentiating speech segments from non-speech segments

non-speech speech

Time

Fre

qu

en

cyA

mp

litu

de

A typical interview-speech file in NIST SRE 2008


1010

Fre

qu

en

cyA

mp

litu

de

Am

plit

ud

eS

eg

me

nta

tion

S: speechh#: non-speech S: speech

Whole file

Time


Some files have very low SNR

11

Some files contain spiky signals, causing wrong VAD decision threshold

Time

Am

plit

ud

e

Spiky signal


12

Some files contain low-energy speech signal superimposed on periodic background noise.

Time

Fre

qu

en

cyA

mp

litu

de

Se

gm

en

tatio

n

Non-speech detected as speech


13

Outline



14

Use speech enhancement as a pre-processing step

VAD for NIST Speaker Recognition Evaluation

Noisy Speech Denoised Speech

Speech Segment Info

Spectral-Subtraction VAD (SVAD)

Feature Extraction

ScoringMFCC Accept/Reject

Decision Making

SpeakerModel

ImpostorModel

DecisionThreshold

S S S S S S

15

Use speech enhancement as a pre-processing step


Signal Frequency Spectrum

Clean speech x(n,m) X(ω,m)Noisy speech y(n,m) Y(ω,m)Background speech

b(n,m) B(ω,m)

This values were set such that we remove as much noise as possible.

16


Without denoising

With denoising

Time

Am

plit

ude

Time

Am

plit

ude

17


Without denoising

S: speechh#: non-speech

18


With denoising

SS-V

ADVA

D in

ETS

I-AM

R sp

eech

cod

er

S: speechh#: non-speech

19


Speech-segment-length to speech-file-length ratio of 3 VADs

6249 Speech Files (NIST’05-08)

Speech / Non-speech

Speech / Non-speech

Speech / Non-speech

total duration: 10 secs .

total speech segment: 3 secs.

speech-segment-length to speech-file-length ratio = 3/10

20


Speech-segment-length to speech-file-length ratio of 3 VADs

High frequency of occurrence, suggesting many non-speech segments being mistakenly detected as speech segments

Ordinary Energy-based VAD

Spectral-Subtraction VAD

VAD in ETSI AMR Coder

21

Outline



22

Experiments on NIST SRE 2008

Dataset NIST’05 & NIST’06 (development) NIST’08 (performance evaluations)

Common Condition Train/Test Condition No. of Targets No. of Trials

1 All interview speech 622 144052 Interview speech, same

microphone type for training and test

125 731

3 Interview speech, different microphone types for

training and test

622 13674

4 Interview speech for training, telephone speech for test

622 5048

Common Condition Train/Test Condition No. of Targets No. of Trials

1 All interview speech 622 144052 Interview speech, same

microphone type for training and test

125 731

3 Interview speech, different microphone types for

training and test

622 13674

4 Interview speech for training, telephone speech for test

622 5048

Speaker Modeling: GMM-SVM Score Normalization: T-norm

23

Results on NIST 2008 SRE

ETSI-AMR: VAD in AMR coder

Baseline: energy-based VAD without SS (γ =0.99) SS-VAD: spectral subtraction VAD

3.57 > 1.12 (69%)

24

Results on NIST 2008 SRE

Common Condition 1

SS-VAD

VAD

ETSI AMR

25

Preliminary Results on NIST 2010

EER (%) Normalized minDCF

Energy-based VAD 11.72 0.99

SS-VAD 4.45 0.58

SMB 5.83 0.75

SS-SMB 4.62 0.60

NIST ASR Transcripts 8.58 0.85

ETSI-AMR 8.05 0.85

Common Condition 2: All trials involving interview speech from different microphones

SMB: Statistical-Model Based VADSohn, et al. “A statistical model-based voice activity detection”, IEEE Signal Processing Letters, 1999.

26

Conclusions

Noise reduction is of primary importance for VAD under extremely low SNR It is important to remove the sinusoidal background found in NIST SRE sound files as this kind

of background signal could lead to many false detection in energy-based VAD. Using noise reduction as a pre-preprocessing step leads to a VAD outperforms the VAD in

ETSI-AMR (Option 2).

27


Threshold Determination and VAD Decision Logicspike

frame

amplitude

apL

ap1

L 500 preset non-speech frames

μb

Sample-based

Frame-based

AmplitudeRanking

28

Results

To find the optimum weighting factor, γ

29


bkguttUBM

(NIST’05 & 06)

spkutt(NIST’08)

GMM-supervectors of target speakers

GMM-supervectors of 300 impostors

300 background speakers(NIST’06)

spk

GMM-SVM

Training phase

NAP NAP

30


Verification phase

MFCCs of a test utterance

from claimant c

MAP and Mean Stacking

NAP

Session-dependent supervector

Session-independent supervector

SVM Scoring T-NormNormalized

scorescore

UBM

TnormModels

)(cX

)( )(cXS )(~ )(cXS

)(cm

),( hcm

SVM of target-speaker s

robust voice activity detection for interview speech in nist speaker recognition evaluation man-wai...

Documents

speech interviewspeech

interview speech

nist sre slide

interviewspeech files

speech h

file time interviewspeech

typical interviewspeech

use speech enhancement