hmm-based pseudo-clean speech synthesis for splice algorithm jun du, yu hu, li-rong dai, ren-hua...

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM

Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang

Wen-Yi Chu

Department of Computer Science & Information Engineering

National Taiwan Normal University

2

Outline

Introduction

Review of SPLICE

Our modifications

Experiments and results

Conclusions

Introduction

• The motivation of our approach is to relax the constraint of recorded stereo-data from a new viewpoint: pseudo-clean features generated by exploiting HMM-based synthesis method is used to replace the ideal clean features from one of the stereo channels in SPLICE.

• Moreover, a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed.

• As a extension, this method of generating the pseudo clean features can be used in any algorithms like SPLICE where the stereo-data is needed.

3

Review of SPLICE(1/2)

Two assumptions of speech modeling and degradation

• The first assumption is that the noisy speech cepstral vector follows the distribution of mixture of Gaussians:

• The second assumption is that conditional distribution for clean vector given the noisy speech vector in each component is Gaussian whose mean vector is a linear transformation of with the bias vector as follows:

4

m

tt mpmypyp )()|()(

),;()|( mmtt yNmyp

),;(),|( mmtttt ryxNmyxp

tx ty mty

mr

Review of SPLICE(2/2)

SPLICE training

Environment selection where Y is the sequence of the noisy speech vectors in the

current utterance, p(E) is set equally for all E.

Cepstral Enhancement

5

t t

t tttm ymp

yxympr

)|(

))(|(

l t

tt lpylp

mpmypymp

)()|(

)()|()|(

)()|(maxarg)|(maxarg* pYpYpEE

m

mttttxt rympyyxEx ||ˆ

State-level force-alignment

• So first, using multi-training speech features, we can get ML-trained HMMs.

• Then state-level force-alignment of multi-training features is performed.

6

HMM-based speech synthesis(1/2)

• Multi-training HMMs are denoted as λ, and the speech parameter vector sequence to be determined is described as:

• , where dynamic features and static

features should satisfy some constraints: is the sequence of static cepstral vectors, the

transformation matrix is decided by the relation between static and dynamic

features.

• From the state-level force-alignment, the state sequence for all frames can be given:

7

],...,,[ 21 ToooO

],,[ 2ttt ccco

WCO CW

TsssS ,,, 21

HMM-based speech synthesis(2/2)

• We calculate the mean and covariance of state as follows:

• For given λ and S, our target is to maximize likelihood function

with respect to O. The objective function can be written as:

(1)

where

the constant K is independent of O, by optimizing Eq. 1, we obtain a

set of equations:

8

tsm

msts ttomp )|(

m

ssmsmsmsts ttttttomp ))(|(

),|( SOp

KUOOOSOp 11

2

1),|(log

1111 ,,,21

Tsssdiag

],,,[21

1

TsssU

UWWCW 11

Post-processing

• As the post processing, gain normalization should be taken based on the observation that there are big differences among the dynamic range of clean speech features, noisy speech features and synthesized speech features, respectively.

where is the final synthesized pseudo-clean features applied to

SPLICE algorithm.

9

)minmax/()1

(11

1

synt

Ttt

Tt

T

tttt ooo

Tox

syntx

ML-based bias adaptation

• In order to handle the mismatch between training and testing, a simple ML-based bias adaptation algorithm is proposed.

where is the adapted bias vector for the mixture , is the global bias shift to describe the mismatch between training and testing.

• can be iteratively estimated using the maximum likelihood criterion:

where the subscript denotes the dimensional index, is the feature vector of the current utterance at frame .

10

brr mm ˆ

t m imt

t m imimiti bymp

ympb

2,

2,,,

/),|(

/)|(

l llt

mmtbt byNlp

byNmpymp

),;()(

),;()()|(

2

2

,

mr̂ m b

b

i tyt

Experimental and results

• CMN is applied before SPLICE algorithm.• The bias parameters are trained using the pseudo clean

and real multi-style data for each of 17 noise conditions .• The noisy speech model consists of a mixutre of 256

Gaussians with diagonal covariance matrices.

11

Conclusuons

• First, we remove the constraint of stereo-data by exploiting HMM-based synthesis method to generate the pseudo-clean speech parameters.

• Then a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed.

• In our future work, we will further study the noise robustness of the HMM-based speech parameters generation, which can be combined with other robust techniques for noisy speech recognition.

12

hmm-based pseudo-clean speech synthesis for splice algorithm jun du, yu hu, li-rong dai, ren-hua...

Documents