hmm-based pseudo-clean speech synthesis for splice algorithm jun du, yu hu, li-rong dai, ren-hua...
TRANSCRIPT
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM
Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang
Wen-Yi Chu
Department of Computer Science & Information Engineering
National Taiwan Normal University
2
Outline
Introduction
Review of SPLICE
Our modifications
Experiments and results
Conclusions
Introduction
• The motivation of our approach is to relax the constraint of recorded stereo-data from a new viewpoint: pseudo-clean features generated by exploiting HMM-based synthesis method is used to replace the ideal clean features from one of the stereo channels in SPLICE.
• Moreover, a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed.
• As a extension, this method of generating the pseudo clean features can be used in any algorithms like SPLICE where the stereo-data is needed.
3
Review of SPLICE(1/2)
Two assumptions of speech modeling and degradation
• The first assumption is that the noisy speech cepstral vector follows the distribution of mixture of Gaussians:
• The second assumption is that conditional distribution for clean vector given the noisy speech vector in each component is Gaussian whose mean vector is a linear transformation of with the bias vector as follows:
4
m
tt mpmypyp )()|()(
),;()|( mmtt yNmyp
),;(),|( mmtttt ryxNmyxp
tx ty mty
mr
Review of SPLICE(2/2)
SPLICE training
Environment selection where Y is the sequence of the noisy speech vectors in the
current utterance, p(E) is set equally for all E.
Cepstral Enhancement
5
t t
t tttm ymp
yxympr
)|(
))(|(
l t
tt lpylp
mpmypymp
)()|(
)()|()|(
)()|(maxarg)|(maxarg* pYpYpEE
m
mttttxt rympyyxEx ||ˆ
State-level force-alignment
• So first, using multi-training speech features, we can get ML-trained HMMs.
• Then state-level force-alignment of multi-training features is performed.
6
HMM-based speech synthesis(1/2)
• Multi-training HMMs are denoted as λ, and the speech parameter vector sequence to be determined is described as:
• , where dynamic features and static
features should satisfy some constraints: is the sequence of static cepstral vectors, the
transformation matrix is decided by the relation between static and dynamic
features.
• From the state-level force-alignment, the state sequence for all frames can be given:
7
],...,,[ 21 ToooO
],,[ 2ttt ccco
WCO CW
TsssS ,,, 21
HMM-based speech synthesis(2/2)
• We calculate the mean and covariance of state as follows:
• For given λ and S, our target is to maximize likelihood function
with respect to O. The objective function can be written as:
(1)
where
the constant K is independent of O, by optimizing Eq. 1, we obtain a
set of equations:
8
tsm
msts ttomp )|(
m
ssmsmsmsts ttttttomp ))(|(
),|( SOp
KUOOOSOp 11
2
1),|(log
1111 ,,,21
Tsssdiag
],,,[21
1
TsssU
UWWCW 11
Post-processing
• As the post processing, gain normalization should be taken based on the observation that there are big differences among the dynamic range of clean speech features, noisy speech features and synthesized speech features, respectively.
where is the final synthesized pseudo-clean features applied to
SPLICE algorithm.
9
)minmax/()1
(11
1
synt
Ttt
Tt
T
tttt ooo
Tox
syntx
ML-based bias adaptation
• In order to handle the mismatch between training and testing, a simple ML-based bias adaptation algorithm is proposed.
where is the adapted bias vector for the mixture , is the global bias shift to describe the mismatch between training and testing.
• can be iteratively estimated using the maximum likelihood criterion:
where the subscript denotes the dimensional index, is the feature vector of the current utterance at frame .
10
brr mm ˆ
t m imt
t m imimiti bymp
ympb
2,
2,,,
/),|(
/)|(
l llt
mmtbt byNlp
byNmpymp
),;()(
),;()()|(
2
2
,
mr̂ m b
b
i tyt
Experimental and results
• CMN is applied before SPLICE algorithm.• The bias parameters are trained using the pseudo clean
and real multi-style data for each of 17 noise conditions .• The noisy speech model consists of a mixutre of 256
Gaussians with diagonal covariance matrices.
11
Conclusuons
• First, we remove the constraint of stereo-data by exploiting HMM-based synthesis method to generate the pseudo-clean speech parameters.
• Then a simple ML-based bias adaptation algorithm to handle the mismatch between training and testing, which yields consistent improvements of different testing sets on aurora2, is proposed.
• In our future work, we will further study the noise robustness of the HMM-based speech parameters generation, which can be combined with other robust techniques for noisy speech recognition.
12