towards an improved modeling of the glottal source in statistical parametric speech synthesis
DESCRIPTION
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. The Centre for Speech Technology Research The University of Edinburgh. Outline. Introduction Voice source model System - PowerPoint PPT PresentationTRANSCRIPT
Towards an Improved Modeling of the Glottal Source in Statistical
Parametric Speech Synthesis
João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi
The Centre for Speech Technology ResearchThe University of Edinburgh
2
Outline
• Introduction• Voice source model• System• Perceptual evaluation• Concluding remarks• Future work
3
IntroductionHMM-based speech synthesizer [Tokuda et al]
Text
Synthetic
Speech
F0
Training speech
F0 extraction Spectral features estimation
spectrum
Pulse train
Noise component
Synthesis filter
Text
analysis HMMs
+
4
• Source-filter model:
• Inverse filtering:
Voice source modelObtaining the glottal source signal
Source
Ug
Vocal tract
A(z)
Lip radiation
d/dzSpeech
Inverse Filter
1/A(z)
Lip radiation
cancellation (∫)Speech
ˆgdU
ˆgU
5
Voice source modelLiljencrants-Fant model (LF-model)
T : period
to : opening instant
tp : instant of max airflow
te : instant of max excitation
ta : return phase duration
tc : closing instant
Ee : excitation amplitude
6
Voice source modelOther parameters of the LF-model
Open
quotient:
Speed
quotient:
Return
quotient:
e at tOQ
T
p
e p
tSQ
t t
atRQT
7
Voice source modelDescription of the LF-model spectrum
Linear stylization of the LF-model spectrum
[Doval and d’Alessandro]
Fg glottal spectral peak
Fc spectral tilt
8
Voice source modelFeatures extraction
• utterances sampled at 16 kHz
• pitch-synchronous analysis (ESPS tools)
• LPCs calculated with windows centered at the glottal
epochs and duration 20ms
• inverse filtering to estimate DGS
• pre-emphasis filter (α=0.97)
• low-pass filtering of the residual at 4 kHz
9
Voice source modelEstimation of te and Ee
te and Ee are estimated from the pitch-marks
10
Voice source modelEstimation of tc, tp and to
max min
max
2o
U Ut
E
minct U
maxpt U
[Gobl & Chasaide]
11
Voice source modelEstimation of ta
ea
s
Et
mF
Fs : sampling frequency
m : slope of the tangent at t=te
12
Curves of the LF-parameters for 2 voiced regions of an utterance
Voice source modelExamples of the estimated parameters
13
SystemGeneral description
- Nitech-HTS 2005 system
- STRAIGHT method for analysis and synthesis
- mixed multi-band excitation with phase manipulation /
pulse train
- Mel Log Spectrum Approximation (MLSA) filter
How was the LF-model integrated in the synthesizer?
14
SystemGeneration of the periodic excitation (pulse signal)
• Pulse centered within
the frame
• multiplied by
asymmetric widows
• summed with Gaussian
noise
15
SystemPeriodic excitation with the LF-model
• 2 LF-waveforms
centered at the instant te
• multiplied by
asymmetric widows
• summed with Gaussian
noise
16
SystemTechnical problem
Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train
Solution: Post-filter
Linear phase FIR filter:
-6dB/dec 1Hz ≤ f ≤ Fg (Hz)
+6dB/dec Fg < f ≤ Fc (Hz)
+12dB/dec Fc < f ≤ 16 kHz
17
SystemEffect of the post-filtering
18
Perceptual evaluationGeneration of the stimuli
• Built US-English voice EM001 provided by ATR for the Blizzard
Challenge
• Glottal parameters were measured in 8 utterances and the mean
values were calculated
• Simple excitation, without multi-band noise or phase
manipulation
• Ten utterances were synthesized, using the LF-model and the
pulse model
19
Perceptual evaluationExperiment
• Forced-choice test
• Presented via a web-interface browser
• Subjects were asked if they used headphones or speakers, and
if they were native speakers (U.K./U.S.)
• 18 listeners (7 native speakers of English)
• Listeners panel was mainly university students and staff
Pulse: LF-model:
Example of test speech signals:
20
Perceptual evaluationResults
Excitation
LF-Model Pulse train
Non-native speakers
61% 39%
Native speakers 68.6% 31.4%
Total scores and 95% CI
64% ± 6.7% 36% ± 6.7%
21
Conclusions
• Nitech-HTS 2005 speech synthesizer was implemented with the LF-
model for the voice source
• Results showed that the LF-model can give better speech quality
than the traditionally used pulse train
• Direct methods used for the estimation of the mean LF-parameters
seemed to perform well
• A technical problem with the integration of the LF-model in the
system was solved using a post-filter
22
Future work
• To find better analysis/synthesis methods to use with the LF-model in
the HMM-based speech synthesis
• To evaluate the speech quality when using the mixed excitation with
the LF-model
• To implement voice quality transformations using the LF-model
• To evaluate the parameterization methods
• To model the glottal parameters with HMMs
23
Acknowledgements
This work was financially supported by the Marie Curie EdSST programme.
Thank you!