towards an improved modeling of the glottal source in statistical parametric speech synthesis

Towards an Improved Modeling of the Glottal Source in Statistical

Parametric Speech Synthesis

João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi

The Centre for Speech Technology ResearchThe University of Edinburgh

2

Outline

• Introduction• Voice source model• System• Perceptual evaluation• Concluding remarks• Future work

3

IntroductionHMM-based speech synthesizer [Tokuda et al]

Text

Synthetic

Speech

F0

Training speech

F0 extraction Spectral features estimation

spectrum

Pulse train

Noise component

Synthesis filter

Text

analysis HMMs

+

4

• Source-filter model:

• Inverse filtering:

Voice source modelObtaining the glottal source signal

Source

Ug

Vocal tract

A(z)

Lip radiation

d/dzSpeech

Inverse Filter

1/A(z)

Lip radiation

cancellation (∫)Speech

ˆgdU

ˆgU

5

Voice source modelLiljencrants-Fant model (LF-model)

T : period

to : opening instant

tp : instant of max airflow

te : instant of max excitation

ta : return phase duration

tc : closing instant

Ee : excitation amplitude

6

Voice source modelOther parameters of the LF-model

Open

quotient:

Speed

quotient:

Return

quotient:

e at tOQ

T

p

e p

tSQ

t t

atRQT

7

Voice source modelDescription of the LF-model spectrum

Linear stylization of the LF-model spectrum

[Doval and d’Alessandro]

Fg glottal spectral peak

Fc spectral tilt

8

Voice source modelFeatures extraction

• utterances sampled at 16 kHz

• pitch-synchronous analysis (ESPS tools)

• LPCs calculated with windows centered at the glottal

epochs and duration 20ms

• inverse filtering to estimate DGS

• pre-emphasis filter (α=0.97)

• low-pass filtering of the residual at 4 kHz

9

Voice source modelEstimation of te and Ee

te and Ee are estimated from the pitch-marks

10

Voice source modelEstimation of tc, tp and to

max min

max

2o

U Ut

E

minct U

maxpt U

[Gobl & Chasaide]

11

Voice source modelEstimation of ta

ea

s

Et

mF

Fs : sampling frequency

m : slope of the tangent at t=te

12

Curves of the LF-parameters for 2 voiced regions of an utterance

Voice source modelExamples of the estimated parameters

13

SystemGeneral description

- Nitech-HTS 2005 system

- STRAIGHT method for analysis and synthesis

- mixed multi-band excitation with phase manipulation /

pulse train

- Mel Log Spectrum Approximation (MLSA) filter

How was the LF-model integrated in the synthesizer?

14

SystemGeneration of the periodic excitation (pulse signal)

• Pulse centered within

the frame

• multiplied by

asymmetric widows

• summed with Gaussian

noise

15

SystemPeriodic excitation with the LF-model

• 2 LF-waveforms

centered at the instant te

• multiplied by

asymmetric widows

• summed with Gaussian

noise

16

SystemTechnical problem

Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train

Solution: Post-filter

Linear phase FIR filter:

-6dB/dec 1Hz ≤ f ≤ Fg (Hz)

+6dB/dec Fg < f ≤ Fc (Hz)

+12dB/dec Fc < f ≤ 16 kHz

17

SystemEffect of the post-filtering

18

Perceptual evaluationGeneration of the stimuli

• Built US-English voice EM001 provided by ATR for the Blizzard

Challenge

• Glottal parameters were measured in 8 utterances and the mean

values were calculated

• Simple excitation, without multi-band noise or phase

manipulation

• Ten utterances were synthesized, using the LF-model and the

pulse model

19

Perceptual evaluationExperiment

• Forced-choice test

• Presented via a web-interface browser

• Subjects were asked if they used headphones or speakers, and

if they were native speakers (U.K./U.S.)

• 18 listeners (7 native speakers of English)

• Listeners panel was mainly university students and staff

Pulse: LF-model:

Example of test speech signals:

20

Perceptual evaluationResults

Excitation

LF-Model Pulse train

Non-native speakers

61% 39%

Native speakers 68.6% 31.4%

Total scores and 95% CI

64% ± 6.7% 36% ± 6.7%

21

Conclusions

• Nitech-HTS 2005 speech synthesizer was implemented with the LF-

model for the voice source

• Results showed that the LF-model can give better speech quality

than the traditionally used pulse train

• Direct methods used for the estimation of the mean LF-parameters

seemed to perform well

• A technical problem with the integration of the LF-model in the

system was solved using a post-filter

22

Future work

• To find better analysis/synthesis methods to use with the LF-model in

the HMM-based speech synthesis

• To evaluate the speech quality when using the mixed excitation with

the LF-model

• To implement voice quality transformations using the LF-model

• To evaluate the parameterization methods

• To model the glottal parameters with HMMs

23

Acknowledgements

This work was financially supported by the Marie Curie EdSST programme.

Thank you!

towards an improved modeling of the glottal source in statistical parametric speech synthesis

Documents