towards an improved modeling of the glottal source in statistical parametric speech synthesis

23
Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi The Centre for Speech Technology Research The University of Edinburgh

Upload: hedwig

Post on 07-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis. João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi. The Centre for Speech Technology Research The University of Edinburgh. Outline. Introduction Voice source model System - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

Towards an Improved Modeling of the Glottal Source in Statistical

Parametric Speech Synthesis

João P. Cabral, Steve Renals, Korin Richmond and Junichi Yamagishi

The Centre for Speech Technology ResearchThe University of Edinburgh

Page 2: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

2

Outline

• Introduction• Voice source model• System• Perceptual evaluation• Concluding remarks• Future work

Page 3: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

3

IntroductionHMM-based speech synthesizer [Tokuda et al]

Text

Synthetic

Speech

F0

Training speech

F0 extraction Spectral features estimation

spectrum

Pulse train

Noise component

Synthesis filter

Text

analysis HMMs

+

Page 4: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

4

• Source-filter model:

• Inverse filtering:

Voice source modelObtaining the glottal source signal

Source

Ug

Vocal tract

A(z)

Lip radiation

d/dzSpeech

Inverse Filter

1/A(z)

Lip radiation

cancellation (∫)Speech

ˆgdU

ˆgU

Page 5: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

5

Voice source modelLiljencrants-Fant model (LF-model)

T : period

to : opening instant

tp : instant of max airflow

te : instant of max excitation

ta : return phase duration

tc : closing instant

Ee : excitation amplitude

Page 6: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

6

Voice source modelOther parameters of the LF-model

Open

quotient:

Speed

quotient:

Return

quotient:

e at tOQ

T

p

e p

tSQ

t t

atRQT

Page 7: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

7

Voice source modelDescription of the LF-model spectrum

Linear stylization of the LF-model spectrum

[Doval and d’Alessandro]

Fg glottal spectral peak

Fc spectral tilt

Page 8: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

8

Voice source modelFeatures extraction

• utterances sampled at 16 kHz

• pitch-synchronous analysis (ESPS tools)

• LPCs calculated with windows centered at the glottal

epochs and duration 20ms

• inverse filtering to estimate DGS

• pre-emphasis filter (α=0.97)

• low-pass filtering of the residual at 4 kHz

Page 9: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

9

Voice source modelEstimation of te and Ee

te and Ee are estimated from the pitch-marks

Page 10: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

10

Voice source modelEstimation of tc, tp and to

max min

max

2o

U Ut

E

minct U

maxpt U

[Gobl & Chasaide]

Page 11: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

11

Voice source modelEstimation of ta

ea

s

Et

mF

Fs : sampling frequency

m : slope of the tangent at t=te

Page 12: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

12

Curves of the LF-parameters for 2 voiced regions of an utterance

Voice source modelExamples of the estimated parameters

Page 13: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

13

SystemGeneral description

- Nitech-HTS 2005 system

- STRAIGHT method for analysis and synthesis

- mixed multi-band excitation with phase manipulation /

pulse train

- Mel Log Spectrum Approximation (MLSA) filter

How was the LF-model integrated in the synthesizer?

Page 14: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

14

SystemGeneration of the periodic excitation (pulse signal)

• Pulse centered within

the frame

• multiplied by

asymmetric widows

• summed with Gaussian

noise

Page 15: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

15

SystemPeriodic excitation with the LF-model

• 2 LF-waveforms

centered at the instant te

• multiplied by

asymmetric widows

• summed with Gaussian

noise

Page 16: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

16

SystemTechnical problem

Problem: the synthesis filter assumes the excitation to have a flat spectrum like the pulse train

Solution: Post-filter

Linear phase FIR filter:

-6dB/dec 1Hz ≤ f ≤ Fg (Hz)

+6dB/dec Fg < f ≤ Fc (Hz)

+12dB/dec Fc < f ≤ 16 kHz

Page 17: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

17

SystemEffect of the post-filtering

Page 18: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

18

Perceptual evaluationGeneration of the stimuli

• Built US-English voice EM001 provided by ATR for the Blizzard

Challenge

• Glottal parameters were measured in 8 utterances and the mean

values were calculated

• Simple excitation, without multi-band noise or phase

manipulation

• Ten utterances were synthesized, using the LF-model and the

pulse model

Page 19: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

19

Perceptual evaluationExperiment

• Forced-choice test

• Presented via a web-interface browser

• Subjects were asked if they used headphones or speakers, and

if they were native speakers (U.K./U.S.)

• 18 listeners (7 native speakers of English)

• Listeners panel was mainly university students and staff

Pulse: LF-model:

Example of test speech signals:

Page 20: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

20

Perceptual evaluationResults

Excitation

LF-Model Pulse train

Non-native speakers

61% 39%

Native speakers 68.6% 31.4%

Total scores and 95% CI

64% ± 6.7% 36% ± 6.7%

Page 21: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

21

Conclusions

• Nitech-HTS 2005 speech synthesizer was implemented with the LF-

model for the voice source

• Results showed that the LF-model can give better speech quality

than the traditionally used pulse train

• Direct methods used for the estimation of the mean LF-parameters

seemed to perform well

• A technical problem with the integration of the LF-model in the

system was solved using a post-filter

Page 22: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

22

Future work

• To find better analysis/synthesis methods to use with the LF-model in

the HMM-based speech synthesis

• To evaluate the speech quality when using the mixed excitation with

the LF-model

• To implement voice quality transformations using the LF-model

• To evaluate the parameterization methods

• To model the glottal parameters with HMMs

Page 23: Towards an Improved Modeling of the Glottal Source in Statistical Parametric Speech Synthesis

23

Acknowledgements

This work was financially supported by the Marie Curie EdSST programme.

Thank you!