high-quality statistical parametric speech synthesis using generative adversarial networkssython.org...

24
©Yuki Saito, 2018/01/25 High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networks 48-166605 Yuki Saito Supervisor: Prof. Hiroshi Saruwatari

Upload: others

Post on 30-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

©Yuki Saito, 2018/01/25

High-Quality Statistical Parametric Speech Synthesis Using

Generative Adversarial Networks

48-166605 Yuki Saito

Supervisor: Prof. Hiroshi Saruwatari

Page 2: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Speech Synthesis

– Technique for synthesizing speech using computer

Applications

– Speech communication assistance (e.g., speech translation)

– Entertainments (e.g., singing voice conversion)

DNN-based speech synthesis* [Zen et al. 2013]

– High flexibility but low speech quality

1

Research Field: Speech Synthesis

Text-To-Speech (TTS)

Text Speech

Voice Conversion (VC)

Outputspeech

Inputspeech

Hello Hello[Sagisaka et al., 1988]

[Stylianou et al., 1988]

DNN: Deep Neural Network

Page 3: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/232

Thesis Overview

Smoothness of speech parametersOverly smoothed Naturally smoothed

Sp

ee

ch

pa

ram

ete

riza

tio

n u

nit Goal

(natural speech)

Vo

co

de

r-d

eri

ve

d

pa

ram

ete

rsS

TF

T*

spe

ctr

a

Chapter 2Wa

ve

form

Global Variance (GV) [Toda et al., 2007]

Chapter 4(proposed)

Generative Adversarial Nets

(GANs)Chapter 3

(proposed)DNNs w/ vocoders

[Zen et al., 2013]

DNNs w/o vocoders

[Takaki et al., 2017]

STFT: Short-Term Fourier Transform

Page 4: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/233

Table of Contents

Chapter 1. Introduction

Chapter 2. Speech Synthesis Using DNNs

Chapter 3. Speech Synthesis Using GANs w/ Vocoders

Chapter 4. Speech Synthesis Using GANs w/o Vocoders

Chapter 5. Conclusion

Page 5: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/234

Speech Analysis and Parameter Extraction

Speech signal

STFT amplitude spectra

(e.g., 513 dim.)

Vocoder-derived speech params.

(e.g., 32 dim.)

Speech synthesis w/ vocoders

(Sec. 2.2)

STFT analysis

Mel-cepstral coefficients(timbre)

Prosodic feats.(pitch, hoarseness)

Speech synthesis w/o vocoders

(Sec. 2.3)

Vocoder-based parameterization

[Kawahara et al., 1999]

Page 6: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/235

DNN-based Speech Synthesis w/ Vocoders

𝒙

Acoustic models(DNN)

⋯⋯

𝒙1

𝒙𝑇

Mel-cepstral coefficients

Prosodicfeats.

Linguisticfeats.

Speechparams.

⋯⋯ 0

0

1

1

aiu

123

Phoneme

Accent

Frameposition

etc.

0

[Zen et al., 2013]

𝒚1

𝒚𝑇

𝒚

Page 7: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/236

DNN-based Speech Synthesis w/o Vocoders

𝒙

Acoustic models(DNN)

⋯⋯

𝒙1

𝒙𝑇

𝒚1

𝒚𝑇

⋯⋯ 0

0

1

1

aiu

123

Phoneme

Accent

Frameposition

etc.

0

[Takaki et al., 2017]

𝒚

+

Prosodic feats.

Amplitudespectra

Linguisticfeats. + F0

Page 8: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Minimum Generation Error (MGE) Training Algorithm

7

𝐿MGE 𝒚, 𝒚

[Wu et al., 2016]

Naturalspeech

params.

𝐿MGE 𝒚, 𝒚 =1

𝑇 𝒚 − 𝒚 ⊤ 𝒚 − 𝒚 → Minimize

𝒚

Generatedspeech params. 𝒚

Acoustic models

𝒙 𝒀

MLPG*

Linguisticfeats.

MLPG: Maximum Likelihood Parameter Generation [Tokuda et al., 2000]

Page 9: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/238

Issue of DNN-based Speech Synthesis:Over-smoothing of Generated Speech Parameters

Natural MGE

21st mel-cepstral coefficient

23

rd m

el-

ce

pst

ral

co

eff

icie

nt

These distributions are significantly different...(GV [Toda et al., 2007] explicitly compensates the 2nd moment.)

Narrow

Page 10: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/239

Table of Contents

Chapter 1. Introduction

Chapter 2. Speech Synthesis Using DNNs

Chapter 3. Speech Synthesis Using GANs w/ Vocoders

Chapter 4. Speech Synthesis Using GANs w/o Vocoders

Chapter 5. Conclusion

Page 11: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Generative Adversarial Nets (GANs) [Goodfellow et al., 2014]

10

1: true0: fake

True sample

→ Minimize𝐿DGAN

𝒚, 𝒚 = 𝐿D,1GAN

𝒚 + 𝐿D,0GAN 𝒚

Loss to recognizetrue sample as true

Loss to recognizefake sample as fake

Discriminator

𝐷 ⋅

Prior(e.g., 𝑁 𝟎, 𝑰 )

𝐿DGAN

𝒚, 𝒚

1: true

Fake sample

Generator

𝒚𝒛Latent

variable 𝒚

or

𝐿DGAN

𝒚, 𝒚 is equivalent to the cross-entropy function.

Page 12: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2311

Discriminator

𝐷 ⋅

Minimize approx. JS* divergence betw. dists. of 𝒚 and 𝒚.

𝐿ADVGAN 𝒚

1: true

→ Minimize𝐿ADVGAN 𝒚 = 𝐿D,1

GAN 𝒚

Loss to recognizefake sample as true

Generative Adversarial Nets (GANs) [Goodfellow et al., 2014]

Generator

𝒚𝒛Latent

variable Fake sample

Prior(e.g., 𝑁 𝟎, 𝑰 )

JS: Jensen-Shannon

Page 13: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Proposed Method:Acoustic Model Training Using GANs

12

𝜔D: weight, 𝐸𝐿∗ : expectation values of 𝐿∗

𝐿G 𝒚, 𝒚 = 𝐿MGE 𝒚, 𝒚 + 𝜔D

𝐸𝐿MGE

𝐸𝐿ADV𝐿ADVGAN 𝒚 → Minimize

Discriminator

𝐷 ⋅

1: natural

𝐿ADVGAN 𝒚

Loss to recognizegenerated params. as natural

𝐿MGE 𝒚, 𝒚

Natural𝒚

Generated 𝒚𝒙 𝒀

MLPG

Linguisticfeats.

Acoustic models(generator)

Page 14: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2313

Distributions of Speech Parameters

The proposed algorithm alleviates the over-smoothing effect!

21st mel-cepstral coefficient

23

rd m

el-

ce

pst

ral

co

eff

icie

nt

Natural MGE Proposed

Narrow

Wide asnatural speech

GANs = minimizing divergence betw. two distributions

Page 15: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2314

Discussions

Compensating for distribution differences

– The proposed method generalizes the conventional methods such as the GV.

Integrating voice anti-spoofing techniques

– Features that are effective for detecting synthetic speech can be used (Sec. 3.4.8).

Changing a divergence to be minimized

– Earth mover’s distance (Wasserstein GAN [Arjovsky et al., 2017]) was the best for improving synthetic speech quality (Sec. 3.4.10).

Applying various speech synthesis

– Not only TTS (next slides), but also VC (Sec. 3.5).

Page 16: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Experimental Conditions

15

Train / evaluate data 450 sentences / 53 sentences (16 kHz sampling)

Linguistic feats. 442-dimensional vector

Speech params. Mel-cepstral coefficients and prosodic features

Optimizer AdaGrad [Duchi et al., 2011]

Acoustic models Feed-Forward 442 – 3x512 (ReLU) – 94 (linear)

Discriminator Feed-Forward 26 – 3x256 (ReLU) – 1 (sigmoid)

Weight 𝜔D 1.0 (Secs. 3.4.2 and 3.4.4)

Methods MGE [Wu et al., 2016], GV [Toda et al., 2007], Proposed

Page 17: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2316

Subjective Evaluationsin Terms of Speech Quality

Preference AB test (select better sounded speech)

Proposed

Error bars denote 95% confidence intervals.

MGE

0.0 0.2 0.4

Preference score

0.6 0.8 1.0

Proposed

GV

(a) MGE vs. Proposed

(b) GV vs. Proposed

Improved

Improved

Proposed method improves synthetic speech quality!

MGE

GV

Proposed

Page 18: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2317

Table of Contents

Chapter 1. Introduction

Chapter 2. Speech Synthesis Using DNNs

Chapter 3. Speech Synthesis Using GANs w/ Vocoders

Chapter 4. Speech Synthesis Using GANs w/o Vocoders

Chapter 5. Conclusion

Page 19: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2318

Issue in Speech Synthesis w/o Vocoders

Over-smoothing of generated STFT amplitude spectra

– Formants (spectral peaks) tend to be weakened.

– The method proposed in Chap. 3 cannot be applied directly.

• Difficulties in modeling highly complex distribution

To deal with the issue...

– Dimensionality reduction retaining spectral structures

Fre

qu

en

cy

(e.g

., 5

13

bin

s)

Time

?

Natural amplitudes

Page 20: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2319

Acoustic Model Training UsingLow-resolution GANs

𝐿GL

𝒚, 𝒚 = 𝐿MGE 𝒚, 𝒚 + 𝜔DL 𝐸𝐿MGE

𝐸𝐿ADV𝐿ADVGAN 𝒚 L → Minimize

Low-resolutiondiscriminator

𝐷 L ⋅

1: natural

𝐿ADVGAN 𝒚 L

𝐿MGE 𝒚, 𝒚Natural 𝒚Generated 𝒚

Acoustic models

𝒙Linguisticfeats. + F0

Average pooling𝝓 ⋅

𝝓 ⋅ 𝒚 𝒚 L

𝒚 L

Page 21: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2320

Examples ofNatural and Generated Amplitude Spectra

Natural

MGE

Proposed (Chap. 3)

Proposed (Chap. 4)

Low-resolution GANs capture differences in formants!

Page 22: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2321

Subjective Evaluationsin Terms of Speech Quality

Low-resolution GAN improves synthetic speech quality!

Preference AB test (select better sounded speech)

Proposed(Chap. 3)

MGE

0.0 0.2 0.4

Preference score

0.6 0.8 1.0

Proposed(Chap. 4)

MGE

(a) MGE vs. Proposed (Chap. 3)

(b) MGE vs. Proposed (Chap. 4)

Improved

MGE

Proposed(Chap. 3)

Proposed(Chap. 4)

Degraded

Error bars denote 95% confidence intervals.

Page 23: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/2322

Table of Contents

Chapter 1. Introduction

Chapter 2. Speech Synthesis Using DNNs

Chapter 3. Speech Synthesis Using GANs w/ Vocoders

Chapter 4. Speech Synthesis Using GANs w/o Vocoders

Chapter 5. Conclusion

Page 24: High-Quality Statistical Parametric Speech Synthesis Using Generative Adversarial Networkssython.org › papers › thesis › saito17_MS_presen.pdf · 2019-06-06 · /23 Speech Synthesis

/23

Conclusion

Purpose: improving synthetic speech quality of SPSS

Proposed: acoustic model training algorithms using GANs

• They compensate for the distribution differences betw. natural / generated speech parameters.

Results

– The proposed algorithms improved synthetic speech quality compared to conventional methods.

Future works

– Investigating anti-spoofing techniques

– Further improving speech quality using STFT spectra

23