speech synthesis as a machine learning problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. ·...

47
Speech Synthesis as A Machine Learning Problem Keiichi Tokuda Nagoya Institute of Technology

Upload: others

Post on 04-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speech Synthesis asA Machine Learning Problem

Keiichi TokudaNagoya Institute of Technology

Page 2: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Introduction

Rule-based, formant synthesis (~’90s)– Hand-crafting each phonetic units by rules

Corpus-based, concatenative synthesis (’90s~)– Concatenate speech units (waveform) from a database

• Single inventory: diphone synthesis• Multiple inventory: unit selection synthesis

Corpus-based, statistical parametric synthesis– Source-filter model + statistical acoustic model

• Hidden Markov model: HMM-based synthesis

2

How we can formulate and understand the whole corpus-based speech synthesis process in a unified statistical framework?

Page 3: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Problem of speech synthesis (1/2)

3

We have a speech database, i.e., a set of texts and corresponding speech waveforms.Given a text to be synthesized, what is the speech waveform corresponding to the text?

: speech waveform

: speech waveforms

: texts

: text to be synthesized

databaseGiven

?

w

WX

x

Page 4: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Problem of speech synthesis (2/2)

Problem of statistical parametric speech synthesis:

4

( )LOlooo

,,|maxargˆ P=

( ) ( ) λLOλλloo

dPP∫= ,|,|maxarg

: Synthesis data (speech parameter sequence)

: Training data (speech parameter sequence)

: Label sequence for training data(pronunciation, stress, POS, pause position, etc.)

: Label sequence for synthesis data

: Model parameters (HMM parameters)λ

l

L

Ο

o

Page 5: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Approximation

Maximum A Posterior (MAP) & ML estimation

Speech parameter generation using estimated parameters

5

( )LOλλλ

,|maxargˆMAP P=

( ) ( )λλLOλ

PP ,|maxarg=

( )λLOλλ

,|maxargˆML P=

MAP estimation

ML estimation

( )λlooo

ˆ,|maxargˆ P= Speech parametergeneration

Page 6: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part6

Text analysis

Page 7: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Parameter generationfrom HMMsLabels Excitation

parametersExcitation

Spectralparameters

Speech signal Training part

Synthesis part7

Text analysis

Training HMMs

Context-dependent HMMs& state duration models

Spectralparameters

Excitationparameters

Labels

),|(maxargˆ λLOλλ

P=

Page 8: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Hidden Markov model (HMM)

11a 22a 33a

12a 23a

)(1 tb o )(2 tb o )(3 tb o

1o 2o 3o 4o 5o To ・ ・

1 2 3

1 1 1 1 2 2 3 3

o

q

Observation sequence

State sequence

8

ija

)( tqb o

: state transition probability

: output probability

Page 9: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Output probability of HMM

1 2 3 4 5 6 7 8

i

=T t

∑∏∑=

−==

qqoλqολο

T

ttqqq ttt

baPP1

)()|,()|(1

9

33a

1

2

3

23a )( 73 ob

Model parameters can be estimated by an EM algorithm

Page 10: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Excitation

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part10

Text analysis

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Spectralparameters

)ˆ,|(maxargˆ λlooo

P=

Page 11: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speech parameter generation algorithm

)|(),|(max

)|(),|()|(

λqλqo

λqλqoλo

q

q

PP

PPP

=∑

),ˆ|(maxargˆ

),|(maxargˆ

λqoo

λlqq

o

q

p

p

=

=

11

For given HMM , determine a speech parameter vectorSequence which maximizes

λTTTT ],,,[ 21 Toooo =

Page 12: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Determination of state sequence

Observation sequence

State sequence

4 10 5dState duration

The state sequence can be determined by state durations12

11a 22a 33a

12a 23a

)(1 tb o )(2 tb o )(3 tb o

1o 2o 3o 4o 5o To ・ ・

1 2 3

1 1 1 1 2 2 3 3

ija

)( tqb o

: state transition probability

: output probability

o

q

Page 13: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM (Hidden Markov Model)– State duration prob. depends only on transition prob.– State duration probability exponentially decreases

HSMM (Hidden Semi Markov Model)– HMM + explicit duration model ⇒ HSMM

State durations are given by means of Gaussians.

Hidden Semi Markov Model

13

1O 2O 3O 4O 5O 6O 7O 9O8O

1q 2q 3q

1d 2d 3d

Introduce state duration probability

)|( qdP

Gaussian

Page 14: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speech parameter generation algorithm

)|(),|(max

)|(),|()|(

λqλqo

λqλqoλo

q

q

PP

PPP

=∑

),ˆ|(maxargˆ

),|(maxargˆ

λqoo

λlqq

o

q

p

p

=

=

14

For given HMM ,determine a speech parameter vectorSequence which maximizes

λTTTT ],,,[ 21 Toooo =

Page 15: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Generated feature sequence

Mean Variance

becomes a sequence of mean vectors⇒ discontinuous outputs between stateso

15

frame

a i

Page 16: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Dynamic features

112

22

11

2

)(5.0

−+

−+

+−≈∂∂

=∆

−≈∂∂

=∆

ttt

t

ttt

tccccc

cctcc

t

t

tc1−tc 1+tc

tc∆

tc1−tc 1+tc

tc2∆

16

1−∆ tc 1+∆ tc 1−∆ tc 1+∆ tc

5.0− 5.0 1 12−

Page 17: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Integration of dynamic features

Relationship between speech parm. vec. & static feat

17

o cW

Page 18: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Solution for the problem (1/2)

0c

qWc=

∂∂ ),ˆ|(log λP

qqq μWWcW ˆ1

ˆ1

ˆ−− = ΣΣ TT

TTTT ],,,[ 21 Tcccc =TTTT ],,,[ ˆˆˆˆ 21 Tqqqq μμμμ =

TTTT ],,,[ ˆˆˆˆ 21 Tqqqq ΣΣΣΣ =

By setting

we obtain

where

18

Page 19: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Generated speech parameter trajectory

Mean Variance c

19

frame

c∆

c

a i

Page 20: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Generated spectra

200

12

34

5(kH

z)0

12

34

5(kH

z)

a i silsil

w/o dyn

w dyn

Spectra changing smoothly between phonemes

Page 21: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Conventional HMM Trajectory HMM

Training

Synthesis

Trajectory HMM

Solve inconsistency between training & synthesis

21

),|( λLCP

)ˆ,|( λlcP

),|( λLOP

Wcoλlo =|)ˆ,|(P

⇒ improving the model accuracy

is not a proper distribution of c),|(),|( λlWcλlo PP =

Page 22: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

TEXT

Text analysis

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels

Labels

Training part

Synthesis part22

Text analysis

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

Excitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH

Excitationparameters

Excitation

Spectralparameters

Page 23: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

∑=

−=M

m

mzmczH0

)(exp)(

Mel-cepstral representation of speech spectra

ML-estimation of mel-cepstrum

ML estimation of spectral parameter

)|(maxarg cxcc

p=

23

xc

: speech waveform (Gaussian process): mel-cepstrum

War

ped

frequ

ency

(rad

)

Frequency (rad)

warped frequencymel-scale frequency

ω

αα ~

1

11

1~ je

zzz −

−− =

−−

=

∑=

−=M

m

mzmczH0

~)(exp)(

ω

π

π0 2/π

Page 24: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Synthesis filter

Overview of speech vocoding

Original speech

Mel-cepstrumUnvoiced/voicedF0

24

)(zH)(ne )(nxexcitation

pulse train

white noise

speech

These speech parameter modeled by HMM

Page 25: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part25

Text analysis

Page 26: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Observation of F0

Time

Log

Freq

uenc

y

Unable to model by continuous or discrete distributions⇒ Multi-space distribution HMM (MSD-HMM)

26

11 R=Ω

voiced0

2 R=Ωunvoiced

Page 27: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Structure of MSD-HMM

27

2,1w 2,2w 2,3w2

2nR=Ω

)(12 xN )(22 xN )(32 xN

3,1w 3,2w 3,3w3

3nR=Ω

)(13 xN )(23 xN )(33 xN

1 2 3

1,1w 1,2w 1,3w1

1nR=Ω

)(11 xN )(21 xN )(31 xN

Page 28: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

MSD-HMM for F0 modeling

1 2 3HMM for F0

0,1w 0,2w 0,3w

1,1w 1,2w 1,3w

unvoiced

voiced

unvoiced

voiced

unvoiced

voiced

・ ・ ・01 R=Ω

12 R=Ω

voiced/unvoiced weights

28

Page 29: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Generated F0

29

Page 30: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speech samples

Mel-cepstrum

w/ dyn. w/o dyn.

log F0w/ dyn.

w/o dyn.

30

Page 31: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Inclusion of speech analysis & waveform reconstruction

Problem of statistical parametric speech synthesis

31

( )LXlxxx

,,|maxargˆ P=

( )∑∑∫=c Cx

cx |maxarg P ( )λlc ,|P

( )LCλ ,|P× ( ) λXC dP |

Waveform reconstruction Speech parameter generation

HMM parameter estimation Speech analysis

c and consist of mel-cepstrum & F0 parametersC

Page 32: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Context clustering factors preceding, succeeding two phonemes Current phoneme Position of current phoneme in current syllable # of phonemes at preceding, current, succeeding syllable accent, stress of preceding, current, succeeding syllable Position of current syllable in current word # of preceding, succeeding accented, stressed syllable in current phrase # of syllables from previous, to next accented, stressed syllable Vowel within current syllable Guess at part of speech of preceding, current, succeeding word # of syllables in preceding, current, succeeding word Position of current word in current phrase # of preceding, succeeding content words in current phrase # of words from previous, to next content word # of syllables in preceding, current, succeeding phrase…

Vast # of combinations ⇒ Difficult to have possible models32

Page 33: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Decision tree-based state clustering

k-a+b

t-a+h

……

yes

yesyesyes

yes no

no

no

no

no

w-a+t

w-a+sil w-a+sh

gy-a+sil

gy-a+pau

g-a+pau

leaf nodes

synthesizedstates

R=silence?

L=“gy”?

L=voice?

L=“w”?

R=silence?

33

Page 34: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

HMM

State durationmodel

Decision treefor state dur.

models

34

Page 35: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Inclusion of model structure parameter

Problem of statistical parametric speech synthesis

35

( )LOlooo

,,|maxargˆ P=

( )∑∫=m

mP ,,|maxarg λloo

( )LOλ ,,| mP× ( ) λLO dmP ,|

Generate coefficient

Posterior ofmodel parameters

Posterior ofmodel structure

Usually fixed is usedm

Page 36: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

SYNTHESIZEDSPEECH

Excitationparameters

Excitation

Spectralparameters

Speech signal Training part

Synthesis part36

),|(maxargˆ ΛWLLL

P=

TEXT

Text analysisParameter generation

from HMMsLabels

),|(maxargˆ Λwlll

P=

Training HMMs

Context-dependent HMMs& state duration models

Spectralparameters

Excitationparameters

Labels

Text analysis

Page 37: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Inclusion of text analysis

Problem of statistical parametric speech synthesis

37

( )WOwooo

,,|maxargˆ P=

( )∑∑∫ ∫=l Lo

λlo ,|maxarg P ( )Λ,| wlP

( )LOλ ,|P× ( ) ( ) ΛΛΛ ddPP λWL ,|

Speech parameter generation

HMM parameter estimation Text processing

Text processing

Prior

Page 38: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

HMM-based speech synthesis system

SPEECHDATABASE

ExcitationParameterextraction

SpectralParameterExtraction

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Spectralparameters

Excitationparameters

Labels

Speech signal Training part

Synthesis part38

Text analysis

Page 39: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Inclusion of all components

Problem of statistical parametric speech synthesis

39

( )WXwxxx

,,|maxargˆ P=

( )∑∑∑∫ ∫=Cc Llx

cx, ,

|maxargm

P ( )mP ,,| λlc ( )Λ,| wlP

Waveform generation Parameter generation Text processing

Posterior of model parameter

Text processing Prior

( ) ΛΛ ddP λ( )Λ,|WLP× ( )XC|P

Speech analysis

( )LCλ ,,| mP× ( )LC ,|mP

Posterior of model structure

Page 40: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Emotional Speech Synthesis

text neutral angry

「授業中に携帯いじってんじゃねえよ!

電源切っとけ!」

“Don’t touch your cell phone during a class! Turn off it!”

「ミーティングには毎週参加しなさい!」

“You must attend the weekly meeting!”

trained with 200 utterances

Page 41: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speaker Adaptation (mimicking voices)

w/o adaptation (initial model) Adapted with 4 utterances Adapted with 50 utterances Speaker-dependent model

Speaker-independent

Adaptationdata

Adapted model

MLLR-based adaptation

adaptation

Page 42: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Speaker Interpolation (mixing voices)

Linear combination of two speaker-dependent models

Model A Model BInterpolated model

1.00 0.75 0.50 0.25 0.00

0.00 0.25 0.50 0.75 1.00

A:

B:

Page 43: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Voice Morphing

A BTwo voices:

Four voices:A B

Page 44: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Interpolation of Speaking Styles

Neutral High Tension

Interpolation extrapolation

Base model A Base model B

Page 45: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Multilingual Speech Synthesis Japanese American English Chinese (Mandarin) (by ATR) Brazilian Portuguese (by Nitech, and UFRJ) European Portuguese

(by Nitech, Univ of Porto, and UFRJ) Slovenian

(by Bostjan Vesnicer, University of Ljubljana, Slovenia ) Swedish

(by Anders Lundgren, KTH, Sweden) German (by University of Bonn, and Nitech) Korean (by Sang-Jin Kim, ETRI, Korea) Hungarian (by Budapest University of Technology and

Economics) Finish (by TKK, Finland) Baby English (by Univ of Edinburgh, UK) Polish, Slovak, Finnish, Arabic, Farsi, Polyglot, etc.

Page 46: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Singing Voice Synthesis

Singing voice database

Musical score

Trained HMMs

Singing voice forany piece of music

male:

female:

male:

female:

Page 47: Speech Synthesis as A Machine Learning Problemtokuda/tokuda_o-cocosda2010.pdf · 2010. 12. 1. · Problem of speech synthesis (1/2) 3 We have a speech database, i.e., a set of texts

Conclusion

• Whole speech synthesis process is described as a statistical framework

• It gives a unified view and reveals what is correct and what is wrong

• Importance of the databaseFuture work• Still we have many problems should be solved:

• Speech waveform modeling• Combination with text processing part

47

Statistical approach to speech synthesis