regression approaches to voice quality control based on one-to-many eigenvoice conversion kumi ohta,...

25
Regression Approaches to Voice Quality Control Based on One-to-M any Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki T oda, Hiroshi Saruwatari, and Kiyoh iro Shikano Nara Institute of Science and Technolog y (NAIST), Japan August 23rd, 2007

Upload: meagan-wright

Post on 01-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Co

nversion

Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano

Nara Institute of Science and Technology (NAIST), Japan

August 23rd, 2007

2

– Amusement device– Speech enhancement device

• for a speaking aid system recovering a disabled person’s voice

• for a hearing aid system to make speech sounds more intelligible

Voice Quality Control

• Technique for converting user’s voice quality into another one

Applications

Development of voice quality control with high quality and high controllability is desired!

Controller

Hello.

3

Contents

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions

4

Arbitrary speakers

Multiple pre-stored target speakers

Conversion

Training

Source speaker

Hello.Thank you.

Hello.Thank you.

Hello.Thank you.

Hello.Thank you.

Let’s convert. Let’s convert.

Eigenvoice GMM (EV-GMM)

Manually setting

Parallel data

One-to-Many Eigenvoice Conversion (EVC)[Toda et al., 2006]

• A source speaker’s voice is statistically converted into an arbitrary speaker’s one.

5

• Converted voice quality is controlled by weights for eigenvectors.

Eigenvoice GMM (EV-GMM)

Weight

Mean vector

Covariance matrix Eigenvectors

(for eigenvoices)

Bias vector(for average voice)

Parameters of the i th mixture

Source mean vector

Target mean vector

=+

)0()()(

)()(

Yi

Yi

XiZ

ibwB

μμ

i

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Weights for eigenvoices(free parameters)

Problem: eigenvoices do NOT represent a specific physical meaning (such as a masculine voice or a clear voice). Intuitive control of the converted voice quality is difficult!

: Speaker independent parameters

: Free parameters

6

Contents

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions

7

Proposed Framework

We would like to intuitively control the converted voice quality!

We propose multiple regression approaches to one-to-many EVC.

Converted voice quality is controlled with the voice quality control vector.* Similar approaches have been proposed in HMM-based speech synthesis [Tachibana et al., 2006].

8

Process of Proposed Framework

1. Preparing multiple parallel data sets2. Setting the voice quality control vector for

every pre-stored target speaker3. Modeling the target mean vectors with

voice quality control vector

1. Preparing multiple parallel data sets2. Setting the voice quality control vector for

every pre-stored target speaker3. Modeling the target mean vectors with

voice quality control vector

9

Setting Voice Quality Control Vector

• We manually assign scores for expression word pairs to each pre-stored target speaker.

• Assigned scores are used as components of the voice quality control vector.

Tense

Hoarse

Masculine

Elderly

Thin

Feminine

Clear

Youthful

Deep

Lax

-3 -2 -1 0 1 2 3

Very VeryQuite QuiteSome-what

Some-what

No preference

2

1

1

-2

-1Voice quality control vectorfor the speaker A

Assigned scores for the speaker A

10

Process of Proposed Framework

1. Preparing multiple parallel data sets2. Setting the voice quality control vector for

every pre-stored target speaker3. Modeling the target mean vectors with

voice quality control vector

We propose 3 regression methods.

11

Proposed Method A

.)()( rRwp se

s Regression parameters

Principal componentsfor the sth target speaker

Modeling principal components is modeled by

.minarg,2

1

)()(

S

s

se

s rRwprR

Minimizing the following error function:

Error of principal components for the sth pre-stored target speaker

)(sp

Total error over all pre-stored target speakers

Least-squares (LS) estimation of regression parameters converting the voice quality control vector into principal components

Voice quality control vectorfor the sth target speaker

12

Resulting EV-GMM in Method A

=

Weight

Mean vector

)0()()(

)()(

Yie

Yi

XiZ

ibrwRB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Eigenvectors

Bias vector

Parameters of the i th mixture

Target mean vector Regressionparameters

Voice quality control vector

++

Problem: the desired voice characteristics might not be represented as a linear combination of eigenvectors. Changing the eigenvectors themselves is necessary!

: Training parameters

: Speaker independent EV-GMM parameters

13

Proposed Method B

.)0()(minarg)0(,2

1

)()()()()()(

S

s

Yse

YYYY s bwBμbB

Minimizing the following error function:

.)0(

)()()()(

)(

Yse

Y

Y s

bwB

μ

Target mean vector is modeled by)()( sYμ

Error of target mean vectors for the sth pre-stored target speaker

Total error over all pre-stored target speakers

LS estimation of a regression parameters converting the voice quality control vector into the target mean vectors

= +

Regression parameters

Target mean vectorfor the sth target speaker

Voice quality control vectorfor the sth target speaker

14

Resulting EV-GMM in Method B

=

Weight

Mean vector

+

)0(ˆˆ )()(

)()(

Yie

Yi

XiZ

ibwB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Regression parameters

Parameters of the i th mixture

Target mean vector Voice quality control vector

Problem: the desired voice quality might not be obtained because the converted voice quality is affected by all EV-GMM parameters.

: Training parameters

: Speaker independent EV-GMM parameters

15

Proposed Method C

.,logmaxarg )()()(

1 1

)(

)(

se

EVst

S

s

T

t

EV Ps

EVwλZλ

λ

Maximizing the following likelihood function:

* This process is considered as speaker adaptive training (SAT) of EV-GMM [Ohtani et al., Interspeech 2007].

Likelihood of the adapted EV-GMM for each pre-stored target speaker

Maximum Likelihood (ML) estimation of all EV-GMM parameters while fixing the voice quality control vector

Total likelihood over all pre-stored target speakers

.)0(

)()()()(

)(

Yse

Y

Y s

bwB

μ

Target mean vector is modeled by)()( sYμ

= +

Regression parameters

Target mean vectorfor the sth target speaker

Voice quality control vectorfor the sth target speaker

16

Resulting EV-GMM in Method C

=

Weight

Mean vector

+

)0(ˆˆ )()(

)()(

Yie

Yi

XiZ

ibwB

μμ

i

Covariance matrix

)()(

)()()(

YYi

YXi

XYi

XXiZZ

iΣΣ

ΣΣΣ

Parameters of the i th mixture

Target mean vectorVoice quality control vector

Regression parameters

: Training parameters

17

Comparison of Proposed Methods

Dependent variablesTied parameters of EV-GMM

Training criterion

Method A

Principal componentsSpeaker

independentLS

Method B

Target mean vectorsSpeaker

independentLS

Method C

Target mean vectors Optimized ML

18

Contents

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions

19

Verification of Proposed Methods

• Objective verification

• Subjective verification

Source speaker One female

Pre-stored target speakers

15 males and 15 females

Sentences Phonetically balanced 50 sentences per a speaker

Expression word pairs masculine / feminine, hoarse / clear, elderly / youthful, thin / deep, lax / tense

Number of mixtures 128

Number of Eigenvectors 29 (no loss of information)

Experimental conditions

20

Objective VerificationIs a correspondence of the voice quality control vector into the converted voice quality appropriately modeled?

• For each pre-stored target speaker in the training data, the following two voice quality control vectors were compared.

1. Manually assigned one

2. Adjusted one on the trained EV-GMM so that the converted voice quality becomes similar to the target

* approximately determined by maximum likelihood eigen-decomposition for EV-GMM [Toda et al., 2006] using two sentences

• Euclidean distance and correlation coefficient between those two vectors were calculated as objective measures.

21

Results of Objective Verification

* Reassigned: assigned scores by the same listener a second time on a different day

Worse

Better

Better

Worse

Better!

1. The method A does not work at all.1. The method A does not work at all.2. The method B works but not so good.1. The method A does not work at all.2. The method B works but not so good.3. The method C works reasonably well.

Too consistent compared with

human judgment? Better!

22

Subjective Verification

• Preference test on the converted speech quality was conducted.– Comparison of average voices* by the trained EV-GMMs * converted voices when setting every component of the voice q

uality control vector to zero

Test sentences 50 sentences not included in training data

Number of subjects 5

Experimental conditions

Having very similar speaker individuality in both method B and C

Which is better, the method B or the method C?

23

Result of Subjective Verification

• The method B outperforms the method C.

Possibility to be thought– The EV-GMM parameters trained in EM algorithm

converged to local optima due to using inappropriate initial model (i.e., the target independent GMM).

24

Contents

1. Conventional voice quality control methods

2. Proposed voice quality control methods

3. Experimental verification

4. Conclusions

25

Conclusions

• Proposal of regression approaches to the voice quality control based on one-to-many eigenvoice conversion (EVC)– Based on a statistical conversion framework– Allowing intuitive control of converted voice quality with voi

ce quality control vector

• Experimental verification– Showing the possibility that voice quality control with hig

h quality and high controllability is realized.