nearly perfect detection of continuous f0 contour and ... · contour and frame classification for...

29
Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender

Upload: hoangtram

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for

TTS Synthesis

Thomas Ewender

2

Outline

• Motivation• Detection algorithm of continuous F0 contour• Frame classification algorithm

3

Motivation

• Detection algorithm of continuous F0 contour– Prediction of target prosody– Prosodic modification of segments

• Frame classification algorithm– Prosodic modification of segments

4

T0 contour estimation from a high resolution cepstrogram

Easy to “see” the T0 contour for humans:

5

Easy to “see” the T0 contour for humans:

T0 contour estimation from a high resolution cepstrogram

6

Ingredients to solve problem

• High resolution cepstrogram• Statistical description of T0 contour form• Dynamic programming to do global optimisation

7

Cepstrum with standard resolutionResolution : / , standard resolution: 1/

s

s s

t N M M Nt f⋅ =

=

8

Cepstrum with high resolutionResolution : / , standard resolution: 1/ high resolution:

s

s s

t N M M Nt f M N⋅ =

=

9

Statistical description of T0 contour form

• Used 17 hours of speech data• Various speakers and languages• Created a statistical model

10

Probability distribution of local gradient and curvature

− − + − − − − + −= =

Define local gradient ( ) and the curvature ( ) of the discrete time sequence ( ) :( ( ) ( 2 )) / 2 ( ) ( ) ( ) 2 ( ) ( 2 )( ) , ( )

2s s s s

s s

d t c t q tq t q t T q t q t T q t q t T q t Td t c t

T T− 2 st t

− st T− 2 st T t

− − + − − − − + −= =

Define local gradient ( ) and the curvature ( ) of the discrete time sequence ( ) :( ( ) ( 2 )) / 2 ( ) ( ) ( ) 2 ( ) ( 2 )( ) , ( )

2s s s s

s s

d t c t q tq t q t T q t q t T q t q t T q t Td t c t

T T

11

Gradient [%/s]

Curvature [%/s]

Empirical probability distribution of the T0 course

12

Probability density function of GMM approximation

Gradient [%/s]

Curvature [%/s]

13

Finding the globally most probable T0 contour

− 2 st T − st T t

Compute p(d,c) from the sequence of logarithmic quefrencies q(t)

14

α = ⋅ ( , )Local score: ( , ) ( , ) ( , ) : log cepstrogram : discrete time

wC t lt l p d c e C t lt

: log quefrency : weighting factor

lw

Finding the globally most probable T0 contour

− 2 st T − st T t

15

Finding the globally most probable T0 contour

( ) ( ){ }δ δ α

δ

= −

= ≤

Compute the optimal sequence using dynamic programming: ( , ) max , · ,

where ( , ) 1 for 0.k st l t T k t l

t l t

δ −( , )st T k

α( , )t l

− 2 st T − st T t

16

Resulting T0 contour in the cepstrogram

17

Achievements F0 detection

• F0 detection based on a clear mathematical model and on statistical properties

• Global optimisation gives more robust results (in contrast to piecewise)

• No post-processing of resulting F0 or T0 contours needed

18

ANN-based frame classification for prosody modification

Obtain class information required for prosody modification (with PSOLA e.g.) according to signal properties:

• voiced• unvoiced• mixed• irregular• silence

19

ANN-based frame classification for prosody modification

Obtain class information required for prosody modification (with PSOLA e.g.) according to signal properties:

• voiced• unvoiced• mixed• irregular• silence

20

Mixed segments

Speech with voicing and noise

21

Irregularly glottalised segmentsSpeech with irregularly spaced glottal pulses, no significant fricative components

22

Classification features• Zero crossing rate• Speech signal power (in logarithmic scale)• Spectral tilt (first mel frequency cepstral coefficient)• Dominance of central frequencies (second mel frequency

cepstral coefficient)• Value of the cepstrogram at quefrency T0• Amplitude of fundamental wave• Regularity of fundamental wave frequency increase

(dynamics)• Regularity of fundamental wave shape• Irregularity of amplitude increase

23

Classification features• Zero crossing rate• Speech signal power (in logarithmic scale)• Spectral tilt (first mel frequency cepstral coefficient)• Dominance of central frequencies (second mel frequency

cepstral coefficient)• Value of the cepstrogram at quefrency T0• Amplitude of fundamental wave• Regularity of fundamental wave frequency increase

(dynamics)• Regularity of fundamental wave shape• Irregularity of amplitude increase

Signal

Fundamental wave

24

Classifier and training

• 2-layer ANN classifier • Training: 20 different voices covering a range of

12 European and Asian languages

25

Example: voiced/unvoiced transition

26

Example: creaky voicesNorwegian voice

Mandarin voice

27

Classification evaluation

Classification accuracy in %

• Voiced/unvoiced decision nearly perfect• Confusions occur mainly between classes that share signal qualities (as unvoiced and mixed and voiced and irregular)

28

( )Realise prosodic modification without perceivable negative impact

Conclusion

29

References

1. C. Traber, “Svox: The implementation of a text-to-speech system for German,” Ph.D. dissertation, No. 11064, Computer Engineering and Networks Laboratory, ETH Zurich, TIK-Schriftenreihe Nr. 7 (ISBN 3 7281 2239 4), March 1995.

2. H. Romsdorfer, “Polyglot text-to-speech synthesis: Text analysis & prosody control,” Ph.D. dissertation, No. 18210, Shaker Verlag Aachen (ISBN 978-3-8322-8090-1), February 2009.

3. F. Charpentier and E. Moulines, “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,” in Proceedings of Eurospeech’89, 1989, pp. 13–19.

4. A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.

5. H. Kawahara and A. de Cheveigné, et al., “Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT,” in Proceedings of Interspeech. ISCA, 2005, pp. 537–540.

6. D. Joho, M. Bennewitz, and S. Behnke, “Pitch estimation using models of voiced speech on three levels,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, Honolulu, Hawaii, USA, April 2007, pp. 1077–1080.