nearly perfect detection of continuous f0 contour and ... · contour and frame classification for...
TRANSCRIPT
Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for
TTS Synthesis
Thomas Ewender
2
Outline
• Motivation• Detection algorithm of continuous F0 contour• Frame classification algorithm
3
Motivation
• Detection algorithm of continuous F0 contour– Prediction of target prosody– Prosodic modification of segments
• Frame classification algorithm– Prosodic modification of segments
6
Ingredients to solve problem
• High resolution cepstrogram• Statistical description of T0 contour form• Dynamic programming to do global optimisation
8
Cepstrum with high resolutionResolution : / , standard resolution: 1/ high resolution:
s
s s
t N M M Nt f M N⋅ =
=
9
Statistical description of T0 contour form
• Used 17 hours of speech data• Various speakers and languages• Created a statistical model
10
Probability distribution of local gradient and curvature
− − + − − − − + −= =
Define local gradient ( ) and the curvature ( ) of the discrete time sequence ( ) :( ( ) ( 2 )) / 2 ( ) ( ) ( ) 2 ( ) ( 2 )( ) , ( )
2s s s s
s s
d t c t q tq t q t T q t q t T q t q t T q t Td t c t
T T− 2 st t
− st T− 2 st T t
− − + − − − − + −= =
Define local gradient ( ) and the curvature ( ) of the discrete time sequence ( ) :( ( ) ( 2 )) / 2 ( ) ( ) ( ) 2 ( ) ( 2 )( ) , ( )
2s s s s
s s
d t c t q tq t q t T q t q t T q t q t T q t Td t c t
T T
13
Finding the globally most probable T0 contour
− 2 st T − st T t
Compute p(d,c) from the sequence of logarithmic quefrencies q(t)
14
α = ⋅ ( , )Local score: ( , ) ( , ) ( , ) : log cepstrogram : discrete time
wC t lt l p d c e C t lt
: log quefrency : weighting factor
lw
Finding the globally most probable T0 contour
− 2 st T − st T t
15
Finding the globally most probable T0 contour
( ) ( ){ }δ δ α
δ
= −
= ≤
Compute the optimal sequence using dynamic programming: ( , ) max , · ,
where ( , ) 1 for 0.k st l t T k t l
t l t
δ −( , )st T k
α( , )t l
− 2 st T − st T t
17
Achievements F0 detection
• F0 detection based on a clear mathematical model and on statistical properties
• Global optimisation gives more robust results (in contrast to piecewise)
• No post-processing of resulting F0 or T0 contours needed
18
ANN-based frame classification for prosody modification
Obtain class information required for prosody modification (with PSOLA e.g.) according to signal properties:
• voiced• unvoiced• mixed• irregular• silence
19
ANN-based frame classification for prosody modification
Obtain class information required for prosody modification (with PSOLA e.g.) according to signal properties:
• voiced• unvoiced• mixed• irregular• silence
21
Irregularly glottalised segmentsSpeech with irregularly spaced glottal pulses, no significant fricative components
22
Classification features• Zero crossing rate• Speech signal power (in logarithmic scale)• Spectral tilt (first mel frequency cepstral coefficient)• Dominance of central frequencies (second mel frequency
cepstral coefficient)• Value of the cepstrogram at quefrency T0• Amplitude of fundamental wave• Regularity of fundamental wave frequency increase
(dynamics)• Regularity of fundamental wave shape• Irregularity of amplitude increase
23
Classification features• Zero crossing rate• Speech signal power (in logarithmic scale)• Spectral tilt (first mel frequency cepstral coefficient)• Dominance of central frequencies (second mel frequency
cepstral coefficient)• Value of the cepstrogram at quefrency T0• Amplitude of fundamental wave• Regularity of fundamental wave frequency increase
(dynamics)• Regularity of fundamental wave shape• Irregularity of amplitude increase
Signal
Fundamental wave
24
Classifier and training
• 2-layer ANN classifier • Training: 20 different voices covering a range of
12 European and Asian languages
27
Classification evaluation
Classification accuracy in %
• Voiced/unvoiced decision nearly perfect• Confusions occur mainly between classes that share signal qualities (as unvoiced and mixed and voiced and irregular)
29
References
1. C. Traber, “Svox: The implementation of a text-to-speech system for German,” Ph.D. dissertation, No. 11064, Computer Engineering and Networks Laboratory, ETH Zurich, TIK-Schriftenreihe Nr. 7 (ISBN 3 7281 2239 4), March 1995.
2. H. Romsdorfer, “Polyglot text-to-speech synthesis: Text analysis & prosody control,” Ph.D. dissertation, No. 18210, Shaker Verlag Aachen (ISBN 978-3-8322-8090-1), February 2009.
3. F. Charpentier and E. Moulines, “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,” in Proceedings of Eurospeech’89, 1989, pp. 13–19.
4. A. de Cheveigné and H. Kawahara, “YIN, a fundamental frequency estimator for speech and music,” Journal of the Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930, 2002.
5. H. Kawahara and A. de Cheveigné, et al., “Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT,” in Proceedings of Interspeech. ISCA, 2005, pp. 537–540.
6. D. Joho, M. Bennewitz, and S. Behnke, “Pitch estimation using models of voiced speech on three levels,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, Honolulu, Hawaii, USA, April 2007, pp. 1077–1080.