Synthesis of Child Speech Synthesis of Child Speech With HMM Adaptation and With HMM Adaptation and
Voice ConversionVoice ConversionOliver Watts, Junichi Yamagishi, Member, IEEE, Simon Oliver Watts, Junichi Yamagishi, Member, IEEE, Simon King, Senior Member, IEEE, and Kay Berkling, Senior King, Senior Member, IEEE, and Kay Berkling, Senior Member, IEEE,IEEE TRANSACTIONS ON AUDIO, Member, IEEE,IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 5, JULY 2010NO. 5, JULY 2010
Adviser: Dr. Yeou - Jiunn ChenPresenter: Ming –Da Lee
OOutlineutline
IntroductionIntroduction Child speech dataChild speech data The systemsThe systems Evaluation Evaluation ConclusionConclusion ReferenceReference
IntroductionIntroduction
The synthesis of child speech presents special The synthesis of child speech presents special difficulties for the data-driven speech difficulties for the data-driven speech synthesis systems synthesis systems The type of child speech corpus typically availableThe type of child speech corpus typically available Two typesTwo types
Unit selection synthesisUnit selection synthesis Statistical parametric approachesStatistical parametric approaches
IntroductionIntroduction
Unit selection synthesis Unit selection synthesis To produce waveforms for arbitrary novel To produce waveforms for arbitrary novel
utterances.utterances. To reuse existing sections of waveform from To reuse existing sections of waveform from
a database.a database. If the database is imperfectIf the database is imperfect
A direct impact on the quality of the speech A direct impact on the quality of the speech synthesissynthesis
Speaker inconsistency, background noise, and Speaker inconsistency, background noise, and poor phonetic coverage.poor phonetic coverage.
IntroductionIntroduction
Statistical parametric approaches to speech Statistical parametric approaches to speech synthesissynthesis Hidden Markov model (HMM)-based speech Hidden Markov model (HMM)-based speech
synthesissynthesis
IntroductionIntroduction HMMs baseHMMs base
To be trained on cleanly To be trained on cleanly recorded datarecorded data
Rich in phonetic contextsRich in phonetic contexts High-quality speechHigh-quality speech
The adaptation data is noisy The adaptation data is noisy and sparseand sparse
IntroductionIntroduction
Adaptation techniquesAdaptation techniques Data-driven synthesizer of child speechData-driven synthesizer of child speech
This work with fuller analysisThis work with fuller analysis HMM adaptation techniques and techniques from HMM adaptation techniques and techniques from
voice conversion of an existing synthesizer to a voice conversion of an existing synthesizer to a child speaker.child speaker.
Child speech dataChild speech data
Child speech dataChild speech data
Type-Token Ratios (TTR)
Child speech dataChild speech data
Child speech dataChild speech data
The systemsThe systems
The systemsThe systemsSpeaker-Dependent Systems (A, C, E)
Speaker Adaptive Systems (B, D, F):CMU-ARCTIC
Systems M, N, and O were all designed to be compared with system L .
Systems Q, R, and S were all designed to be compared with system P .
EvaluationEvaluation
We used sentences from the corpus for this part of the test. 48 paid listeners,all native speakers of English between the ages of 18 and 25.
EvaluationEvaluation
EvaluationEvaluation
EvaluationEvaluation
Evaluation Evaluation
Results of pairwise Wilcoxon signed rank tests between systems; a black square shows a significant difference between systems with α =0.01(with Bonferroni correction).
EvaluationEvaluation
Results of XAB test for speaker individuality, comparisons Results of XAB test for speaker individuality, comparisons among systems F, I, J, and K. Vertical lines show 95% among systems F, I, J, and K. Vertical lines show 95% confidence intervals (with Bonferroni correction).confidence intervals (with Bonferroni correction).
EvaluationEvaluation
Results of XAB test for speaker individuality; comparisons Results of XAB test for speaker individuality; comparisons among systems L–S, Vertical lines show 95% confidence among systems L–S, Vertical lines show 95% confidence intervals (with Bonferroni correction).intervals (with Bonferroni correction).
ConclusionConclusion
When the adaptation data is restricted to 15 When the adaptation data is restricted to 15 min, there was no significant preference for min, there was no significant preference for either HMM adaptation or voice conversion either HMM adaptation or voice conversion methods.methods.
HMM adaptation was preferred in every caseHMM adaptation was preferred in every case Using the full target speaker corpus. Using the full target speaker corpus. This is because relatively large amounts of data This is because relatively large amounts of data
enable extensive use of the decision tree.enable extensive use of the decision tree. Incorporates high-level linguistic and prosodic Incorporates high-level linguistic and prosodic
information in speaker adaptation. information in speaker adaptation.
Thank you
Reference Junichi Yamagishi, Member, IEEE, Takashi Nose, Heiga Zen, Zhen-Hua
Ling, Tomoki Toda, Member, IEEE, Keiichi Tokuda, Member, IEEE, Simon King, Senior Member, IEEE, and Steve Renals, Member, EEE“Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis” IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 6, AUGUST 2009