December 2006
Cairo UniversityFaculty of Computers and Information
HMM Based Speech Synthesis
Presented byOssama Abdel-Hamid Mohamed
2 HMM Based Speech Synthesis
Agenda
Speech SynthesisHMM Based Speech SynthesisProposed SystemChallenges
3 HMM Based Speech Synthesis
Speech Synthesis
What is speech synthesis?– Generating human like speech using computers.
Applications– Text To Speech.
– Conversation systems.
– Speech to speech translation.
– Concept to speech.
Systems built since late 1970s.– MITTALK 1979
– Klattalk 1980
4 HMM Based Speech Synthesis
Speech Synthesis, Cont.
Challenges:– Intelligibility.
– Naturalness.
– Pleasantness.
– Emotions.
5 HMM Based Speech Synthesis
Speech Synthesis, Techniques
•Techniques
•Formant Based •Concatenative HMM Based
•Rule Based
•Difficult to make
•Machine Like
•Instance Based
•Based on corpus
•Better quality
•Not flexible
•Statistical Based
•Based on corpus
•Newest technique
•More flexible
6 HMM Based Speech Synthesis
Agenda
Speech SynthesisHMM Based Speech SynthesisProposed SystemChallenges
7 HMM Based Speech Synthesis
HMM Based Speech Synthesis Overview
HMM has been used successfully in speech recognition.
In Recogntion
In Speech Synthesis:
)|(maxarg* OPOO
)|(maxarg*
OP
8 HMM Based Speech Synthesis
HMM Based Speech Synthesis Overview, Cont. Include delta and acceleration to get smooth
output
9 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
10 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
Modeled using MSD-HMM 25 Mel-Cepstral
11 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
Context Dependant Models
Each model 5 States
12 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
13 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
Each Frame is either voicedor unvoiced
14 HMM Based Speech Synthesis
The Overall System
Synthesis Part
Speech Database F0
ExtractionMel-Cepstral
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Pulse or Noise Excitation
f0
MLSA filter Speech
Mel-cepstrum
Excitation
Mel-cepstrum
f0
Training Part
15 HMM Based Speech Synthesis
Advantages
1. Its voice characteristics can be easily modified,
2. It can be applied to various languages with little modification,
3. A variety of speaking styles or emotional speech can be synthesized using the small amount of speech data,
4. Techniques developed in ASR can be easily applied,
5. Its footprint is relatively small. An HMM based TTS system produced best
results in Blizzard challenge.
16 HMM Based Speech Synthesis
Agenda
Speech SynthesisHMM Based Speech SynthesisProposed SystemChallenges
17 HMM Based Speech Synthesis
Problems we tried to solve
1. Marking each frame as either voiced or unvoiced degrades quality, because there are some unvoiced components on most voiced speech parts, and there are mixed-excitation phonemes.
2. Used speech signal analysis / synthesis techniques and parameters degrades quality.
18 HMM Based Speech Synthesis
Multi-Band Excitation
In MBE (Multi-Band Excitation) speech is divided into a number of frequency bands, and voicing is estimated in each band (used 17 bands).
19 HMM Based Speech Synthesis
Mixed Excitation
In synthesis periodic and noise excitations are mixed according to voicing parameters
20 HMM Based Speech Synthesis
Spectral Envelop Estimation
Find values for a fixed number of samples
Use sinusoidal model for synthesis
21 HMM Based Speech Synthesis
Modified System
Synthesis Part
Speech Database F0
ExtractionSpectral Envelop
Analysis
HMM Training
Models
Labels and context features
Text Analysis
Text
Text Analysis Parameters Generation
Labels and context features
Spectral Envelop Samples
f0
Training PartBands Voicing
detectionBands Voicing
Noise + STFT filter
HarmonicsSynthesis
Bands Mixing
Spec. Env. Samples+ f0
Bands Voicing
Voiced Speech
Unvoiced Speech Speech
22 HMM Based Speech Synthesis
Result
MOS scores
1
1.5
2
2.5
3
3.5
4
4.5
5
BaselineSystem
Baseline +MBE
ProposedSystem
Sc
ore
23 HMM Based Speech Synthesis
Agenda
Speech SynthesisHMM Based Speech SynthesisProposed SystemChallenges
24 HMM Based Speech Synthesis
Other Challenges
Speech is overly smoothed– Use global variance.
Modeling accuracy, the system uses same modeling as recognition.
– Hidden semi markov models (duration).
– Trajectory HMMs,
– Minimum Generation error training
– More states clusters and use acoustic context (under research).
25 HMM Based Speech Synthesis
More States Clusters
Instead of computing one Gaussian per state, we store all occurrences. And record the context of each occurrence.
At synthesis we get the best sequence using dynamic programming.
Previous NextCurrent
…
26 HMM Based Speech Synthesis
Thank You