04/08/04 why speech synthesis is hard chris brew the ohio state university

04/08/04

Why Speech Synthesis is Hard

Chris Brew

The Ohio State University

04/08/04

Issues for text-to-speech

It should sound like a person AND should sound like a person who can

read AND it should sound like a person who

understands what they are reading

04/08/04

Credits

FESTIVAL: Alan W. Black, Paul Taylor, Simon King, Kevin Lenzo

Huang, Acero and Huang: Spoken Language Processing

Many web-based demos– http://www.ims.uni-stuttgart.de/~moehler/

synthspeech/examples.html– http://www.icsi.berkeley.edu/eecs225d/klatt.html

04/08/04

Text-to-speech

Text and Phonetic Analysis: What to say Prosody: How to say it Waveform synthesis: Making it sound right

04/08/04

Text and phonetic processing

Homographs Letter-to-sound Abbreviations

04/08/04

Prosody

Pauses Pitch Speech rate/ relative duration

04/08/04

Waveform generation

Articulatory Synthesis – Simulation of mechanics of speech production

Formant Synthesis– Source/filter model.

Concatenative synthesis– Limited domain waveform concatenation– No waveform modification– With waveform modification

04/08/04

Waveform generation

Use linear predictive coding to analyse signal into filter and residual, then excite with appropriate residual. Main benefit, compression.

04/08/04

One slide of speech acoustics

Formants - bands of strong energy in the speech signal

Spectrogram - representation of relation between time (x), frequency (y) and intensity

The speech organs consist of a noise source and some resonant cavities. We speak by changing the shape of the cavities, making some parts of the source come out strong, others weaker.

04/08/04

Sound like a person

Get a person to record whole vocabulary, then splice together the words to make sentences.

But: speech is hard to cut up in such a way that it sews back together nicely.

04/08/04

Sound like a person who can read

Grapheme to phoneme conversion. Input: text Output: phoneme string + annotations for

stress and intonation. Spelling rules get you some of the way, but

even in languages with regular spelling (English not among these) exceptions require the use of a dictionary.

04/08/04

Text Normalization

Henry V Part I, Act II scene 11, Mr. X is, I believe V.I. Lenin and not Charles I.

04/08/04

Specialized text types

Smith,Bobbie Q,3337 St Laurence St, Fort Worth,TX 71611-5484 (817) 839-3689

Anderson, W, 445 Sycamore Way NE, Lincoln, NE 98125-5108,(212)404-9998

Raw

Address

04/08/04

SABLE

See rinss-slides

04/08/04

Sound like you understand

Lexical stress and intonation matter very much, and tie in with pragmatics.

The system doesn’t in fact understand enough to get this right.

Best you can do is fake it. There are lots of cues available in the text, but mistakes are inevitable.

04/08/04

Rumpke Advert

Rhetorical Systems

Definitely wrong

Possibly good enough

04/08/04

Multilingual and flexible

Festival is open-architecture, and has been extended by lots of people

It can even (easily) be made to speak in your voice.

04/08/04

Prosody

04/08/04

Boston

It will be rainy today in Boston

04/08/04

Challenges for speech synthesis

Improve overall speech quality Refine ways of organizing and collecting

speech databases Improve the quality of the control signal

04/08/04

Sounds

04/08/04 why speech synthesis is hard chris brew the ohio state university

Documents

prosody slide

sound right slide

waveform modification

speech text

rinssslides slide

raw address slide

speech synthesis

ohio state university