6.#textto(speech#synthesis# - xavier anguera · that allowed it to produce voewls and consonants....
TRANSCRIPT
![Page 1: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/1.jpg)
6. Text-‐to-‐Speech Synthesis
(Most Of these slides come from Dan Jura>y’s course at Stanford)
![Page 2: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/2.jpg)
Tractament Digital de la Parla 2
History of Speech Synthesis • In 1779, the Danish scientist Christian Kratzenstein builds models of
the human vocal tract that can produce the five long vowel sounds. • In 1791 Wolfgang von Kempelen (the creator of the Turk chess playing
game) devises the bellows(fuelle, mancha)-operated “automatic-mechanical speech machine”. It added models of the tongue and lips that allowed it to produce voewls and consonants.
• In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's design.
• In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible. It was later refined into the VODER, which was exhibited at the 1939 New York World's Fair.
![Page 3: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/3.jpg)
Von Kempelen: • Small whistles
controlled consonants • Rubber mouth and
nose; nose had to be covered with two fingers for non-nasals
• Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air
From Traunmüller’s web site
![Page 4: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/4.jpg)
Von Kempelen’s speaking machine
![Page 5: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/5.jpg)
Bell labs VOCODER machine
![Page 6: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/6.jpg)
Homer Dudley 1939 VODER • Synthesizing speech by electrical means • 1939 World’s Fair
![Page 7: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/7.jpg)
Homer Dudley’s VODER
• Manually controlled through complex keyboard • Operator training was a problem
![Page 8: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/8.jpg)
One of the first “talking” computers
![Page 9: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/9.jpg)
Closer to a natural vocal tract: Riesz 1937
![Page 10: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/10.jpg)
The UK Speaking Clock
• July 24, 1936 • Photographic storage on 4 glass disks • 2 disks for minutes, 1 for hour, one for
seconds. • Other words in sentence distributed across
4 disks, so all 4 used at once. • Voice of “Miss J. Cain”
![Page 11: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/11.jpg)
The 1936 UK Speaking Clock
From hQp://web.ukonline.co.uk/freshwater/clocks/spkgclock.htm
![Page 12: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/12.jpg)
The UK speaking clock
![Page 13: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/13.jpg)
Gunnar Fant’s OVE synthesizer • Of the Royal Institute of
Technology, Stockholm • Formant
Synthesizer for vowels
• F1 and F2 could be controlled
From Traunmüller’s web site
![Page 14: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/14.jpg)
Cooper’s Pattern Playback
• Haskins Labs for investigating speech perception
• Works like an inverse of a spectrograph • Light from a lamp goes through a rotating
disk then through spectrogram into photovoltaic cells
• Thus amount of light that gets transmitted at each frequency band corresponds to amount of acoustic energy at that band
![Page 15: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/15.jpg)
Cooper’s Pattern Playback
![Page 16: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/16.jpg)
Modern TTS systems • 1960’s first full TTS: Umeda et al (1968) • 1970’s
– Joe Olive 1977 concatenation of linear-prediction diphones – Texas Instruments Speak and Spell,
• June 1978 • Paul Breedlove
• 1980’s – 1979 MIT MITalk (Allen, Hunnicut, Klatt)
• 1990’s-present – Diphone synthesis – Unit selection synthesis – HMM synthesis
![Page 17: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/17.jpg)
Speak and spell demo
![Page 18: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/18.jpg)
TTS Demos (Unit-Selection) • Cereproc
– Catalan – Spanish
• Festival (open source) – English
• ATT – English – South American
![Page 19: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/19.jpg)
Types of Waveform Synthesis
• Articulatory Synthesis: – Model movements of articulators and acoustics of vocal
tract • Formant Synthesis:
– Start with acoustics, create rules/filters to create each formant
• Concatenative Synthesis: – Use databases of stored speech to assemble new
utterances. • Diphone • Unit Selection
• Statistical (HMM) Synthesis – Trains parameters on databases of speech
Text modified from Richard Sproat slides
![Page 20: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/20.jpg)
1st gen. synthesis: Formant Synthesis
• Were the most common commercial systems when computers were slow and had little memory.
• 1979 MIT MITalk (Allen, Hunnicut, Klatt) • 1983 DECtalk system
– “Perfect Paul” (The voice of Stephen Hawking)
– “Beautiful Betty”
![Page 21: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/21.jpg)
2nd Generation Synthesis
• Diphone Synthesis – Units are diphones; middle of one phone to
middle of next. – Why? Middle of phone is steady state. – Record 1 speaker saying each diphone – ~1400 recordings – Paste them together and modify prosody.
![Page 22: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/22.jpg)
3rd GenerationSynthesis • All current commercial systems.
• Unit Selection Synthesis – Larger units of variable length – Record one speaker speaking 10 hours or more,
• Have multiple copies of each unit
– Use search to find best sequence of units
• Hidden Markov Model Synthesis – Train a statistical model on large amounts of data.
![Page 23: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/23.jpg)
General TTS architecture
Raw text in
Text analysis: • Text NormalizaYon • Part-‐of-‐speech tagging • Homograph disambiguaYon
PhoneYc analysis • DicYonary lookup • Grapheme-‐to-‐phoneme(LTS)
Prosodic analysis • Boundary placement • Pitch accent assignment • DuraYon computaYon
Waveform synthesis
Speech out
![Page 24: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/24.jpg)
1. Text Normalization • Analysis of raw text into pronounceable words • Sample problems:
– He stole $100 million from the bank – It's 13 St. Andrews St. – The home page is h?p://www.stanford.edu – yes, see you the following tues, that's 11/12/01
• Steps: – Identify tokens in text – Chunk tokens into reasonably sized sections – Map tokens to words – Identify types for words
![Page 25: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/25.jpg)
Identify tokens and chunk them
• Whitespace can be viewed as separators • Punctuation can be separated from the
raw tokens • Festival converts text into
– ordered list of tokens – each with features:
• its own preceding whitespace • its own succeeding punctuation
![Page 26: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/26.jpg)
End-of-utterance detection
• Relatively simple if utterance ends in ?! • But what about ambiguity of “.” • Ambiguous between end-of-utterance and
end-of-abbreviation – My place on Winfield St. is around the corner. – I live at 151 Winfield St. – (Not “I live at 151 Winfield St..”)
![Page 27: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/27.jpg)
Identify Types of Tokens, and Convert Tokens to Words
• Pronunciation of numbers often depends on type. 3 ways to pronounce 1776: – 1776 date: seventeen seventy six. – 1776 phone number: one seven seven six – 1776 quantifier: one thousand seven hundred
(and) seventy six – Also:
• 25 day: twenty-fifth
![Page 28: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/28.jpg)
Step 2: Classify token into 1 of 20 types
• EXPN: abbrev, contractions (adv, N.Y., mph, gov’t) • LSEQ: letter sequence (CIA, D.C., CDs) • ASWD: read as word, e.g. CAT, proper names • MSPL: misspelling • NUM: number (cardinal) (12,45,1/2, 0.6) • NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates II • NTEL: telephone (or part) e.g. 212-555-4523 • NDIG: number as digits e.g. Room 101 • NIDE: identifier, e.g. 747, 386, I5, PC110 • NADDR: number as stresst address, e.g. 5000 Pennsylvania • NZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc • SLNT: not spoken (KENT*REALTY)
![Page 29: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/29.jpg)
POS Tagging: Definition
• The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
the koala put the keys on the table
WORDS TAGS
N V P DET
![Page 30: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/30.jpg)
Part of speech tagging
• 8 (ish) traditional parts of speech – Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc – This idea has been around for over 2000
years (Dionysius Thrax of Alexandria, c. 100 B.C.)
– Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS
– We’ll use POS most frequently
![Page 31: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/31.jpg)
POS examples
• N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous • ADV adverb unfortunately, slowly, • P preposition of, by, to • PRO pronoun I, me, mine • DET determiner the, a, that, those
![Page 32: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/32.jpg)
POS Tagging example WORD tag
the DET koala N put V the DET keys N on P the DET table N
![Page 33: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/33.jpg)
POS tagging: Choosing a tagset
• There are so many parts of speech, potential distinctions we can draw
• To do POS tagging, need to choose a standard set of tags to work with
• Could pick very coarse tagets – N, V, Adj, Adv.
• More commonly used set is finer grained, the “UPenn TreeBank tagset”, 45 tags – PRP$, WRB, WP$, VBG
• Even more fine-grained tagsets exist
![Page 34: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/34.jpg)
Penn TreeBank POS Tag set
![Page 35: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/35.jpg)
POS importance
• Part of speech tagging plays important role in TTS
• Most algorithms get 96-97% tag accuracy • Not a lot of studies on whether remaining
error tends to cause problems in TTS
1/5/07
![Page 36: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/36.jpg)
PhoneYc analysis
• Deals with the pronunciaYon of each word • Finds, for each word, the way to pronounce it • Straigh^orward method would be a dicYonary lookup, but: – Unknown words and proper names will be missing – Words in other languages
![Page 37: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/37.jpg)
Phoneme sets
![Page 38: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/38.jpg)
ConverYng from words to phones • Two methods:
– DicYonary-‐based – Rule-‐based (through leQer-‐to-‐sound -‐LTS-‐ rules)
• All early systems used LTS. The first to use a dicYonary was MITTalk (10K words)
• A good dicYonary example is CMU dicYonary with 127K words
![Page 39: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/39.jpg)
When dictionaries aren’t sufficient
• Unknown words – Seem to be linear with number of words in unseen text – Mostly person, company, product names – But also foreign words, etc. – From a Black et al analysis
• Of 39K tokens in part of the Wall Street Journal • 1775 (4.6%) were not in the OALD dictionary:
• So commercial systems have 3-part system: – Big dictionary – Special code for handling names – Machine learned LTS system for other unknown words
1/5/07
![Page 40: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/40.jpg)
Learning LTS rules • Rules for Spanish are straigh^orward
– Grapheme(wriQen form) is very similar to the phoneme sequence
• In the past most big systems employed many linguists (of PhD students) to manually write LTS rules
• Now they are mostly induced automaYcally from a dicYonary of the language – An important step is the phoneme-‐grapheme alignment
![Page 41: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/41.jpg)
Homograph disambiguaYon • An homograph is a word (or group of words) that share the same wriQen form but have different meanings.
• In pronunciaYon, they might have the same pronunciaYon (homonyms) or different (heteronyms).
• It is very important to detect and disambiguate them for correct pronunciaYon
• Examples: – English: wood(madera)/wood(bosque), bear(aguantar)/bear(barba), read(presente leer)/read(pasado leer)
– Spanish: all words are homonyms
![Page 42: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/42.jpg)
Next steps afer text analysis
• Afer Text Analysis we now have a string of Phones. We also have some semanYc informaYon to do Prosodic Analysis, so next steps are:
• Prosody – Desired F0 for enYre uQerance – DuraYon for each phone – Stress value for each phone, possibly accent value
• Generate Waveforms
![Page 43: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/43.jpg)
Prosody Prosody is in charge of converYng words+phones into boundaries, accent, F0 and duraYon informaYon • Prosodic phrasing
– Need to break uQerances into phrases – PunctuaYon is useful, not sufficient
• Accents: – PredicYons of accents: which syllables should be accented – RealizaYon of F0 contour: given accents/tones, – generate F0 contour
• DuraYon: – PredicYng duraYon of each phone
![Page 44: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/44.jpg)
Graphic representation of F0
legumes are a good source of VITAMINS 50
100
150
200
250
300
350
400
time
F0 (i
n H
ertz
)
Slide from Jennifer Venditti
![Page 45: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/45.jpg)
AlternaYve accents
• We might want to give prominence to different parts of a sentence depending on the context.
• For example, imagine answering a quesYon on the previous example
![Page 46: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/46.jpg)
Q1: What types of foods are a good source of vitamins?
50
100
150
200
250
300
350
400
LEGUMES are a good source of vitamins
Slide from Jennifer Venditti
![Page 47: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/47.jpg)
Q2: Are legumes a source of vitamins?
50
100
150
200
250
300
350
400
Legumes are a GOOD source of vitamins
Slide from Jennifer Venditti
![Page 48: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/48.jpg)
Q3: I’ve heard that legumes are healthy, but what are they a good source of ?
legumes are a good source of VITAMINS 50
100
150
200
250
300
350
400
Slide from Jennifer Venditti
![Page 49: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/49.jpg)
Duration • Simplest: fixed size for all phones (100 ms) • Next simplest: average duration for that phone (from training data).
Samples from SWBD in ms: – aa 118 b 68 – ax 59 d 68 – ay 138 dh 44 – eh 87 f 90 – ih 77 g 66
• Next Next Simplest: add in phrase-final and initial lengthening plus stress
• Lots of fancy models of duration prediction: – Using Z-scores and other clever normalizations – Sum-of-products model – New features like word predictability
• Words with higher bigram probability are shorter
![Page 50: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/50.jpg)
Waveform synthesis generation
• Given: – String of phones – Prosody
• Desired F0 for entire utterance • Duration for each phone • Stress value for each phone, possibly accent value
• Generate: – Waveforms
![Page 51: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/51.jpg)
The hourglass architecture
![Page 52: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/52.jpg)
Internal Representation: Input to Waveform Synthesis
![Page 53: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/53.jpg)
Waveform synthesis
• concatenaYve synthesis systems – Diphone synthesis: join diphones to form the waveform.
– Unit selecYon synthesis: join the longest unit possible.
• HMM-‐based sytnthesis
![Page 54: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/54.jpg)
Diphones • mid-‐phone is more stable than edge • Need O(phone2) number of units
– Some combinaYons don’t exist (hopefully) – ATT (Olive et al. 1998) system had 43 phones
• 1849 possible diphones • PhonotacYcs ([h] only occurs before vowels), don’t need to keep diphones across silence
• Only 1172 actual diphones – May include stress, consonant clusters
• So could have more – Lots of phoneYc knowledge in design
• Database relaYvely small (by today’s standards) – Around 8 megabytes for English (16 KHz 16 bit)
Slide from Richard Sproat
![Page 55: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/55.jpg)
Diphones • Mid-phone is more stable than edge:
![Page 56: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/56.jpg)
Diphone synthesis • Training:
– Choose units (kinds of diphones) – Record 1 speaker saying 1 example of each diphone – Mark the boundaries of each diphones,
• cut each diphone out and create a diphone database
• Synthesizing an utterance, – grab relevant sequence of diphones from database – Concatenate the diphones, doing slight signal
processing at boundaries – use signal processing to change the prosody (F0,
energy, duration) of selected sequence of diphones
![Page 57: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/57.jpg)
Diphone labelling • Recorded sentences need to be labelled with phone start-‐end and middle(stable parts). This can be done automaYcally with: – ASR forced alignment – Audio-‐to-‐syntheYc audio auto-‐alignment
![Page 58: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/58.jpg)
Diphone auto-alignment
• Given – synthesized prompts – Human speech of same prompts
• Do a dynamic time warping alignment of the two – Using Euclidean distance
• Works very well 95%+ – Errors are typically large (easy to fix) – Maybe even automatically detected
Slide from Richard Sproat
![Page 59: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/59.jpg)
Dynamic Time Warping
Slide from Richard Sproat
![Page 60: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/60.jpg)
Pupng it all together
Once we have the diphones we want to join, we need to: • Join them together to form the words and sentences
• Modify the speaking speed to reflect the target duraYon
• Modify the pitch to reflect the target intonaYon
![Page 61: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/61.jpg)
Joining diphones Given any group of diphones to join together, we can follow a set of techniques: • Dumb:
– just join -‐> we will get many arYfacts – at zero crossings
• Algorithms that allow joining and modifying the duraYon at the same Yme – SOLA – TD-‐PSOLA(Time-‐domain pitch-‐synchronous overlap-‐and-‐add)
– Necessary prerequisite in voiced signals: epoch labelling
![Page 62: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/62.jpg)
Epoch-labeling
• An example of epoch-labeling useing “SHOW PULSES” in Praat:
It can be easily done using an EGG, or (less accurately) via signal processing)
![Page 63: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/63.jpg)
Epoch-labeling: Electroglottograph (EGG)
• Also called laryngograph or Lx – Device that straps on
speaker’s neck near the larynx
– Sends small high frequency current through adam’s apple
– Human tissue conducts well; air not as well
– Transducer detects how open the glottis is (I.e. amount of air between folds) by measuring impedence.
Picture from UCLA PhoneYcs Lab
![Page 64: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/64.jpg)
DuraYon/pitch modificaYon
If we playback a sound faster than usual (for example by resampling the signal) both pitch and length gets affected. To only alter one of them: • Time stretching: algorithms used to change the length/speed of a signal without altering its pitch
• Pitch scaling/shifing: algorithms used to change the pitch without affecYng the length/speed.
![Page 65: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/65.jpg)
Speech as Short Term signals
Alan Black
![Page 66: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/66.jpg)
Duration modification • Duplicate/remove short term signals
![Page 67: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/67.jpg)
Pitch Modification • Move short-term signals closer together/further apart
Slide from Richard Sproat
![Page 68: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/68.jpg)
Windowing • To avoid artifacts at the edges we first apply a
windowing to short-term signals • y[n] = w[n]s[n]
![Page 69: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/69.jpg)
Windowing
• Multiply value of signal at sample number n by the value of a windowing function
• y[n] = w[n]s[n]
![Page 70: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/70.jpg)
Pitch/duraYon modificaYon algorithms • OLA (Overlap-‐add synthesis): Applies a certain overlap to each short
term signal (if we desire a pitch modificaYon) and add them. If we want to lengthen the signal we just repeat some of the periods. It can create glitches when the short term boundaries are not well defined.
! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"
"("$("$" '!"#%#&#$"
)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!
! 'A!'() ! "!B!$!!
C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!
!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%
)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!
)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!
!
L,-./+%!14!%.++9*!
OPI!3,7,@,%+!%2%7+-!
!
ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%
Q1--,&3!,&,/2%0%=!
L+,59*0&:!415!%+:-+&7%!
L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!
N91D%709,/!1D7.D7!
I%+5!91--,&3!
-%
-
--
--
--
--
---%
-----%
!!
*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3
! !
! !!!!!!!!!!!!! ! "!#!$!$ $$""!!%"&$ '!!"#$"
"("$("$" '!"#%#&#$"
)*+!%,-./+%!%0"&$!1&/2!3044+5! 451-!6+51!1&!,&! 0&7+58,/!3+.+&3+&7!1&!5+918+50&:!4,9715!;<=!3+40&+3!,%!,!5,701!14!!%06+!>!14!7*+!,&,/2%0%!?0&31?!?"&$!@2!7*+!.079*!.+5013(!!
! 'A!'() ! "!B!$!!
C&! .5,9709+=! ?+! 9*11%+! ;<! ! B=! ?*+&! 7*+! %.+975D-! 14! ! %0"&$! %0:&,/! ,..51E0-,7+%! 7*+!%.+975D-!14!%"&$!%0:&,/(!)*+&! 7*+!91&9,7+&,701&!.519+%%!9*,&:+%! 7*+!.079*!?07*1D7!,44+970&:!7*+!415-,&7%F!45+GD+&90+%(!)*+!D%+!14!3044+5+&7!5+918+50&:!4,9715!9,D%+%!%751&:!3+:5,3,701&!14!%2&7*+%06+3!%.++9*=!+(:(!@D660&:!15!+44+97!14!-+7,/!8109+(!!
!! "#$%&'(&$)"%'*%+,("#$+-.$/%01+$2%'(%"#$%$34%21"101+$%
)*+!91&9+.7D,/!%9*+-+!14!7*+!30.*1&+!%2&7*+%06+5!0%!%*1?&!1&!7*+!40:D5+!#(!)*+5+!,5+!7*5++!@,%09!.*,%+%!14!%2&7*+%0%(!C&!7*+!405%7!.*,%+=!07!0%!&+9+%%,52!71!95+,7+!,!3,7,@,%+!71!%715+!%,-./+%! 14! *D-,&! /,&:D,:+(! )*+&=! 7*+! .,5709D/,5! %+:-+&7%! 14! 915.D%=! %D9*! ,%! .*1&+-+%=!30.*1&+%=! .+5013%=! +79(! ,5+! ,D71-,709,//2! 5+91:&06+3(! )*0%! %7+.! 0%! 31&+! @2! 7*+! %.+90,/!5+91:&060&:! %147?,5+=! 7*+! .,57! 14! 1D5! %2%7+-(! )0-+! -,5H%! 415! 7*+! .,5709D/,5! %+:-+&7%! 14!915.D%=!7*+!1D7.D7!14!7*+!%147?,5+=!,5+!%715+3!0&!7*+!3,7,@,%+(!!
)*+!%+91&3!.*,%+!0%!7*+!%.++9*!%2&7*+%0%(!I%+5!?507+%!7+E7=!*+!?,&7%!71!%2&7*+%06+!?07*!3+%05+3!45+GD+&92(!;05%7/2=!7*+!7+E7!0%!,&,/2%+3(!)*+&!7*+!3,7,@,%+!0%!%+,59*+3!,&3!,991530&:!71!7*+! 5D/+%! 14! 7*+! /,&:D,:+=! ,..51.50,7+! 30.*1&+%! ,5+! -,5H+3(! ;0&,//2=! 7*+2! ,5+! %2&7*+%06+3!,991530&:! 71! 7*+! )J! KLM>N! ,/:1507*-(! L2&7*+%06+3! %.++9*! 0%! %715+3! 0&! 7*+! 3,7,@,%+! 415!4D57*+5!.519+%%0&:!"7*+!7*053!.*,%+$=!15!07!9,&!@+!./,2+3!18+5!,&!1D7.D7!3+809+(!!
!
L,-./+%!14!%.++9*!
OPI!3,7,@,%+!%2%7+-!
!
ND71-,709!5+91:&0701&!14!.*1&+-+%=!30.*1&+%!,&3!.+5013%
Q1--,&3!,&,/2%0%=!
L+,59*0&:!415!%+:-+&7%!
L.++9*!%2&7*+%0%!@,%+3!1&!!)J!KLM>N!
N91D%709,/!1D7.D7!
I%+5!91--,&3!
-%
-
--
--
--
--
---%
-----%
!!
*567%89! *+#,-./0123$,4-5-3+637".4+#-3$8#/4-$"9-:3
! !
Original period Desired period
![Page 71: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/71.jpg)
Pitch/duraYon modificaYon algorithms • PSOLA (Pitch synchronous overlap-‐add synthesis): Through a good
pitch/F0 detecYon it defines the boundaries of the short term signals to be amplitude peaks containing 2-‐3 pitch periods. Then it uses OLA to join them.
• TD-‐PSOLA (Time-‐domain PSOLA): Patented by France Telecom, is an efficient Yme-‐domain version of PSOLA (no FFT required, suitable for real-‐Yme TTS). Can modify pitch up to 2X or 0.5X
• MBR-‐PSOLA (MulY-‐band re-‐synthesis psola): tries to solve problems of TD-‐PSOLA in boundaries between diphones due to phase, pitch or spectral envelope mismatches y normalizing all the database a priory. It later turned into the open project called MBROLA, which focuses on minimizing the database storage.
• LP-‐PSOLA (Linear predictors PSOLA): modificaYon of PSOLA where the LP coefficients are stored instead of the signal itself.
![Page 72: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/72.jpg)
! ! ! !!
!"#$%&'! !"#$%&&'(#)*+,!-)#.*/01%/#12'3-!45#6787("#$*+,!-)#.*/0#12'3-!4#931&#:7;<=7#:7><=7#:7?<#=#
(! )*+),-./*+.%0+1%23423)5/63.%
"#$%&%'(! )*%! +,%%-*(! +./)*%+01%2! 3.! 45! 6789:! ;<=#'0)*>! 0+! ?@0)%! @/2%'+);/2;3<%(!*@>;/! %;'! ,%'-%0&%+! 0)! ;+! +./)*%)0-A! B@')*%'>#'%(! ;!='%;)! %<#/=;)0#/! #C! +./)*%+01%2! +,%%-*!+0=/;<! 3.! ;220/=! ,%'0#2+(! -;@+%+! )*%! %-*#! %CC%-)A!D#/)';-)0#/! #C! ! )*%! +0=/;<! 2#%+! /#)! -;@+%!@/C;&#@';3<%! ;@20)0&%! ,%'-%,)0#/+A! "#$%&%'(! #>0))0/=!>#'%! ,%'0#2+! 0/! 20,*#/%(! -;@+%+! )*%!<#++!#C!+,%%-*!0/C#'>;)0#/!-#/)%/)A!!
4*%!45!6789:!;<=#'0)*>!$;+!0>,<%>%/)%2!0/!4D9!+-'0,)0/=!<;/=@;=%A!:+!;!'%+@<)!#C!3%<#/=0/=! )#! )*%! ='#@,! #C! 0/)%','%)%2! <;/=@;=%+(! 4D9! 0+! '%<;)0&%<.! +<#$A! E&%/! )*#@=*! )*%!,'#=';>>%!$;+!#,)0>01%2(! )*%! +./)*%+0+!#C!;!+%/)%/-%! <;+)+! +%&%';<! +%-#/2+A! F!+@,,#+%(! )*;)!)*%!@+%!#C!-#>,0<%2!<;/=@;=%!%A=A!DGG(!-#>30/%2!$0)*!*;'2$;'%!,%'C#'>;/-%!='#$)*!0/!)*%!C@)@'%(!-#@<2!+*#')%/!)*0+!)0>%!)#!*@/2'%2+!#C!!>0<0+%-#/2+A!4*%!@/2%/0;3<%!;2&;/);=%!#C!4D9!<;/=@;=%!'%>;0/+!)*%!,#++030<0).!#C!,#')0/=!4D9!+-'0,)+!;>#/=!#,%';)0/=!+.+)%>+A!!
4*%!C@/2;>%/);<!,'#3<%>!#C!)*%!EHI!4D9!-#/+#<%!0+!;!,<%/).!#C!%''#'+!0/!+#@'-%!-#2%A!H;/.!C@/-)0#/+!2#%+!/#)!$#'J!,'#,%'<.(!;/2!+#>%!#C!)*%>(!0/!+,%-0;<!-;+%+(!2#%+!/#)!$#'J!;)!;<<A!4*%'%C#'%!+#>%!#C!)*%>!*;2!)#!3%!/%$<.!0>,<%>%/)%2A!!
5%+,0)%! )*%! C;-)! )*;)! )*%! +,%%-*! -'%;)%2! 3.! -#/-;)%/;)0#/! #C! ! 20,*#/%+! 0+! +@,%'0#'! )#!-#/-;)%/;)0#/!#C! ,*#/%>%+(! )*0+! ;,,'#;-*!2#%+! /#)! %/;3<%! +./)*%+0+! #C! *0=*!?@;<0).!*@>;/!+,%%-*A!F)!0+!-;@+%2!3.!)*%!C;-)(!)*;)!/%0)*%'!;!*@=%!20,*#/%!2;);3;+%!0+!;3<%!)#!-#&%'!;!='%;)!&;'0%).!#C! !*@>;/!+,%%-*A!4*%!@+%!#C!H@<)0K;/2!EL-0);)0#/!M%+./)*%+0+!#/!2;);3;+%(!>0=*)!+<0=*)<.! 0>,'#&%! )*%!?@;<0).!#C! )*%!+,%%-*(!3@)! !;')0-@<;)%!+./)*%+01%'+! '%>;0/! )*%!&0+0#/!C#'!)*%!C@)@'%A!
43!343+)3.%NOP!5@)#0)(!4AQ!:/!F/)'#2@-)0#/!)#!4%L)R)#R+,%%-*!7./)*%+0+(!*)),QSS)-)+AC,>+A;-A3%S+./)*%+0+!NTP! D;++02.(! 7A(! ";''0/=)#/! UAQ! H@<)0R<%&%<! ://#);)0#/! 0/! )*%! EHI! 7,%%-*! 5;);3;+%!
H;/;=%>%/)!7.+)%>(!7,%%-*!D#>>@/0-;)0#/!VV(!TWWT!
NVP! 5@)#0)(!4A(!9%0-*(!"AQ!HKM!6789:!447!7./)*%+0+!K;+%2!#/!;/!HKE!M%R7./)*%+0+!#C!)*%!7%=>%/)+!5;);3;+%(!*)),QSS)-)+AC,>+A;-A3%!
NXP!7.'2;<(!:A!;/2!D#<AQ!45!6789:!Y%'+@+!";'>#/0-!Z#0+%!H#2%<!F/!50,*#/%!K;+%2!7,%%-*!7./)*%+0+(!*)),QSS$$$A10,,.A*#A;))A-#>!
! !
![Page 73: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/73.jpg)
Problems with diphone synthesis
• Signal processing leave artifacts, making the speech sound unnatural
• Diphone synthesis only captures local effects – But there are many more global effects
(syllable structure, stress pattern, word-level effects)
![Page 74: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/74.jpg)
Unit Selection Synthesis
• Generalization of the diphone intuition – Larger units
• From diphones to sentences
– Many many copies of each unit • 10 hours of speech instead of 1500 diphones (a
few minutes of speech)
– Little or no signal processing applied to each unit
• Unlike diphones
![Page 75: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/75.jpg)
Why Unit Selection Synthesis
• Natural data solves problems with diphones – Diphone databases are carefully designed but:
• Speaker makes errors • Speaker doesn’t speak intended dialect • Require database design to be right
– If it’s automatic • Labeled with what the speaker actually said • Coarticulation, schwas, flaps are natural
• “There’s no data like more data” – Lots of copies of each unit mean you can choose just
the right one for the context – Larger units mean you can capture wider effects
![Page 76: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/76.jpg)
Unit Selection Intuition • Given a big database • For each segment (diphone) that we want to synthesize
– Find the unit in the database that is the best to synthesize this target segment
• What does “best” mean? – “Target cost”: Closest match to the target description, in terms of
• Phonetic context • F0, stress, phrase position
– “Join cost”: Best join with neighboring units • Matching formants + other spectral characteristics • Matching energy • Matching F0
!
C(t1n,u1
n ) = Ctarget (i=1
n
" ti,ui) + C join (i= 2
n
" ui#1,ui)
![Page 77: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/77.jpg)
Search of best units using Viterbi algorithm
!
ˆ u 1n = argmin
u1 ,...,un
C(t1n,u1
n )
![Page 78: 6.#Textto(Speech#Synthesis# - Xavier Anguera · that allowed it to produce voewls and consonants. • In 1837 Charles Wheatstone produces a "speaking machine" based on von Kempelen's](https://reader033.vdocument.in/reader033/viewer/2022042209/5eada78038f82364831d1254/html5/thumbnails/78.jpg)
HMM synthesis • It is a totally different approach to concatenaYve synthesis – It is much closer to ASR
• Hidden Markov Models (HMM) are trained from labeled data to learn how each phone is pronounced in each condiYon – It also learns its prosody
• Then, given a desired phoneme sequence and prosody paQern, it outputs the most probable audio sequence.
Samples ~2007 Sample 2010 NITech