1 ling 180 autumn 2007 linguist 180: introduction to computational linguistics dan jurafsky lecture...

108
1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

Upload: nigel-norton

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

1 LING 180 Autumn 2007

LINGUIST 180: Introduction to Computational Linguistics

Dan JurafskyLecture 17: Text-to-Speech

Page 2: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

2 LING 180 Autumn 2007

Outline

1) Arpabet2) TTS Architectures3) TTS Components

• Text Analysis• Text Normalization• Homonym Disambiguation• Grapheme-to-Phoneme (Letter-to-Sound)• Intonation

• Waveform Generation• Unit Selection• Diphones

Page 3: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

3 LING 180 Autumn 2007

Dave Barry on TTS

“And computers are getting smarter all the time; scientists tell us that soon they will be able to talk with us.

(By "they", I mean computers; I doubt scientists will ever be able to talk to us.)

Page 4: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

4 LING 180 Autumn 2007

ARPAbet

http://www.stanford.edu/class/linguist180/arpabet.html

Page 5: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

5 LING 180 Autumn 2007

ARPAbet Vowels

b_d ARPA b_d ARPA

1 bead iy 9 bode ow

2 bid ih 10 booed uw

3 bayed ey 11 bud ah

4 bed eh 12 bird er

5 bad ae 13 bide ay

6 bod(y) aa 14 bowed aw

7 bawd ao 15 Boyd oy

8 Budd(hist) uh

Sounds from Ladefoged

Page 6: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

6 LING 180 Autumn 2007

Brief Historical Interlude

• Pictures and some text from Hartmut Traunmüller’s web site:• http://www.ling.su.se/staff/hartmut/kemplne.htm

• Von Kempeln 1780 b. Bratislava 1734 d. Vienna 1804

• Leather resonator manipulated by the operator to try and copy vocal tract configuration during sonorants (vowels, glides, nasals)

• Bellows provided air stream, counterweight provided inhalation

• Vibrating reed produced periodic pressure wave

Page 7: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

7 LING 180 Autumn 2007

Von Kempelen:

• Small whistles controlled consonants

• Rubber mouth and nose; nose had to be covered with two fingers for non-nasals

• Unvoiced sounds: mouth covered, auxiliary bellows driven by string provides puff of air

From Traunmüller’s web site

Page 8: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

8 LING 180 Autumn 2007

Modern TTS systems

1960’s first full TTS: Umeda et al (1968)1970’s

Joe Olive 1977 concatenation of linear-prediction diphonesSpeak and Spell

1980’s1979 MIT MITalk (Allen, Hunnicut, Klatt)

1990’s-presentDiphone synthesisUnit selection synthesis

Page 9: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

9 LING 180 Autumn 2007

2. Overview of TTS:Architectures of Modern Synthesis

Articulatory Synthesis:Model movements of articulators and acoustics of vocal tract

Formant Synthesis:Start with acoustics, create rules/filters to create each formant

Concatenative Synthesis:Use databases of stored speech to assemble new utterances.

Text from Richard Sproat slides

Page 10: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

10 LING 180 Autumn 2007

Formant Synthesis

Were the most common commercial systems while computers were relatively underpowered.1979 MIT MITalk (Allen, Hunnicut, Klatt)1983 DECtalk systemThe voice of Stephen Hawking

Page 11: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

11 LING 180 Autumn 2007

Concatenative Synthesis

All current commercial systems.Diphone Synthesis

Units are diphones; middle of one phone to middle of next.Why? Middle of phone is steady state.Record 1 speaker saying each diphone

Unit Selection SynthesisLarger unitsRecord 10 hours or more, so have multiple copies of each unitUse search to find best sequence of units

Page 12: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

12 LING 180 Autumn 2007

TTS Demos (all are Unit-Selection)

Festivalhttp://www-2.cs.cmu.edu/~awb/festival_demos/index.html

Cepstralhttp://www.cepstral.com/cgi-bin/demos/general

IBMhttp://www-306.ibm.com/software/pervasive/tech/demos/tts.shtml

Page 13: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

13 LING 180 Autumn 2007

Architecture

The three types of TTSConcatenativeFormantArticulatory

Only cover the segments+f0+duration to waveform part.A full system needs to go all the way from random text to sound.

Page 14: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

14 LING 180 Autumn 2007

Two steps

PG&E will file schedules on April 20.

TEXT ANALYSIS: Text into intermediate representation:

WAVEFORM SYNTHESIS: From the intermediate representation into waveform

Page 15: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

15 LING 180 Autumn 2007

Page 16: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

16 LING 180 Autumn 2007

1. Text Normalization

Analysis of raw text into pronounceable words:

Sentence TokenizationText Normalization

Identify tokens in textChunk tokens into reasonably sized sectionsMap tokens to wordsIdentify types for words

Page 17: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

17 LING 180 Autumn 2007

Rules for end-of-utterance detection

A dot with one or two letters is an abbrevA dot with 3 cap letters is an abbrev.An abbrev followed by 2 spaces and a capital letter is an end-of-utteranceNon-abbrevs followed by capitalized word are breaksThis fails for

Cog. Sci. NewsletterLots of cases at end of line.Badly spaced/capitalized sentences

Page 18: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

18 LING 180 Autumn 2007

Determining if a word is end-of-utterance: a Decision Tree

Page 19: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

19 LING 180 Autumn 2007

Learning Decision Trees

DTs are rarely built by handHand-building only possible for very simple features, domainsLots of algorithms for DT induction

Page 20: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

20 LING 180 Autumn 2007

Next Step: Identify Types of Tokens, and Convert Tokens to Words

Pronunciation of numbers often depends on type:

1776 date: seventeen seventy six.1776 phone number: one seven seven six1776 quantifier: one thousand seven hundred (and) seventy six25 day: twenty-fifth

Page 21: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

21 LING 180 Autumn 2007

Classify token into 1 of 20 types

EXPN: abbrev, contractions (adv, N.Y., mph, gov’t)LSEQ: letter sequence (CIA, D.C., CDs)ASWD: read as word, e.g. CAT, proper namesMSPL: misspellingNUM: number (cardinal) (12,45,1/2, 0.6)NORD: number (ordinal) e.g. May 7, 3rd, Bill Gates IINTEL: telephone (or part) e.g. 212-555-4523NDIG: number as digits e.g. Room 101NIDE: identifier, e.g. 747, 386, I5, PC110NADDR: number as stresst address, e.g. 5000 PennsylvaniaNZIP, NTIME, NDATE, NYER, MONEY, BMONY, PRCT,URL,etc

SLNT: not spoken (KENT*REALTY)

Page 22: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

22 LING 180 Autumn 2007

More about the types

4 categories for alphabetic sequences:EXPN: expand to full word or word seq (fplc for fireplace, NY for New York)

LSEQ: say as letter sequence (IBM)

ASWD: say as standard word (either OOV or acronyms)

5 main ways to read numbers:Cardinal (quantities)

Ordinal (dates)

String of digits (phone numbers)

Pair of digits (years)

Trailing unit: serial until last non-zero digit: 8765000 is “eight seven six five thousand” (some phone numbers, long addresses)

But still exceptions: (947-3030, 830-7056)

Page 23: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

23 LING 180 Autumn 2007

Finally: expanding NSW Tokens

Type-specific heuristicsASWD expands to itselfLSEQ expands to list of words, one for each letterNUM expands to string of words representing cardinalNYER expand to 2 pairs of NUM digits…NTEL: string of digits with silence for puncutationAbbreviation:

– use abbrev lexicon if it’s one we’ve seen– Else use training set to know how to expand– Cute idea: if “eat in kit” occurs in text, “eat-in kitchen”

will also occur somewhere.

Page 24: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

24 LING 180 Autumn 2007

2. Homograph disambiguation

use319

increase 230close 215record 195house 150contract 143lead 131live

130lives 105protest 94

survey 91project 90separate 87present 80read 72subject 68rebel 48finance 46estimate 46

19 most frequent homographs, from Liberman and Church

Not a huge problem, but still important

Page 25: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

25 LING 180 Autumn 2007

POS Tagging for homograph disambiguation

Many homographs can be distinguished by POS

use y uw s y uw z

close k l ow s k l ow z

house h aw s h aw z

live l ay v l ih v

REcord reCORD

INsult inSULT

OBject obJECT

OVERflow overFLOW

DIScount disCOUNT

CONtent conTENT

Page 26: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

26 LING 180 Autumn 2007

3. Letter-to-Sound: Getting from words to phones

Two methods:Dictionary-basedRule-based (Letter-to-sound=LTS)

Early systems, all LTSMITalk was radical in having huge 10K word dictionaryNow systems use a combination

Page 27: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

27 LING 180 Autumn 2007

Pronunciation Dictionaries: CMU

CMU dictionary: 127K wordshttp://www.speech.cs.cmu.edu/cgi-bin/cmudict

Some problems:Has errorsOnly American pronunciationsNo syllable boundariesDoesn’t tell us which pronunciation to use for which homophones

– (no POS tags)Doesn’t distinguish case

– The word US has 2 pronunciations [AH1 S] and [Y UW1 EH1 S]

Page 28: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

28 LING 180 Autumn 2007

Pronunciation Dictionaries: UNISYN

UNISYN dictionary: 110K words (Fitt 2002)http://www.cstr.ed.ac.uk/projects/unisyn/

Benefits:Has syllabification, stress, some morphological boundariesPronunciations can be read off in

– General American– RP British– Australia– Etc

(Other dictionaries like CELEX not used because too small, British-only)

Page 29: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

29 LING 180 Autumn 2007

Dictionaries aren’t sufficient

Unknown words (= OOV = “out of vocabulary”)Increase with the (sqrt of) number of words in unseen textBlack et al (1998) OALD on 1st section of Penn Treebank:Out of 39923 word tokens,

– 1775 tokens were OOV: 4.6% (943 unique types):

So commercial systems have 4-part system:Big dictionaryNames handled by special routinesAcronyms handled by special routines (previous lecture)Machine learned g2p algorithm for other unknown words

names unknown Typos/other

1360 351 64

76.6% 19.8% 3.6%

Page 30: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

30 LING 180 Autumn 2007

Names

Big problem area is namesNames are common

20% of tokens in typical newswire text will be names1987 Donnelly list (72 million households) contains about 1.5 million namesPersonal names: McArthur, D’Angelo, Jiminez, Rajan, Raghavan, Sondhi, Xu, Hsu, Zhang, Chang, NguyenCompany/Brand names: Infinit, Kmart, Cytyc, Medamicus, Inforte, Aaon, Idexx Labs, Bebe

Page 31: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

31 LING 180 Autumn 2007

Names

Methods: Can do morphology (Walters -> Walter, Lucasville)Can write stress-shifting rules (Jordan -> Jordanian)Rhyme analogy: Plotsky by analogy with Trostsky (replace tr with pl)Liberman and Church: for 250K most common names, got 212K (85%) from these modified-dictionary methods, used LTS for rest.Can do automatic country detection (from letter trigrams) and then do country-specific rulesCan train g2p system specifically on names

– Or specifically on types of names (brand names, Russian names, etc)

Page 32: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

32 LING 180 Autumn 2007

Acronyms

We saw aboveUse machine learning to detect acronyms

EXPNASWORDLETTERS

Use acronym dictionary, hand-written rules to augment

Page 33: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

33 LING 180 Autumn 2007

Letter-to-Sound Rules

Earliest algorithms: handwritten Chomsky+Halle-style rules:

Festival version of such LTS rules:(LEFTCONTEXT [ ITEMS] RIGHTCONTEXT = NEWITEMS )

Example:( # [ c h ] C = k )( # [ c h ] = ch )

# denotes beginning of wordC means all consonantsRules apply in order

“christmas” pronounced with [k]But word with ch followed by non-consonant pronounced [ch]

– E.g., “choice”

Page 34: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

34 LING 180 Autumn 2007

Stress rules in hand-written LTS

English famously evil: one from Allen et al 1987

Where X must contain all prefixes:Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final syllable containing a short vowel and 0 or more consonants (e.g. difficult)Assign 1-stress to the vowel in a syllable preceding a weak syllable followed by a morpheme-final vowel (e.g. oregano)etc

Page 35: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

35 LING 180 Autumn 2007

Modern method: Learning LTS rules automatically

Induce LTS from a dictionary of the language Black et al. 1998Applied to English, German, FrenchTwo steps:

alignment (CART-based) rule-induction

Page 36: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

36 LING 180 Autumn 2007

Alignment

Letters: c h e c k e dPhones: ch _ eh _ k _ tBlack et al Method 1:

First scatter epsilons in all possible ways to cause letters and phones to alignThen collect stats for P(phone|letter) and select best to generate new stats

This iterated a number of times until settles (5-6)This is EM (expectation maximization) alg

Page 37: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

37 LING 180 Autumn 2007

Alignment

Black et al method 2

Page 38: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

38 LING 180 Autumn 2007

Hand specify which letters can be rendered as which phones

C goes to k/ch/s/shW goes to w/v/f, etcAn actual list:

Once mapping table is created, find all valid alignments, find p(letter|phone), score all alignments, take best

Page 39: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

39 LING 180 Autumn 2007

Alignment

Some alignments will turn out to be really bad.These are just the cases where pronunciation doesn’t match letters:

Dept d ih p aa r t m ah n tCMU s iy eh m y uwLieutenant l eh f t eh n ax n t (British)

Also foreign wordsThese can just be removed from alignment training

Page 40: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

40 LING 180 Autumn 2007

Building CART trees

Build a CART tree for each letter in alphabet (26 plus accented) using context of +-3 letters# # # c h e c -> chc h e c k e d -> _

Page 41: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

41 LING 180 Autumn 2007

Add more features

Even more: for French liaison, we need to know what the next word is, and whether it starts with a vowelFrench six

[s iy s] in j’en veux six [s iy z] in six enfants[s iy] in six filles

Page 42: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

42 LING 180 Autumn 2007

Prosody:from words+phones to boundaries, accent, F0, duration

Prosodic phrasing Need to break utterances into phrasesPunctuation is useful, not sufficient

Accents:Predictions of accents: which syllables should be accentedRealization of F0 contour: given accents/tones, generate F0 contour

Duration:Predicting duration of each phone

Page 43: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

43 LING 180 Autumn 2007

Defining Intonation

Ladd (1996) “Intonational phonology”“The use of suprasegmental phonetic featuresSuprasegmental = above and beyond the segment/phone

F0Intensity (energy)Duration

to convey sentence-level pragmatic meanings”I.e. meanings that apply to phrases or utterances as a whole, not lexical stress, not lexical tone.

Page 44: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

44 LING 180 Autumn 2007

Three aspects of prosody

Prominence: some syllables/words are more prominent than othersStructure/boundaries: sentences have prosodic structure

Some words group naturally togetherOthers have a noticeable break or disjuncture between them

Tune: the intonational melody of an utterance.

From Ladd (1996)

Page 45: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

45 LING 180 Autumn 2007

Prosodic Prominence: Pitch Accents

A: What types of foods are a good source of vitamins?

B1: Legumes are a good source of VITAMINS.B2: LEGUMES are a good source of vitamins.

• Prominent syllables are:• Louder• Longer• Have higher F0 and/or sharper changes in F0 (higher F0 velocity)

Slide from Jennifer Venditti

Page 46: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

46 LING 180 Autumn 2007

Stress vs. accent (2)

The speaker decides to make the word vitamin more prominent by accenting it.Lexical stress tell us that this prominence will appear on the first syllable, hence VItamin.

Page 47: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

47 LING 180 Autumn 2007

Which word receives an accent?

It depends on the context. For example, the ‘new’ information in the answer to a question is often accented, while the ‘old’ information usually is not.

Q1: What types of foods are a good source of vitamins?A1: LEGUMES are a good source of vitamins.

Q2: Are legumes a source of vitamins?A2: Legumes are a GOOD source of vitamins.

Q3: I’ve heard that legumes are healthy, but what are they a good source of ?A3: Legumes are a good source of VITAMINS.

Slide from Jennifer Venditti

Page 48: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

48 LING 180 Autumn 2007

Factors in accent prediction

Part of speech:Content words are usually accentedFunction words are rarely accented

– Of, for, in on, that, the, a, an, no, to, and but or will may would can her is their its our there is am are was were, etc

Page 49: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

49 LING 180 Autumn 2007

Complex Noun Phrase Structure

Sproat, R. 1994. English noun-phrase accent prediction for text-to-speech. Computer Speech and Language 8:79-94.

Proper Names, stress on right-most wordNew York CITY; Paris, FRANCE

Adjective-Noun combinations, stress on nounLarge HOUSE, red PEN, new NOTEBOOK

Noun-Noun compounds: stress left nounHOTdog (food) versus HOT DOG (overheated animal)WHITE house (place) versus WHITE HOUSE (made of stucco)

examples:MEDICAL Building, APPLE cake, cherry PIE.What about: Madison avenue, Park street ???

Some Rules:Furniture+Room -> RIGHT (e.g., kitchen TABLE)Proper-name + Street -> LEFT (e.g. PARK street)

Page 50: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

50 LING 180 Autumn 2007

State of the art

Hand-label large training setsUse CART, SVM, CRF, etc to predict accentLots of rich features from context (parts of speech, syntactic structure, information structure, contrast, etc.)Classic lit:

Hirschberg, Julia. 1993. Pitch Accent in context: predicting intonational prominence from text. Artificial Intelligence 63, 305-340

Page 51: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

51 LING 180 Autumn 2007

Levels of prominence

Most phrases have more than one accentThe last accent in a phrase is perceived as more prominent

Called the Nuclear Accent

Emphatic accents like nuclear accent often used for semantic purposes, such as indicating that a word is contrastive, or the semantic focus.

The kind of thing you represent via ***s in IM, or capitalized letters‘I know SOMETHING interesting is sure to happen,’ she said to herself.

Can also have words that are less prominent than usualReduced words, especially function words.

Often use 4 classes of prominence:emphatic accent, pitch accent, unaccented, reduced

Page 52: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

52 LING 180 Autumn 2007

50

100

150

200

250

300

350

400

450

500

550

are legumes a good source of VITAMINS

Yes-No question

Rise from the main accent to the end of the sentence.

Slide from Jennifer Venditti

Page 53: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

53 LING 180 Autumn 2007

‘Surprise-redundancy’ tune

legumes are a good source of vitamins

Low beginning followed by a gradual rise to a high at the end.

[How many times do I have to tell you ...]

50

100

150

200

250

300

350

400

Slide from Jennifer Venditti

Page 54: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

54 LING 180 Autumn 2007

50

100

150

200

250

300

350

400

‘Contradiction’ tune

linguini isn’t a good source of vitamins

Sharp fall at the beginning, flat and low, then rising at the end.

“I’ve heard that linguini is a good source of vitamins.”

[... how could you think that?]

Slide from Jennifer Venditti

Page 55: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

55 LING 180 Autumn 2007

Duration

Simplest: fixed size for all phones (100 ms)Next simplest: average duration for that phone (from training data). Samples from SWBD in ms:

aa 118 b 68ax 59 d 68ay 138 dh 44eh 87 f 90ih 77 g 66

Next Next Simplest: add in phrase-final and initial lengthening plus stress:

Page 56: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

56 LING 180 Autumn 2007

Intermediate representation:using Festival

Do you really want to see all of it?

Page 57: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

57 LING 180 Autumn 2007

Waveform Synthesis

Given:String of phonesProsody

– Desired F0 for entire utterance– Duration for each phone– Stress value for each phone, possibly accent value

Generate:Waveforms

Page 58: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

58 LING 180 Autumn 2007

Diphone TTS architecture

Training:Choose units (kinds of diphones)Record 1 speaker saying 1 example of each diphoneMark the boundaries of each diphones,

– cut each diphone out and create a diphone database

Synthesizing an utterance, grab relevant sequence of diphones from databaseConcatenate the diphones, doing slight signal processing at boundariesuse signal processing to change the prosody (F0, energy, duration) of selected sequence of diphones

Page 59: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

59 LING 180 Autumn 2007

Diphones

Mid-phone is more stable than edge:

Page 60: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

60 LING 180 Autumn 2007

Diphones

mid-phone is more stable than edgeNeed O(phone2) number of units

Some combinations don’t exist (hopefully)ATT (Olive et al. 1998) system had 43 phones

– 1849 possible diphones– Phonotactics ([h] only occurs before vowels), don’t need

to keep diphones across silence – Only 1172 actual diphones

May include stress, consonant clusters– So could have more

Lots of phonetic knowledge in design

Database relatively small (by today’s standards)

Around 8 megabytes for English (16 KHz 16 bit)

Slide from Richard Sproat

Page 61: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

61 LING 180 Autumn 2007

Voice

SpeakerCalled a voice talent

Diphone databaseCalled a voice

Page 62: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

62 LING 180 Autumn 2007

Prosodic Modification

Modifying pitch and duration independentlyChanging sample rate modifies both:

Chipmunk speech

Duration: duplicate/remove parts of the signalPitch: resample to change pitch

Text from Alan Black

Page 63: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

63 LING 180 Autumn 2007

Speech as Short Term signals

Alan Black

Page 64: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

64 LING 180 Autumn 2007

Duration modification

Duplicate/remove short term signals

Slide from Richard Sproat

Page 65: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

65 LING 180 Autumn 2007

Duration modification

Duplicate/remove short term signals

Page 66: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

66 LING 180 Autumn 2007

Pitch Modification

Move short-term signals closer together/further apart

Slide from Richard Sproat

Page 67: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

67 LING 180 Autumn 2007

TD-PSOLA ™

Time-Domain Pitch Synchronous Overlap and AddPatented by France Telecom (CNET)Very efficient

No FFT (or inverse FFT) required

Can modify Hz up to two times or by half

Slide from Richard Sproat

Page 68: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

68 LING 180 Autumn 2007

TD-PSOLA ™

Time-Domain Pitch Synchronous Overlap and AddPatented by France Telecom (CNET)

WindowedPitch-synchronousOverlap-and-addVery efficientCan modify Hz up to two times or by half

Page 69: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

69 LING 180 Autumn 2007

Unit Selection Synthesis

Generalization of the diphone intuitionLarger units

– From diphones to sentences

Many many copies of each unit– 10 hours of speech instead of 1500 diphones (a few

minutes of speech)

Page 70: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

70 LING 180 Autumn 2007

Unit Selection Intuition

Given a big databaseFind the unit in the database that is the best to synthesize some target segmentWhat does “best” mean?

“Target cost”: Closest match to the target description, in terms of

– Phonetic context– F0, stress, phrase position

“Join cost”: Best join with neighboring units– Matching formants + other spectral characteristics– Matching energy– Matching F0

Page 71: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

71 LING 180 Autumn 2007

Targets and Target Costs

A measure of how well a particular unit in the database matches the internal representation produced by the prior stagesFeatures, costs, and weightsExamples:

/ih-t/ from stressed syllable, phrase internal, high F0, content word/n-t/ from unstressed syllable, phrase final, low F0, content word/dh-ax/ from unstressed syllable, phrase initial, high F0, from function word “the”

Slide from Paul Taylor

Page 72: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

72 LING 180 Autumn 2007

Target Costs

Comprised of k subcostsStressPhrase positionF0Phone durationLexical identity

Target cost for a unit:

C t (ti,ui) wktCk

t (k1

p

ti,ui)

Slide from Paul Taylor

Page 73: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

73 LING 180 Autumn 2007

Join (Concatenation) Cost

Measure of smoothness of joinMeasured between two database units (target is irrelevant)Features, costs, and weightsComprised of k subcosts:

Spectral featuresF0Energy

Join cost:

C j (ui 1,ui) wkjCk

j (k1

p

ui 1,ui)

Slide from Paul Taylor

Page 74: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

74 LING 180 Autumn 2007

Total Costs

Hunt and Black 1996We now have weights (per phone type) for features set between target and database unitsFind best path of units through database that minimize:

Standard problem solvable with Viterbi search with beam width constraint for pruning

C(t1n,u1

n ) C target (i1

n

ti,ui) C join (i2

n

ui 1,ui)

ˆ u 1n argmin

u1 ,...,un

C(t1n,u1

n )

Slide from Paul Taylor

Page 75: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

75 LING 180 Autumn 2007

Page 76: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

76 LING 180 Autumn 2007

Unit Selection Summary

AdvantagesQuality is far superior to diphonesNatural prosody selection sounds better

Disadvantages:Quality can be very bad in places

– HCI problem: mix of very good and very bad is quite annoying

Synthesis is computationally expensiveCan’t synthesize everything you want:

– Diphone technique can move emphasis– Unit selection gives good (but possibly incorrect) result

Slide from Richard Sproat

Page 77: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

77 LING 180 Autumn 2007

Evaluation of TTS

Intelligibility TestsDiagnostic Rhyme Test (DRT)

– Humans do listening identification choice between two words differing by a single phonetic feature

Voicing, nasality, sustenation, sibilation– 96 rhyming pairs– Veal/feel, meat/beat, vee/bee, zee/thee, etc

Subject hears “veal”, chooses either “veal or “feel” Subject also hears “feel”, chooses either “veal” or “feel”

– % of right answers is intelligibility score.

Overall Quality TestsHave listeners rate space on a scale from 1 (bad) to 5 (excellent)

Preference Tests (prefer A, prefer B)

Huang, Acero, Hon

Page 78: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

78 LING 180 Autumn 2007

Recent stuff

Problems with Unit Selection SynthesisCan’t modify signal(mixing modified and unmodified sounds bad)But database often doesn’t have exactly what you want

Solution: HMM (Hidden Markov Model) Synthesis

Won the last TTS bakeoff.Sounds less natural to researchersBut naïve subjects preferred itHas the potential to improve on both diphone and unit selection.

Page 79: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

79 LING 180 Autumn 2007

HMM Synthesis

Unit selection (Roger)HMM (Roger)

Unit selection (Nina)HMM (Nina)

Page 80: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

80 LING 180 Autumn 2007

Summary

1) Arpabet2) TTS Architectures3) TTS Components

• Text Analysis• Text Normalization• Homonym Disambiguation• Grapheme-to-Phoneme (Letter-to-Sound)• Intonation

• Waveform Generation• Diphones• Unit Selection

Page 81: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

81 LING 180 Autumn 2007

Optional: Articulatory Phonetics

Page 82: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

82 LING 180 Autumn 2007

Speech Production Process

Respiration:We (normally) speak while breathing out. Respiration provides airflow. “Pulmonic egressive airstream”

PhonationAirstream sets vocal folds in motion. Vibration of vocal folds produces sounds. Sound is then modulated by:

Articulation and ResonanceShape of vocal tract, characterized by:Oral tract

– Teeth, soft palate (velum), hard palate– Tongue, lips, uvula

Nasal tract

Text adopted from Sharon Rose

Page 83: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

83 LING 180 Autumn 2007

Nasal Cavity

Pharynx

Vocal Folds (within the Larynx)

Trachea

Lungs

Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide

Sagittal section of the vocal tract (Techmer 1880)

Page 84: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

84 LING 180 Autumn 2007

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Mark Liberman’s website, from Ultimate Visual Dictionary

Page 85: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

85 LING 180 Autumn 2007

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Mark Liberman’s Web Site, from Language Files (7th ed)

Page 86: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

86 LING 180 Autumn 2007

Larynx and Vocal Folds

The Larynx (voice box)A structure made of cartilage and muscleLocated above the trachea (windpipe) and below the pharynx (throat)Contains the vocal folds(adjective for larynx: laryngeal)

Vocal Folds (older term: vocal cords)Two bands of muscle and tissue in the larynxCan be set in motion to produce sound (voicing)

Text from slides by Sharon Rose UCSD LING 111 handout

Page 87: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

87 LING 180 Autumn 2007

The larynx, external structure, from front

Figure thnx to John Coleman!!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 88: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

88 LING 180 Autumn 2007

Vertical slice through larynx, as seen from back

Figure thnx to John Coleman!!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 89: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

89 LING 180 Autumn 2007

Voicing:

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

•Air comes up from lungs•Forces its way through vocal cords, pushing open (2,3,4)•This causes air pressure in glottis to fall, since:

• when gas runs through constricted passage, its velocity increases (Venturi tube effect)• this increase in velocity results in a drop in pressure (Bernoulli principle)

•Because of drop in pressure, vocal cords snap together again (6-10)•Single cycle: ~1/100 of a second.

Figure & text from John Coleman’s web site

Page 90: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

90 LING 180 Autumn 2007

Voicelessness

When vocal cords are open, air passes through unobstructedVoiceless sounds: p/t/k/s/f/sh/th/chIf the air moves very quickly, the turbulence causes a different kind of phonation: whisper

Page 91: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

91 LING 180 Autumn 2007

Vocal folds open during breathing

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

From Mark Liberman’s web site, from Ultimate Visual Dictionary

Page 92: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

92 LING 180 Autumn 2007

Vocal Fold Vibration

UCLA Phonetics Lab Demo

Page 93: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

93 LING 180 Autumn 2007

Consonants and Vowels

Consonants: phonetically, sounds with audible noise produced by a constrictionVowels: phonetically, sounds with no audible noise produced by a constriction

(it’s more complicated than this, since we have to consider syllabic function, but this will do for now)

Text adapted from John Coleman

Page 94: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

94 LING 180 Autumn 2007

Place of Articulation

Consonants are classified according to the location where the airflow is most constricted.This is called place of articulationThree major kinds of place articulation:

Labial (with lips)Coronal (using tip or blade of tongue)Dorsal (using back of tongue)

Page 95: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

95 LING 180 Autumn 2007

Places of articulation

labial

dentalalveolar post-alveolar/palatal

velar

uvular

pharyngeal

laryngeal/glottal

Figure thanks to Jennifer Venditti

Page 96: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

96 LING 180 Autumn 2007

Labial place

bilabial

labiodental

Figure thanks to Jennifer Venditti

Bilabial:p, b, m

Labiodental:f, v

Page 97: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

97 LING 180 Autumn 2007

Coronal place

dentalalveolar post-alveolar/palatal

Figure thanks to Jennifer Venditti

Dental:th/dh

Alveolar:t/d/s/z/l

Post:sh/zh/y

Page 98: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

98 LING 180 Autumn 2007

Dorsal Place

velar

uvular

pharyngeal

Figure thanks to Jennifer Venditti

Velar:k/g/ng

Page 99: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

99 LING 180 Autumn 2007

Manner of Articulation

Stop: complete closure of articulators, so no air escapes through mouthOral stop: palate is raised, no air escapes through nose. Air pressure builds up behind closure, explodes when released

p, t, k, b, d, g

Nasal stop: oral closure, but palate is lowered, air escapes through nose.

m, n, ng

Page 100: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

100 LING 180 Autumn 2007

Oral vs. Nasal Sounds

Thanks to Jong-bok Kim for this figure!

Page 101: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

101 LING 180 Autumn 2007

More on Manner of articulation of consonants

FricativesClose approximation of two articulators, resulting in turbulent airflow between them, producing a hissing sound.

– f, v, s, z, th, dh

ApproximantNot quite-so-close approximation of two articulators, so no turbulence

– y, r

Lateral approximantObstruction of airstream along center of oral tract, with opening around sides of tongue.

– l

Text from Ladefoged “A Course in Phonetics”

Page 102: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

102 LING 180 Autumn 2007

More on manner of articulation of consonants

Tap or flapTongue makes a single tap against the alveolar ridge

– dx in “butter”

AffricateStop immediately followed by a fricative

– ch, jh

Page 103: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

103 LING 180 Autumn 2007

Articulatory parameters for English consonants (in ARPAbet)

PLACE OF ARTICULATION

bilabial labio-dental

inter-dental

alveolar palatal velar glottal

stop p b t d k g q

fric. f v th dh s z sh zh h

affric. ch jh

nasal m n ng

approx w l/r y

flap dxMA

NN

ER

OF

AR

TIC

ULA

TIO

N

VOICING:voiceless voicedfrom Jennifer Venditt!i

Page 104: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

104 LING 180 Autumn 2007

Tongue position for vowels

QuickTime™ and aVideo decompressor

are needed to see this picture.

Page 105: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

105 LING 180 Autumn 2007

Vowels

IY AA UW

Fig. from Eric Keller

Page 106: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

106 LING 180 Autumn 2007

American English Vowel Space

FRONT BACK

HIGH

LOW

eyow

aw

oy

ay

iy

ih

eh

ae aa

ao

uw

uh

ah

ax

ix ux

Figure from Jennifer Venditti

Page 107: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

107 LING 180 Autumn 2007

More phonetic structure

SyllablesComposed of vowels and consonants. Not well defined. Something like a “vowel nucleus with some of its surrounding consonants”.

StressSome syllables have more energy than othersStressed syllables versus unstressed syllables(an) ‘INsult vs. (to) in’SULT(an) ‘OBject vs. (to) ob’JECTUnstressed vowels are generally transcribed as schwa:

– ax

Page 108: 1 LING 180 Autumn 2007 LINGUIST 180: Introduction to Computational Linguistics Dan Jurafsky Lecture 17: Text-to-Speech

108 LING 180 Autumn 2007

Where to go for more info

Ladefoged, Peter. 1993. A Course in PhoneticsMark Liberman’s site

http://www.ling.upenn.edu/courses/Spring_2001/ling001/phonetics.html

John Coleman’s sitehttp://www.phon.ox.ac.uk/%7Ejcoleman/mst_mphil_phonetics_course_index.html