lsa 352: speech recognition and synthesis
DESCRIPTION
LSA 352: Speech Recognition and Synthesis. Dan Jurafsky Lecture 1: 1) Overview of Course 2) Refresher: Intro to Probability 3) Language Modeling. IP notice: some slides for today from: Josh Goodman, Dan Klein, Bonnie Dorr, Julia Hirschberg, Sandiway Fong. Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/1.jpg)
1LSA 352 Summer 2007
LSA 352: Speech Recognition and Synthesis
Dan Jurafsky
Lecture 1: 1) Overview of Course2) Refresher: Intro to Probability 3) Language Modeling
IP notice: some slides for today from: Josh Goodman, Dan Klein, Bonnie Dorr, Julia Hirschberg,Sandiway Fong
![Page 2: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/2.jpg)
2LSA 352 Summer 2007
Outline
Overview of CourseProbabilityLanguage Modeling
Language Modeling means “probabilistic grammar”
![Page 3: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/3.jpg)
3LSA 352 Summer 2007
Definitions
Speech RecognitionSpeech-to-Text
– Input: a wavefile,– Output: string of words
Speech SynthesisText-to-Speech
– Input: a string of words– Output: a wavefile
![Page 4: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/4.jpg)
4LSA 352 Summer 2007
Automatic Speech Recognition (ASR)Automatic Speech Understanding (ASU)
ApplicationsDictationTelephone-based Information (directions, air travel, banking, etc)Hands-free (in car)Second language ('L2') (accent reduction)Audio archive searchingLinguistic research– Automatically computing word durations, etc
![Page 5: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/5.jpg)
5LSA 352 Summer 2007
Applications of Speech Synthesis/Text-to-Speech (TTS)
GamesTelephone-based Information (directions, air travel, banking, etc)Eyes-free (in car)Reading/speaking for disabledEducation: Reading tutorsEducation: L2 learning
![Page 6: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/6.jpg)
6LSA 352 Summer 2007
Applications of Speaker/Lg Recognition
Language recognition for call routingSpeaker Recognition:
Speaker verification (binary decision)– Voice password, telephone assistant
Speaker identification (one of N)– Criminal investigation
![Page 7: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/7.jpg)
7LSA 352 Summer 2007
History: foundational insights 1900s-1950s
Automaton:Markov 1911Turing 1936McCulloch-Pitts neuron (1943)
– http://marr.bsee.swin.edu.au/~dtl/het704/lecture10/ann/node1.html
– http://diwww.epfl.ch/mantra/tutorial/english/mcpits/html/Shannon (1948) link between automata and Markov models
Human speech processingFletcher at Bell Labs (1920’s)
Probabilistic/Information-theoretic modelsShannon (1948)
![Page 8: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/8.jpg)
8LSA 352 Summer 2007
Synthesis precursors
Von Kempelen mechanical (bellows, reeds) speech production simulacrum1929 Channel vocoder (Dudley)
![Page 9: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/9.jpg)
9LSA 352 Summer 2007
History: Early Recognition
• 1920’s Radio RexCelluloid dog with iron base held within house by electromagnet against force of springCurrent to magnet flowed through bridge which was sensitive to energy at 500 Hz500 Hz energy caused bridge to vibrate, interrupting current, making dog spring forwardThe sound “e” (ARPAbet [eh]) in Rex has 500 Hz component
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 10: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/10.jpg)
10LSA 352 Summer 2007
History: early ASR systems• 1950’s: Early Speech recognizers
1952: Bell Labs single-speaker digit recognizer – Measured energy from two bands (formants)– Built with analog electrical components– 2% error rate for single speaker, isolated digits
1958: Dudley built classifier that used continuous spectrum rather than just formants1959: Denes ASR combining grammar and acoustic probability
1960’sFFT - Fast Fourier transform (Cooley and Tukey 1965)LPC - linear prediction (1968)1969 John Pierce letter “Whither Speech Recognition?”
– Random tuning of parameters,– Lack of scientific rigor, no evaluation metrics– Need to rely on higher level knowledge
![Page 11: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/11.jpg)
11LSA 352 Summer 2007
ASR: 1970’s and 1980’sHidden Markov Model 1972
Independent application of Baker (CMU) and Jelinek/Bahl/Mercer lab (IBM) following work of Baum and colleagues at IDA
ARPA project 1971-19765-year speech understanding project: 1000 word vocab, continous speech, multi-speakerSDC, CMU, BBNOnly 1 CMU system achieved goal
1980’s+Annual ARPA “Bakeoffs”Large corpus collection
– TIMIT– Resource Management– Wall Street Journal
![Page 12: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/12.jpg)
12LSA 352 Summer 2007
State of the Art
ASRspeaker-independent, continuous, no noise, world’s best research systems:
– Human-human speech: ~13-20% Word Error Rate (WER)
– Human-machine speech: ~3-5% WERTTS (demo next week)
![Page 13: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/13.jpg)
13LSA 352 Summer 2007
LVCSR Overview
Large Vocabulary Continuous (Speaker-Independent) Speech Recognition
Build a statistical model of the speech-to-words processCollect lots of speech and transcribe all the wordsTrain the model on the labeled speechParadigm: Supervised Machine Learning + Search
![Page 14: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/14.jpg)
14LSA 352 Summer 2007
Unit Selection TTS Overview
Collect lots of speech (5-50 hours) from one speaker, transcribe very carefully, all the syllables and phones and whatnotTo synthesize a sentence, patch together syllables and phones from the training data.Paradigm: search
![Page 15: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/15.jpg)
15LSA 352 Summer 2007
Requirements and Grading
Readings:Required Text:Selected chapters on web from
– Jurafsky & Martin, 2000. Speech and Language Processing.– Taylor, Paul. 2007. Text-to-Speech Synthesis.
GradingHomework: 75% (3 homeworks, 25% each)Participation: 25%You may work in groups
![Page 16: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/16.jpg)
16LSA 352 Summer 2007
Overview of the course
http://nlp.stanford.edu/courses/lsa352/
![Page 17: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/17.jpg)
17LSA 352 Summer 2007
6. Introduction to ProbabilityExperiment (trial)
Repeatable procedure with well-defined possible outcomesSample Space (S)
– the set of all possible outcomes – finite or infinite
Example– coin toss experiment– possible outcomes: S = {heads, tails}
Example– die toss experiment– possible outcomes: S = {1,2,3,4,5,6}
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Slides from Sandiway Fong
![Page 18: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/18.jpg)
18LSA 352 Summer 2007
Introduction to ProbabilityDefinition of sample space depends on what we are asking
Sample Space (S): the set of all possible outcomesExample
– die toss experiment for whether the number is even or odd– possible outcomes: {even,odd} – not {1,2,3,4,5,6}
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 19: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/19.jpg)
19LSA 352 Summer 2007
More definitionsEvents
an event is any subset of outcomes from the sample spaceExample
die toss experiment let A represent the event such that the outcome of the die toss experiment is divisible by 3A = {3,6} A is a subset of the sample space S= {1,2,3,4,5,6}
ExampleDraw a card from a deck
– suppose sample space S = {heart,spade,club,diamond} (four suits)
let A represent the event of drawing a heartlet B represent the event of drawing a red cardA = {heart} B = {heart,diamond}
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 20: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/20.jpg)
20LSA 352 Summer 2007
Introduction to Probability
Some definitionsCounting
– suppose operation oi can be performed in ni ways, then– a sequence of k operations o1o2...ok – can be performed in n1 n2 ... nk ways
Example– die toss experiment, 6 possible outcomes– two dice are thrown at the same time– number of sample points in sample space = 6 6 = 36
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 21: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/21.jpg)
21LSA 352 Summer 2007
Definition of Probability
The probability law assigns to an event a nonnegative numberCalled P(A)Also called the probability AThat encodes our knowledge or belief about the collective likelihood of all the elements of AProbability law must satisfy certain properties
![Page 22: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/22.jpg)
22LSA 352 Summer 2007
Probability Axioms
NonnegativityP(A) >= 0, for every event A
Additivity If A and B are two disjoint events, then the probability of their union satisfies:P(A U B) = P(A) + P(B)
Normalization The probability of the entire sample space S is equal to 1, I.e. P(S) = 1.
![Page 23: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/23.jpg)
23LSA 352 Summer 2007
An example
An experiment involving a single coin tossThere are two possible outcomes, H and TSample space S is {H,T}If coin is fair, should assign equal probabilities to 2 outcomesSince they have to sum to 1P({H}) = 0.5P({T}) = 0.5P({H,T}) = P({H})+P({T}) = 1.0
![Page 24: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/24.jpg)
24LSA 352 Summer 2007
Another exampleExperiment involving 3 coin tossesOutcome is a 3-long string of H or TS ={HHH,HHT,HTH,HTT,THH,THT,TTH,TTTT}Assume each outcome is equiprobable
“Uniform distribution”What is probability of the event that exactly 2 heads occur?A = {HHT,HTH,THH}P(A) = P({HHT})+P({HTH})+P({THH})= 1/8 + 1/8 + 1/8=3/8
![Page 25: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/25.jpg)
25LSA 352 Summer 2007
Probability definitions
In summary:
Probability of drawing a spade from 52 well-shuffled playing cards:
![Page 26: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/26.jpg)
26LSA 352 Summer 2007
Probabilities of two events
If two events A and B are independentThen
P(A and B) = P(A) x P(B)
If flip a fair coin twiceWhat is the probability that they are both heads?
If draw a card from a deck, then put it back, draw a card from the deck again
What is the probability that both drawn cards are hearts?A coin is flipped twice
What is the probability that it comes up heads both times?
![Page 27: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/27.jpg)
27LSA 352 Summer 2007
How about non-uniform probabilities? An example
A biased coin,twice as likely to come up tails as heads, is tossed twice
What is the probability that at least one head occurs?Sample space = {hh, ht, th, tt} (h = heads, t = tails)Sample points/probability for the event:
ht 1/3 x 2/3 = 2/9 hh 1/3 x 1/3= 1/9th 2/3 x 1/3 = 2/9 tt 2/3 x 2/3 = 4/9
Answer: 5/9 = 0.56 (sum of weights in red)
![Page 28: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/28.jpg)
28LSA 352 Summer 2007
Moving toward languageWhat’s the probability of drawing a 2 from a deck of 52 cards with four 2s?
What’s the probability of a random word (from a random dictionary page) being a verb?
P(drawing a two) 4
52
113
.077
P(drawing a verb) # of ways to get a verb
all words
![Page 29: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/29.jpg)
29LSA 352 Summer 2007
Probability and part of speech tags• What’s the probability of a random word (from
a random dictionary page) being a verb?
• How to compute each of these• All words = just count all the words in the
dictionary• # of ways to get a verb: number of words
which are verbs!• If a dictionary has 50,000 entries, and 10,000
are verbs…. P(V) is 10000/50000 = 1/5 = .20
P(drawing a verb) # of ways to get a verb
all words
![Page 30: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/30.jpg)
30LSA 352 Summer 2007
Conditional Probability
A way to reason about the outcome of an experiment based on partial information
In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”?How likely is it that a person has a disease given that a medical test was negative?A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft?
![Page 31: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/31.jpg)
31LSA 352 Summer 2007
More precisely
Given an experiment, a corresponding sample space S, and a probability lawSuppose we know that the outcome is within some given event BWe want to quantify the likelihood that the outcome also belongs to some other given event A.We need a new probability law that gives us the conditional probability of A given BP(A|B)
![Page 32: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/32.jpg)
32LSA 352 Summer 2007
An intuition• A is “it’s raining now”.• P(A) in dry California is .01• B is “it was raining ten minutes ago”
• P(A|B) means “what is the probability of it raining now if it was raining 10 minutes ago”
• P(A|B) is probably way higher than P(A)• Perhaps P(A|B) is .10
• Intuition: The knowledge about B should change our estimate of the probability of A.
![Page 33: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/33.jpg)
33LSA 352 Summer 2007
Conditional probability
One of the following 30 items is chosen at randomWhat is P(X), the probability that it is an X? What is P(X|red), the probability that it is an X given that it is red?
![Page 34: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/34.jpg)
34LSA 352 Summer 2007
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
S
Conditional Probabilitylet A and B be eventsp(B|A) = the probability of event B occurring given event A occursdefinition: p(B|A) = p(A B) / p(A)
![Page 35: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/35.jpg)
35LSA 352 Summer 2007
Conditional probability
P(A|B) = P(A B)/P(B)Or
)(),()|(
BPBAPBAP
A BA,B
Note: P(A,B)=P(A|B) · P(B)Also: P(A,B) = P(B,A)
![Page 36: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/36.jpg)
36LSA 352 Summer 2007
Independence
What is P(A,B) if A and B are independent?
P(A,B)=P(A) · P(B) iff A,B independent.
P(heads,tails) = P(heads) · P(tails) = .5 · .5 = .25
Note: P(A|B)=P(A) iff A,B independentAlso: P(B|A)=P(B) iff A,B independent
![Page 37: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/37.jpg)
37LSA 352 Summer 2007
Bayes Theorem
)()()|()|(
APBPBAPABP
•Swap the conditioning•Sometimes easier to estimate one
kind of dependence than the other
![Page 38: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/38.jpg)
38LSA 352 Summer 2007
Deriving Bayes Rule
P(B | A) P(A B)P(A)
P(A | B) P(A B)P(B)
P(B | A)P(A) P(A B)
P(A | B)P(B) P(A B)
P(A | B)P(B) P(B | A)P(A)
P(A | B) P(B | A)P(A)
P(B)
![Page 39: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/39.jpg)
39LSA 352 Summer 2007
Summary
ProbabilityConditional ProbabilityIndependenceBayes Rule
![Page 40: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/40.jpg)
40LSA 352 Summer 2007
How many words?
I do uh main- mainly business data processingFragmentsFilled pauses
Are cat and cats the same word?Some terminology
Lemma: a set of lexical forms having the same stem, major part of speech, and rough word sense
– Cat and cats = same lemmaWordform: the full inflected surface form.
– Cat and cats = different wordforms
![Page 41: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/41.jpg)
41LSA 352 Summer 2007
How many words?they picnicked by the pool then lay back on the grass and looked at the stars
16 tokens14 types
SWBD: ~20,000 wordform types, 2.4 million wordform tokens
Brown et al (1992) large corpus583 million wordform tokens293,181 wordform types
Let N = number of tokens, V = vocabulary = number of typesGeneral wisdom: V > O(sqrt(N))
![Page 42: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/42.jpg)
42LSA 352 Summer 2007
Language Modeling
We want to compute P(w1,w2,w3,w4,w5…wn), the probability of a sequenceAlternatively we want to compute P(w5|w1,w2,w3,w4,w5): the probability of a word given some previous wordsThe model that computes P(W) or P(wn|w1,w2…wn-1) is called the language model.A better term for this would be “The Grammar”But “Language model” or LM is standard
![Page 43: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/43.jpg)
43LSA 352 Summer 2007
Computing P(W)
How to compute this joint probability:
P(“the”,”other”,”day”,”I”,”was”,”walking”,”along”,”and”,”saw”,”a”,”lizard”)
Intuition: let’s rely on the Chain Rule of Probability
![Page 44: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/44.jpg)
44LSA 352 Summer 2007
The Chain Rule of Probability
Recall the definition of conditional probabilities
Rewriting:
More generallyP(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)In general P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)
)()^()|(
BPBAPBAP
)()|()^( BPBAPBAP
![Page 45: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/45.jpg)
45LSA 352 Summer 2007
The Chain Rule Applied to joint probability of words in sentence
P(“the big red dog was”)=
P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)
![Page 46: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/46.jpg)
46LSA 352 Summer 2007
Very easy estimate:
How to estimate?P(the|its water is so transparent that)
P(the|its water is so transparent that)=C(its water is so transparent that the)_______________________________C(its water is so transparent that)
![Page 47: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/47.jpg)
47LSA 352 Summer 2007
Unfortunately
There are a lot of possible sentences
We’ll never be able to get enough data to compute the statistics for those long prefixes
P(lizard|the,other,day,I,was,walking,along,and,saw,a)OrP(the|its water is so transparent that)
![Page 48: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/48.jpg)
48LSA 352 Summer 2007
Markov Assumption
Make the simplifying assumptionP(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a)
Or maybeP(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)
![Page 49: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/49.jpg)
49LSA 352 Summer 2007
So for each component in the product replace with the approximation (assuming a prefix of N)
Bigram version
P(wn | w1n 1) P(wn | wn N 1
n 1 )
Markov Assumption
P(wn | w1n 1) P(wn | wn 1)
![Page 50: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/50.jpg)
50LSA 352 Summer 2007
Estimating bigram probabilities
The Maximum Likelihood Estimate
P(wi | wi 1) count(wi 1,wi)
count(wi 1)
P(wi | wi 1) c(wi 1,wi)
c(wi 1)
![Page 51: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/51.jpg)
51LSA 352 Summer 2007
An example<s> I am Sam </s><s> Sam I am </s><s> I do not like green eggs and ham </s>
This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)
![Page 52: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/52.jpg)
52LSA 352 Summer 2007
Maximum Likelihood Estimates
The maximum likelihood estimate of some parameter of a model M from a training set T
Is the estimatethat maximizes the likelihood of the training set T given the model M
Suppose the word Chinese occurs 400 times in a corpus of a million words (Brown corpus)What is the probability that a random word from some other text will be “Chinese”MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpusBut it is the estimate that makes it most likely that “Chinese” will occur 400 times in a million word corpus.
![Page 53: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/53.jpg)
53LSA 352 Summer 2007
More examples: Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close bymid priced thai food is what i’m looking fortell me about chez panissecan you give me a listing of the kinds of food that are availablei’m looking for a good place to eat breakfastwhen is caffe venezia open during the day
![Page 54: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/54.jpg)
54LSA 352 Summer 2007
Raw bigram counts
Out of 9222 sentences
![Page 55: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/55.jpg)
55LSA 352 Summer 2007
Raw bigram probabilities
Normalize by unigrams:
Result:
![Page 56: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/56.jpg)
56LSA 352 Summer 2007
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =p(i|<s>) x p(want|I) x p(english|want)
x p(food|english) x p(</s>|food) = .24 x .33 x .0011 x 0.5 x 0.68 =.000031
![Page 57: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/57.jpg)
57LSA 352 Summer 2007
What kinds of knowledge?
P(english|want) = .0011P(chinese|want) = .0065P(to|want) = .66P(eat | to) = .28P(food | to) = 0P(want | spend) = 0P (i | <s>) = .25
![Page 58: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/58.jpg)
58LSA 352 Summer 2007
The Shannon Visualization Method
Generate random sentences:Choose a random bigram <s>, w according to its probabilityNow choose a random bigram (w, x) according to its probabilityAnd so on until we choose </s>Then string the words together<s> I
I want want to to eat eat Chinese
Chinese food food </s>
![Page 59: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/59.jpg)
59LSA 352 Summer 2007
![Page 60: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/60.jpg)
60LSA 352 Summer 2007
Shakespeare as corpus
N=884,647 tokens, V=29,066Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
![Page 61: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/61.jpg)
61LSA 352 Summer 2007
The wall street journal is not shakespeare (no offense)
![Page 62: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/62.jpg)
62LSA 352 Summer 2007
Evaluation
We train parameters of our model on a training set.How do we evaluate how well our model works?We look at the models performance on some new dataThis is what happens in the real world; we want to know how our model performs on data we haven’t seenSo a test set. A dataset which is different than our training setThen we need an evaluation metric to tell us how well our model is doing on the test set.One such metric is perplexity (to be introduced below)
![Page 63: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/63.jpg)
63LSA 352 Summer 2007
Unknown words: Open versus closed vocabulary tasks
If we know all the words in advancedVocabulary V is fixedClosed vocabulary task
Often we don’t know thisOut Of Vocabulary = OOV wordsOpen vocabulary task
Instead: create an unknown word token <UNK>Training of <UNK> probabilities
– Create a fixed lexicon L of size V– At text normalization phase, any training word not in L changed to
<UNK>– Now we train its probabilities like a normal word
At decoding time– If text input: Use UNK probabilities for any word not in training
![Page 64: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/64.jpg)
64LSA 352 Summer 2007
Evaluating N-gram models
Best evaluation for an N-gramPut model A in a speech recognizerRun recognition, get word error rate (WER) for APut model B in speech recognition, get word error rate for BCompare WER for A and BIn-vivo evaluation
![Page 65: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/65.jpg)
65LSA 352 Summer 2007
Difficulty of in-vivo evaluation of N-gram models
In-vivo evaluationThis is really time-consumingCan take days to run an experiment
SoAs a temporary solution, in order to run experimentsTo evaluate N-grams we often use an approximation called perplexityBut perplexity is a poor approximation unless the test data looks just like the training dataSo is generally only useful in pilot experiments (generally is not sufficient to publish)But is helpful to think about.
![Page 66: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/66.jpg)
66LSA 352 Summer 2007
PerplexityPerplexity is the probability of the test set (assigned by the language model), normalized by the number of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probabilityThe best language model is one that best predicts an unseen test set
![Page 67: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/67.jpg)
67LSA 352 Summer 2007
A totally different perplexity Intuition
How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9,oh’: easy, perplexity 11 (or if we ignore ‘oh’, perplexity 10)How hard is recognizing (30,000) names at Microsoft. Hard: perplexity = 30,000If a system has to recognize
Operator (1 in 4)Sales (1 in 4)Technical Support (1 in 4)30,000 names (1 in 120,000 each)Perplexity is 54
Perplexity is weighted equivalent branching factor
Slide from Josh Goodman
![Page 68: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/68.jpg)
68LSA 352 Summer 2007
Perplexity as branching factor
![Page 69: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/69.jpg)
69LSA 352 Summer 2007
Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ
![Page 70: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/70.jpg)
70LSA 352 Summer 2007
Lesson 1: the perils of overfitting
N-grams only work well for word prediction if the test corpus looks like the training corpus
In real life, it often doesn’tWe need to train robust models, adapt to test set, etc
![Page 71: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/71.jpg)
71LSA 352 Summer 2007
Lesson 2: zeros or not?
Zipf’s Law:A small number of events occur with high frequencyA large number of events occur with low frequencyYou can quickly collect statistics on the high frequency eventsYou might have to wait an arbitrarily long time to get valid statistics on low frequency events
Result:Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate!Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN!How to address?
Answer:Estimate the likelihood of unseen N-grams!
Slide adapted from Bonnie Dorr and Julia Hirschberg
![Page 72: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/72.jpg)
72LSA 352 Summer 2007
Smoothing is like Robin Hood:Steal from the rich and give to the poor (in probability mass)
Slide from Dan Klein
![Page 73: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/73.jpg)
73LSA 352 Summer 2007
Laplace smoothing
Also called add-one smoothingJust add one to all the counts!Very simple
MLE estimate:
Laplace estimate:
Reconstructed counts:
![Page 74: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/74.jpg)
74LSA 352 Summer 2007
Laplace smoothed bigram counts
![Page 75: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/75.jpg)
75LSA 352 Summer 2007
Laplace-smoothed bigrams
![Page 76: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/76.jpg)
76LSA 352 Summer 2007
Reconstituted counts
![Page 77: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/77.jpg)
77LSA 352 Summer 2007
Note big change to counts
C(count to) went from 608 to 238!P(to|want) from .66 to .26!Discount d= c*/c
d for “chinese food” =.10!!! A 10x reductionSo in general, Laplace is a blunt instrumentCould use more fine-grained method (add-k)
But Laplace smoothing not used for N-grams, as we have much better methodsDespite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially
For pilot studiesin domains where the number of zeros isn’t so huge.
![Page 78: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/78.jpg)
78LSA 352 Summer 2007
Better discounting algorithms
Intuition used by many smoothing algorithmsGood-TuringKneser-NeyWitten-Bell
Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen
![Page 79: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/79.jpg)
79LSA 352 Summer 2007
Good-Turing: Josh Goodman intuition
Imagine you are fishingThere are 8 species: carp, perch, whitefish, trout, salmon, eel, catfish, bass
You have caught 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that next species is new (i.e. catfish or bass)
3/18Assuming so, how likely is it that next species is trout?
Must be less than 1/18
Slide adapted from Josh Goodman
![Page 80: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/80.jpg)
80LSA 352 Summer 2007
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-xSo N10=1, N1=3, etc
To estimate total number of unseen speciesUse number of species (words) we’ve seen oncec0
* =c1 p0 = N1/NAll other estimates are adjusted (down) to give probabilities for unseen
Slide from Josh Goodman
![Page 81: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/81.jpg)
81LSA 352 Summer 2007
Good-Turing Intuition
Notation: Nx is the frequency-of-frequency-xSo N10=1, N1=3, etc
To estimate total number of unseen speciesUse number of species (words) we’ve seen oncec0
* =c1 p0 = N1/N p0=N1/N=3/18
All other estimates are adjusted (down) to give probabilities for unseen P(eel) = c*(1) = (1+1) 1/ 3 = 2/3
Slide from Josh Goodman
![Page 82: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/82.jpg)
82LSA 352 Summer 2007
![Page 83: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/83.jpg)
83LSA 352 Summer 2007
Bigram frequencies of frequencies and GT re-estimates
![Page 84: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/84.jpg)
84LSA 352 Summer 2007
ComplicationsIn practice, assume large counts (c>k for some k) are reliable:
That complicates c*, making it:
Also: we assume singleton counts c=1 are unreliable, so treat N-grams with count of 1 as if they were count=0Also, need the Nk to be non-zero, so we need to smooth (interpolate) the Nk counts before computing c* from them
![Page 85: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/85.jpg)
85LSA 352 Summer 2007
Backoff and Interpolation
Another really useful source of knowledgeIf we are estimating:
trigram p(z|xy) but c(xyz) is zero
Use info from:Bigram p(z|y)
Or even:Unigram p(z)
How to combine the trigram/bigram/unigram info?
![Page 86: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/86.jpg)
86LSA 352 Summer 2007
Backoff versus interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigramInterpolation: mix all three
![Page 87: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/87.jpg)
87LSA 352 Summer 2007
Interpolation
Simple interpolation
Lambdas conditional on context:
![Page 88: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/88.jpg)
88LSA 352 Summer 2007
How to set the lambdas?
Use a held-out corpusChoose lambdas which maximize the probability of some held-out data
I.e. fix the N-gram probabilitiesThen search for lambda valuesThat when plugged into previous equationGive largest probability for held-out setCan use EM to do this search
![Page 89: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/89.jpg)
89LSA 352 Summer 2007
Katz Backoff
![Page 90: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/90.jpg)
90LSA 352 Summer 2007
Why discounts P* and alpha?
MLE probabilities sum to 1
So if we used MLE probabilities but backed off to lower order model when MLE prob is zeroWe would be adding extra probability massAnd total probability would be greater than 1
![Page 91: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/91.jpg)
91LSA 352 Summer 2007
GT smoothed bigram probs
![Page 92: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/92.jpg)
92LSA 352 Summer 2007
Intuition of backoff+discounting
How much probability to assign to all the zero trigrams?
Use GT or other discounting algorithm to tell usHow to divide that probability mass among different contexts?
Use the N-1 gram estimates to tell usWhat do we do for the unigram words not seen in training?
Out Of Vocabulary = OOV words
![Page 93: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/93.jpg)
93LSA 352 Summer 2007
OOV words: <UNK> wordOut Of Vocabulary = OOV wordsWe don’t use GT smoothing for these
Because GT assumes we know the number of unseen eventsInstead: create an unknown word token <UNK>
Training of <UNK> probabilities– Create a fixed lexicon L of size V– At text normalization phase, any training word not in L changed to
<UNK>– Now we train its probabilities like a normal word
At decoding time– If text input: Use UNK probabilities for any word not in training
![Page 94: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/94.jpg)
94LSA 352 Summer 2007
Practical Issues
We do everything in log spaceAvoid underflow(also adding is faster than multiplying)
![Page 95: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/95.jpg)
95LSA 352 Summer 2007
ARPA format
![Page 96: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/96.jpg)
96LSA 352 Summer 2007
![Page 97: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/97.jpg)
97LSA 352 Summer 2007
Language Modeling Toolkits
SRILMCMU-Cambridge LM Toolkit
![Page 98: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/98.jpg)
98LSA 352 Summer 2007
Google N-Gram Release
![Page 99: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/99.jpg)
99LSA 352 Summer 2007
Google N-Gram Releaseserve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234
![Page 100: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/100.jpg)
100LSA 352 Summer 2007
Advanced LM stuffCurrent best smoothing algorithm
Kneser-Ney smoothingOther stuff
Variable-length n-gramsClass-based n-grams
– Clustering– Hand-built classes
Cache LMsTopic-based LMsSentence mixture modelsSkipping LMsParser-based LMs
![Page 101: LSA 352: Speech Recognition and Synthesis](https://reader036.vdocument.in/reader036/viewer/2022062411/568168af550346895ddf6d02/html5/thumbnails/101.jpg)
101LSA 352 Summer 2007
Summary
LMN-gramsDiscounting: Good-TuringKatz backoff with Good-Turing discountingInterpolationUnknown wordsEvaluation:
– Entropy, Entropy Rate, Cross Entropy– Perplexity
Advanced LM algorithms