natural language processing language model. language models formal grammars (e.g. regular, context...
TRANSCRIPT
![Page 1: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/1.jpg)
Natural Language Processing
Language Model
![Page 2: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/2.jpg)
Language Models
• Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language
• For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful
• To specify a correct probability distribution, the probability of all sentences in a language must sum to 1
![Page 3: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/3.jpg)
Uses of Language Models
• Speech recognition– “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”
• OCR & Handwriting recognition– More probable sentences are more likely correct readings
• Machine translation– More likely sentences are probably better translations
• Generation– More likely sentences are probably better NL generations
• Context sensitive spelling correction– “Their are problems wit this sentence.”
![Page 4: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/4.jpg)
Completion Prediction
• A language model also supports predicting the completion of a sentence– Please turn off your cell _____– Your program does not ______
• Predictive text input systems can guess what you are typing and give choices on how to complete it
![Page 5: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/5.jpg)
Probability
• P(X) means probability that X is true– P(baby is a boy) 0.5 (% of total that are boys)– P(baby is named John) 0.001 (% of total named
John)
BabiesBaby boys
John
![Page 6: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/6.jpg)
Probability
• P(X|Y) means probability that X is true when we already know Y is true– P(baby is named John | baby is a boy) 0.002– P(baby is a boy | baby is named John ) 1
BabiesBaby boys
John
![Page 7: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/7.jpg)
Probability
• P(X|Y) = P(X, Y) / P(Y)– P(baby is named John | baby is a boy) = P(baby is named John, baby is a boy) / P(baby is a boy) =
0.001 / 0.5 = 0.002
BabiesBaby boys
John
![Page 8: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/8.jpg)
Bayes Rule
• Bayes rule: P(X|Y) = P(Y|X) P(X) / P(Y)• P(named John | boy) = P(boy | named John)
P(named John) / P(boy)
BabiesBaby boys
John
![Page 9: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/9.jpg)
Word Sequence Probabilities
• Given a word sequence (sentence):
its probability is:
The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model
nn www ...11
)|()|()...|()|()()( 11
1
11
2131211
kn
kk
nn
n wwPwwPwwPwwPwPwP
![Page 10: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/10.jpg)
N-Gram Models• Estimate probability of each word given prior context
– P(phone | Please turn off your cell)• Number of parameters required grows exponentially with the
number of words of prior context• An N-gram model uses only N1 words of prior context.
– Unigram: P(phone)– Bigram: P(phone | cell)– Trigram: P(phone | your cell)
• Bigram approximation:
• N-gram approximation: )|()( 11
11
k
Nk
n
kk
n wwPwP
)|()( 11
1 k
n
kk
n wwPwP
![Page 11: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/11.jpg)
Estimating Probabilities
• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.
• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.
)(
)()|(
1
11
n
nnnn wC
wwCwwP
)(
)()|(
11
111
1
nNn
nnNnn
Nnn wC
wwCwwP
Bigram:
N-gram:
![Page 12: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/12.jpg)
Generative Model and MLE
• An N-gram model can be seen as a probabilistic automata for generating sentences.
• Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T.
Initialize sentence with N1 <s> symbolsUntil </s> is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words.
))(|(argmaxˆ
MTP
![Page 13: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/13.jpg)
Example
• Estimate the likelihood of the sentence:I want to eat Chinese foodP(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
• What do we need to calculate these likelihoods?– Bigram probabilities for each word pair sequence in
the sentence– Calculated from a large corpus
![Page 14: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/14.jpg)
Corpus
• A language model must be trained on a large corpus of text to estimate good parameter values
• Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate)
• Ideally, the training (and test) corpus should be representative of the actual application data
![Page 15: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/15.jpg)
Terminology
• Types: number of distinct words in a corpus (vocabulary size)
• Tokens: total number of words
![Page 16: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/16.jpg)
Early Bigram Probabilities from BERP
.001Eat British.03Eat today
.007Eat dessert.04Eat Indian
.01Eat tomorrow.04Eat a
.02Eat Mexican.04Eat at
.02Eat Chinese.05Eat dinner
.02Eat in.06Eat lunch
.03Eat breakfast.06Eat some
.03Eat Thai.16Eat on
![Page 17: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/17.jpg)
.01British lunch.05Want a
.01British cuisine.65Want to
.15British restaurant.04I have
.60British food.08I don’t
.02To be.29I would
.09To spend.32I want
.14To have.02<start> I’m
.26To eat.04<start> Tell
.01Want Thai.06<start> I’d
.04Want some.25<start> I
![Page 18: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/18.jpg)
Back to our sentence…
• I want to eat Chinese food
0100004Lunch
000017019Food
112000002Chinese
522190200Eat
12038601003To
686078603Want
00013010878I
lunchFoodChineseEatToWantI
![Page 19: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/19.jpg)
Relative Frequencies
• Normalization: divide each row's counts by appropriate unigram counts for wn-1
• Computing the bigram probability of I I– C(I,I)/C( I)– p (I|I) = 8 / 3437 = .0023
4591506213938325612153437
LunchFoodChineseEatToWantI
![Page 20: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/20.jpg)
Approximating Shakespeare
• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter
• With bigrams...– What means, sir. I confess she? then all sorts, he is trim,
captain.– Why dost stand forth thy canopy, forsooth; he is this
palpable hit the King Henry.• Trigrams
– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if renown made it
empty.
![Page 21: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/21.jpg)
Approximating Shakespeare
• Quadrigrams– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?– What's coming out here looks like Shakespeare
because it is Shakespeare• Note: As we increase the value of N, the
accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained
![Page 22: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/22.jpg)
Evaluation
• Perplexity and entropy: how do you estimate how well your language model fits a corpus once you’re done?
![Page 23: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/23.jpg)
Random Variables
• A variable defined by the probabilities of each possible value in the population
• Discrete Random Variable – Whole Number (0, 1, 2, 3 etc.)– Countable, Finite Number of Values
• Jump from one value to the next and cannot take any values in between
• Continuous Random Variables– Whole or Fractional Number– Obtained by Measuring– Infinite Number of Values in Interval
• Too Many to List Like Discrete Variable
![Page 24: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/24.jpg)
Discrete Random Variables
• For example:
# of girls in the family# of of correct answers of a given exam
…
![Page 25: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/25.jpg)
Mass Probability Function
• Probability– x = Value of Random Variable (Outcome)– p(x) = Probability Associated with Value
• Mutually Exclusive (No Overlap)• Collectively Exhaustive (Nothing Left Out)• 0 p(x) 1• p(x) = 1
![Page 26: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/26.jpg)
Measures
• Expected Value– Mean of Probability Distribution– Weighted Average of All Possible Values– = E(X) = x p(x)
• Variance– Weighted Average Squared Deviation about Mean – 2 = V(X)= E[ (x (x p(x) – 2 = V(X)=E(X )-[E(X)
• Standard Deviation– =2 = SD(X)
![Page 27: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/27.jpg)
Perplexity and Entropy
• Information theoretic metrics– Useful in measuring how well a grammar or language model
(LM) models a natural language or a corpus
• Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is there (in e.g. a grammar or LM) about what the next word will be?
• For a random variable X ranging over e.g. bigrams and a probability function p(x), the entropy of X is the expected negative log probability
![Page 28: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/28.jpg)
Entropy- Example
• Horse race – 8 horses, we want to send a bet to the bookie
• In the naïve way – 3 bits message• Can we do better?• Suppose we know the distribution of the bets
placed, i.e. - Horse1 ½ Horse5 1/64
Horse2 ¼ Horse6 1/64
Horse3 1/8 Horse7 1/64
Horse4 1/16 Horse8 1/64
![Page 29: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/29.jpg)
Entropy - Example
• The Entropy of the random variable X give lower bound on the number of bits –
![Page 30: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/30.jpg)
Perplexity
• The weighted average number of choices a random variable has to make
• In the previous example its 8
![Page 31: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/31.jpg)
Entropy of a Language
• Measuring all the sequences of size n in a language L:
• Entropy rate:
![Page 32: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/32.jpg)
Cross Entropy and Perplexity
• Given an approximation probability of the language p and a model m we want to measure how good m predicts p
![Page 33: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/33.jpg)
Perplexity
• Better models m of the unknown distribution p will tend to assign higher probabilities m(xi) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample
![Page 34: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/34.jpg)
Example
Slide from Philip Keohn
![Page 35: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/35.jpg)
Comparison 1-4 grams LM
Slide from Philip Keohn
![Page 36: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/36.jpg)
Smoothing
• Words follow a Zipfian distribution–Small number of words occur very frequently–A large number are seen only once
• Zero probabilities on one bigram cause a zero probability on the entire sentence
• So….how do we estimate the likelihood of unseen n-grams?
![Page 37: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/37.jpg)
Steal from the rich and give to the poor (in probability mass)
Slide from Dan Klein
![Page 38: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/38.jpg)
Add-One Smoothing
• For all possible n-grams, add the count of one:
– C = count of n-gram in corpus – N = count of history – V = vocabulary size
• But there are many more unseen n-grams than seen n-grams
![Page 39: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/39.jpg)
Add-One Smoothingunsmoothed bigram counts:
unsmoothed normalized bigram probabilities:
1st w
ord
2nd word
Vocabulary = 1616
![Page 40: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/40.jpg)
Add-One Smoothingadd-one smoothed bigram counts:
add-one normalized bigram probabilities:
![Page 41: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/41.jpg)
Add-One Problems
• bigrams starting with Chinese are boosted by a factor of 8! (1829 / 213)
unsmoothed bigram counts:
add-one smoothed bigram counts:
![Page 42: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/42.jpg)
Add-One Problems
• Every previously unseen n-gram is given a low probability, but there are so many of them that too much probability mass is given to unseen events
• Adding 1 to frequent bigram, does not change much, but adding 1 to low bigrams (including unseen ones) boosts them too much!
• In NLP applications that are very sparse, Add-One actually gives far too much of the probability space to unseen events
![Page 43: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/43.jpg)
Witten-Bell Smoothing
• Intuition: – An unseen n-gram is one that just did not
occur yet– When it does happen, it will be its first occurrence– So give to unseen n-grams the probability of
seeing a new n-gram
![Page 44: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/44.jpg)
Witten-Bell – Unigram Case
• N: number of tokens• T: number of types (diff. observed words) - can
be different than V (number of words in dictionary)
Prob. of unseen unigrams
Prob. of seen unigrams
![Page 45: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/45.jpg)
Witten-Bell – bigram case
Prob. of unseen unigrams
Prob. of seen unigrams
![Page 46: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/46.jpg)
Witten-Bell Example
• The original counts were –
• T(w)= number of different seen bigrams types starting with w• We have a vocabulary of 1616 words, so we can compute
Z(w)= number of unseen bigrams types starting with w Z(w) = 1616 - T(w)
• N(w) = number of bigrams tokens starting with w
![Page 47: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/47.jpg)
Witten-Bell Example
• WB smoothed probabilities:
![Page 48: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/48.jpg)
Back-off• So far, we gave the same probability to all unseen n-grams
– we have never seen the bigrams • journal of Punsmoothed(of |journal) = 0 • journal from Punsmoothed(from |journal) = 0 • journal never Punsmoothed(never |journal) = 0
– all models so far will give the same probability to all 3 bigrams
• but intuitively, “journal of” is more probable because...– “of” is more frequent than “from” & “never” – unigram probability P(of) > P(from) > P(never)
![Page 49: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/49.jpg)
Back-off• Observation:
– unigram model suffers less from data sparseness than bigram model
– bigram model suffers less from data sparseness than trigram model
– …
• So use a lower model estimate, to estimate probability of unseen n-grams
• If we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model
![Page 50: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/50.jpg)
Linear Interpolation
• Solve the sparseness in a trigram model by mixing with bigram and unigram models
• Also called: – linear interpolation– finite mixture models – deleted interpolation
• Combine linearlyPli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1)
– where 0 i 1 and i i =1
![Page 51: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/51.jpg)
Back-off Smoothing
• Smoothing of Conditional Probabilities
p(Angeles | to, Los)
• If „to Los Angeles“ is not in the training corpus,the smoothed probability p(Angeles | to, Los) isidentical to p(York | to, Los)
• However, the actual probability is probably close tothe bigram probability p(Angeles | Los)
![Page 52: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/52.jpg)
Back-off Smoothing• (Wrong) Back-off Smoothing of trigram probabilities
• if C(w‘, w‘‘, w) > 0
P*(w | w‘, w‘‘) = P(w | w‘, w‘‘)
• else if C(w‘‘, w) > 0
P*(w | w‘, w‘‘) = P(w | w‘‘)
• else if C(w) > 0
P*(w | w‘, w‘‘) = P(w)
• else
P*(w | w‘, w‘‘) = 1 / #words
![Page 53: Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences](https://reader035.vdocument.in/reader035/viewer/2022062304/56649de45503460f94adc446/html5/thumbnails/53.jpg)
Back-off Smoothing
• Problem: not a probability distribution
• Solution:Combination of Back-off and Smoothing
if C(w1,...,wk,w) > 0
P(w | w1,...,wk) = C* (w1,...,wk,w) / N else
P(w | w1,...,wk) = (w1,...,wk) P(w | w2,...,wk)