introduction to language modelingfraser/reading... · outline 1. introduction 2. smoothing methods...
TRANSCRIPT
![Page 1: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/1.jpg)
Introduction to Language Modeling(Parsing and Machine Translation Reading Group)
Christian Scheible
October 23, 2008
Christian Scheible Introduction to Language Modeling October 23, 2008 1 / 26
![Page 2: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/2.jpg)
Outline
1. Introduction
2. Smoothing Methods
Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26
![Page 3: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/3.jpg)
Language Models
What?
Probability distribution for word sequences:p(w1, ...,wn) = p(wn
1 )
⇒ How likely is a sentence to be produced?
Why?
• Machine translation
• Speech recognition
• ...
Christian Scheible Introduction to Language Modeling October 23, 2008 3 / 26
![Page 4: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/4.jpg)
N-gram Models
• Conditional probabilities for words given their N predecessors
• Independence assumption!
→ Approximation
Example
p(My dog finds a bone) ≈p(My) p(dog|my) p(finds|my,dog) p(a|dog,finds) p(bone|finds, a)
Christian Scheible Introduction to Language Modeling October 23, 2008 4 / 26
![Page 5: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/5.jpg)
Why Smoothing?
Smoothing: Adjusting maximum likelihood estimates to produce moreaccurate probabilities
Christian Scheible Introduction to Language Modeling October 23, 2008 5 / 26
![Page 6: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/6.jpg)
Why Smoothing?
• Probabilities: relative frequencies (from corpus)
p(w1, ...,wn) =f (w1, ...,wn)
N• But not all combinations of w1, ...wn are covered
→ Data sparseness leads to probability 0 for many words
Example
p(My dog found two bones)p(My dog found three bones)p(My dog found seventy bones)p(My dog found eighty sausages)p(My dog found ... ...)
Christian Scheible Introduction to Language Modeling October 23, 2008 6 / 26
![Page 7: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/7.jpg)
Smoothing Models – Overview
• Laplace Smoothing
• Additive Smoothing
• Good-Turing Estimate
• Katz Backoff
• Interpolation (Jelinek-Mercer)
• Absolute Discounting
• Witten-Bell Smoothing
• Kneser-Ney Smoothing
Christian Scheible Introduction to Language Modeling October 23, 2008 7 / 26
![Page 8: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/8.jpg)
Laplace Smoothing (1)
• Add 1 to the numerator
• Add the size of the vocabulary (V ) to the denominator
p(w1, ...,wn) =f (w1, ...,wn) + 1
N + |V |
Christian Scheible Introduction to Language Modeling October 23, 2008 8 / 26
![Page 9: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/9.jpg)
Laplace Smoothing (2)
Problems• Intended for uniform distributions, language data usually produces
Zipf distributions
• Overestimates new items, underestimates known items
• Unrealistically alters probabilities of items with high frequencies
MLE Empirical Laplace0 0.000027 0.0001371 0.448 0.0002472 1.25 0.0004113 2.24 0.0005484 3.23 0.0006855 4.21 0.0008226 5.23 0.0009597 6.21 0.001098 7.21 0.001239 8.26 0.00137
Christian Scheible Introduction to Language Modeling October 23, 2008 9 / 26
![Page 10: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/10.jpg)
Additive Smoothing
• Add λ to the numerator
• Add λ× |V | to the denominator
p(w1, ...wn) =f (w1, ...wn) + λ
N + λ|V |
Problems• λ needs to be determined
• Same issues as Laplace smoothing
Christian Scheible Introduction to Language Modeling October 23, 2008 10 / 26
![Page 11: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/11.jpg)
Good-Turing Estimate (1)
• Idea: Reallocate the probability mass of events that occur n + 1 timesto the events that occur n times
• For each frequency f , produce an adjusted (expected) frequency f ∗
f ∗ ≈ (f + 1)E (nf +1)
E (nf )≈ (f + 1)
nf +1
nf
pGT (w1, ...,wn) =f ∗(w1, ...,wn)∑
X
f ∗(X )
Christian Scheible Introduction to Language Modeling October 23, 2008 11 / 26
![Page 12: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/12.jpg)
Good-Turing Estimate (2)
MLE Test Set Laplace Good-Turing0 0.000027 0.000137 0.0000271 0.448 0.000274 0.4462 1.25 0.000411 1.263 2.24 0.000548 2.244 3.23 0.000685 3.245 4.21 0.000822 4.226 5.23 0.000959 5.197 6.21 0.00109 6.218 7.21 0.00123 7.249 8.26 0.00137 8.25
Problems• What if nr is 0?
→ nr need to be smoothed
• No combination with other distributions
Christian Scheible Introduction to Language Modeling October 23, 2008 12 / 26
![Page 13: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/13.jpg)
Katz Backoff (1)
• Combines probability distributions
• Idea: Higher- and lower-order distributions
• Recursive smoothing up to unigrams
p(wi |wk , ...,wi−1) =
f (wk , ...,wn)δ
f (wk , ...,wn−1)if f (...) > 0
α p(wi |wk+1, ...wi−1) else
α ensures that the sum of all probabilities equals 1
Christian Scheible Introduction to Language Modeling October 23, 2008 13 / 26
![Page 14: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/14.jpg)
Katz Backoff (2)
Example
Seen: p(two bones)Unseen: p(some bones)
Known: p(bones|two)Unknown: p(bones|some)
Known: p(bones)!
The higher-order frequency is related to the lower-order frequency.Important:p(bones|some) < p(bones|two)
Christian Scheible Introduction to Language Modeling October 23, 2008 14 / 26
![Page 15: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/15.jpg)
Interpolation (Jelinek-Mercer)
• Another method to combine probability distributions
• Weighted average
p(wi |wk , ...,wi−1) =k∑
j=0
λj p(wi |wk+j , ...,wi−1)
About λj
• The sum of all λj is 1
• Use held-out data to determine the best set of values
Christian Scheible Introduction to Language Modeling October 23, 2008 15 / 26
![Page 16: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/16.jpg)
Absolute Discounting (1)
• Subtract a fixed number from all frequencies
• Uniformly assign the sum of these numbers to new events
p(w1, ...,wn) =
f (w1, ...,wn)− δ
Nif f (w1, ...,wn)− δ > 0
(A− n0) δ
n0 Nelse
nn: the number of events that occurred n times
Christian Scheible Introduction to Language Modeling October 23, 2008 16 / 26
![Page 17: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/17.jpg)
Absolute Discounting (2)
How to determine δ?
δ =n1
n1 + 2n2
Problems• Still wrong results for events with frequency 1
→ Solution: Different discounts for n1, n2, n3+
Christian Scheible Introduction to Language Modeling October 23, 2008 17 / 26
![Page 18: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/18.jpg)
Witten-Bell Smoothing (1)
• Instance of Interpolation
• λk : Probability of using the higher-order model
pWB(wi |wk , ...,wi−1) =λk pML(wi |wk , ...,wi−1) + (1− λk) pWB(wi |wk+1, ...,wi−1)
Christian Scheible Introduction to Language Modeling October 23, 2008 18 / 26
![Page 19: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/19.jpg)
Witten-Bell Smoothing (2)
• Thus: 1− λk : Probability of using the lower-order model
• Idea: This is related to the number of different words that follow thehistory wk , ...,wi−1
N1+(wk , ...,wi−1, •) = No. of different words that follow wk , ...,wi−1
1− λk =N1+(wk , ...,wi−1, •)
N1+(wl , ...,wi−1, •) +∑
wif (wk , ...wi )
Christian Scheible Introduction to Language Modeling October 23, 2008 19 / 26
![Page 20: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/20.jpg)
Kneser-Ney Smoothing
• Extension of Absolute Discounting:New way to build the lower-order distribution
Problem Case
”San Francisco”:
• Francisco is common but only occurs after San
• Absolute Discounting will yield high p(Francisco|new word)
• Not intended!
• Unigram probability should not be proportional to its frequency but tothe number of different words it follows
Christian Scheible Introduction to Language Modeling October 23, 2008 20 / 26
![Page 21: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/21.jpg)
Modified Kneser-Ney Smoothing (1)
Idea
Similarities to Absolute Discounting
• Interpolation with lower-order model
• Discounts
pkn(w1|wk , ...,wi−1) =
f (wk , ...,wi )− D(f )∑wi
f (wk , ...,wi )+ γ(wk , ...,wi−1) pkn(wi |wk+1, ...,wi−1)
Christian Scheible Introduction to Language Modeling October 23, 2008 21 / 26
![Page 22: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/22.jpg)
Modified Kneser-Ney Smoothing (2)
Discount function D(f)
• Different discounts for frequencies 1, 2 and 3+
D(f ) =
0 if n = 0
D1 = 1− 2Y n2n1
if n = 1
D2 = 2− 3Y n3n2
if n = 2
D3+ = 3− 4Y n4n3
if n ≥ 3
Christian Scheible Introduction to Language Modeling October 23, 2008 22 / 26
![Page 23: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/23.jpg)
Modified Kneser-Ney Smoothing (3)
Gamma
Ensures that the distribution sums to 1.
γ(wk , ...,wi−1) =
D1N1(wk , ...,wi−1) + D2N2(wk , ...,wi−1) + D3N3(wk , ...,wi−1)∑wi
f (wk , ...,wi )
Christian Scheible Introduction to Language Modeling October 23, 2008 23 / 26
![Page 24: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/24.jpg)
Evaluation
Cross-Entropy Evaluation
• Cross-entropy (H): The average number of bits needed to identifyan event from a set of possibilities
• In our case: How good is the approximation our smoothing methodproduces?
• Calculate cross-entropy for each method on test data
H(p, q) = −∑x
p(x)log q(x)
Christian Scheible Introduction to Language Modeling October 23, 2008 24 / 26
![Page 25: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/25.jpg)
Evaluation
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
100 1000 10000 100000 1e+06 1e+07
dif
f in
tes
t cr
oss
-entr
opy
fro
m b
asel
ine
(bit
s/to
ken
)
training set size (sentences)
relative performance of algorithms on WSJ/NAB corpus, 2-gram
jelinek-mercer-baseline
katz
k-nkneser-ney-mod
abs-disc-int
j-m
witten-bell-backoff
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
100 1000 10000 100000 1e+06 1e+07
dif
f in
tes
t cr
oss
-entr
opy
fro
m b
asel
ine
(bit
s/to
ken
)
training set size (sentences)
relative performance of algorithms on WSJ/NAB corpus, 3-gram
jelinek-mercer-baseline
katzkneser-ney
kneser-ney-mod
abs-disc-interp
j-m
witten-bell-backoff
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
100 1000 10000 100000 1e+06
dif
f in
tes
t cr
oss
-en
tro
py
fro
m b
asel
ine
(bit
s/to
ken
)
training set size (sentences)
relative performance of algorithms on Broadcast News corpus, 2-gram
jelinek-mercer-baseline
katz
kneser-ney
kneser-ney-mod
abs-disc-interp
j-m
witten-bell-backoff
-0.3
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
100 1000 10000 100000 1e+06
dif
f in
tes
t cr
oss
-en
tro
py
fro
m b
asel
ine
(bit
s/to
ken
)
training set size (sentences)
relative performance of algorithms on Broadcast News corpus, 3-gram
baseline
katz
kneser-ney
kneser-ney-mod
abs-disc-interp
jelinek-mercer
witten-bell-backoff
Christian Scheible Introduction to Language Modeling October 23, 2008 25 / 26
![Page 26: Introduction to Language Modelingfraser/reading... · Outline 1. Introduction 2. Smoothing Methods Christian Scheible Introduction to Language Modeling October 23, 2008 2 / 26](https://reader034.vdocument.in/reader034/viewer/2022043013/5fadc3f22074f041073cabe4/html5/thumbnails/26.jpg)
References
1 Stanley F. Chen and Joshua GoodmanAn empirical study of smoothing techniques for language modelingProceedings of the 34th annual meeting of the Association for Computational Linguistics,
1996
2 Joshua GoodmanThe State of the Art in Language Model Smoothinghttp://research.microsoft.com/∼joshuago/lm-tutorial-public.ppt
3 Bill MacCartneyNLP Lunch Tutorial: Smoothinghttp://nlp.stanford.edu/∼wcmac/papers/20050421-smoothing-tutorial.pdf
4 Christopher Manning and Hinrich SchutzeFoundations of statistical natural language processingMIT Press, 1999
5 Helmut SchmidStatistische Methoden in der Maschinellen Sprachverarbeitunghttp://www.ims.uni-stuttgart.de/∼schmid/ParsingII/current/snlp-folien.pdf
Christian Scheible Introduction to Language Modeling October 23, 2008 26 / 26