language models1
TRANSCRIPT
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 1/35
Language Models
CS6370: Natural Language Processing
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 2/35
2
Why model language?!! Validation/Verification
!! Check for syntax!! Better “understanding”
!! Generation
!! Q & A!! Dialogue systems
!! Prediction
!!Speech recognition
!! Spelling correction
!! Discrimination
!! Topic detection
!! Authorship verification
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 3/35
3
Language model: Grammar?
!! Model: Rules of a “language” +
dictionary
!! Context Free Grammars!!
Constructing full parse tree: expensive!! Ambiguity: PCFG
!! Specified apriori: not data driven
!! Overkill: complete grammar specificationnot needed for several tasks!! spell check, prediction, discrimination, etc.
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 4/35
4
What model do you use?!! We turned ___ the TV to watch the Cricket
__________.!! We turned on the TV to watch the Cricket
match.
!! We turned in the TV to watch the Cricketgame.
!! We turned on the TV to watch the Cricket
tournament.!! We turned off the TV to watch the Cricket
hop.
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 6/35
6
Why do they help?match Cricket watched the I
is less common thanI watched the Cricket match
!! Similarly match is a more likely completion
for I watched the Cricket than game or hop.!! You probably used this idea in your
spell check assignment.
!! Simple way of modeling this is to useN-gram statistics or N-gram models.
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 7/35
7
N-gram model intuition
!! N-gram is a sequence of N consecutive
words.
!! the cricket match is a 3-gram (trigram)
!! Look at the relative frequency of variousN-grams in your training corpus
!! P(I watched the cricket match) = P(I | . .) x P(watched | . I) x
P(the | I watched) x P(Cricket | watched the) x P(match | theCricket) x P(. | Cricket match) x P(. | match .)
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 8/35
8
N-gram computation
!! Cannot do this for the entire sequence
of words!! Computational issues
!!Scalability
!! So estimate for N-grams and use chain
rule.
P(the I watched )=C (I watched the)
C(I watched)
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 10/35
10
Estimating the bigram model
!! Relative frequency
!! Maximum Likelihood estimate
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham</s>
P(wn
|wn!1
) =C (w
n!1w
n)
C (wn!1w)w"
=
C (wn!1w
n)
C (wn!1)
P(I <s>)=2
3; P(Sam <s>) =
1
3;P(am I)=
2
3
P
(</s> Sam )=
1
2 ;P
(Sam am )=
1
2 ;P
(do I)=
1
3
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 11/35
11
Data sets: Some issues!! Usually a corpus is split into training and testing sets.
!! Use various measures for evaluation!! Can also use a hold-out set and a development set
!! Performance depends on training corpora as with alldata driven methods
!! Other issue: Closed vs open vocabulary !! Closed: Only a certain set of pre-determined words can
occur
!!Open: Allow for an unlimited vocabulary !! Choose a vocabulary to model
!! Replace out-of-vocabulary words with <UNK>
!! Find probability of <UNK> in training;
!! Alternative use first occurrence all words as <UNK>
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 12/35
12
Counting words: Some issues!! What is a word?
!! cat vs. cats
!! eat vs. ate vs. eating
!! President of the United States vs. POTUS
!! ahh, umm
!! .,?!;, etc
!! Lemmatization, stemming, tokenization
!! Task dependent
!! All distinctions needed in speech recognition.
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 13/35
13
Evaluating a Language Model:
Perplexity
!! How well the model fits the test data?
!! Higher the probability of the test data,
lower the perplexity
!! Measure of branching factor !! Related to entropy
PP(W ) = P(w1w
2!w N )!
1
N =
1P(w
1w
2!w
N )
N =1
P(wiw
i!1)
i=1
N
" N
Count </s> but not <s>, why?
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 14/35
14
Using the N-gram model!! Validity
!! P(Sam I am) = P(Sam|<s>) x P(I | Sam) x P(am | I) x P(</s>| Sam)
!! If P(Sam I am) < Threshold, invalid sentence!
!! Prediction
!!“Bring me green eggs and ”
!! argmax w [ P (w |and) x P (</s>|w ) ]
!! Discrimination
!! “I like green eggs and ham”
!! P seuss( I like green eggs and ham ) > P shakespeare( I like green eggs
and ham )
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 15/35
15
Using the N-gram model!! Generation
!! Fix w n-N+1 to w n-1. Sample w n from P(·| w n-N+1 w n-N+2 …w n-1 )!! Treat punctuation, end of sentence, etc., as words. !! Unigram: To him swallowed confess hear both
!! Bigram: What means, sir. I confess she?
!! Trigram: Therefore the sadness of parting, as they say, ’tis done.
!! Quadrigram: What! I shall go seek the traitor Gloucester
!! Limited applicability!! Combine with other methods such as back-off and
interpolation!! Most success in speech recognition and discrimination
!! Useful in non-grammatical settings!
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 16/35
16
Smoothing<s> I am Sam </s>
<s> Sam I am </s><s> I do not like green eggs and ham</s>
!! What is the probability of I do not like Sam I am ?!!
P (Sam | like) = 0; hence this is also 0 !! Sparseness of data
!! Do not assign zero probability even to unseen bigrams
!!Smoothing helps handle low or zero count cases. !! ..that arise due to sparseness
!! How about really rare or nonsense combinations?
!! Sam green ham
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 17/35
17
Laplace smoothing
!! Increment counts of all words by 1
!! add-one smoothing
!! add-! smoothing; less dramatic.
unigram
case
PLaplace(wi) =
C (wi)+1
N +V V: Size of Vocabulary
PLaplace(wnw
n!1) =C (w
n!1wn) +1
C (wn!1) +V
bi-gram
case
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 18/35
18
Laplace adjusted estimates
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham</s>
P!(I <s> )= 3
14; P!(Sam <s>) = 2
14;P!(am I)= 3
14
P!
(</s> Sam )=2
13;P
!
(Sam am )=2
13;P
!
(do I)=2
14
P
(I <s>)=
2
3;P
(Sam <s>) =
1
3;P
(am I)=
2
3
P(</s> Sam )=1
2;P(Sam am )=
1
2;P(do I)=
1
3
P!(Sam like)= 1
12
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 19/35
19
Laplace adjusted counts
!! Divide by N or C(w n-1 ) to get the smoothed
estimates!! Discounted counts!
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham</s>
C !
(wi) = C (w
i) +1( )
N
N +V C
!
(wn"1wn
) = C (wn"1wn
)+1( )C (w
n"1)
C (w
n"1)+V
unigram bi-gram
C (I) = 3;C !
(I) = 4 "17
28= 2.43
C (I am)=2; C !
(I am)=3"3
14= 0.64
C !
(like Sam)=1"1
12= 0.083
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 20/35
20
Good-Turing Discounting!! Laplace simple but smoothens too much
!! Use frequency of events that occurred once
for estimating frequency of unseen events
!!Let N c be the number of items that occur exactly c times frequency of frequency c
!! The adjusted count is given by:
!! The probability of zero frequency items:
c!
= (c +1)N
c+1
N c
PGT
!
(items in N0 ) = N
1
N Developed by Turing and Good
during World War II as part of
their work on deciphering
German codes - Enigma
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 21/35
Good-Turing some issues
!! Potential N-grams known
!! So number of unseen N-grams can be
calculated
!! Assumes that each of the N-gramdistribution is binomial
!! The probability of a bigram occurring once
is given by the GT-estimate
!! So observed counts could be from bigrams
of a different frequency
21
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 22/35
22
Simple Good-Turing Gale and Simpson !! What happens when N c+1 is zero?
!! We need a way of estimating the missing Nc!! Assumption: Nc=a´cb
!! log N c = a + b log c
!! Linear regression on a log-log scale log c vs. log N c
!! Alternatively: fit adjusted counts
!! Not a good fit for small c , hence use N c as it is if available
!! Switch from actual counts to estimated countswhen error is small
N c
*=
N c
0.5(c+
! c!
)
, c!
cc+ are consecutive non-zero frequencies
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 23/35
23
Good-Turing Katz’s correction !! Assume that low frequency items are really
zero frequency !! Assume that the count of very high frequency
items are correct and do not have to be
discounted
!! k of 5 is suggested by Katz.
c!=
(c +1)N
c+1
N c
" c(k +1) N
k +1
N 1
1" (k +1) N k +1
N 1
, for 1 # c # k
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 24/35
24
Combining Estimators!! Another technique for handling sparseness
!! Estimate N-gram probability by using the
estimates for the constituent grams
!! trigram using trigram, bigram and unigram
!! Yields better models than smoothing a fixed
N-gram model
!! We look at two methods:!! Interpolation
!! Back-off
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 25/35
25
Simple Linear Interpolation
!! Linear combination of shorter grams.
!! Finite mixture model
!! Weights determined by EM!! Empirically through a hold out set
!!Discounting higher order probabilities
Pli (wn wn!2wn
!1) = " 1P1(wn ) + " 2P2 (wn wn
!1) + " 3P3(wn wn
!2wn
!1)
0 ! " i ! 1, " i = 1i#
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 26/35
26
General Linear Interpolation
!! The combining weights are functions of thehistory!!
Can give higher weights to longer history if their counts are high
!! Histories are not treated individually butbinned!! Same frequencies
!! Weight of N-1 gram model determined by theaverage number of non-zero N grams that follow
this N-1 gram!! Takes care of “grammatical zeroes”
Pli(w h) = !
i(h)P
i(w h)
i=1
k
" , where #h,0 $ ! i(h) $ 1, and !
i(h) = 1
i"
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 28/35
28
Katz back-off for trigrams
Pbo ( z x, y) =P
!
( z x, y), if C ( x, y, z) > k " xyPbo ( z y), else if C ( x, y) > 0
P!( z), otherwise
#$%
&
%
Pbo ( z y) =P!
( z y), if C ( y, z) > k
" yP!( z), otherwise
#$%
&%
x ! wi"2
y ! wi"1
z ! wi
! xy =
1" P*( z x, y)
z:C ( x, y, z)>k
#
1" Pbo( z y) z:C ( x, y, z)>k #
! y =
1" P*( z y)
z:C ( y, z)>k
#
1" P*
( z) z:C ( y, z )>k #
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 29/35
29
Katz back-off: some issues!! If N-1-gram was never seen, then " is 1
!! Can start from a quadrigram and go down tounigram!
!! Generally perform well
!! Can have problem with Grammatical zeroes!! w very frequent but not part of a trigram, then
potential zero
!! Backing-off would estimate it to be some fractionof the bigram probability
!! Change dramatically with new data
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 30/35
Absolute Discounting!! Instead of multiplicative reduction of
higher order N-gram probability, do anadditive reduction
!!Limit total mass subtracted
30
Pabsolute(wiw
i!1) =
C (wi!1wi
)! D
C (wi!1
), if C (w
i!1wi) > 0
" (wi)P(wi), otherwise
#
$ %
& %
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 31/35
Kneser-Ney Discounting!! Unigram probability used only when
bigram probability is not available
!! Francisco is more frequent than glasses !! But appears as only San Francisco!! Hence unigram count of Francisco can be
low!
!! If not after San then the probability of Francisco is small
!! Continuation probability !
31
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 32/35
Kneser-Ney Discounting!! Unigram count depends on number of
different bigrams in which the wordoccurs
32
P*(w
i) =
wi!1
:C (wi!1w
i) > 0{ }
wi!1
:C (wi!1w) > 0{ }
w"
PKN(wiw
i!1) =
C (wi!1wi
)! D
C (wi!1)
, if C (wi!1wi
) > 0
" (wi)w
i!1 :C (wi!1wi
) > 0{ }
wi!1 :C (w
i!1w) > 0{ }w
# , otherwise
$
%
& &
' & &
PKN(w
iw
i!1) =
C (wi!1wi
) ! D
C (wi!1)
+ " (wi)
wi!1
:C (wi!1wi
) > 0{ }
wi!1 :C (wi!1w) > 0{ }w#
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 34/35
34
Other N-gram models!! Use longer “spheres” of influence
!! Skip N-grams!! Green Eggs, Green Duck Eggs
!! Variable skip length
!! Variable length N-gram!! Green Eggs and, and Ham - both “bigrams”
!! Use semantic information to guide formation of longer N-
grams.
!! Trigger based
!! Only after a trigger word
!! “like….ham”, “like…Sam”, “like…cricket”
!! Within a window from trigger
7/27/2019 Language Models1
http://slidepdf.com/reader/full/language-models1 35/35
35
N-gram models: Summary!! Frequentist
!! Use MLE estimate of probabilities estimated froma corpus
!! Simple, yet effective
!!Smoothing to handle sparseness of data!! Add-one, add-delta
!! Good Turing
!!Combine estimators for better models!! Interpolation
!! Back-off