language models1

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 1/35

Language Models

CS6370: Natural Language Processing



2

Why model language?!! Validation/Verification

!! Check for syntax!! Better “understanding”

!! Generation

!! Q & A!! Dialogue systems

!! Prediction

!!Speech recognition

!! Spelling correction

!! Discrimination

!! Topic detection

!! Authorship verification



3

Language model: Grammar?

!! Model: Rules of a “language” +

dictionary

!! Context Free Grammars!!

Constructing full parse tree: expensive!! Ambiguity: PCFG

!! Specified apriori: not data driven

!! Overkill: complete grammar specificationnot needed for several tasks!! spell check, prediction, discrimination, etc.



4

What model do you use?!! We turned ___ the TV to watch the Cricket

__________.!! We turned on the TV to watch the Cricket

match.

!! We turned in the TV to watch the Cricketgame.

!! We turned on the TV to watch the Cricket

tournament.!! We turned off the TV to watch the Cricket

hop.



6

Why do they help?match Cricket watched the I

is less common thanI watched the Cricket match

!! Similarly match is a more likely completion

for I watched the Cricket than game or hop.!! You probably used this idea in your

spell check assignment.

!! Simple way of modeling this is to useN-gram statistics or N-gram models.



7

N-gram model intuition

!! N-gram is a sequence of N consecutive

words.

!! the cricket match is a 3-gram (trigram)

!! Look at the relative frequency of variousN-grams in your training corpus

!! P(I watched the cricket match) = P(I | . .) x P(watched | . I) x

P(the | I watched) x P(Cricket | watched the) x P(match | theCricket) x P(. | Cricket match) x P(. | match .)



8

N-gram computation

!! Cannot do this for the entire sequence

of words!! Computational issues

!!Scalability

!! So estimate for N-grams and use chain

rule.

P(the I watched )=C (I watched the)

C(I watched)



10

Estimating the bigram model

!! Relative frequency

!! Maximum Likelihood estimate

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham</s>

P(wn

|wn!1

) =C (w

n!1w

n)

C (wn!1w)w"

=

C (wn!1w

n)

C (wn!1)

P(I <s>)=2

3; P(Sam <s>) =

1

3;P(am I)=

2

3

P

(</s> Sam )=

1

2 ;P

(Sam am )=

1

2 ;P

(do I)=

1

3



11

Data sets: Some issues!! Usually a corpus is split into training and testing sets.

!! Use various measures for evaluation!! Can also use a hold-out set and a development set

!! Performance depends on training corpora as with alldata driven methods

!! Other issue: Closed vs open vocabulary !! Closed: Only a certain set of pre-determined words can

occur

!!Open: Allow for an unlimited vocabulary !! Choose a vocabulary to model

!! Replace out-of-vocabulary words with <UNK>

!! Find probability of <UNK> in training;

!! Alternative use first occurrence all words as <UNK>



12

Counting words: Some issues!! What is a word?

!! cat vs. cats

!! eat vs. ate vs. eating

!! President of the United States vs. POTUS

!! ahh, umm

!! .,?!;, etc

!! Lemmatization, stemming, tokenization

!! Task dependent

!! All distinctions needed in speech recognition.



13

Evaluating a Language Model:

Perplexity

!! How well the model fits the test data?

!! Higher the probability of the test data,

lower the perplexity

!! Measure of branching factor !! Related to entropy

PP(W ) = P(w1w

2!w N )!

1

N =

1P(w

1w

2!w

N )

N =1

P(wiw

i!1)

i=1

N

" N

Count </s> but not <s>, why?



14

Using the N-gram model!! Validity

!! P(Sam I am) = P(Sam|<s>) x P(I | Sam) x P(am | I) x P(</s>| Sam)

!! If P(Sam I am) < Threshold, invalid sentence!

!! Prediction

!!“Bring me green eggs and ”

!! argmax w [ P (w |and) x P (</s>|w ) ]

!! Discrimination

!! “I like green eggs and ham”

!! P seuss( I like green eggs and ham ) > P shakespeare( I like green eggs

and ham )



15

Using the N-gram model!! Generation

!! Fix w n-N+1 to w n-1. Sample w n from P(·| w n-N+1 w n-N+2 …w n-1 )!! Treat punctuation, end of sentence, etc., as words. !! Unigram: To him swallowed confess hear both

!! Bigram: What means, sir. I confess she?

!! Trigram: Therefore the sadness of parting, as they say, ’tis done.

!! Quadrigram: What! I shall go seek the traitor Gloucester

!! Limited applicability!! Combine with other methods such as back-off and

interpolation!! Most success in speech recognition and discrimination

!! Useful in non-grammatical settings!



16

Smoothing<s> I am Sam </s>

<s> Sam I am </s><s> I do not like green eggs and ham</s>

!! What is the probability of I do not like Sam I am ?!!

P (Sam | like) = 0; hence this is also 0 !! Sparseness of data

!! Do not assign zero probability even to unseen bigrams

!!Smoothing helps handle low or zero count cases. !! ..that arise due to sparseness

!! How about really rare or nonsense combinations?

!! Sam green ham



17

Laplace smoothing

!! Increment counts of all words by 1

!! add-one smoothing

!! add-! smoothing; less dramatic.

unigram

case

PLaplace(wi) =

C (wi)+1

N +V V: Size of Vocabulary

PLaplace(wnw

n!1) =C (w

n!1wn) +1

C (wn!1) +V

bi-gram

case



18

Laplace adjusted estimates

<s> I am Sam </s>

<s> Sam I am </s>


P!(I <s> )= 3

14; P!(Sam <s>) = 2

14;P!(am I)= 3

14

P!

(</s> Sam )=2

13;P

!

(Sam am )=2

13;P

!

(do I)=2

14

P

(I <s>)=

2

3;P

(Sam <s>) =

1

3;P

(am I)=

2

3

P(</s> Sam )=1

2;P(Sam am )=

1

2;P(do I)=

1

3

P!(Sam like)= 1

12



19

Laplace adjusted counts

!! Divide by N or C(w n-1 ) to get the smoothed

estimates!! Discounted counts!

<s> I am Sam </s>

<s> Sam I am </s>


C !

(wi) = C (w

i) +1( )

N

N +V C

!

(wn"1wn

) = C (wn"1wn

)+1( )C (w

n"1)

C (w

n"1)+V

unigram bi-gram

C (I) = 3;C !

(I) = 4 "17

28= 2.43

C (I am)=2; C !

(I am)=3"3

14= 0.64

C !

(like Sam)=1"1

12= 0.083



20

Good-Turing Discounting!! Laplace simple but smoothens too much

!! Use frequency of events that occurred once

for estimating frequency of unseen events

!!Let N c be the number of items that occur exactly c times frequency of frequency c

!! The adjusted count is given by:

!! The probability of zero frequency items:

c!

= (c +1)N

c+1

N c

PGT

!

(items in N0 ) = N

1

N Developed by Turing and Good

during World War II as part of

their work on deciphering

German codes - Enigma



Good-Turing some issues

!! Potential N-grams known

!! So number of unseen N-grams can be

calculated

!! Assumes that each of the N-gramdistribution is binomial

!! The probability of a bigram occurring once

is given by the GT-estimate

!! So observed counts could be from bigrams

of a different frequency

21



22

Simple Good-Turing Gale and Simpson !! What happens when N c+1 is zero?

!! We need a way of estimating the missing Nc!! Assumption: Nc=a´cb

!! log N c = a + b log c

!! Linear regression on a log-log scale log c vs. log N c

!! Alternatively: fit adjusted counts

!! Not a good fit for small c , hence use N c as it is if available

!! Switch from actual counts to estimated countswhen error is small

N c

*=

N c

0.5(c+

! c!

)

, c!

cc+ are consecutive non-zero frequencies



23

Good-Turing Katz’s correction !! Assume that low frequency items are really

zero frequency !! Assume that the count of very high frequency

items are correct and do not have to be

discounted

!! k of 5 is suggested by Katz.

c!=

(c +1)N

c+1

N c

" c(k +1) N

k +1

N 1

1" (k +1) N k +1

N 1

, for 1 # c # k



24

Combining Estimators!! Another technique for handling sparseness

!! Estimate N-gram probability by using the

estimates for the constituent grams

!! trigram using trigram, bigram and unigram

!! Yields better models than smoothing a fixed

N-gram model

!! We look at two methods:!! Interpolation

!! Back-off



25

Simple Linear Interpolation

!! Linear combination of shorter grams.

!! Finite mixture model

!! Weights determined by EM!! Empirically through a hold out set

!!Discounting higher order probabilities

Pli (wn wn!2wn

!1) = " 1P1(wn ) + " 2P2 (wn wn

!1) + " 3P3(wn wn

!2wn

!1)

0 ! " i ! 1, " i = 1i#



26

General Linear Interpolation

!! The combining weights are functions of thehistory!!

Can give higher weights to longer history if their counts are high

!! Histories are not treated individually butbinned!! Same frequencies

!! Weight of N-1 gram model determined by theaverage number of non-zero N grams that follow

this N-1 gram!! Takes care of “grammatical zeroes”

Pli(w h) = !

i(h)P

i(w h)

i=1

k

" , where #h,0 $ ! i(h) $ 1, and !

i(h) = 1

i"



28

Katz back-off for trigrams

Pbo ( z x, y) =P

!

( z x, y), if C ( x, y, z) > k " xyPbo ( z y), else if C ( x, y) > 0

P!( z), otherwise

#$%

&

%

Pbo ( z y) =P!

( z y), if C ( y, z) > k

" yP!( z), otherwise

#$%

&%

x ! wi"2

y ! wi"1

z ! wi

! xy =

1" P*( z x, y)

z:C ( x, y, z)>k

#

1" Pbo( z y) z:C ( x, y, z)>k #

! y =

1" P*( z y)

z:C ( y, z)>k

#

1" P*

( z) z:C ( y, z )>k #



29

Katz back-off: some issues!! If N-1-gram was never seen, then " is 1

!! Can start from a quadrigram and go down tounigram!

!! Generally perform well

!! Can have problem with Grammatical zeroes!! w very frequent but not part of a trigram, then

potential zero

!! Backing-off would estimate it to be some fractionof the bigram probability

!! Change dramatically with new data



Absolute Discounting!! Instead of multiplicative reduction of

higher order N-gram probability, do anadditive reduction

!!Limit total mass subtracted

30

Pabsolute(wiw

i!1) =

C (wi!1wi

)! D

C (wi!1

), if C (w

i!1wi) > 0

" (wi)P(wi), otherwise

#

$ %

& %



Kneser-Ney Discounting!! Unigram probability used only when

bigram probability is not available

!! Francisco is more frequent than glasses !! But appears as only San Francisco!! Hence unigram count of Francisco can be

low!

!! If not after San then the probability of Francisco is small

!! Continuation probability !

31



Kneser-Ney Discounting!! Unigram count depends on number of

different bigrams in which the wordoccurs

32

P*(w

i) =

wi!1

:C (wi!1w

i) > 0{ }

wi!1

:C (wi!1w) > 0{ }

w"

PKN(wiw

i!1) =

C (wi!1wi

)! D

C (wi!1)

, if C (wi!1wi

) > 0

" (wi)w

i!1 :C (wi!1wi

) > 0{ }

wi!1 :C (w

i!1w) > 0{ }w

# , otherwise

$

%

& &

' & &

PKN(w

iw

i!1) =

C (wi!1wi

) ! D

C (wi!1)

+ " (wi)

wi!1

:C (wi!1wi

) > 0{ }

wi!1 :C (wi!1w) > 0{ }w#



34

Other N-gram models!! Use longer “spheres” of influence

!! Skip N-grams!! Green Eggs, Green Duck Eggs

!! Variable skip length

!! Variable length N-gram!! Green Eggs and, and Ham - both “bigrams”

!! Use semantic information to guide formation of longer N-

grams.

!! Trigger based

!! Only after a trigger word

!! “like….ham”, “like…Sam”, “like…cricket”

!! Within a window from trigger



35

N-gram models: Summary!! Frequentist

!! Use MLE estimate of probabilities estimated froma corpus

!! Simple, yet effective

!!Smoothing to handle sparseness of data!! Add-one, add-delta

!! Good Turing

!!Combine estimators for better models!! Interpolation

!! Back-off

language models1

Documents