language models1

36
Language Models CS6370: Natural Language Processing

Upload: abilngeorge

Post on 14-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 1/35

Language Models

CS6370: Natural Language Processing

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 2/35

2

Why model language?!! Validation/Verification 

!! Check for syntax!! Better “understanding”

!! Generation 

!! Q & A!! Dialogue systems

!! Prediction 

!!Speech recognition

!! Spelling correction

!! Discrimination 

!! Topic detection

!!  Authorship verification

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 3/35

3

Language model: Grammar?

!! Model: Rules of a “language” +

dictionary

!! Context Free Grammars!!

Constructing full parse tree: expensive!! Ambiguity: PCFG 

!! Specified apriori: not data driven 

!! Overkill: complete grammar specificationnot needed for several tasks!! spell check, prediction, discrimination, etc.

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 4/35

4

What model do you use?!! We turned ___ the TV to watch the Cricket

 __________.!! We turned on the TV to watch the Cricket

match.

!! We turned in the TV to watch the Cricketgame.

!! We turned on the TV to watch the Cricket

tournament.!! We turned off the TV to watch the Cricket

hop.

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 5/35

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 6/35

6

Why do they help?match Cricket watched the I 

is less common thanI watched the Cricket match

!! Similarly match is a more likely completion

for I watched the Cricket  than game or hop.!! You probably used this idea in your 

spell check assignment.

!! Simple way of modeling this is to useN-gram statistics or N-gram models. 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 7/35

7

N-gram model intuition

!! N-gram is a sequence of N consecutive

words.

!! the cricket match is a 3-gram (trigram)

!! Look at the relative frequency of variousN-grams in your training corpus

!! P(I watched the cricket match) = P(I | . .) x P(watched | . I) x

P(the | I watched) x P(Cricket | watched the) x P(match | theCricket) x P(. | Cricket match) x P(. | match .) 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 8/35

8

N-gram computation

!! Cannot do this for the entire sequence

of words!! Computational issues

!!Scalability

!! So estimate for N-grams and use chain

rule.

P(the I watched )=C (I watched the)

C(I watched)

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 9/35

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 10/35

10

Estimating the bigram model

!! Relative frequency

!! Maximum Likelihood estimate

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham</s> 

P(wn

|wn!1

) =C (w

n!1w

n)

C (wn!1w)w"

=

C (wn!1w

n)

C (wn!1)

P(I <s>)=2

3; P(Sam <s>) =

1

3;P(am I)=

2

3

P

(</s> Sam )=

1

2 ;P

(Sam am )=

1

2 ;P

(do I)=

1

3

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 11/35

11

Data sets: Some issues!! Usually a corpus is split into training and testing sets. 

!! Use various measures for evaluation!! Can also use a hold-out set and a development set 

!! Performance depends on training corpora as with alldata driven methods 

!! Other issue: Closed vs open vocabulary !! Closed: Only a certain set of pre-determined words can

occur 

!!Open: Allow for an unlimited vocabulary !! Choose a vocabulary to model

!! Replace out-of-vocabulary words with <UNK>

!! Find probability of <UNK> in training;

!!  Alternative use first occurrence all words as <UNK>

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 12/35

12

Counting words: Some issues!! What is a word?

!! cat vs. cats

!! eat vs. ate vs. eating

!! President of the United States vs. POTUS

!! ahh, umm

!! .,?!;, etc

!! Lemmatization, stemming, tokenization

!! Task dependent

!!  All distinctions needed in speech recognition.

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 13/35

13

Evaluating a Language Model:

Perplexity 

!! How well the model fits the test data?

!! Higher the probability of the test data,

lower the perplexity

!! Measure of branching factor !! Related to entropy

PP(W ) = P(w1w

2!w N )!

1

 N =

1P(w

1w

2!w

 N )

 N  =1

P(wiw

i!1)

i=1

 N 

" N 

Count </s> but not <s>, why?

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 14/35

14

Using the N-gram model!! Validity 

!! P(Sam I am) = P(Sam|<s>) x P(I | Sam) x P(am | I) x P(</s>| Sam)

!! If P(Sam I am) < Threshold, invalid sentence!

!! Prediction

!!“Bring me green eggs and ”

!! argmax w  [ P (w |and) x P (</s>|w ) ]

!! Discrimination

!! “I like green eggs and ham”

!! P seuss( I like green eggs and ham ) > P shakespeare( I like green eggs

and ham )

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 15/35

15

Using the N-gram model!! Generation

!! Fix w n-N+1 to w n-1. Sample w n from P(·| w n-N+1 w n-N+2  …w n-1 )!! Treat punctuation, end of sentence, etc., as words. !! Unigram: To him swallowed confess hear both

!! Bigram: What means, sir. I confess she?

!! Trigram: Therefore the sadness of parting, as they say, ’tis done.

!! Quadrigram: What! I shall go seek the traitor Gloucester 

!! Limited applicability!! Combine with other methods such as back-off and

interpolation!! Most success in speech recognition and discrimination

!! Useful in non-grammatical settings!

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 16/35

16

Smoothing<s> I am Sam </s>

<s> Sam I am </s><s> I do not like green eggs and ham</s> 

!! What is the probability of  I do not like Sam I am ?!!

P (Sam | like) = 0; hence this is also 0 !! Sparseness of data 

!! Do not assign zero probability even to unseen bigrams 

!!Smoothing helps handle low or zero count cases. !! ..that arise due to sparseness

!! How about really rare or nonsense combinations? 

!! Sam green ham 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 17/35

17

Laplace smoothing

!! Increment counts of all words by 1

!! add-one smoothing

!! add-! smoothing; less dramatic.

unigram

case

PLaplace(wi) =

C (wi)+1

 N +V V: Size of Vocabulary 

PLaplace(wnw

n!1) =C (w

n!1wn) +1

C (wn!1) +V 

bi-gram

case

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 18/35

18

Laplace adjusted estimates

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham</s>

P!(I <s> )= 3

14; P!(Sam <s>) = 2

14;P!(am I)= 3

14

P!

(</s> Sam )=2

13;P

!

(Sam am )=2

13;P

!

(do I)=2

14

P

(I <s>)=

2

3;P

(Sam <s>) =

1

3;P

(am I)=

2

3

P(</s> Sam )=1

2;P(Sam am )=

1

2;P(do I)=

1

3

P!(Sam like)= 1

12

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 19/35

19

Laplace adjusted counts

!! Divide by N or C(w n-1 ) to get the smoothed

estimates!! Discounted counts!

<s> I am Sam </s>

<s> Sam I am </s>

<s> I do not like green eggs and ham</s>

C !

(wi) = C (w

i) +1( )

 N +V C 

!

(wn"1wn

) = C (wn"1wn

)+1( )C (w

n"1)

C (w

n"1)+V 

unigram bi-gram

C (I) = 3;C !

(I) = 4 "17

28= 2.43

C (I am)=2; C !

(I am)=3"3

14= 0.64

C !

(like Sam)=1"1

12= 0.083

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 20/35

20

Good-Turing Discounting!! Laplace simple but smoothens too much 

!! Use frequency of events that occurred once

for estimating frequency of unseen events

!!Let N c be the number of items that occur exactly c times frequency of frequency c  

!! The adjusted count is given by:

!! The probability of zero frequency items: 

c!

= (c +1)N 

c+1

 N c

PGT 

!

(items in N0 ) = N 

1

 N Developed by Turing and Good

during World War II as part of 

their work on deciphering

German codes - Enigma

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 21/35

Good-Turing some issues

!! Potential N-grams known

!! So number of unseen N-grams can be

calculated

!! Assumes that each of the N-gramdistribution is binomial

!! The probability of a bigram occurring once

is given by the GT-estimate

!! So observed counts could be from bigrams

of a different frequency

21

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 22/35

22

Simple Good-Turing Gale and Simpson !! What happens when N c+1 is zero?

!! We need a way of estimating the missing Nc!!  Assumption: Nc=a´cb

!! log N c = a + b log c  

!! Linear regression on a log-log scale log c vs. log N c 

!!  Alternatively: fit adjusted counts

!! Not a good fit for small c , hence use N c as it is if available

!! Switch from actual counts to estimated countswhen error is small

 N c

*=

 N c

0.5(c+

! c!

)

, c!

cc+ are consecutive non-zero frequencies

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 23/35

23

Good-Turing Katz’s correction !!  Assume that low frequency items are really

zero frequency !!  Assume that the count of very high frequency

items are correct and do not have to be

discounted

!! k  of 5 is suggested by Katz. 

c!=

(c +1)N 

c+1

 N c

" c(k +1) N 

k +1

 N 1

1" (k +1) N k +1

 N 1

, for 1 # c # k 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 24/35

24

Combining Estimators!!  Another technique for handling sparseness

!! Estimate N-gram probability by using the

estimates for the constituent grams

!! trigram using trigram, bigram and unigram

!! Yields better models than smoothing a fixed

N-gram model

!! We look at two methods:!! Interpolation

!! Back-off 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 25/35

25

Simple Linear Interpolation

!! Linear combination of shorter grams.

!! Finite mixture model

!! Weights determined by EM!! Empirically through a hold out set

!!Discounting higher order probabilities

Pli (wn wn!2wn

!1) = " 1P1(wn ) + " 2P2 (wn wn

!1) + " 3P3(wn wn

!2wn

!1)

0 ! " i ! 1, " i = 1i#

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 26/35

26

General Linear Interpolation

!! The combining weights are functions of thehistory!!

Can give higher weights to longer history if their counts are high

!! Histories are not treated individually butbinned!! Same frequencies

!! Weight of N-1 gram model determined by theaverage number of non-zero N grams that follow

this N-1 gram!! Takes care of “grammatical zeroes”

Pli(w h) = ! 

i(h)P

i(w h)

i=1

" , where #h,0 $ ! i(h) $ 1, and ! 

i(h) = 1

i"

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 27/35

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 28/35

28

Katz back-off for trigrams

Pbo ( z x, y) =P

!

( z x, y), if C ( x, y, z) > k  "  xyPbo ( z y), else if C ( x, y) > 0

P!( z), otherwise

#$%

&

%

Pbo ( z y) =P!

( z y), if C ( y, z) > k 

"  yP!( z), otherwise

#$%

&%

 x ! wi"2

 y ! wi"1

 z ! wi

!  xy =

1" P*( z x, y)

 z:C ( x, y, z)>k 

#

1" Pbo( z y) z:C ( x, y, z)>k #

!  y =

1" P*( z y)

 z:C ( y, z)>k 

#

1" P*

( z) z:C ( y, z )>k #

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 29/35

29

Katz back-off: some issues!! If N-1-gram was never seen, then " is 1

!! Can start from a quadrigram and go down tounigram!

!! Generally perform well

!! Can have problem with Grammatical zeroes!! w very frequent but not part of a trigram, then

potential zero

!! Backing-off would estimate it to be some fractionof the bigram probability

!! Change dramatically with new data

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 30/35

 Absolute Discounting!! Instead of multiplicative reduction of 

higher order N-gram probability, do anadditive reduction

!!Limit total mass subtracted

30

Pabsolute(wiw

i!1) =

C (wi!1wi

)!  D

C (wi!1

), if C (w

i!1wi) > 0

" (wi)P(wi), otherwise

$ % 

& % 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 31/35

Kneser-Ney Discounting!! Unigram probability used only when

bigram probability is not available

!! Francisco is more frequent than glasses !! But appears as only San Francisco!! Hence unigram count of  Francisco can be

low!

!! If not after San then the probability of  Francisco is small

!! Continuation probability !

31

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 32/35

Kneser-Ney Discounting!! Unigram count depends on number of 

different bigrams in which the wordoccurs

32

P*(w

i) =

wi!1

:C (wi!1w

i) > 0{ }

wi!1

:C (wi!1w) > 0{ }

w"

PKN(wiw

i!1) =

C (wi!1wi

)!  D

C (wi!1)

, if C (wi!1wi

) > 0

" (wi)w

i!1 :C (wi!1wi

) > 0{ }

wi!1 :C (w

i!1w) > 0{ }w

# , otherwise

& & 

' & & 

PKN(w

iw

i!1) =

C (wi!1wi

) !  D

C (wi!1)

+ " (wi)

wi!1

:C (wi!1wi

) > 0{ }

wi!1 :C (wi!1w) > 0{ }w#

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 33/35

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 34/35

34

Other N-gram models!! Use longer “spheres” of influence

!! Skip N-grams!! Green Eggs, Green Duck Eggs 

!! Variable skip length

!! Variable length N-gram!! Green Eggs and, and Ham - both “bigrams”

!! Use semantic information to guide formation of longer N-

grams.

!! Trigger based

!! Only after a trigger word

!! “like….ham”, “like…Sam”, “like…cricket”

!! Within a window from trigger 

7/27/2019 Language Models1

http://slidepdf.com/reader/full/language-models1 35/35

35

N-gram models: Summary!! Frequentist

!! Use MLE estimate of probabilities estimated froma corpus

!! Simple, yet effective

!!Smoothing to handle sparseness of data!!  Add-one, add-delta

!! Good Turing

!!Combine estimators for better models!! Interpolation

!! Back-off