natural language processing language model. language models formal grammars (e.g. regular, context...

Natural Language Processing

Language Model

Language Models

• Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences in a language

• For NLP, a probabilistic model of a language that gives a probability that a string is a member of a language is more useful

• To specify a correct probability distribution, the probability of all sentences in a language must sum to 1

Uses of Language Models

• Speech recognition– “I ate a cherry” is a more likely sentence than “Eye eight uh Jerry”

• OCR & Handwriting recognition– More probable sentences are more likely correct readings

• Machine translation– More likely sentences are probably better translations

• Generation– More likely sentences are probably better NL generations

• Context sensitive spelling correction– “Their are problems wit this sentence.”

Completion Prediction

• A language model also supports predicting the completion of a sentence– Please turn off your cell _____– Your program does not ______

• Predictive text input systems can guess what you are typing and give choices on how to complete it

Probability

• P(X) means probability that X is true– P(baby is a boy) 0.5 (% of total that are boys)– P(baby is named John) 0.001 (% of total named

John)

BabiesBaby boys

John

Probability

• P(X|Y) means probability that X is true when we already know Y is true– P(baby is named John | baby is a boy) 0.002– P(baby is a boy | baby is named John ) 1

BabiesBaby boys

John

Probability

• P(X|Y) = P(X, Y) / P(Y)– P(baby is named John | baby is a boy) = P(baby is named John, baby is a boy) / P(baby is a boy) =

0.001 / 0.5 = 0.002

BabiesBaby boys

John

Bayes Rule

• Bayes rule: P(X|Y) = P(Y|X) P(X) / P(Y)• P(named John | boy) = P(boy | named John)

P(named John) / P(boy)

BabiesBaby boys

John

Word Sequence Probabilities

• Given a word sequence (sentence):

its probability is:

The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history. In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model

nn www ...11

)|()|()...|()|()()( 11

1

11

2131211

kn

kk

nn

n wwPwwPwwPwwPwPwP

N-Gram Models• Estimate probability of each word given prior context

– P(phone | Please turn off your cell)• Number of parameters required grows exponentially with the

number of words of prior context• An N-gram model uses only N1 words of prior context.

– Unigram: P(phone)– Bigram: P(phone | cell)– Trigram: P(phone | your cell)

• Bigram approximation:

• N-gram approximation: )|()( 11

11

k

Nk

n

kk

n wwPwP

)|()( 11

1 k

n

kk

n wwPwP

Estimating Probabilities

• N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences.

• To have a consistent probabilistic model, append a unique start (<s>) and end (</s>) symbol to every sentence and treat these as additional words.

)(

)()|(

1

11

n

nnnn wC

wwCwwP

)(

)()|(

11

111

1

nNn

nnNnn

Nnn wC

wwCwwP

Bigram:

N-gram:

Generative Model and MLE

• An N-gram model can be seen as a probabilistic automata for generating sentences.

• Relative frequency estimates can be proven to be maximum likelihood estimates (MLE) since they maximize the probability that the model M will generate the training corpus T.

Initialize sentence with N1 <s> symbolsUntil </s> is generated do: Stochastically pick the next word based on the conditional probability of each word given the previous N 1 words.

))(|(argmaxˆ

MTP

Example

• Estimate the likelihood of the sentence:I want to eat Chinese foodP(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

• What do we need to calculate these likelihoods?– Bigram probabilities for each word pair sequence in

the sentence– Calculated from a large corpus

Corpus

• A language model must be trained on a large corpus of text to estimate good parameter values

• Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate)

• Ideally, the training (and test) corpus should be representative of the actual application data

Terminology

• Types: number of distinct words in a corpus (vocabulary size)

• Tokens: total number of words

Early Bigram Probabilities from BERP

.001Eat British.03Eat today

.007Eat dessert.04Eat Indian

.01Eat tomorrow.04Eat a

.02Eat Mexican.04Eat at

.02Eat Chinese.05Eat dinner

.02Eat in.06Eat lunch

.03Eat breakfast.06Eat some

.03Eat Thai.16Eat on

.01British lunch.05Want a

.01British cuisine.65Want to

.15British restaurant.04I have

.60British food.08I don’t

.02To be.29I would

.09To spend.32I want

.14To have.02<start> I’m

.26To eat.04<start> Tell

.01Want Thai.06<start> I’d

.04Want some.25<start> I

Back to our sentence…

• I want to eat Chinese food

0100004Lunch

000017019Food

112000002Chinese

522190200Eat

12038601003To

686078603Want

00013010878I

lunchFoodChineseEatToWantI

Relative Frequencies

• Normalization: divide each row's counts by appropriate unigram counts for wn-1

• Computing the bigram probability of I I– C(I,I)/C( I)– p (I|I) = 8 / 3437 = .0023

4591506213938325612153437

LunchFoodChineseEatToWantI

Approximating Shakespeare

• Generating sentences with random unigrams...– Every enter now severally so, let– Hill he late speaks; or! a more to leg less first you enter

• With bigrams...– What means, sir. I confess she? then all sorts, he is trim,

captain.– Why dost stand forth thy canopy, forsooth; he is this

palpable hit the King Henry.• Trigrams

– Sweet prince, Falstaff shall die.– This shall forbid it should be branded, if renown made it

empty.

Approximating Shakespeare

• Quadrigrams– What! I will go seek the traitor Gloucester.– Will you not tell me who I am?– What's coming out here looks like Shakespeare

because it is Shakespeare• Note: As we increase the value of N, the

accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

Evaluation

• Perplexity and entropy: how do you estimate how well your language model fits a corpus once you’re done?

Random Variables

• A variable defined by the probabilities of each possible value in the population

• Discrete Random Variable – Whole Number (0, 1, 2, 3 etc.)– Countable, Finite Number of Values

• Jump from one value to the next and cannot take any values in between

• Continuous Random Variables– Whole or Fractional Number– Obtained by Measuring– Infinite Number of Values in Interval

• Too Many to List Like Discrete Variable

Discrete Random Variables

• For example:

# of girls in the family# of of correct answers of a given exam

…

Mass Probability Function

• Probability– x = Value of Random Variable (Outcome)– p(x) = Probability Associated with Value

• Mutually Exclusive (No Overlap)• Collectively Exhaustive (Nothing Left Out)• 0 p(x) 1• p(x) = 1

Measures

• Expected Value– Mean of Probability Distribution– Weighted Average of All Possible Values– = E(X) = x p(x)

• Variance– Weighted Average Squared Deviation about Mean – 2 = V(X)= E[ (x (x p(x) – 2 = V(X)=E(X )-[E(X)

• Standard Deviation– =2 = SD(X)

Perplexity and Entropy

• Information theoretic metrics– Useful in measuring how well a grammar or language model

(LM) models a natural language or a corpus

• Entropy: With 2 LMs and a corpus, which LM is the better match for the corpus? How much information is there (in e.g. a grammar or LM) about what the next word will be?

• For a random variable X ranging over e.g. bigrams and a probability function p(x), the entropy of X is the expected negative log probability

Entropy- Example

• Horse race – 8 horses, we want to send a bet to the bookie

• In the naïve way – 3 bits message• Can we do better?• Suppose we know the distribution of the bets

placed, i.e. - Horse1 ½ Horse5 1/64

Horse2 ¼ Horse6 1/64

Horse3 1/8 Horse7 1/64

Horse4 1/16 Horse8 1/64

Entropy - Example

• The Entropy of the random variable X give lower bound on the number of bits –

Perplexity

• The weighted average number of choices a random variable has to make

• In the previous example its 8

Entropy of a Language

• Measuring all the sequences of size n in a language L:

• Entropy rate:

Cross Entropy and Perplexity

• Given an approximation probability of the language p and a model m we want to measure how good m predicts p

Perplexity

• Better models m of the unknown distribution p will tend to assign higher probabilities m(xi) to the test events. Thus, they have lower perplexity: they are less surprised by the test sample

Example

Slide from Philip Keohn

Comparison 1-4 grams LM

Slide from Philip Keohn

Smoothing

• Words follow a Zipfian distribution–Small number of words occur very frequently–A large number are seen only once

• Zero probabilities on one bigram cause a zero probability on the entire sentence

• So….how do we estimate the likelihood of unseen n-grams?

Steal from the rich and give to the poor (in probability mass)

Slide from Dan Klein

Add-One Smoothing

• For all possible n-grams, add the count of one:

– C = count of n-gram in corpus – N = count of history – V = vocabulary size

• But there are many more unseen n-grams than seen n-grams

Add-One Smoothingunsmoothed bigram counts:

unsmoothed normalized bigram probabilities:

1st w

ord

2nd word

Vocabulary = 1616

Add-One Smoothingadd-one smoothed bigram counts:

add-one normalized bigram probabilities:

Add-One Problems

• bigrams starting with Chinese are boosted by a factor of 8! (1829 / 213)

unsmoothed bigram counts:

add-one smoothed bigram counts:

Add-One Problems

• Every previously unseen n-gram is given a low probability, but there are so many of them that too much probability mass is given to unseen events

• Adding 1 to frequent bigram, does not change much, but adding 1 to low bigrams (including unseen ones) boosts them too much!

• In NLP applications that are very sparse, Add-One actually gives far too much of the probability space to unseen events

Witten-Bell Smoothing

• Intuition: – An unseen n-gram is one that just did not

occur yet– When it does happen, it will be its first occurrence– So give to unseen n-grams the probability of

seeing a new n-gram

Witten-Bell – Unigram Case

• N: number of tokens• T: number of types (diff. observed words) - can

be different than V (number of words in dictionary)

Prob. of unseen unigrams

Prob. of seen unigrams

Witten-Bell – bigram case

Prob. of unseen unigrams

Prob. of seen unigrams

Witten-Bell Example

• The original counts were –

• T(w)= number of different seen bigrams types starting with w• We have a vocabulary of 1616 words, so we can compute

Z(w)= number of unseen bigrams types starting with w Z(w) = 1616 - T(w)

• N(w) = number of bigrams tokens starting with w

Witten-Bell Example

• WB smoothed probabilities:

Back-off• So far, we gave the same probability to all unseen n-grams

– we have never seen the bigrams • journal of Punsmoothed(of |journal) = 0 • journal from Punsmoothed(from |journal) = 0 • journal never Punsmoothed(never |journal) = 0

– all models so far will give the same probability to all 3 bigrams

• but intuitively, “journal of” is more probable because...– “of” is more frequent than “from” & “never” – unigram probability P(of) > P(from) > P(never)

Back-off• Observation:

– unigram model suffers less from data sparseness than bigram model

– bigram model suffers less from data sparseness than trigram model

– …

• So use a lower model estimate, to estimate probability of unseen n-grams

• If we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model

Linear Interpolation

• Solve the sparseness in a trigram model by mixing with bigram and unigram models

• Also called: – linear interpolation– finite mixture models – deleted interpolation

• Combine linearlyPli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1)

– where 0 i 1 and i i =1

Back-off Smoothing

• Smoothing of Conditional Probabilities

p(Angeles | to, Los)

• If „to Los Angeles“ is not in the training corpus,the smoothed probability p(Angeles | to, Los) isidentical to p(York | to, Los)

• However, the actual probability is probably close tothe bigram probability p(Angeles | Los)

Back-off Smoothing

• Problem: not a probability distribution

• Solution:Combination of Back-off and Smoothing

if C(w1,...,wk,w) > 0

P(w | w1,...,wk) = C* (w1,...,wk,w) / N else

P(w | w1,...,wk) = (w1,...,wk) P(w | w2,...,wk)

natural language processing language model. language models formal grammars (e.g. regular, context...

Documents

john baby

john boy

boy baby

boy p baby

probability pxy

order markov model slide

conditional probability

ngram models