nlp language models1 language models, lm noisy channel model simple markov models smoothing...

44
NLP Language Models 1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

Upload: rosaline-augusta-mcdowell

Post on 12-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 1

• Language Models, LM• Noisy Channel model• Simple Markov Models• Smoothing

Statistical Language Models

Page 2: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 2

Two Main Approaches to NLPKnowlege (AI)

•Statistical models

- Inspired in speech recognition :

probability of next word based on previous

- Statistical models different from

statistical methods methods

Page 3: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 3

Probability Theory

• X be uncertain outcome of some event. Called a random variable

• V(X) finite number of possible outcome (not a real number)

• P(X=x), probability of the particular outcome x (x belongs V(X))

• X desease of your patient, V(X) all possible diseases,

Page 4: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 4

Conditional probability of the outcome of an event based upon

the outcome of a second event

We pick two words randomly from a book. We know first word is

the, we want to know probability second word is dog

P(W2 = dog|W1 = the) = | W1 = the,W2= dog|/ |W1 = the|

Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)

Probability Theory

Page 5: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 5

Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)

P(desease/symptom)= P(desease)P(symptom/desease)/P(symptom)

P(w1,n| speech signal) = P(w1,n)P(speech signal | w1,n)/ P(speech signal)

We only need to maximize the numerator

P(speech signal | w1,n) expresses how well the speech signal fits

the sequence of words w1,n

Probability Theory

Page 6: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 6

Useful generalizations of the Bayes’s law

- To find the probability of something happening calculate the probability that it hapens given some second event times the probability of the second event

- P(w,z|y,z) = P(w,x) P(y,z | w,z) / P(y,z)

Where x,y,y,z aseparate event (i.e. take a word)

- P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)

also when conditioned on some event

P(w1,w2,..,wn|x) =P(w1|x)P(w2|w1,x) … P(wn|w1..,wn-1,x)

Probability Theory

Page 7: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 7

Statistical Model of a Language

• Statistical models of words of sentences – language models

• Probability of all possible sequences of words .

• For sequences of words of length n,

• assign a number to P(W1,n= w1,n), being w1,n a sequence of

n random variable w1,w2,..wn being any word Computing the probability of next word related to computing probability of sequencesP(the big dog) = P(the)P(big/the)P(dog/the bP(the pig dog) = P(the)P(pig/the)P(dog/the pig)

Page 8: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 8

Ngram Model

• Simple but durable statistical model

• Useful to indentify words in noisy, ambigous input.

• Speech recognition, many input speech sounds similar and confusable

• Machine translation, spelling correction, handwriting recognition, predictive text input

• Other NLP tasks: part of speech tagging, NL generation, word similarity

Page 9: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 9

CORPORA

• Corpora (singular corpus) are online collections of text or speech.

• Brown Corpus: 1 million word collection from 500 written texts

• from different genres (newspaper,novels, academic).

• Punctuation can be treated as words.

• Switchboard corpus: 2430 Telephone conversations averaging 6 minutes each, 240 hour of speech and 3 million words

There is no punctuation.

There are disfluencies: filled pauses and fragments

“I do uh main- mainly business data processing”

Page 10: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 10

Training and Test Sets

• Probabilities of N-gram model come from

the corpus it is trained for

• Data in the corpus is divided into training set

• (or training corpus) and test set (or test corpus).

• Perplexity: compare statistical models

Page 11: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 11

Ngram Model

• How can we compute probabilities of entire sequences P(w1,w2,..,wn)

• Descomposition using the chain rule of probability

P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)

• Assigns a conditional probability to possible next words considering the history.

• Markov assumption : we can predict the probability of some future unit without looking too far into the past.

• Bigrams only consider previous usint, trigrams, two previous unit, n-grams, n previous unit

Page 12: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 12

Ngram Model

• Assigns a conditional probability to possible next words.Only n-1 previous words have effect on the probabilities of next word

• For n = 3, Trigrams P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)

• How we estimate these trigram or N-gram probabilities?

Maximum Likelihood Estimation (MLE)

To maximize the likelihood of the training set T given the model M (P(T/M)

• To create the model use training text (corpus), taking counts and normalizing them so they lie between 0 and 1.

Page 13: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 13

Ngram Model

• For n = 3, Trigrams

P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)

• To create the model use training text and record pairs and triples of words that appear in the text and how many times

P(wi|wi-2,wi-1)= C(wi-2,i) / C(wi-2,i-1)

P(submarine|the, yellow) = C(the,yellow, submarine)/C(the,yellow)

Relative frequency: observed frequency of a particular sequence divided by observed fequency of a prefix

Page 14: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 14

• Statistical Models• Language Models (LM)• Vocabulary (V), word

• w V

• Language (L), sentence • s L• L V* usually infinite

• s = w1,…wN

• Probability of s• P(s)

Language Models

Page 15: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 15

W X W*Yencoder decoderChannel

p(y|x)message

input to channel

Output fromchannel

Attempt to reconstruct message based on output

Noisy Channel Model

Page 16: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 16

• In NLP we do not usually act on coding. The problem is reduced to decode for getting the most likely input given the output,

p i p o∣i ¿

decoderNoisy Channel p(o|I)

I O I

Noisy Channel Model in NLP

Page 17: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 17

p i p o∣i ¿

Language Model Channel Model

Page 18: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 18

noisy channel X Y

real Language X

Observed Language Y

We want to retrieve X from Y

Page 19: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 19

Correct text

errors

text with errors

noisy channel X Y

real Language X

Observed Language Y

Page 20: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 20

Correct text

space removing

text without spaces

noisy channel X Y

real Language X

Observed Language Y

Page 21: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 21

language model

acoustic model noisy channel X Y

real Language X

Observed Language Y

text

spelling

speech

Page 22: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 22

Source Language

Translationnoisy channel X Y

real Language X

Observed Language YTarget Language

Page 23: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 23

Acoustic chain word chain

Language model Acoustic Model

example: ASR Automatic Speech Recognizer

ASRX1 ... XT w1 ... wN

Page 24: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 24

Target Language Model Translation Model

example: Machine Translation

Page 25: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 25

• Naive Implementation• Enumerate s L• Compute p(s)• Parameters of the model |L|

• But ...• L is usually infinite• How to estimate the parameters?

• Simplifications

• History• hi = { wi, … wi-1}

• Markov Models

Page 26: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 26

• Markov Models of order n + 1

• P(wi|hi) = P(wi|wi-n+1, … wi-1)

• 0-gram

• 1-gram• P(wi|hi) = P(wi)

• 2-gram• P(wi|hi) = P(wi|wi-1)

• 3-gram• P(wi|hi) = P(wi|wi-2,wi-1)

Page 27: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 27

• n large:• more context information (more discriminative power)

• n small:• more cases in the training corpus (more reliable)

• Selecting n: • ej. for |V| = 20.000

n num. parameters

2 (bigrams) 400,000,000

3 (trigrams)

8,000,000,000,000

4 (4-grams) 1.6 x 1017

Page 28: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 28

• Parameters of an n-gram model• |V|n

• MLE estimation• From a training corpus

• Problem of sparseness

Page 29: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 29

• 1-gram Model

• 2-gram Model

• 3-gram Model

PMLEw =C w ∣V ∣

P MLE w i∣w i−1 ,w i−2 =C w i−2w i−1w i C w i−2w i−1

P MLE w i∣w i−1 =C w i−1w i C w i−1

Page 30: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 30

Page 31: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 31

Page 32: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 32

True probability distribution

Page 33: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 33

The seen cases are overestimated the unseen ones have a null probability

Page 34: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 34

Save a part of the mass probability from seen cases and assign it to the unseen

ones

SMOOTHING

Page 35: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 35

• Some methods perform on the countings:• Laplace, Lidstone, Jeffreys-Perks

• Some methods perform on the probabilities:• Held-Out• Good-Turing• Descuento

• Some methods combine models• Linear interpolation• Back Off

Page 36: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 36

P laplace w 1⋯w n =C w 1⋯w n 1

N+B

P = probability of an n-gram

C = counting of the n-gram in the training corpus

N = total of n-grams in the training corpus

B = parameters of the model (possible n-grams)

Laplace (add 1)

Page 37: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 37

P Lid w 1⋯w n =C w1⋯w n +λ

N+B⋅λ

= small positive number

M.L.E: = 0Laplace: = 1Jeffreys-Perks: = ½

Lidstone (generalization of Laplace)

Page 38: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 38

• Compute the percentage of the probability mass that has to be reserved for the n-grams unseen in the training corpus

• We separate from the training corpus a held-out corpus

• We compute howmany n-grams unseen in the training corpus occur in the held-out corpus

• An alternative of using a held-out corpus is using Cross-Validation• Held-out interpolation• Deleted interpolation

Held-Out

Page 39: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 39

T r= ∑{w1⋯w

n:C

1w

1⋯w

n=r }C 2w1⋯wn

P how1⋯wn=T rN r N

Let a n-gram w1… wn

r = C(w1… wn)

C1(w1… wn) counting of the n-gram in the training set

C2(w1… wn) counting of the n-gram in the held-out set

Nr number of n-grams with counting r in the training set

Held-Out

Page 40: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 40

r* = adjusted countNr = number of n-gram-types occurring r times

E(Nr) = expected value

E(Nr+1) < E(Nr)

Zipf law

r= r+ 1 E N r+1 E N r

PGT =r /N

Good-Turing

Page 41: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 41

Combination of models:

Linear combination(interpolation)

• Linear combination of de 1-gram, 2-gram, 3-gram, ...• Estimation of using a development corpus

P li w n∣w n−2 ,w n−1 =

λ 1 P 1 w n +λ 2 P 2 w n∣w n−1 +λ1 P 3 w n∣w n−2 ,w n−1

Page 42: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 42

• Start with a n-gram model• Back off to n-1 gram for null (or low) counts• Proceed recursively

Katz’s Backing-Off

Page 43: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 43

• Performing on the history

• Class-based Models• Clustering (or classifying) words into

classes• POS, syntactic, semantic

• Rosenfeld, 2000:• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,wi-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|Ci-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci-2,Ci-1)

Page 44: NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models

NLP Language Models 44

• Structured Language Models• Jelinek, Chelba, 1999• Including the syntactic structure into

the history

• Ti are the syntactic structures • binarized lexicalized trees