probability & language modeling · the film got a great opening and the film went on to become...

173
Probability & Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability & Language Modeling

CMSC 473/673

UMBC

Some slides adapted from 3SLP, Jason Eisner

Page 2: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

orthography

morphology

lexemes

syntax

semantics

pragmatics

discourse

VISION

AUDIO

prosody

intonation

color

Page 3: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •
Page 4: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

score( )

Page 5: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

Page 6: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

what’s a probability?

Page 7: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today.

pθ( )

what do we estimate?Documents? Sentences? Words? Characters?

Page 8: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining

Path attack today.

pθ( )

what’s a word?how to deal with morphology and orthography

Page 9: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Tree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of an Shining Path attack today.

pθ( )

how do we estimate robustly?

Page 10: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of an ISISattack today.

pθ( )

how do we generalize?

Page 11: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 12: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 13: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Takeaways

Basic probability axioms and definitions

Probabilistic Independence

Definition of joint probability

Definition of conditional probability

Bayes rule

Probability chain rule

Page 14: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Kinds of Statistics

Descriptive

Confirmatory

Predictive

The average grade on this

assignment is 83.

Page 15: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Interpretations of Probability

Past performance58% of the past 100 flips were heads

Hypothetical performanceIf I flipped the coin in many parallel universes…

Subjective strength of beliefWould pay up to 58 cents for chance to win $1

Output of some computable formula?p(heads) vs q(heads)

Page 16: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities Measure Sets

all (known) outcomes involving coin being flipped

coin coming

up heads

Page 17: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities Measure Sets

all (known) outcomes involving coin being flipped

coin is ancient

coin coming

up heads

Page 18: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities Measure Sets

all (known) outcomes involving coin being flipped

defectiveminting process

coin is ancient

coin coming

up heads

Page 19: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities Measure Sets

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

Page 20: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

everything

Page 21: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

p(φ) = 0everything

Page 22: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

p(φ) = 0

p(A) ≤ p(B), when A ⊆ B everything

B

A

Page 23: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

p(φ) = 0

p(A) ≤ p(B), when A ⊆ B

p(A ∪ B) = p(A) + p(B),

when A ∩ B = φ

everything

A B

Page 24: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

p(φ) = 0

p(A) ≤ p(B), when A ⊆ B

p(A ∪ B) = p(A) + p(B),

when A ∩ B = φ

everything

A B

p(A ∪ B) ≠ p(A) + p(B)

Page 25: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

(Most) Probability Axioms

p(everything) = 1

p(φ) = 0

p(A) ≤ p(B), when A ⊆ B

p(A ∪ B) = p(A) + p(B),

when A ∩ B = φ

everything

A B

p(A ∪ B) = p(A) + p(B) – p(A ∩ B)

Page 26: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝(ancient coin AND defective minting process)

Page 27: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 ancient coin AND defective minting process =𝑝 ancient coin ∗ 𝑝(defective minting process)

Page 28: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probabilities of Independent Events Multiply

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

comma represents AND

𝑝 ancient coin, defective minting process =𝑝 ancient coin ∗ 𝑝(defective minting process)

Page 29: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Joint Probabilities Are (Should Be) Symmetric

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 defective minting process, ancient coin =𝑝 defective minting process ∗ 𝑝 ancient coin

But the arguments to

joint probabilities can have an order

Page 30: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads defective minting process)

Page 31: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads defective minting process) =𝑝(heads AND defective minting process)

𝑝(defective minting process)

Page 32: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads defective minting process) =𝑝(heads AND defective minting process)

𝑝(defective minting process)

defective process favors tails

Page 33: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities (Also) Measure Sets

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

coin is ancient

𝑝 heads defective minting process) =𝑝(heads AND defective minting process)

𝑝(defective minting process)

defective process favors heads

defectiveminting process

coin coming

up heads

Page 34: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads egg salad) 𝑝 heads NOT egg salad)vs.

Page 35: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads egg salad) 𝑝 heads NOT egg salad)vs.

𝑝 heads egg salad) 𝑝 tails egg salad)vs.

Page 36: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Conditional Probabilities Are Probabilities

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads egg salad) 𝑝 heads NOT egg salad)vs.

𝑝 heads egg salad) 𝑝 tails egg salad)vs.

𝑝 heads egg salad) 𝑝 tails NOT egg salad)vs.

Page 37: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads defective minting process) =𝑝(heads AND defective minting process)

𝑝(defective minting process)

Page 38: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads AND defective minting process =𝑝 heads defective minting process) ∗ 𝑝(defective minting process)

Page 39: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads AND defective minting process =𝑝 heads defective minting process) ∗ 𝑝(defective minting process)

Page 40: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads AND defective minting process =𝑝(defective minting process | heads) ∗ 𝑝(heads)

Page 41: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads AND defective minting process =𝑝(defective minting process | heads) ∗ 𝑝(heads)

𝑝 heads AND defective minting process =𝑝 heads defective minting process) ∗ 𝑝(defective minting process)

Page 42: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

all (known) outcomes involving coin being flipped

cafeteria serves egg

salad

defectiveminting process

coin is ancient

coin coming

up heads

𝑝 heads defective minting process) =

𝑝(defective minting process | heads) ∗ 𝑝(heads)

𝑝(defective minting process)

Page 43: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

𝑝 𝑋 𝑌) =𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

𝑝(𝑌)

Page 44: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

𝑝 𝑋 𝑌) =𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

𝑝(𝑌)posterior probability

Page 45: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

𝑝 𝑋 𝑌) =𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

𝑝(𝑌)posterior probability

likelihood

Page 46: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

𝑝 𝑋 𝑌) =𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

𝑝(𝑌)posterior probability

likelihoodprior

probability

Page 47: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Bayes Rule

𝑝 𝑋 𝑌) =𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

𝑝(𝑌)posterior probability

likelihoodprior

probability

marginal likelihood (probability)

Page 48: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Left

0

1

p(A)

Page 49: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Left

0

1

p(A, B)

p(A)

Page 50: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Left

0

1

p(A, B, C)

p(A, B)

p(A)

Page 51: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Left

p(A, B, C, D)

0

1

p(A, B, C)

p(A, B)

p(A)

Page 52: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Left

p(A, B, C, D)

0

1

p(A, B, C)

p(A, B)

p(A)

p(A, B, C, D, E)

Page 53: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Right

0

1

p(A | B)

p(A)

Page 54: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Right

0

1

p(A | B)p(A)

Page 55: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Right

0

1 p(A | B)

p(A)

Page 56: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Changing the Right

Bias vs. Variance

Lower bias: More specific to what we care about

Higher variance: For fixed observations, estimates become less reliable

Page 57: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Chain Rule

𝑝 𝑥1, 𝑥2 = 𝑝 𝑥1 𝑝 𝑥2 𝑥1)Bayes rule

Page 58: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Chain Rule

𝑝 𝑥1, 𝑥2, … , 𝑥𝑆 =𝑝 𝑥1 𝑝 𝑥2 𝑥1)𝑝 𝑥3 𝑥1, 𝑥2)⋯𝑝 𝑥𝑆 𝑥1, … , 𝑥𝑖

Page 59: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Chain Rule

𝑝 𝑥1, 𝑥2, … , 𝑥𝑆 =𝑝 𝑥1 𝑝 𝑥2 𝑥1)𝑝 𝑥3 𝑥1, 𝑥2)⋯𝑝 𝑥𝑆 𝑥1, … , 𝑥𝑖 =

𝑖

𝑆

𝑝 𝑥𝑖 𝑥1, … , 𝑥𝑖−1)

Page 60: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Takeaways

Basic probability axioms and definitions

Probabilistic Independence

Definition of joint probability

Definition of conditional probability

Bayes rule

Probability chain rule

Page 61: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 62: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

Linguists don’t agree

(Human) Language-dependent

White-space separation is a sometimes okay (for written English longform)

Social media? Spoken vs. written? Other languages?

Page 63: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

bat

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

Page 64: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

bats

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

Page 65: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

Fledermaus

flutter mouse

http://www.freepngimg.com/download/bat/9-2-bat-png-hd.png

Page 66: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

pişirdiler

They cooked it.

pişmişlermişlerdi

They had it cooked it.

Page 67: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words?

my leg is hurting nasty ):

):

Page 68: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Examples of Text Normalization

Segmenting or tokenizing words

Normalizing word formats

Segmenting sentences in running text

Page 69: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

What Are Words? Tokens vs. TypesThe film got a great opening and the film went on to become a hit .

Tokens: an instance of

that type in running text.

• The• film• got• a• great• opening• and• the• film• went• on• to• become• a• hit• .

Types: an element of the

vocabulary.

• The• film• got• a• great• opening• and• the• went• on• to• become• hit• .

Page 70: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Some Issues with Tokenization

mph, MPH, M.D.

MD, M.D.

Baltimore’s mayor

I’m, won’t

state-of-the-art

San Francisco

Page 71: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

CaSE inSensitive?

Replace all letters with lower case version

Can be useful for information retrieval (IR), machine translation, language modeling

cat vs Cat (there are other ways to signify beginning)

Page 72: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

CaSE inSensitive?

Replace all letters with lower case version

Can be useful for information retrieval (IR), machine translation, language modeling

But… case can be useful

Sentiment analysis, machine translation, information extraction

cat vs Cat (there are other ways to signify beginning)

US vs us

Page 73: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

cat ≟ cats

Lemma: same stem, part of speech, rough word sense

cat and cats: same lemma

Word form: the fully inflected surface form

cat and cats: different word forms

Page 74: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Lemmatization

Reduce inflections or variant forms to base formam, are, is be

car, cars, car's, cars' car

the boy's cars are different colors

the boy car be different color

Page 75: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Morphosyntax

Morphemes: The small meaningful units that make up words

Stems: The core meaning-bearing units

Affixes: Bits and pieces that adhere to stems

Page 76: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Morphosyntax

Morphemes: The small meaningful units that make up words

Stems: The core meaning-bearing units

Affixes: Bits and pieces that adhere to stems

Inflectional:

(they) look (they) looked

(they) ran (they) run

Derivational:

(a) run running (of the Bulls)

code codeable

Page 77: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Morphosyntax

Morphemes: The small meaningful units that make up words

Stems: The core meaning-bearing units

Affixes: Bits and pieces that adhere to stems

Syntax: Contractions can rewrite and reorder a sentence

Baltimore’s [mayor’s {campaign} ]

[ {the campaign} of the mayor] of Baltimore

Inflectional:

(they) look (they) looked

(they) ran (they) run

Derivational:

(a) run running (of the Bulls)

code codeable

Page 78: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Words vs. Sentences

!, ? are relatively unambiguous

Period “.” is quite ambiguous

Sentence boundary

Abbreviations like Inc. or Dr.

Numbers like .02% or 4.3

Solution: write rules, build a classifier

Page 79: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 80: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

[…text..]pθ( )

Goal of Language Modeling

Learn a probabilistic model of text

Accomplished through observing text and updating model parameters to make text more likely

Page 81: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

[…text..]pθ( )

Goal of Language Modeling

Learn a probabilistic model of text

Accomplished through observing text and updating model parameters to make

text more likely

0 ≤ 𝑝𝜃 [… 𝑡𝑒𝑥𝑡 … ] ≤ 1

𝑡:𝑡 is valid text

𝑝𝜃 𝑡 = 1

Page 82: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

“The Unreasonable Effectiveness of Recurrent Neural Networks”

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 83: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

“The Unreasonable Effectiveness of Character-level Language Models”

(and why RNNs are still cool)http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

“The Unreasonable Effectiveness of Recurrent Neural Networks”

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 84: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Simple Count-Based

𝑝 item

Page 85: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Simple Count-Based

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡(item)

“proportional to”

Page 86: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Simple Count-Based

“proportional to”

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡 item

=𝑐𝑜𝑢𝑛𝑡(item)

σany other item𝑦 𝑐𝑜𝑢𝑛𝑡(y)

Page 87: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Simple Count-Based

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡 item

=𝑐𝑜𝑢𝑛𝑡(item)

σany other item𝑦 𝑐𝑜𝑢𝑛𝑡(y)

“proportional to”

constant

Page 88: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Simple Count-Based

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡(item)

sequence of characters pseudo-words

sequence of words pseudo-phrases

Page 89: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Shakespearian Sequences of Characters

Page 90: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Shakespearian Sequences of Words

Page 91: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Novel Words, Novel Sentences

“Colorless green ideas sleep furiously” –Chomsky (1957)

Let’s observe and record all sentences with our big, bad supercomputer

Red ideas? Read ideas?

Page 92: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Chain Rule

𝑝 𝑥1, 𝑥2, … , 𝑥𝑆 =𝑝 𝑥1 𝑝 𝑥2 𝑥1)𝑝 𝑥3 𝑥1, 𝑥2)⋯𝑝 𝑥𝑆 𝑥1, … , 𝑥𝑖

Page 93: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Probability Chain Rule

𝑝 𝑥1, 𝑥2, … , 𝑥𝑆 =𝑝 𝑥1 𝑝 𝑥2 𝑥1)𝑝 𝑥3 𝑥1, 𝑥2)⋯𝑝 𝑥𝑆 𝑥1, … , 𝑥𝑖 =

𝑖

𝑆

𝑝 𝑥𝑖 𝑥1, … , 𝑥𝑖−1)

Page 94: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously)

Page 95: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously) =p(Colorless) *

Page 96: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *

Page 97: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | Colorless green ideas) *p(furiously | Colorless green ideas sleep)

Page 98: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | Colorless green ideas) *p(furiously | Colorless green ideas sleep)

apply the chain rule

Page 99: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

Maintaining an entire joint inventory over sentences could be too much to ask

Store “smaller” pieces?

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | Colorless green ideas) *p(furiously | Colorless green ideas sleep)

apply the chain rule

Page 100: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(furiously | Colorless green ideas sleep)

How much does “Colorless” influence the choice of “furiously?”

Page 101: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(furiously | Colorless green ideas sleep)

How much does “Colorless” influence the choice of “furiously?”

Remove history and contextual info

Page 102: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(furiously | Colorless green ideas sleep)

How much does “Colorless” influence the choice of “furiously?”

Remove history and contextual info

p(furiously | Colorless green ideas sleep) ≈

p(furiously | Colorless green ideas sleep)

Page 103: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(furiously | Colorless green ideas sleep)

How much does “Colorless” influence the choice of “furiously?”

Remove history and contextual info

p(furiously | Colorless green ideas sleep) ≈

p(furiously | ideas sleep)

Page 104: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | Colorless green ideas) *p(furiously | Colorless green ideas sleep)

Page 105: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Grams

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | Colorless green ideas) *p(furiously | Colorless green ideas sleep)

Page 106: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Trigrams

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | green ideas) *p(furiously | ideas sleep)

Page 107: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Trigrams

p(Colorless green ideas sleep furiously) =p(Colorless) *

p(green | Colorless) *p(ideas | Colorless green) *

p(sleep | green ideas) *p(furiously | ideas sleep)

Page 108: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Trigrams

p(Colorless green ideas sleep furiously) =p(Colorless | <BOS> <BOS>) *p(green | <BOS> Colorless) *p(ideas | Colorless green) *

p(sleep | green ideas) *p(furiously | ideas sleep)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbols

Page 109: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Trigrams

p(Colorless green ideas sleep furiously) =p(Colorless | <BOS> <BOS>) *p(green | <BOS> Colorless) *p(ideas | Colorless green) *

p(sleep | green ideas) *p(furiously | ideas sleep) *p(<EOS> | sleep furiously)

Consistent notation: Pad the left with <BOS> (beginning of sentence) symbolsFully proper distribution: Pad the right with a single <EOS> symbol

Page 110: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Gram Terminology

nCommonly

calledHistory Size

(Markov order)Example

1 unigram 0 p(furiously)

Page 111: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Gram Terminology

nCommonly

calledHistory Size

(Markov order)Example

1 unigram 0 p(furiously)

2 bigram 1 p(furiously | sleep)

Page 112: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Gram Terminology

nCommonly

calledHistory Size

(Markov order)Example

1 unigram 0 p(furiously)

2 bigram 1 p(furiously | sleep)

3trigram

(3-gram)2 p(furiously | ideas sleep)

Page 113: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Gram Terminology

nCommonly

calledHistory Size

(Markov order)Example

1 unigram 0 p(furiously)

2 bigram 1 p(furiously | sleep)

3trigram

(3-gram)2 p(furiously | ideas sleep)

4 4-gram 3 p(furiously | green ideas sleep)

n n-gram n-1 p(wi | wi-n+1 … wi-1)

Page 114: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

N-Gram Probability

𝑝 𝑤1, 𝑤2, 𝑤3, ⋯ ,𝑤𝑆 =

𝑖=1

𝑆

𝑝 𝑤𝑖 𝑤𝑖−𝑁+1, ⋯ ,𝑤𝑖−1)

Page 115: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡(item)

Page 116: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡(z)

Page 117: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z

=𝑐𝑜𝑢𝑛𝑡 z

σ𝑣 𝑐𝑜𝑢𝑛𝑡(v)

Page 118: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z

=𝑐𝑜𝑢𝑛𝑡 z

σ𝑣 𝑐𝑜𝑢𝑛𝑡(v)

word type word type

word type

Page 119: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z

=𝑐𝑜𝑢𝑛𝑡 z

𝑊

word type word type

number of tokens observed

Page 120: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability

The 1

film 2

got 1

a 2

great 1

opening 1

and 1

the 1

went 1

on 1

to 1

become 1

hit 1

. 1

Page 121: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability

The 1

16

film 2

got 1

a 2

great 1

opening 1

and 1

the 1

went 1

on 1

to 1

become 1

hit 1

. 1

Page 122: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Normalization Probability

The 1

16

1/16

film 2 1/8

got 1 1/16

a 2 1/8

great 1 1/16

opening 1 1/16

and 1 1/16

the 1 1/16

went 1 1/16

on 1 1/16

to 1 1/16

become 1 1/16

hit 1 1/16

. 1 1/16

Page 123: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Trigrams)

𝑝 z|x, y ∝ 𝑐𝑜𝑢𝑛𝑡(x, y, z)

order matters in conditioning

order matters in count

Page 124: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Trigrams)

𝑝 z|x, y ∝ 𝑐𝑜𝑢𝑛𝑡(x, y, z)

order matters in conditioning

order matters in count

count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

Page 125: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Count-Based N-Grams (Trigrams)

𝑝 z|x, y ∝ 𝑐𝑜𝑢𝑛𝑡 x, y, z

=𝑐𝑜𝑢𝑛𝑡 x, y, z

σ𝑣 𝑐𝑜𝑢𝑛𝑡(x, y, v)

Page 126: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Context Word (Type) Raw Count Normalization Probability

The film The 0

1

0/1

The film film 0 0/1

The film got 1 1/1

The film went 0 0/1

a great great 0

1

0/1

a great opening 1 1/1

a great and 0 0/1

a great the 0 0/1

Count-Based N-Grams (Trigrams)

The film got a great opening and the film went on to become a hit .

Page 127: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Context Word (Type) Raw Count Normalization Probability

the film the 0

2

0/1

the film film 0 0/1

the film got 1 1/2

the film went 1 1/2

a great great 0

1

0/1

a great opening 1 1/1

a great and 0 0/1

a great the 0 0/1

Count-Based N-Grams (Lowercased Trigrams)

the film got a great opening and the film went on to become a hit .

Page 128: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 129: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Maximum Likelihood Estimates

Maximizes the likelihood of the training set

Do different corpora look the same?

Low(er) bias, high(er) variance

For large data: can actually do reasonably well

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡(item)

Page 130: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

n = 1

, , land of in , a teachers The , wilds the and gave a Etienne any

two beginning without probably heavily that other useless the the

a different . the able mines , unload into in foreign the thebe either other Britain finally avoiding , for of have the cure , the Gutenberg-tm ; of being can as country in authority deviates as d seldom and They employed about from business marshal materials than in , they

Page 131: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

n = 2

These varied with it to the civil wars , therefore , it did not for the company had the East India , the mechanical , the sum which were by barter , vol. i , and , conveniencies of all made to purchase a council of landlords , constitute a sum as an argument , having thus forced abroad , however , and influence in the one , or banker , will there was encouraged and more common trade to corrupt , profit , it ; but a master does not , twelfth year the consent that of volunteers and […]

, the other hand , it certainly it very earnestly entreat both nations .

In opulent nations in a revenue of four parts of production .

Page 132: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

n = 3

His employer , if silver was regulated according to the temporary and occasional event .

What goods could bear the expense of defending themselves , than in the value of different sorts of goods , and placed at a much greater , there have been the effects of self-deception , this attention , but a very important ones , and which , having become of less than they ever were in this agreement for keeping up the business of weighing .

After food , clothes , and a few months longer credit than is wanted , there must be sufficient to keep by him , are of such colonies to surmount .

They facilitated the acquisition of the empire , both from the rents of land and labour of those pedantic pieces of silver which he can afford to take from the duty upon every quarter which they have a more equable distribution of employment .

Page 133: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

n = 4

To buy in one market , in order to have it ; but the 8th of George III .

The tendency of some of the great lords , gradually encouraged their villains to make upon the prices of corn , cattle , poultry , etc .

Though it may , perhaps , in the mean time , that part of the governments of New England , the market , trade cannot always be transported to so great a number of seamen , not inferior to those of other European nations from any direct trade to America .

The farmer makes his profit by parting with it .

But the government of that country below what it is in itself necessarily slow , uncertain , liable to be interrupted by the weather .

Page 134: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Maximum Likelihood Estimates

Maximizes the likelihood of the training set

Do different corpora look the same?

For large data: can actually do reasonably well

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡(item)

Page 135: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

0s Are Not Your (Language Model’s) Friend

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡 item = 0 →𝑝 item = 0

Page 136: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

0s Are Not Your (Language Model’s) Friend

0 probability item is impossible0s annihilate: x*y*z*0 = 0

Language is creative:new words keep appearingexisting words could appear in known contexts

How much do you trust your data?

𝑝 item ∝ 𝑐𝑜𝑢𝑛𝑡 item = 0 →𝑝 item = 0

Page 137: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-λ estimation

Laplace smoothing, Lidstone smoothing

Pretend we saw each word λ more times

than we did

Add λ to all the counts

Page 138: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-λ estimation

Laplace smoothing, Lidstone smoothing

Pretend we saw each word λ more times

than we did

Add λ to all the counts

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z + 𝜆

Page 139: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-λ estimation

Laplace smoothing, Lidstone smoothing

Pretend we saw each word λ more times

than we did

Add λ to all the counts

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z + 𝜆

=𝑐𝑜𝑢𝑛𝑡 z + 𝜆

σ𝑣(𝑐𝑜𝑢𝑛𝑡 v + 𝜆)

Page 140: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-λ estimation

Laplace smoothing, Lidstone smoothing

Pretend we saw each word λ more times

than we did

Add λ to all the counts

𝑝 z ∝ 𝑐𝑜𝑢𝑛𝑡 z + 𝜆

=𝑐𝑜𝑢𝑛𝑡 z + 𝜆

𝑊 + 𝑉𝜆

Page 141: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-λ N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-λ Count Add-λ Norm. Add-λ Prob.

The 1

16

1/16

film 2 1/8

got 1 1/16

a 2 1/8

great 1 1/16

opening 1 1/16

and 1 1/16

the 1 1/16

went 1 1/16

on 1 1/16

to 1 1/16

become 1 1/16

hit 1 1/16

. 1 1/16

Page 142: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-1 N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob.

The 1

16

1/16 2

film 2 1/8 3

got 1 1/16 2

a 2 1/8 3

great 1 1/16 2

opening 1 1/16 2

and 1 1/16 2

the 1 1/16 2

went 1 1/16 2

on 1 1/16 2

to 1 1/16 2

become 1 1/16 2

hit 1 1/16 2

. 1 1/16 2

Page 143: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-1 N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob.

The 1

16

1/16 2

16 + 14*1 =30

film 2 1/8 3

got 1 1/16 2

a 2 1/8 3

great 1 1/16 2

opening 1 1/16 2

and 1 1/16 2

the 1 1/16 2

went 1 1/16 2

on 1 1/16 2

to 1 1/16 2

become 1 1/16 2

hit 1 1/16 2

. 1 1/16 2

Page 144: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Add-1 N-Grams (Unigrams)The film got a great opening and the film went on to become a hit .

Word (Type) Raw Count Norm Prob. Add-1 Count Add-1 Norm. Add-1 Prob.

The 1

16

1/16 2

16 + 14*1 =30

=1/15

film 2 1/8 3 =1/10

got 1 1/16 2 =1/15

a 2 1/8 3 =1/10

great 1 1/16 2 =1/15

opening 1 1/16 2 =1/15

and 1 1/16 2 =1/15

the 1 1/16 2 =1/15

went 1 1/16 2 =1/15

on 1 1/16 2 =1/15

to 1 1/16 2 =1/15

become 1 1/16 2 =1/15

hit 1 1/16 2 =1/15

. 1 1/16 2 =1/15

Page 145: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much

Page 146: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Backoff and Interpolation

Sometimes it helps to use less context

condition on less context for contexts you haven’t learned much about

Backoff:

use trigram if you have good evidence

otherwise bigram, otherwise unigram

Page 147: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Backoff and Interpolation

Sometimes it helps to use less contextcondition on less context for contexts you haven’t learned much about

Backoff: use trigram if you have good evidence

otherwise bigram, otherwise unigram

Interpolation: mix (average) unigram, bigram, trigram

Page 148: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Linear Interpolation

Simple interpolation

𝑝 𝑦 𝑥) = 𝜆𝑝2 𝑦 𝑥) + 1 − 𝜆 𝑝1 𝑦

0 ≤ 𝜆 ≤ 1

Page 149: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Linear Interpolation

Simple interpolation

Condition on context

𝑝 𝑦 𝑥) = 𝜆𝑝2 𝑦 𝑥) + 1 − 𝜆 𝑝1 𝑦

0 ≤ 𝜆 ≤ 1

𝑝 𝑧 𝑥, 𝑦) =𝜆3 𝑥, 𝑦 𝑝3 𝑧 𝑥, 𝑦) + 𝜆2(𝑦)𝑝2 𝑧 | 𝑦 + 𝜆1𝑝1(𝑧)

Page 150: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Backoff

Trust your statistics, up to a point

𝑝 𝑧 𝑥, 𝑦) ∝ ቊ𝑝3 𝑧 𝑥, 𝑦) if count 𝑥, 𝑦, 𝑧 > 0

𝑝2 z y otherwise

Page 151: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Discounted Backoff

Trust your statistics, up to a point

𝑝 𝑧 𝑥, 𝑦) ∝ ቊ𝑝3 𝑧 𝑥, 𝑦) − 𝑑 if count 𝑥, 𝑦, 𝑧 > 0

𝛽 𝑥, 𝑦 𝑝2 z y otherwise

Page 152: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Discounted Backoff

Trust your statistics, up to a point

𝑝 𝑧 𝑥, 𝑦) = ቊ𝑝3 𝑧 𝑥, 𝑦) − 𝑑 if count 𝑥, 𝑦, 𝑧 > 0

𝛽 𝑥, 𝑦 𝑝2 z y otherwise

discount constant

context-dependent normalization constant

Page 153: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Setting Hyperparameters

Use a development corpus

Choose λs to maximize the probability of dev data:– Fix the N-gram probabilities (on the training data)

– Then search for λs that give largest probability to held-out set:

Training DataDevData

Test Data

Page 154: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Implementation: Unknown words

Create an unknown word token <UNK>

Training:

1. Create a fixed lexicon L of size V

2. Change any word not in L to <UNK>

3. Train LM as normal

Evaluation:Use UNK probabilities for any word not in training

Page 155: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Other Kinds of Smoothing

Interpolated (modified) Kneser-Ney

Idea: How “productive” is a context?

How many different word types v appear in a context x, y

Good-Turing

Partition words into classes of occurrence

Smooth class statistics

Properties of classes are likely to predict properties of other classes

Witten-Bell

Idea: Every observed type was at some point novel

Give MLE prediction for novel type occurring

Page 156: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models

Page 157: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Page 158: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Training DataDevData

Test Data

Page 159: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Training DataDevData

Test Data

acquire primary statistics for learning model parameters

fine-tune any secondary (hyper)parameters

perform final evaluation

Page 160: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Training DataDevData

Test Data

acquire primary statistics for learning model parameters

fine-tune any secondary (hyper)parameters

perform final evaluation

DO NOT ITERATE ON THE TEST DATA

Page 161: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Extrinsic: Evaluate LM in downstream task

Test an MT, ASR, etc. system and see which LM does better

Propagate & conflate errors

Page 162: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Evaluating Language Models

What is “correct?”

What is working “well?”

Extrinsic: Evaluate LM in downstream task

Test an MT, ASR, etc. system and see which LM does better

Propagate & conflate errors

Intrinsic: Treat LM as its own downstream task

Use perplexity (from information theory)

Page 163: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

More outcomes More surprised

Fewer outcomes Less surprised

Page 164: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

n-gram history(n-1 items)

Page 165: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

Page 166: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

≤ 0: higher

Page 167: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

≤ 0: higher

≤ 0, higher

Page 168: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

≤ 0: higher

≤ 0, higher

≥ 0, lower is better

Page 169: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

≤ 0: higher

≤ 0, higher

≥ 0, lower is better

≥ 0, lower

Page 170: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

≥ 0, ≤ 1: higher

≤ 0: higher

≤ 0, higher

≥ 0, lower is better

≥ 0, lower

base must be the same

Page 171: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity = exp(−1

𝑀σ𝑖=1𝑀 log 𝑝 𝑤𝑖 ℎ𝑖))

=𝑀ς𝑖=1

1

𝑝 𝑤𝑖 ℎ𝑖)

weighted geometric

average

Page 172: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Perplexity

Lower is better : lower perplexity less surprised

perplexity =𝑀ς𝑖=1

1

𝑝 𝑤𝑖 ℎ𝑖)

471/671: Branching factor

Page 173: Probability & Language Modeling · The film got a great opening and the film went on to become a hit . Tokens: an instance of that type in running text. • The • film • got •

Outline

Probability review

Words

Defining Language Models

Breaking & Fixing Language Models

Evaluating Language Models