lm tutorial v8

SA1-1

The State of The Art in The State of The Art in Language Modeling Language Modeling

Joshua GoodmanJoshua GoodmanMicrosoft Research, Machine Learning Microsoft Research, Machine Learning GroupGrouphttp://www.research.microsoft.com/http://www.research.microsoft.com/~joshuago~joshuago

Eugene CharniakEugene CharniakBrown University, Department of Brown University, Department of Computer ScienceComputer Sciencehttp://www.cs.brown.edu/people/echttp://www.cs.brown.edu/people/ec

SA1-2

A bad language modelA bad language model

SA1-3


SA1-4


Herm

an is reprinted with perm

ission from L

aughingStock L

icensing Inc., Ottaw

a Canada.

All rights reserved.

SA1-5


SA1-6

What’s a Language ModelWhat’s a Language Model

A Language model is a probability A Language model is a probability distribution over word sequencesdistribution over word sequences

P(“And nothing but the truth”) P(“And nothing but the truth”) 0.0010.001

P(“And nuts sing on the roof”) P(“And nuts sing on the roof”) 0 0

SA1-7

What’s a language model What’s a language model for?for?

Speech recognitionSpeech recognition Handwriting recognitionHandwriting recognition Spelling correctionSpelling correction Optical character recognitionOptical character recognition Machine translationMachine translation

(and anyone doing statistical (and anyone doing statistical modeling)modeling)

SA1-8

Really Quick OverviewReally Quick Overview

HumorHumor What is a language model?What is a language model? Really quick overviewReally quick overview Two minute probability overviewTwo minute probability overview How language models work (trigrams)How language models work (trigrams) Real overviewReal overview Smoothing, caching, skipping, sentence-Smoothing, caching, skipping, sentence-

mixture models, clustering, parsing mixture models, clustering, parsing language models, applications, toolslanguage models, applications, tools

SA1-9

Everything you need to Everything you need to know about probability – know about probability – definitiondefinition

P(X) means probability that X is P(X) means probability that X is truetrue• P(baby is a boy) P(baby is a boy) 0.5 (% of total that 0.5 (% of total that

are boys)are boys)• P(baby is named John) P(baby is named John) 0.001 (% of 0.001 (% of

total named John)total named John) BabiesBaby boys

John

SA1-10

Everything about Everything about probabilityprobabilityJoint probabilitiesJoint probabilities

P(X, Y) means probability that X P(X, Y) means probability that X and Y are both true, e.g. P(brown and Y are both true, e.g. P(brown eyes, boy)eyes, boy)

BabiesBaby boys

JohnBrown eyes

SA1-11

Everything about Everything about probability:probability:Conditional probabilitiesConditional probabilities

P(X|Y) means probability that X is P(X|Y) means probability that X is true when we already know Y is true when we already know Y is truetrue• P(baby is named John | baby is a boy) P(baby is named John | baby is a boy)

0.002 0.002• P(baby is a boy | baby is named John ) P(baby is a boy | baby is named John )

1 1 BabiesBaby boys

John

SA1-12

Everything about Everything about probabilities: mathprobabilities: math

P(X|Y) = P(X, Y) / P(Y)P(X|Y) = P(X, Y) / P(Y)• P(baby is named John | baby is a boy) P(baby is named John | baby is a boy)

==

P(baby is named John, baby is a boy) P(baby is named John, baby is a boy) / P(baby is a boy) = 0.001 / 0.5 = / P(baby is a boy) = 0.001 / 0.5 = 0.0020.002

BabiesBaby boys

John

SA1-14

THE EquationTHE Equation

SA1-15

How Language Models How Language Models workwork

Hard to compute P(“And nothing but Hard to compute P(“And nothing but the truth”)the truth”)

Step 1: Decompose probabilityStep 1: Decompose probability P(“And nothing but the truth) =P(“And nothing but the truth) =P(“And”) P(“And”) P(“nothing|and”) P(“nothing|and”) P(“but| P(“but|

and nothing”) and nothing”) P(“the|and nothing P(“the|and nothing but”) but”) P(“truth|and nothing but P(“truth|and nothing but the”) the”)

SA1-16

The Trigram The Trigram ApproximationApproximation

Assume each word depends only on the Assume each word depends only on the previous two words (three words total – previous two words (three words total – tri means three, gram means writing)tri means three, gram means writing)

P(“the|… whole truth and nothing but”) P(“the|… whole truth and nothing but”) P(“the|nothing but”)P(“the|nothing but”)P(“truth|… whole truth and nothing but P(“truth|… whole truth and nothing but

the”) the”) P(“truth|but the”)P(“truth|but the”)

SA1-17

Trigrams, continuedTrigrams, continued

How do we find probabilities?How do we find probabilities? Get real text, and start counting!Get real text, and start counting!

• P(“the | nothing but”) P(“the | nothing but”) C(“nothing but the”) / C(“nothing but the”) /

C(“nothing but”)C(“nothing but”)

SA1-19

Real Overview OverviewReal Overview Overview

Basics: probability, language model definitionBasics: probability, language model definition Real Overview (8 slides)Real Overview (8 slides) EvaluationEvaluation SmoothingSmoothing Caching, SkippingCaching, Skipping ClusteringClustering Sentence-mixture models Sentence-mixture models Parsing language modelsParsing language models ApplicationsApplications ToolsTools

SA1-20

Real Overview: EvaluationReal Overview: Evaluation

Need to compare different Need to compare different language modelslanguage models

Speech recognition word error rateSpeech recognition word error rate PerplexityPerplexity EntropyEntropy Coding theoryCoding theory

SA1-21

Real Overview: SmoothingReal Overview: Smoothing

Got trigram for P(“the” | “nothing Got trigram for P(“the” | “nothing but”) from C(“nothing but the”) / but”) from C(“nothing but the”) / C(“nothing but”)C(“nothing but”)

What about P(“sing” | “and nuts”) What about P(“sing” | “and nuts”) ==

C(“and nuts sing”) / C(“and nuts”)C(“and nuts sing”) / C(“and nuts”) Probability would be 0: very bad!Probability would be 0: very bad!

SA1-22

Real Overview: CachingReal Overview: Caching

If you say something, you are If you say something, you are likely to say it again laterlikely to say it again later

SA1-23

Real Overview: SkippingReal Overview: Skipping

Trigram uses last two wordsTrigram uses last two words Other words are useful too – 3-Other words are useful too – 3-

back, 4-backback, 4-back Words are useful in various Words are useful in various

combinations (e.g. 1-back (bigram) combinations (e.g. 1-back (bigram) combined with 3-back)combined with 3-back)

SA1-24

Real Overview: ClusteringReal Overview: Clustering

What is the probability What is the probability

P(“Tuesday | party on”)P(“Tuesday | party on”) Similar to P(“Monday | party on”)Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration Similar to P(“Tuesday | celebration

on”)on”) Put words in clusters: Put words in clusters:

• WEEKDAY = Sunday, Monday, Tuesday, …WEEKDAY = Sunday, Monday, Tuesday, …• EVENT=party, celebration, birthday, …EVENT=party, celebration, birthday, …

SA1-25

Real Overview:Real Overview:Sentence Mixture ModelsSentence Mixture Models

In Wall Street Journal, many sentencesIn Wall Street Journal, many sentences““In heavy trading, Sun Microsystems In heavy trading, Sun Microsystems fell 25 points yesterday”fell 25 points yesterday”

In Wall Street Journal, many sentencesIn Wall Street Journal, many sentences““Nathan Mhyrvold, vice president of Nathan Mhyrvold, vice president of Microsoft, took a one year leave of Microsoft, took a one year leave of absence.”absence.”

Model each sentence type separately.Model each sentence type separately.

SA1-26

Real Overview: Real Overview: Parsing Language ModelsParsing Language Models

Language has structure – noun Language has structure – noun phrases, verb phrases, etc.phrases, verb phrases, etc.

““The butcher from Albuquerque The butcher from Albuquerque slaughtered chickens” – even though slaughtered chickens” – even though slaughtered is far from butchered, it slaughtered is far from butchered, it is predicted by butcher, not by is predicted by butcher, not by AlbuquerqueAlbuquerque

Recent, somewhat promising modelsRecent, somewhat promising models

SA1-27

Real Overview: Real Overview: ApplicationsApplications

In Machine Translation we break up In Machine Translation we break up the problem in two: proposing the problem in two: proposing possible phrases, and then fitting them possible phrases, and then fitting them together. The second is the language together. The second is the language model.model.

The same is true for optical character The same is true for optical character recognition (just substitute “letter” for recognition (just substitute “letter” for “phrase”.“phrase”.

A lot of other problems also fit this A lot of other problems also fit this mold.mold.

SA1-28

Real Overview:Real Overview:ToolsTools

You can make your own language You can make your own language models with tools freely available models with tools freely available for researchfor research

CMU language modeling toolkitCMU language modeling toolkit SRI language modeling toolkitSRI language modeling toolkit

SA1-29

EvaluationEvaluation

How can you tell a good language How can you tell a good language model from a bad one?model from a bad one?

Run a speech recognizer (or your Run a speech recognizer (or your application of choice), calculate application of choice), calculate word error rateword error rate• SlowSlow• Specific to your recognizerSpecific to your recognizer

SA1-30

Evaluation:Evaluation:Perplexity IntuitionPerplexity Intuition

Ask a speech recognizer to recognize digits: “0, Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 101, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10

Ask a speech recognizer to recognize names at Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000Microsoft – hard – 30,000 – perplexity 30,000

Ask a speech recognizer to recognize Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54each – perplexity 54

Perplexity is weighted equivalent branching Perplexity is weighted equivalent branching factor.factor.

SA1-31

Evaluation: perplexityEvaluation: perplexity

““A, B, C, D, E, F, G…Z”: perplexity is 26A, B, C, D, E, F, G…Z”: perplexity is 26 ““Alpha, bravo, charlie, delta…yankee, Alpha, bravo, charlie, delta…yankee,

zulu”: perplexity is 26zulu”: perplexity is 26 Perplexity measures language model Perplexity measures language model

difficulty, not acoustic difficulty.difficulty, not acoustic difficulty.

SA1-32

Perplexity: MathPerplexity: Math

Perplexity is geometric Perplexity is geometric average inverse probability average inverse probability Imagine model: “Operator” (1 in 4),Imagine model: “Operator” (1 in 4), “ “Technical support” (1 in 4),Technical support” (1 in 4), “ “sales” (1 in 4), 30,000 names (1 in 120,000) sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,003 equally likelyImagine data: All 30,003 equally likely

Example:Example:

Perplexity of test data, given model, is 119,829Perplexity of test data, given model, is 119,829

Remarkable fact: the true model for data has the lowest possible Remarkable fact: the true model for data has the lowest possible perplexityperplexity

Perplexity is geometric Perplexity is geometric average inverse probability average inverse probability

SA1-33

Perplexity: MathPerplexity: Math

Imagine model: “Operator” (1 in 4), Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) 4), 30,000 names (1 in 120,000)

Imagine data: All 30,003 equally likelyImagine data: All 30,003 equally likely Can compute three different perplexitiesCan compute three different perplexities

• Model (ignoring test data): perplexity 54Model (ignoring test data): perplexity 54• Test data (ignoring model): perplexity 30,003Test data (ignoring model): perplexity 30,003• Model on test data: perplexity 119,829Model on test data: perplexity 119,829

When we say perplexity, we mean “model When we say perplexity, we mean “model on test”on test”

Remarkable fact: the true model for data Remarkable fact: the true model for data has the lowest possible perplexityhas the lowest possible perplexity

SA1-34

Perplexity:Perplexity:Is lower better?Is lower better?

Remarkable fact: the true model for data Remarkable fact: the true model for data has the lowest possible perplexityhas the lowest possible perplexity

Lower the perplexity, the closer we are to Lower the perplexity, the closer we are to true model.true model.

Typically, perplexity correlates well with Typically, perplexity correlates well with speech recognition word error ratespeech recognition word error rate• Correlates better when both models are Correlates better when both models are

trained on same datatrained on same data• Doesn’t correlate well when training data Doesn’t correlate well when training data

changeschanges

SA1-35

Perplexity: The Shannon Perplexity: The Shannon GameGame

Ask people to guess the next letter, Ask people to guess the next letter, given context. Compute perplexity.given context. Compute perplexity.

• (when we get to entropy, the “100” column (when we get to entropy, the “100” column corresponds to the “1 bit per character” corresponds to the “1 bit per character” estimate)estimate)

Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532

10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998

100 1.5 2.5 10 142

SA1-36

Evaluation: entropyEvaluation: entropy

Entropy = Entropy =

loglog2 2 perplexityperplexity

Should be called “cross-entropy of Should be called “cross-entropy of model on test data.”model on test data.” Remarkable fact: entropy is average Remarkable fact: entropy is average number of bits per word required to number of bits per word required to encode test data using this probability encode test data using this probability model, and an optimal coder. Called model, and an optimal coder. Called bits.bits.

SA1-37

Smoothing: NoneSmoothing: None

Called Maximum Likelihood estimate.Called Maximum Likelihood estimate. Lowest perplexity trigram on training Lowest perplexity trigram on training

data.data. Terrible on test data: If no occurrences Terrible on test data: If no occurrences

of C(xyz), probability is 0.of C(xyz), probability is 0.

SA1-38

Smoothing: Add OneSmoothing: Add One

What is P(sing|nuts)? Zero? Leads What is P(sing|nuts)? Zero? Leads to infinite perplexity!to infinite perplexity!

Add one smoothing:Add one smoothing: Works very badly. DO NOT DO Works very badly. DO NOT DO

THISTHIS Add delta smoothing:Add delta smoothing: Still very bad. DO NOT DO THISStill very bad. DO NOT DO THIS

SA1-39

Smoothing: Simple Smoothing: Simple InterpolationInterpolation

Trigram is very context specific, very noisyTrigram is very context specific, very noisy Unigram is context-independent, smoothUnigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for Interpolate Trigram, Bigram, Unigram for

best combinationbest combination Find Find 0<0<<1 by optimizing on “held-out” <1 by optimizing on “held-out”

datadata Almost good enoughAlmost good enough

SA1-40

Smoothing: Smoothing: Finding parameter valuesFinding parameter values

Split data into training, “heldout”, testSplit data into training, “heldout”, test Try lots of different values for Try lots of different values for on on

heldout data, pick bestheldout data, pick best Test on test dataTest on test data Sometimes, can use tricks like “EM” Sometimes, can use tricks like “EM”

(estimation maximization) to find values(estimation maximization) to find values I prefer to use a generalized search I prefer to use a generalized search

algorithm, “Powell search” – see algorithm, “Powell search” – see Numerical Recipes in CNumerical Recipes in C

SA1-41

Smoothing digression:Smoothing digression:Splitting dataSplitting data

How much data for training, heldout, test?How much data for training, heldout, test? Some people say things like “1/3, 1/3, 1/3” Some people say things like “1/3, 1/3, 1/3”

or “80%, 10%, 10%” They are WRONGor “80%, 10%, 10%” They are WRONG Heldout should have (at least) 100-1000 Heldout should have (at least) 100-1000

words per parameter.words per parameter. Answer: enough test data to be Answer: enough test data to be

statistically significant. (1000s of words statistically significant. (1000s of words perhaps)perhaps)

SA1-42

Smoothing digression:Smoothing digression:Splitting dataSplitting data

Be careful: WSJ data divided into Be careful: WSJ data divided into stories. Some are easy, with lots of stories. Some are easy, with lots of numbers, financial, others much harder. numbers, financial, others much harder. Use enough to cover many stories. Use enough to cover many stories.

Be careful: Some stories repeated in Be careful: Some stories repeated in data sets.data sets.

Can take data from end – better – or Can take data from end – better – or randomly from within training. randomly from within training. Temporal effects like “Elian Gonzalez”Temporal effects like “Elian Gonzalez”

SA1-43

Smoothing: Smoothing: Jelinek-MercerJelinek-Mercer

Simple interpolation:Simple interpolation:

Better: smooth a little after “The Better: smooth a little after “The Dow”, lots after “Adobe acquired” Dow”, lots after “Adobe acquired”

SA1-44

Smoothing:Smoothing:Jelinek-Mercer continuedJelinek-Mercer continued

Put Put s into buckets by counts into buckets by count Find Find s by cross-validation on s by cross-validation on

held-out dataheld-out data Also called “deleted-interpolation”Also called “deleted-interpolation”

SA1-45

Smoothing: Good TuringSmoothing: Good Turing

Imagine you are fishingImagine you are fishing You have caught 10 You have caught 10

Carp, 3 Cod, 2 tuna, 1 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.trout, 1 salmon, 1 eel.

How likely is it that next How likely is it that next species is new? 3/18species is new? 3/18

How likely is it that next How likely is it that next is tuna? Less than 2/18is tuna? Less than 2/18

SA1-46

Smoothing: Good TuringSmoothing: Good Turing

How many species How many species (words) were seen (words) were seen once? Estimate for once? Estimate for how many are unseen.how many are unseen.

All other estimates are All other estimates are adjusted (down) to adjusted (down) to give probabilities for give probabilities for unseenunseen

SA1-47

Smoothing:Smoothing:Good Turing ExampleGood Turing Example

10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.eel.

How likely is new data (pHow likely is new data (p0 0 ). ).

Let nLet n1 1 be number occurring be number occurring

once (3), N be total (18). ponce (3), N be total (18). p00=3/18=3/18

How likely is eel? 1How likely is eel? 1** nn1 1 =3, n=3, n2 2 =1=1 11* * =2 =2 1/3 = 2/31/3 = 2/3 P(eel) = 1P(eel) = 1* * /N = (2/3)/18 = 1/27/N = (2/3)/18 = 1/27

SA1-48

Smoothing: KatzSmoothing: Katz

Use Good-Turing estimateUse Good-Turing estimate

Works pretty well.Works pretty well. Not good for 1 countsNot good for 1 counts is calculated so probabilities sum to 1is calculated so probabilities sum to 1

SA1-49

Smoothing:Absolute Smoothing:Absolute DiscountingDiscounting

Assume fixed discountAssume fixed discount

Works pretty well, easier than Katz.Works pretty well, easier than Katz. Not so good for 1 countsNot so good for 1 counts

SA1-50

Smoothing:Smoothing:Interpolated Absolute Interpolated Absolute DiscountDiscount

Backoff: ignore bigram if have Backoff: ignore bigram if have trigramtrigram

Interpolated: always combine Interpolated: always combine bigram, trigrambigram, trigram

SA1-51

Smoothing: Interpolated Smoothing: Interpolated Multiple Absolute Multiple Absolute DiscountsDiscounts

One discount is goodOne discount is good

Different discounts for different countsDifferent discounts for different counts

Multiple discounts: for 1 count, 2 counts, Multiple discounts: for 1 count, 2 counts, >2>2

SA1-52

Smoothing: Kneser-NeySmoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | P(Francisco | eggplant) vs P(stew | eggplant)eggplant)

““Francisco” is common, so backoff, Francisco” is common, so backoff, interpolated methods say it is likelyinterpolated methods say it is likely

But it only occurs in context of “San”But it only occurs in context of “San” ““Stew” is common, and in many Stew” is common, and in many

contextscontexts Weight backoff by number of contexts Weight backoff by number of contexts

word occurs inword occurs in

SA1-53

Smoothing: Kneser-NeySmoothing: Kneser-Ney

InterpolatedInterpolated Absolute-Absolute-

discountdiscount Modified Modified

backoff backoff distributiondistribution

Consistently Consistently best techniquebest technique

SA1-54

Smoothing: ChartSmoothing: Chart

SA1-55

CachingCaching

If you say If you say something, you something, you are likely to say are likely to say it again later.it again later.

Interpolate Interpolate trigram with trigram with cachecache

SA1-56

Caching: Real LifeCaching: Real Life

Someone says “I swear to tell the truth”Someone says “I swear to tell the truth” System hears “I swerve to smell the soup”System hears “I swerve to smell the soup” Cache remembers!Cache remembers! Person says “The whole truth”, and, with Person says “The whole truth”, and, with

cache, system hears “The whole soup.” – cache, system hears “The whole soup.” – errors are locked in.errors are locked in.

Caching works well when users corrects Caching works well when users corrects as they go, poorly or even hurts without as they go, poorly or even hurts without correction.correction.

SA1-57

Caching: VariationsCaching: Variations

N-gram caches:N-gram caches:

Conditional n-gram cache: use n-Conditional n-gram cache: use n-gram cache only if xy gram cache only if xy history history

Remove function-words like “the”, Remove function-words like “the”, “to”“to”

SA1-58

Cache ResultsCache Results

0

5

10

15

20

25

30

35

40

100,000 1,000,000 10,000,000 all

Training Data Size

Per

plex

ity

Red

ucti

on

unigram + condbigram + condtrigramunigram + condtrigram

trigram

bigram

unigram

SA1-59

5-grams5-grams

Why stop at 3-grams?Why stop at 3-grams? If P(z|…rstuvwxy)If P(z|…rstuvwxy) P(z|xy) is good, P(z|xy) is good,

thenthen

P(z|…rstuvwxy) P(z|…rstuvwxy) P(z|vwxy) is better! P(z|vwxy) is better! Very important to smooth wellVery important to smooth well Interpolated Kneser-Ney works much Interpolated Kneser-Ney works much

better than Katz on 5-gram, more better than Katz on 5-gram, more than on 3-gramthan on 3-gram

SA1-60

N-gram versus smoothing N-gram versus smoothing algorithmalgorithm

5.5

6

6.5

7

7.5

8

8.5

9

9.5

10

1 2 3 4 5 6 7 8 9 10 20

n-gram order

En

trop

y

100,000 Katz

100,000 KN

1,000,000 Katz

1,000,000 KN

10,000,000 Katz

10,000,000 KN

all Katz

all KN

SA1-61

Speech recognizer Speech recognizer mechanicsmechanics

Keep many Keep many

hypotheses alivehypotheses alive Find acoustic, language model Find acoustic, language model

scoresscores• P(acoustics | truth = .3), P(truth | tell P(acoustics | truth = .3), P(truth | tell

the) = .1the) = .1• P(acoustics | soup = .2), P(soup | P(acoustics | soup = .2), P(soup |

smell the) = .01smell the) = .01

“…tell the” (.01)“…smell the” (.01)

“…tell the truth” (.01 .3 .1)“…smell the soup” (.01 .2 .01)

SA1-62

Speech recognizer Speech recognizer slowdownsslowdowns

Speech recognizer uses tricks Speech recognizer uses tricks (dynamic programming) so merge (dynamic programming) so merge hypotheseshypotheses

Trigram: Trigram: Fivegram: Fivegram: “…tell the”

“…smell the”

“…swear to tell the”“…swerve to smell the”

“swear too tell the”“swerve too smell the”

“swerve to tell the”“swerve too tell the”

…

SA1-63

Speech recognizer vs. n-Speech recognizer vs. n-gramgram

Recognizer can threshold out bad Recognizer can threshold out bad hypotheseshypotheses

Trigram works so much better than Trigram works so much better than bigram, better thresholding, no bigram, better thresholding, no slow-down slow-down

4-gram, 5-gram start to become 4-gram, 5-gram start to become expensiveexpensive

SA1-64

Speech recognizer with Speech recognizer with language modellanguage model

In theory,In theory,

In practice, language model is a In practice, language model is a better predictor -- acoustic better predictor -- acoustic probabilities aren’t “real” probabilities aren’t “real” probabilitiesprobabilities

In practice, penalize insertionsIn practice, penalize insertions

)()|(maxarg cewordsequenPcewordsequenacousticsPcewordsequen

)(8 1.)(

)|(maxarg cewordsequenlength

cewordsequen cewordsequenP

cewordsequenacousticsP

SA1-66

5-gram Skipping Results5-gram Skipping Results

0

1

2

3

4

5

6

7

10000 100000 1000000 10000000 1E+08 1E+09

Training Size

Pe

rple

xit

y R

ed

uc

tio

n

vwyx, vxyw, wxyv,vywx, yvwx, xvwy,wvxyvw_y, v_xy, vwx_ --skipping

xvwy, wvxy, yvwx, --rearranging

vwyx, vxyw, wxyv --rearranging

vwyx, vywx, yvwx --rearranging

vw_y, v_xy -- skipping

(Best trigram skipping result: 11% reduction)

SA1-67

ClusteringClustering

CLUSTERING = CLASSES (same thing)CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”)What is P(“Tuesday | party on”) Similar to P(“Monday | party on”)Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”)Similar to P(“Tuesday | celebration on”) Put words in clusters: Put words in clusters:

• WEEKDAY = Sunday, Monday, Tuesday, …WEEKDAY = Sunday, Monday, Tuesday, …• EVENT=party, celebration, birthday, …EVENT=party, celebration, birthday, …

SA1-68

Clustering overviewClustering overview

Major topic, useful in many fieldsMajor topic, useful in many fields Kinds of clusteringKinds of clustering

• Predictive clusteringPredictive clustering• Conditional clusteringConditional clustering• IBM-style clusteringIBM-style clustering

How to get clustersHow to get clusters• Be clever or it takes forever!Be clever or it takes forever!

SA1-70

Predictive clustering Predictive clustering exampleexample

Find P(Tuesday | party on) Find P(Tuesday | party on) • PPsmoothsmooth (WEEKDAY | party on) (WEEKDAY | party on) PPsmoothsmooth (Tuesday | party on WEEKDAY) (Tuesday | party on WEEKDAY)• C( party on Tuesday) = 0C( party on Tuesday) = 0• C(party on Wednesday) = 10C(party on Wednesday) = 10• C(arriving on Tuesday) = 10C(arriving on Tuesday) = 10• C(on Tuesday) = 100C(on Tuesday) = 100

PPsmoothsmooth (WEEKDAY | party on) is high (WEEKDAY | party on) is high PPsmoothsmooth (Tuesday | party on WEEKDAY) backs (Tuesday | party on WEEKDAY) backs

off to Poff to Psmooth smooth (Tuesday | on WEEKDAY)(Tuesday | on WEEKDAY)

SA1-73

Cluster ResultsCluster Results

-20

-15

-10

-5

0

5

10

15

20

100,000 1,000,000 10,000,000

Training Size

Per

ple

xity

Red

uct

ion

Kneser-Neytrigram

Predict

IBM

Full IBMPredict

All Combine

SA1-74

Clustering by PositionClustering by Position

““A” and “AN”: same cluster or A” and “AN”: same cluster or different cluster?different cluster?

Same cluster for predictive clusteringSame cluster for predictive clustering Different clusters for conditional Different clusters for conditional

clusteringclustering Small improvement by using different Small improvement by using different

clusters for conditional and predictiveclusters for conditional and predictive

SA1-75

Clustering: how to get Clustering: how to get themthem

Build them by handBuild them by hand• Works ok when almost no dataWorks ok when almost no data

Part of Speech (POS) tagsPart of Speech (POS) tags• Tends not to work as well as Tends not to work as well as

automaticautomatic Automatic ClusteringAutomatic Clustering

• Swap words between clusters to Swap words between clusters to minimize perplexityminimize perplexity

SA1-76

Clustering: automaticClustering: automatic

Minimize perplexity of P(z|Y) Minimize perplexity of P(z|Y) Mathematical tricks speed Mathematical tricks speed it upit up

Use top-down splitting,Use top-down splitting,

not bottom up merging!not bottom up merging!

SA1-77

Two actual WSJ classesTwo actual WSJ classes

MONDAYS MONDAYS FRIDAYS FRIDAYS THURSDAY THURSDAY MONDAY MONDAY EURODOLLARS EURODOLLARS SATURDAYSATURDAY WEDNESDAYWEDNESDAY FRIDAYFRIDAY TENTERHOOKS TENTERHOOKS TUESDAY TUESDAY SUNDAYSUNDAY CONDITIONCONDITION

PARTYPARTY FESCOFESCO CULTCULT NILSON NILSON PETA PETA CAMPAIGN CAMPAIGN WESTPAC WESTPAC FORCE FORCE CONRAN CONRAN DEPARTMENT DEPARTMENT PENHPENH GUILDGUILD

SA1-78

Sentence Mixture ModelsSentence Mixture Models

Lots of different sentence types:Lots of different sentence types:• Numbers (The Dow rose one hundred Numbers (The Dow rose one hundred

seventy three points)seventy three points)• Quotations (Officials said “quote we deny Quotations (Officials said “quote we deny

all wrong doing ”quote)all wrong doing ”quote)• Mergers (AOL and Time Warner, in an Mergers (AOL and Time Warner, in an

attempt to control the media and the attempt to control the media and the internet, will merge)internet, will merge)

Model each sentence type separatelyModel each sentence type separately

SA1-79

Sentence Mixture ModelsSentence Mixture Models

Roll a die to pick sentence type, sRoll a die to pick sentence type, skk

with probability with probability kk

Probability of sentence, given sProbability of sentence, given sk k

Probability of sentence across types:Probability of sentence across types:

m

k

n

ikiiik swwwP

1 112 )|(

SA1-80

Sentence Model Sentence Model SmoothingSmoothing

Each topic model is smoothed with Each topic model is smoothed with overall model.overall model.

Sentence mixture model is Sentence mixture model is smoothed with overall model smoothed with overall model (sentence type 0).(sentence type 0).

m

k

n

i iiik

kiiikk wwwP

swwwP

0 1 12

12

)|()1(

)|(

SA1-81

Sentence Mixture ResultsSentence Mixture Results

0

2

4

6

8

10

12

14

16

18

20

1 2 4 8 16 32 64 128

Number of Sentence Types

Per

ple

xity

Red

uct

ion all 3gram

all 5gram

10,000,000 3gram

10,000,000 5gram

1,000,000 3gram

1,000,000 5gram

100,000 5gram

100,000 3gram

SA1-82

Sentence ClusteringSentence Clustering

Same algorithm as word clusteringSame algorithm as word clustering Assign each sentence to a type, sAssign each sentence to a type, skk

Minimize perplexity of P(z|sMinimize perplexity of P(z|sk k ) ) instead of P(z|Y)instead of P(z|Y)

SA1-83

Topic Examples - 0Topic Examples - 0(Mergers and acquisitions)(Mergers and acquisitions)

JOHN BLAIR &AMPERSAND COMPANY IS CLOSE TO AN AGREEMENT TO JOHN BLAIR &AMPERSAND COMPANY IS CLOSE TO AN AGREEMENT TO SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND SELL ITS T. V. STATION ADVERTISING REPRESENTATION OPERATION AND PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. PROGRAM PRODUCTION UNIT TO AN INVESTOR GROUP LED BY JAMES H. ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED ROSENFIELD ,COMMA A FORMER C. B. S. INCORPORATED EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD EXECUTIVE ,COMMA INDUSTRY SOURCES SAID .PERIOD

INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT INDUSTRY SOURCES PUT THE VALUE OF THE PROPOSED ACQUISITION AT MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD MORE THAN ONE HUNDRED MILLION DOLLARS .PERIOD

JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP JOHN BLAIR WAS ACQUIRED LAST YEAR BY RELIANCE CAPITAL GROUP INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN INCORPORATED ,COMMA WHICH HAS BEEN DIVESTING ITSELF OF JOHN BLAIR'S MAJOR ASSETS .PERIOD BLAIR'S MAJOR ASSETS .PERIOD

JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL JOHN BLAIR REPRESENTS ABOUT ONE HUNDRED THIRTY LOCAL TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER TELEVISION STATIONS IN THE PLACEMENT OF NATIONAL AND OTHER ADVERTISING .PERIOD ADVERTISING .PERIOD

MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE MR. ROSENFIELD STEPPED DOWN AS A SENIOR EXECUTIVE VICE PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY PRESIDENT OF C. B. S. BROADCASTING IN DECEMBER NINETEEN EIGHTY FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD FIVE UNDER A C. B. S. EARLY RETIREMENT PROGRAM .PERIOD

SA1-84

Topic Examples - 1Topic Examples - 1(production, promotions, (production, promotions, commas)commas)

MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE MR. DION ,COMMA EXPLAINING THE RECENT INCREASE IN THE STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY STOCK PRICE ,COMMA SAID ,COMMA "DOUBLE-QUOTE OBVIOUSLY ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO ,COMMA IT WOULD BE VERY ATTRACTIVE TO OUR COMPANY TO WORK WITH THESE PEOPLE .PERIOD WORK WITH THESE PEOPLE .PERIOD

BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. BOTH MR. BRONFMAN AND MR. SIMON WILL REPORT TO DAVID G. SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SACKS ,COMMA PRESIDENT AND CHIEF OPERATING OFFICER OF SEAGRAM .PERIOD SEAGRAM .PERIOD

JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA JOHN A. KROL WAS NAMED GROUP VICE PRESIDENT ,COMMA AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS AGRICULTURE PRODUCTS DEPARTMENT ,COMMA OF THIS DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. DIVERSIFIED CHEMICALS COMPANY ,COMMA SUCCEEDING DALE E. WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD WOLF ,COMMA WHO WILL RETIRE MAY FIRST .PERIOD

MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE MR. KROL WAS FORMERLY VICE PRESIDENT IN THE AGRICULTURE PRODUCTS DEPARTMENT .PERIOD PRODUCTS DEPARTMENT .PERIOD

RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS RAPESEED ,COMMA ALSO KNOWN AS CANOLA ,COMMA IS CANADA'S MAIN OILSEED CROP .PERIOD CANADA'S MAIN OILSEED CROP .PERIOD

YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD YALE E. KEY IS A WELL -HYPHEN SERVICE CONCERN .PERIOD

SA1-85

Topic Examples - 2Topic Examples - 2(Numbers) (Numbers) SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR SOUTH KOREA POSTED A SURPLUS ON ITS CURRENT ACCOUNT OF FOUR

HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A HUNDRED NINETEEN MILLION DOLLARS IN FEBRUARY ,COMMA IN CONTRAST TO A DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA DEFICIT OF ONE HUNDRED TWELVE MILLION DOLLARS A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD THE GOVERNMENT SAID .PERIOD

THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME THE CURRENT ACCOUNT COMPRISES TRADE IN GOODS AND SERVICES AND SOME UNILATERAL TRANSFERS .PERIOD UNILATERAL TRANSFERS .PERIOD

COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR COMMERCIAL -HYPHEN VEHICLE SALES IN ITALY ROSE ELEVEN .POINT FOUR %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT %PERCENT IN FEBRUARY FROM A YEAR EARLIER ,COMMA TO EIGHT THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING THOUSAND ,COMMA EIGHT HUNDRED FORTY EIGHT UNITS ,COMMA ACCORDING TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO TO PROVISIONAL FIGURES FROM THE ITALIAN ASSOCIATION OF AUTO MAKERS .PERIOD MAKERS .PERIOD

INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN INDUSTRIAL PRODUCTION IN ITALY DECLINED THREE .POINT FOUR %PERCENT IN JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD JANUARY FROM A YEAR EARLIER ,COMMA THE GOVERNMENT SAID .PERIOD

CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH CANADIAN MANUFACTURERS' NEW ORDERS FELL TO TWENTY .POINT EIGHT OH BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA BILLION DOLLARS (LEFT-PAREN CANADIAN )RIGHT-PAREN IN JANUARY ,COMMA DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN DOWN FOUR %PERCENT FROM DECEMBER'S TWENTY ONE .POINT SIX SEVEN BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS BILLION DOLLARS ON A SEASONALLY ADJUSTED BASIS ,COMMA STATISTICS CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD CANADA ,COMMA A FEDERAL AGENCY ,COMMA SAID .PERIOD

THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN THE DECREASE FOLLOWED A FOUR .POINT FIVE %PERCENT INCREASE IN DECEMBER .PERIOD DECEMBER .PERIOD

SA1-86

Topic Examples – 3Topic Examples – 3(quotations)(quotations)

NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE NEITHER MR. ROSENFIELD NOR OFFICIALS OF JOHN BLAIR COULD BE REACHED FOR COMMENT .PERIOD REACHED FOR COMMENT .PERIOD

THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN THE AGENCY SAID THERE IS "DOUBLE-QUOTE SOME INDICATION OF AN UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF UPTURN "DOUBLE-QUOTE IN THE RECENT IRREGULAR PATTERN OF SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND SHIPMENTS ,COMMA FOLLOWING THE GENERALLY DOWNWARD TREND RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD RECORDED DURING THE FIRST HALF OF NINETEEN EIGHTY SIX .PERIOD

THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD THE COMPANY SAID IT ISN'T AWARE OF ANY TAKEOVER INTEREST .PERIOD THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND THE SALE INCLUDES THE RIGHTS TO GERMAINE MONTEIL IN NORTH AND

SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE SOUTH AMERICA AND IN THE FAR EAST ,COMMA AS WELL AS THE WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND WORLDWIDE RIGHTS TO THE DIANE VON FURSTENBERG COSMETICS AND FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER FRAGRANCE LINES AND U. S. DISTRIBUTION RIGHTS TO LANCASTER BEAUTY PRODUCTS .PERIOD BEAUTY PRODUCTS .PERIOD

BUT THE COMPANY WOULDN'T ELABORATE .PERIOD BUT THE COMPANY WOULDN'T ELABORATE .PERIOD HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR. HEARST CORPORATION WOULDN'T COMMENT ,COMMA AND MR.

GOLDSMITH COULDN'T BE REACHED .PERIOD GOLDSMITH COULDN'T BE REACHED .PERIOD A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON A MERRILL LYNCH SPOKESMAN CALLED THE REVISED QUOTRON

AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT AGREEMENT "DOUBLE-QUOTE A PRUDENT MANAGEMENT MOVE --DASH IT GIVES US A LITTLE FLEXIBILITY .PERIOD GIVES US A LITTLE FLEXIBILITY .PERIOD

SA1-87

ParsingParsing

Parser

Alice ate yellow squash.

S

NP VP

V NP

N Adj.

Alice ate yellow squash

N

SA1-88

The Importance of ParsingThe Importance of Parsing

In the hotel fake property was sold to In the hotel fake property was sold to tourists.tourists.

What does “fake” modify?

What does “In the hotel” modify?

SA1-89

AmbiguityAmbiguity

SS

NP NP

NP NPNP

VP VP

V VN N

N N N NDet Det

Salesmen sold the dog biscuits Salesmen sold the dog biscuits

SA1-90

Probabilistic Context-free Probabilistic Context-free Grammars (PCFGs)Grammars (PCFGs)

S NP VP S NP VP 1.01.0

VP V NPVP V NP 0.50.5

VP V NP NP VP V NP NP 0.50.5

NP Det NNP Det N 0.50.5

NP Det N NNP Det N N 0.50.5

N salespeopleN salespeople 0.30.3

N dogN dog 0.40.4

N biscuitsN biscuits 0.30.3

V soldV sold 1.01.0

S

NP

NP

VP

VN

N NDet

Salesmen sold the dog biscuits

SA1-91

Producing a Single “Best” Producing a Single “Best” ParseParse

The parser finds the most probable The parser finds the most probable parse tree given the sentence (parse tree given the sentence (ss))

For a PCFG we have the following, For a PCFG we have the following, where where r r varies over the rules used in varies over the rules used in the tree the tree ::

SA1-92

The Penn Wall Street The Penn Wall Street Journal Tree-bankJournal Tree-bank

About one million words.About one million words. Average sentence length is 23 Average sentence length is 23

words and punctuation.words and punctuation.

In an Oct. 19 review of “The Misanthrope” at Chicago’s Goodman Theatre (“Revitalized Classics Take the Stage in Windy City,” Leisure & Arts), the role of Celimene, played by Kim Cattrall, was mistakenly attributed to Christina Hagg.

SA1-93

““Learning” a PCFG from a Learning” a PCFG from a Tree-BankTree-Bank

(S (NP (N Salespeople))(S (NP (N Salespeople))

(VP (V sold)(VP (V sold)

(NP (Det the)(NP (Det the)

(N dog)(N dog)

(N biscuits)))(N biscuits)))

(. .))(. .))

S NP VP .

VP V NP

SA1-94

Lexicalized ParsingLexicalized Parsing

To do better, it is necessary to condition To do better, it is necessary to condition probabilities on the actual words of the probabilities on the actual words of the sentence. This makes the probabilities sentence. This makes the probabilities much tighter:much tighter:

pp(VP V NP NP) (VP V NP NP) = 0.00151= 0.00151

pp(VP V NP NP | said) (VP V NP NP | said) = 0.00001= 0.00001

pp(VP V NP NP | gave) (VP V NP NP | gave) = 0.01980= 0.01980

SA1-96

Statistical Parser Statistical Parser ImprovementImprovement

95 96 97 98 99 00 01

Year

84

85

86

87

88

89

90Labeled Precision & Recall %

SA1-97

Parsers and Language Parsers and Language ModelsModels

Generative parsers are of particular Generative parsers are of particular interest because they can be turned interest because they can be turned into language models.into language models.

Here, of course, Here, of course,

if if

SA1-98

Parsing vs. Trigram Parsing vs. Trigram

Model Perplexity

Trigram poor smoothing 167

Trigram deleted-interpolation 155

Trigram Kneser-Ney 145

Parsing 119

All experiments are trained on one million words of Penn tree-bank data, and tested on 80,000 words.

SA1-99

Language ModelingLanguage ModelingApplicationsApplications

Speech RecognitionSpeech Recognition Machine TranslationMachine Translation Language AnalysisLanguage Analysis Fuzzy KeyboardFuzzy Keyboard Information RetrievalInformation Retrieval (Spelling correction, handwriting (Spelling correction, handwriting

recognition, telephone keypad recognition, telephone keypad entry, Chinese/Japanese text entry)entry, Chinese/Japanese text entry)

SA1-100

Application in Machine Application in Machine TranslationTranslation

Let Let f f be a French sentence we wish be a French sentence we wish to translate into English.to translate into English.

Let Let e e be a possible English be a possible English translation.translation.

Language Model

Translation Model

SA1-103

MT and Parsing Language MT and Parsing Language ModelsModels

Finds tree fragments that match the French.

Puts the fragments together into a parse.

SA1-104

Some SuccessesSome Successes

Correct: This is not possible.

Trigram: Impossibility.

Parser: This is impossible.

Correct: This is the globalization of production.

Trigram: This globalization of production.

Parser: This is globalization of production.

SA1-105

Some Less than SuccessesSome Less than Successes

Correct: He said he often eats Chinese dishes.

Trigram: He said China frequently tastes food.

Parser: He said recurrent taste of Chinese cuisine.

Correct: Wishful thinking out of touch with reality.

Trigram: Divorce practical delusion.

Parser: Practical delusion divorced.

SA1-106

MeasurementMeasurement

So far we have asked how much good So far we have asked how much good does, say, caching do, and taken it as a does, say, caching do, and taken it as a fact about language modeling. fact about language modeling. Alternatively we could take it as a Alternatively we could take it as a measurement about caching.measurement about caching.

Also, we have asked questions about Also, we have asked questions about the perplexity of “all English”. the perplexity of “all English”. Alternatively we could ask about Alternatively we could ask about smaller units.smaller units.

SA1-107

Measurement by SentenceMeasurement by Sentence

A basic fact about, A basic fact about, say, a newspaper say, a newspaper article is that it has article is that it has a first sentence, a a first sentence, a second, etc. second, etc.

How does per-word How does per-word perplexity vary as perplexity vary as a function of a function of sentence number?sentence number?

1 2 3 4 5 6 7 8 ...

Sentence Number

Perplexity

SA1-108

Perplexity Increases with Perplexity Increases with Sentence NumberSentence Number

100120140160180200220

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sentence Number

Per

plex

ity

SA1-109

Theory: Sentence Perplexity Theory: Sentence Perplexity is is Constant.Constant.

We want to measure sentence We want to measure sentence perplexity given all previous perplexity given all previous information.information.

In fact, models like trigram, or even In fact, models like trigram, or even parsing language models only look at parsing language models only look at local context.local context.

There are useful clues from previous There are useful clues from previous sentences that are not being used, and sentences that are not being used, and these clues increase with sentence these clues increase with sentence number.number.

SA1-110

Which Words, How they Which Words, How they are Usedare Used

We can categorize possible We can categorize possible contextual influences into those contextual influences into those that effect which words are used that effect which words are used and those that effect how they are and those that effect how they are used.used.

SA1-111

Open vs. Closed Class Open vs. Closed Class WordsWords

Closed class words like Closed class words like prepositions (“of”, “at”, “for” etc.) prepositions (“of”, “at”, “for” etc.) or determiners (“the”, “a”, “some” or determiners (“the”, “a”, “some” etc.) should remain constant over etc.) should remain constant over the sentences.the sentences.

Open class words, like nouns Open class words, like nouns (“piano”, “trial”, “concert”) should (“piano”, “trial”, “concert”) should change depending on context.change depending on context.

SA1-112

Perplexity of Closed Class Perplexity of Closed Class Items Items

100

120

140

160

180

200

220

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sentence Number

Perp

lexit

y

SA1-113

Perplexity of Open Class Perplexity of Open Class ItemsItems

Noun

200

250

300

350400

450

500

550

1 2 3 4 5 6 7 8 9 10

Sentence Number

Per

ple

xity

SA1-114

Contextual Lexical Effects as Contextual Lexical Effects as Detected by Caching ModelsDetected by Caching Models

Prepositions

25

27

29

31

33

35

37

39

41

1 2 3 4 5 6 7 8 9 10

Sentence Number

Pe

rple

xit

y

SA1-115

Effect of Caching on NounsEffect of Caching on Nouns

Nouns

220

270

320

370

420

470

520

570

1 2 3 4 5 6 7 8 9 10

Sentence Number

SA1-116

Fuzzy KeyboardFuzzy Keyboard

A soft keyboard is A soft keyboard is an image of a an image of a keyboard on a Palm keyboard on a Palm Pilot or Windows CE Pilot or Windows CE device.device.

Very small – users Very small – users can type on key can type on key boundary, or hit boundary, or hit the wrong key the wrong key easilyeasily

SA1-117

Fuzzy Keyboard IdeaFuzzy Keyboard Idea

If a user hits the “Q” key, then hits between If a user hits the “Q” key, then hits between the “U” and “I” keys, assume he meant “QU”the “U” and “I” keys, assume he meant “QU”

If a user hits between “Q” and “W” and then If a user hits between “Q” and “W” and then hits the “E”, assume he meant “WE”hits the “E”, assume he meant “WE”

In general, we can use a language model to In general, we can use a language model to help decide.help decide.

SA1-118

Fuzzy KeyboardFuzzy KeyboardLanguage model and Pen Language model and Pen PositionsPositions

Math: Language Model times Pen PostionMath: Language Model times Pen Postion

For pen down positions, collect data, and For pen down positions, collect data, and compute simple Gaussian distribution.compute simple Gaussian distribution.

arg max (letter sequence) P(pen down positions | letter sequence)letter sequences

P ´

SA1-119

Fuzzy Keyboard ResultsFuzzy Keyboard Results

40% Fewer 40% Fewer errors, same errors, same speed.speed.

See poster at See poster at AAAIAAAI

Can be applied Can be applied to eye typing, to eye typing, etc.etc.

SA1-120

Information RetrievalInformation Retrieval

For a given query, what document is For a given query, what document is most likely?most likely?

Ignore, or use uniform, or learn somehow.Use a language model

SA1-121

Probability of QueryProbability of Query

Use a simple unigram language model Use a simple unigram language model smoothed with global model.smoothed with global model.

P(qP(qii|document) =|document) =

C(qC(qiidocument) / doclengthdocument) / doclength + (1 - + (1 - ) C(q) C(qi i in all documents) / everythingin all documents) / everything

P(query | document)=P(qP(query | document)=P(q11|document)|document) P(qP(q22|document)|document) … … P(qP(qnn|document)|document)

Works about as well as TF/IDF Works about as well as TF/IDF (standard, simple IR techniques).(standard, simple IR techniques).

SA1-122

Clusters for IRClusters for IR

Use predictive clustering:Use predictive clustering:

P(qP(qii|document) = |document) =

P(QP(Qii|document) |document) P(qP(qii|document, |document, QQii))

Example: search for Honda, find Example: search for Honda, find Accord, because in same cluster. Word Accord, because in same cluster. Word model is smoothed with global model.model is smoothed with global model.

In some experiments, works better In some experiments, works better than unigram.than unigram.

SA1-123

Handwriting RecognitionHandwriting Recognition• P(observed ink|words) P(observed ink|words) P(words)P(words)

Telephone Keypad inputTelephone Keypad input• P(numbers|words) P(numbers|words) P(words)P(words)

Spelling CorrectionSpelling Correction• P(observed keys|words) P(observed keys|words) P(words)P(words)

Chinese/Japanese text entryChinese/Japanese text entry• P(phonetic representation|P(phonetic representation|

characters) characters) P(characters)P(characters)

Other Language Model Other Language Model UsesUses

Language Model

SA1-124

Tools: Tools: CMU Language Modeling CMU Language Modeling ToolkitToolkit

Can handle bigram, trigrams, moreCan handle bigram, trigrams, more Can handle different smoothing Can handle different smoothing

schemes schemes Many separate tools – output of one Many separate tools – output of one

tool is input to next: easy to usetool is input to next: easy to use Free for research purposesFree for research purposes http://svr-www.eng.cam.ac.uk/~prc14/http://svr-www.eng.cam.ac.uk/~prc14/

toolkit.htmltoolkit.html

SA1-125

Using the CMU LM ToolsUsing the CMU LM Tools

SA1-126

Tools: SRI Language Tools: SRI Language Modeling ToolkitModeling Toolkit

More powerful than CMU toolkitMore powerful than CMU toolkit Can handles clusters, lattices, n-Can handles clusters, lattices, n-

best lists, hidden tagsbest lists, hidden tags Free for research useFree for research use http://www.speech.sri.com/http://www.speech.sri.com/

projects/srilmprojects/srilm

SA1-127

Tools: Text normalizationTools: Text normalization

What about “$3,100,000” What about “$3,100,000” convert to convert to “Three million one hundred thousand “Three million one hundred thousand dollars”, etc.dollars”, etc.

Need to do this for dates, numbers, Need to do this for dates, numbers, maybe abbreviations.maybe abbreviations.

Some text-normalization tools come Some text-normalization tools come with Wall Street Journal corpus, from with Wall Street Journal corpus, from LDC (Linguistic Data Consortium)LDC (Linguistic Data Consortium)

Not much availableNot much available Write your own (use Perl!)Write your own (use Perl!)

SA1-128

Small enoughSmall enough

Real language models are often hugeReal language models are often huge 5-gram models typically larger than the 5-gram models typically larger than the

training datatraining data Use count-cutoffs (eliminate parameters with Use count-cutoffs (eliminate parameters with

fewer counts) or, betterfewer counts) or, better Use Stolcke pruning – finds counts that Use Stolcke pruning – finds counts that

contribute least to perplexity reduction, contribute least to perplexity reduction, • P(City | New York”) P(City | New York”) P(City | York) P(City | York)• P(Friday | God it’s) P(Friday | God it’s) P(Friday | it’s) P(Friday | it’s)

Remember, Kneser-Ney helped most when Remember, Kneser-Ney helped most when lots of 1 countslots of 1 counts

SA1-129

Some ExperimentsSome Experiments

I re-implemented all techniques I re-implemented all techniques Trained on 260,000,000 words of WSJTrained on 260,000,000 words of WSJ Optimize parameters on heldoutOptimize parameters on heldout Test on separate test sectionTest on separate test section Some combinations extremely time-Some combinations extremely time-

consuming (days of CPU time)consuming (days of CPU time)• Don’t try this at home, or in anything you want to Don’t try this at home, or in anything you want to

shipship Rescored N-best lists to get resultsRescored N-best lists to get results

• Maximum possible improvement from 10% word Maximum possible improvement from 10% word error rate absolute to 5%error rate absolute to 5%

SA1-130

Overall Results: PerplexityOverall Results: Perplexity

Katz skip

all-cache-sentence

all-cache-skipall-cache

all-cache-5gram

all-cache-cluster

all-cache-KN

KN 5gram

KN SkipKN Sentence

KN Cluster

Katz ClusterKatz Sentence

Katz 5-gramKN

Katz

70

75

80

85

90

95

100

105

110

115

8.8 9 9.2 9.4 9.6 9.8 10

Word Error Rate

Per

ple

xity

SA1-131

ConclusionsConclusions

Use trigram modelsUse trigram models Use any reasonable smoothing Use any reasonable smoothing

algorithm (Katz, Kneser-Ney)algorithm (Katz, Kneser-Ney) Use caching if you have correction Use caching if you have correction

information.information. Parsing is promising technique.Parsing is promising technique. Clustering, sentence mixtures, Clustering, sentence mixtures,

skipping not usually worth effort.skipping not usually worth effort.

SA1-132

Shannon RevisitedShannon Revisited

People can make GREAT use of long People can make GREAT use of long contextcontext

With 100 characters, computers get very With 100 characters, computers get very roughly 50% word perplexity reduction.roughly 50% word perplexity reduction.

Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532

10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998

100 1.5 2.5 10 142

SA1-133

The Future?The Future?

Sentence mixture models need more Sentence mixture models need more explorationexploration

Structured language modelsStructured language models Topic-based modelsTopic-based models Integrating domain Integrating domain knowledge with knowledge with language modellanguage model Other ideas?Other ideas? In the end, we need In the end, we need real understandingreal understanding

SA1-134

More ResourcesMore Resources

Joshua’s web page: Joshua’s web page: www.research.microsoft.com/~joshuagowww.research.microsoft.com/~joshuago• Smoothing technical report: good Smoothing technical report: good

introduction to smoothing and lots of details introduction to smoothing and lots of details too.too.

• ““A Bit of Progress in Language Modeling,” A Bit of Progress in Language Modeling,” which is the journal version of much of this which is the journal version of much of this talk.talk.

• Papers on fuzzy keyboard, language model Papers on fuzzy keyboard, language model compression, and maximum entropy.compression, and maximum entropy.

• Clustering toolClustering tool

SA1-135


Eugene’s web page: Eugene’s web page: http://www.cs.brown.edu/people/http://www.cs.brown.edu/people/ecec• Papers on statistical parsing for it’s own Papers on statistical parsing for it’s own

sake and for language modeling, as sake and for language modeling, as well as using language modeling to well as using language modeling to measure contextual influence.measure contextual influence.

• Pointers to software for statistical Pointers to software for statistical parsing as well as statistical parsers parsing as well as statistical parsers optimized for language-modelingoptimized for language-modeling

SA1-136

More Resources:More Resources:BooksBooks

Books (all are OK, none focus on language Books (all are OK, none focus on language models)models)• Statistical Language LearningStatistical Language Learning by by

Eugene CharniakEugene Charniak• Speech and Language ProcessingSpeech and Language Processing by Dan by Dan

Jurafsky and Jim Martin (especially Chapter 6)Jurafsky and Jim Martin (especially Chapter 6)• Foundations of Statistical Natural Language Foundations of Statistical Natural Language

ProcessingProcessing by Chris Manning and Hinrich by Chris Manning and Hinrich Schütze. Schütze.

• Statistical Methods for Speech RecognitionStatistical Methods for Speech Recognition, by , by Frederick JelinekFrederick Jelinek

• Spoken Language ProcessingSpoken Language Processing by Huang, Acero by Huang, Acero and Honand Hon

SA1-137


Sentence Mixture Models: (also, caching) Sentence Mixture Models: (also, caching) • Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving and predicting Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving and predicting

performance of statistical language models in sparse domains" performance of statistical language models in sparse domains" • Rukmini Iyer and Mari Ostendorf. Modeling long distance dependence Rukmini Iyer and Mari Ostendorf. Modeling long distance dependence

in language: Topic mixtures versus dynamic cache models. in language: Topic mixtures versus dynamic cache models. IEEE IEEE Transactions on Acoustics, Speech and Audio ProcessingTransactions on Acoustics, Speech and Audio Processing, 7:30--39, , 7:30--39, January 1999.January 1999.

Caching: Above, plusCaching: Above, plus• R. Kuhn. Speech recognition and the frequency of recently used words: R. Kuhn. Speech recognition and the frequency of recently used words:

A modified markov model for natural language. In A modified markov model for natural language. In 12th International 12th International Conference on Computational LinguisticsConference on Computational Linguistics, pages 348--350, Budapest, , pages 348--350, Budapest, August 1988.August 1988.

• R. Kuhn and R. D. Mori. A cache-based natural language model for R. Kuhn and R. D. Mori. A cache-based natural language model for speech reproduction. speech reproduction. IEEE Transactions on Pattern Analysis and IEEE Transactions on Pattern Analysis and Machine IntelligenceMachine Intelligence, 12(6):570--583, 1990. , 12(6):570--583, 1990.

• R. Kuhn and R. D. Mori. Correction to a cache-based natural language R. Kuhn and R. D. Mori. Correction to a cache-based natural language model for speech reproduction. model for speech reproduction. IEEE Transactions on Pattern Analysis IEEE Transactions on Pattern Analysis and Machine Intelligenceand Machine Intelligence, 14(6):691--692, 1992., 14(6):691--692, 1992.

SA1-138

More Resources: More Resources: ClusteringClustering

The seminal referenceThe seminal reference• P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. P. F. Brown, V. J. DellaPietra, P. V. deSouza, J. C. Lai, and R. L. Mercer.

Class-based n-gram models of natural language. Class-based n-gram models of natural language. Computational Computational LinguisticsLinguistics, 18(4):467--479, December 1992., 18(4):467--479, December 1992.

Two-sided clusteringTwo-sided clustering• H. Yamamoto and Y. Sagisaka. Multi-class composite n-gram based on H. Yamamoto and Y. Sagisaka. Multi-class composite n-gram based on

connection direction. In connection direction. In Proceedings of the IEEE International Proceedings of the IEEE International Conference on Acoustics, Speech and Signal ProcessingConference on Acoustics, Speech and Signal Processing Phoenix, Phoenix, Arizona, May 1999.Arizona, May 1999.

Fast clusteringFast clustering• D. R. Cutting, D. R. Karger, J. R. Pedersen, and J. W. Tukey. D. R. Cutting, D. R. Karger, J. R. Pedersen, and J. W. Tukey.

Scatter/gather: A cluster-based approach to browsing large document Scatter/gather: A cluster-based approach to browsing large document collections. In collections. In SIGIR 92,SIGIR 92, 1992. 1992.

Other:Other:• R. Kneser and H. Ney. Improved clustering techniques for class-based R. Kneser and H. Ney. Improved clustering techniques for class-based

statistical language modeling. In Eurospeech 93, volume 2, pages 973--statistical language modeling. In Eurospeech 93, volume 2, pages 973--976, 1993.976, 1993.

SA1-139


Structured Language ModelsStructured Language Models• Eugene’s web pageEugene’s web page• Ciprian Chelba’s web page: Ciprian Chelba’s web page:

– http://www.clsp.jhu.edu/people/chelba/http://www.clsp.jhu.edu/people/chelba/ Maximum EntropyMaximum Entropy

• Roni Rosenfeld’s home page and thesisRoni Rosenfeld’s home page and thesis http://http://www.cs.cmu.edu/~roniwww.cs.cmu.edu/~roni//• Joshua’s web pageJoshua’s web page

Stolcke PruningStolcke Pruning• A. Stolcke (1998), Entropy-based pruning of backoff A. Stolcke (1998), Entropy-based pruning of backoff

language models. language models. Proc. DARPA Broadcast News Proc. DARPA Broadcast News Transcription and Understanding WorkshopTranscription and Understanding Workshop, pp. 270-, pp. 270-274, Lansdowne, VA. NOTE: get corrected version from 274, Lansdowne, VA. NOTE: get corrected version from http://www.speech.sri.com/people/stolcke http://www.speech.sri.com/people/stolcke

SA1-140

More Resources: SkippingMore Resources: Skipping

Skipping: Skipping: • X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee, X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee,

and R. Rosenfeld. The SPHINX-II speech recognition and R. Rosenfeld. The SPHINX-II speech recognition system: An overview. system: An overview. Computer, Speech, and LanguageComputer, Speech, and Language, , 2:137--148, 1993.2:137--148, 1993.

Lots of stuff:Lots of stuff:• S. Martin, C. Hamacher, J. Liermann, F. Wessel, and H. S. Martin, C. Hamacher, J. Liermann, F. Wessel, and H.

Ney. Assessment of smoothing methods and complex Ney. Assessment of smoothing methods and complex stochastic language modeling. In stochastic language modeling. In 6th European 6th European Conference on Speech Communication and TechnologyConference on Speech Communication and Technology, , volume 5, pages 1939--1942, Budapest, Hungary, volume 5, pages 1939--1942, Budapest, Hungary, September 1999. H. Ney, U. Essen, and R. Kneser.September 1999. H. Ney, U. Essen, and R. Kneser.

• On structuring probabilistic dependences in stochastic On structuring probabilistic dependences in stochastic language modeling. language modeling. Computer, Speech, and LanguageComputer, Speech, and Language, , 8:1--38, 1994.8:1--38, 1994.

lm tutorial v8

Documents

language modela language

bad language model sa1

john baby

language models workhard

probability pand

boy pbaby

minute probability overview

boys pbaby