1 natural language processing (6) zhao hai 赵海 department of computer science and engineering...

96
1 Natural Language Processing (6) Zhao Hai 赵赵 Department of Computer Science and Engineering Shanghai Jiao Tong University [email protected] Revised from Joshua Goodman (Microsoft Research) and Michael Collins (MIT)

Upload: rosalind-jacobs

Post on 13-Jan-2016

225 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

1

Natural Language Processing (6)

Zhao Hai 赵海

Department of Computer Science and EngineeringShanghai Jiao Tong University

[email protected]

Revised from Joshua Goodman (Microsoft Research) and

Michael Collins (MIT)

Page 2: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

2

(Statistical) Language Model

Outline

Page 3: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

3

A bad language model

Page 4: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

4

A bad language model

Page 5: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

5

A bad language model

Page 6: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

6

A bad language model

Page 7: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

7

Really Quick Overview

Humor What is a language model?• Really quick overview

– Two minute probability overview– How language models work (trigrams)

Page 8: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

8

What’s a Language Model

• A Language model is a probability distribution over word sequences

• P(“And nothing but the truth”) 0.001

• P(“And nuts sing on the roof”) 0

Page 9: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

9

What’s a language model for?

• Speech recognition• Handwriting recognition• Spelling correction• Optical character recognition• Machine translation

• (and anyone doing statistical modeling)

Page 10: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

10

Really Quick Overview

HumorWhat is a language model? Really quick overview

– Two minute probability overview– How language models work (trigrams)

Page 11: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

11

Everything you need to know about probability – definition

• P(X) means probability that X is true– P(baby is a boy) 0.5 (% of total that are boys)

– P(baby is named John) 0.001 (% of total named John)

BabiesBaby boys

John

Page 12: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

12

Everything about probabilityJoint probabilities

• P(X, Y) means probability that X and Y are both true, e.g. P(brown eyes, boy)

BabiesBaby boys

JohnBrown eyes

Page 13: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

13

Everything about probability:Conditional probabilities

• P(X|Y) means probability that X is true when we already know Y is true– P(baby is named John | baby is a boy) 0.002– P(baby is a boy | baby is named John ) 1

BabiesBaby boys

John

Page 14: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

14

Everything about probabilities: math

• P(X|Y) = P(X, Y) / P(Y)

P(baby is named John | baby is a boy)

= P(baby is named John, baby is a boy) / P(baby is a boy)

= 0.001 / 0.5 = 0.002

BabiesBaby boys

John

Page 15: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

15

Everything about probabilities: Bayes Rule

• Bayes rule:

P(X|Y) = P(Y|X) P(X) / P(Y)• P(named John | boy) = P(boy | named John)

P(named John) / P(boy)

BabiesBaby boys

John

Page 16: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

16

Really Quick Overview

Humor What is a language model? Really quick overview

– Two minute probability overview– How language models work (trigrams)

Page 17: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

17

THE Equation

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

cewordsequenPcewordsequenacousticsP

acousticsP

cewordsequenPcewordsequenacousticsP

acousticscewordsequenP

cewordsequen

cewordsequen

cewordsequen

Page 18: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

18

How Language Models work

• Hard to compute P(“And nothing but the truth”)

• Step 1: Decompose probability

P(“And nothing but the truth) =

P(“And”) P(“nothing|and”) P(“but|and nothing”) P(“the|and nothing but”) P(“truth|and nothing but the”)

Page 19: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

19

The Trigram Approximation

Step 2:Make Markov Independence Assumptions

Assume each word depends only on the previous two words (three words total – tri means three, gram means writing)

P(“the|… whole truth and nothing but”) P(“the|nothing but”)P(“truth|… whole truth and nothing but the”) P(“truth|but the”)

Page 20: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

20

Trigrams, continued

• How do we find probabilities?• Get real text, and start counting!

P(“the | nothing but”) C(“nothing but the”) / C(“nothing but”)

Page 21: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

21

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation• Smoothing• More techniques

– Caching– Skipping– Clustering– Sentence-mixture models, – Structured language models

• Tools

Page 22: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

22

Evaluation

• How can you tell a good language model from a bad one?

• Run a speech recognizer (or your application of choice), calculate word error rate– Slow– Specific to your recognizer

Page 23: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

23

Evaluation:Perplexity Intuition

• Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10

• Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000

• Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54

• Perplexity is weighted equivalent branching factor.

Page 24: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

24

Evaluation: perplexity

• “A, B, C, D, E, F, G…Z”: – perplexity is 26

• “Alpha, bravo, charlie, delta…yankee, zulu”: – perplexity is 26

• Perplexity measures language model difficulty, not acoustic difficulty.

Page 25: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

25

Perplexity: Math• Perplexity is geometric average inverse probability • Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) • Imagine data: All 30,004 equally likely• Example:

• Perplexity of test data, given model, is 119,829• Remarkable fact: the true model for data has the lowest possible

perplexity• Perplexity is geometric average inverse probability

n

n

i ii wwP 1 1..1 )|(

1

004,30

000,30

000,1201

000,1201

41

41

41

/1

Page 26: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

26

Perplexity: Math• Imagine model: “Operator” (1 in 4), “Technical support”

(1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) • Imagine data: All 30,004 equally likely

• Can compute three different perplexities– Model (ignoring test data): perplexity 54– Test data (ignoring model): perplexity 30,004– Model on test data: perplexity 119,829

• When we say perplexity, we mean “model on test”

• Remarkable fact: the true model for data has the lowest possible perplexity

Page 27: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

27

Perplexity:Is lower better?

• Remarkable fact: the true model for data has the lowest possible perplexity

• Lower the perplexity, the closer we are to true model.

• Typically, perplexity correlates well with speech recognition word error rate– Correlates better when both models are trained on

same data– Doesn’t correlate well when training data changes

Page 28: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

28

Perplexity: The Shannon Game

• Ask people to guess the next letter, given context. Compute perplexity.

– (when we get to entropy, the “100” column corresponds to the “1 bit per character” estimate)

Char n-gram Low Char Upper char Low word Upper word1 9.1 16.3 191,237 4,702,5115 3.2 6.5 653 29,532

10 2.0 4.3 45 2,99815 2.3 4.3 97 2,998

100 1.5 2.5 10 142

Page 29: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

29

Evaluation: Cross Entropy

• Entropy = log2 perplexity

Should be called “cross-entropy of model on test data.”Should be called “cross-entropy of model on test data.” Remarkable fact: entropy is average number of bits per Remarkable fact: entropy is average number of bits per word required to encode test data using this probability word required to encode test data using this probability model, and an optimal coder. Called bits.model, and an optimal coder. Called bits.

n

n

i ii wwP 1 1..1

2 )|(

1log

Page 30: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

30

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing• More techniques

– Caching– Skipping– Clustering– Sentence-mixture models, – Structured language models

• Tools

Page 31: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

31

Smoothing: None

• Called Maximum Likelihood estimate.• Lowest perplexity trigram on training data.• Terrible on test data: If no occurrences of C(xyz),

probability is 0.

)(

)(

)(

)()|(

xyC

xyzC

xywC

xyzCxyzP

w

Page 32: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

32

Smoothing: Add One

• What is P(sing|nuts)? Zero? Leads to infinite perplexity!

• Add one smoothing:

• Works very badly. DO NOT DO THIS

• Add delta smoothing:

• Still very bad. DO NOT DO THIS

VxyC

xyzCxyzP

)(

1)()|(

VxyC

xyzCxyzP

)(

)()|(

Page 33: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

33

Smoothing: Simple Interpolation

• Trigram is very context specific, very noisy• Unigram is context-independent, smooth• Interpolate Trigram, Bigram, Unigram for best

combination• Find 0<<1 by optimizing on “held-out” data• Almost good enough

)(

)()1(

)(

)(

)(

)()|(

C

zC

yC

yzC

xyC

xyzCxyzP

Page 34: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

34

Smoothing: Finding parameter values

• Split data into training, “held out”, test• Try lots of different values for on heldout data,

pick best• Test on test data• Sometimes, can use tricks like “EM” (estimation

maximization) to find values• Goodman suggests to use a generalized search

algorithm, “Powell search” – see Numerical Recipes in C

Page 35: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

35

An Iterative Method

• Initialization: Pick arbitrary/random values for• Step 1: Calculate the following quantities:

• Step 2: Re-estimate ’s as

• Step 3: If ’s have not converged, go to Step 1.

321 ,,

i

i

Page 36: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

36

Smoothing digression:Splitting data

• How much data for training, heldout, test?• Some people say things like “1/3, 1/3, 1/3” or “80%,

10%, 10%” They are WRONG• Heldout should have (at least) 100-1000 words per

parameter.• Answer: enough test data to be statistically

significant. (1000s of words perhaps)

Page 37: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

37

Smoothing digression:Splitting data

• Be careful: WSJ data divided into stories. Some are easy, with lots of numbers, financial, others much harder. Use enough to cover many stories.

• Be careful: Some stories repeated in data sets.

• Can take data from end – better – or randomly from within training.

Page 38: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

38

Smoothing: Jelinek-Mercer

• Simple interpolation:

• Better: smooth a little after “The Dow”, lots after “Adobe acquired”

)|()1()(

)()|( yzP

xyC

xyzCxyzP smoothsmooth

)|())((1()(

)())((

)|(

yzPxyCxyC

xyzCxyC

xyzP

smooth

smooth

Page 39: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

39

Smoothing:Jelinek-Mercer continued

• Find s by cross-validation on held-out data

• Also called “deleted-interpolation”

)|())((1()(

)())((

)|(

yzPxyCxyC

xyzCxyC

xyzP

smooth

smooth

Page 40: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

40

Smoothing: Good Turing

• Invented during WWII by Alan Turing (and Good?), later published by Good. Frequency estimates were needed within the Enigma code-breaking effort.

• Define nr = number of elements x for which Count(x) = r.• Modified count for any x with Count(x) = r and r > 0:

(r+1)nr+1/nr.

• Leads to the following estimate of “missing mass”:

n1/N, where N is the size of the sample. This is the estimate of the

probability of seeing a new element x on the (N +1)’th draw.

Page 41: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

41

Smoothing: Good Turing

• Imagine you are fishing• You have caught 10 Carp, 3

Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.

• How likely is it that next species is new? 3/18

• How likely is it that next is tuna? Less than 2/18

Page 42: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

42

Smoothing: Good Turing

• How many species (words) were seen once? Estimate for how many are unseen.

• All other estimates are adjusted (down) to give probabilities for unseen

r

r

n

nrr 1* )1(

N

np 1

0

Page 43: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

43

Smoothing:Good Turing Example

• 10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel.• How likely is new data (p0 ).

Let n1 be number occurring

once (3), N be total (18). p0=3/18• How likely is eel? 1*

• n1 =3, n2 =1• 1* =2 1/3 = 2/3• P(eel) = 1* /N = (2/3)/18 = 1/27

N

np 1

0

r

r

n

nrr 1* )1(

Page 44: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

44

Smoothing: Katz

• Use Good-Turing estimate

• Works pretty well.• Not good for 1 counts is calculated so probabilities sum to 1

otherwiseyzPxy

xyzCifxyC

xyzC

xyzP

Katz

Katz

)|()(

0)()(

)(*

)|(

0)( )(

)(*1)(

xyC xyC

xyzCxy

Page 45: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

45

Smoothing:Absolute Discounting

• Assume fixed discount

• Works pretty well, easier than Katz.

• Not so good for 1 counts

otherwiseyzPxy

xyzCifxyC

DxyzC

xyzP

absolute

absolute

)|()(

0)()(

)(

)|(

Page 46: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

46

Smoothing:Interpolated Absolute Discount

• Backoff: ignore bigram if have trigram

• Interpolated: always combine bigram, trigram

)|()()(

)()|( xzPxy

xyC

DxyzCxyzP interpabsinterpabs

otherwiseyzPxy

xyzCifxyC

DxyzCxyzP

absolute

absolute

)|()(

0)()(

)()|(

Page 47: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

47

Smoothing: Interpolated Multiple Absolute Discounts

• One discount is good

• Different discounts for different counts

• Multiple discounts: for 1 count, 2 counts, >2

)|()()(

)(xzPxy

xyC

DxyzCinterpabs

)|()()(

)( )( yzPxyxyC

DxyzCinterpabs

xyzC

Page 48: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

48

Smoothing: Kneser-Ney

P(Francisco | eggplant) vs P(stew | eggplant)

• “Francisco” is common, so backoff, interpolated methods say it is likely

• But it only occurs in context of “San”

• “Stew” is common, and in many contexts

• Weight backoff by number of contexts word occurs in

Page 49: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

49

Smoothing: Kneser-Ney

• Interpolated

• Absolute-discount

• Modified backoff distribution

• Consistently best technique

v

xyzC

wyvCw

wyzCwxy

xyC

DxyzC

0)(|

0)(|)(

)(

)( )(

Page 50: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

50

Smoothing: Chart

Page 51: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

51

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models, – Structured language models

• Tools

Page 52: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

52

Caching

• If you say something, you are likely to say it again later.

• Interpolate trigram with cache

)(

)(

)|(

)|()1(

)|(

)|(

historylength

historyzC

historyzP

historyzP

xyzP

historyzP

cache

cache

smooth

Page 53: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

53

Caching: Real Life

• Someone says “I swear to tell the truth”• System hears “I swerve to smell the soup”• Cache remembers!• Person says “The whole truth”, and, with cache,

system hears “The whole soup.” – errors are locked in.

• Caching works well when users corrects as they go, poorly or even hurts without correction.

Page 54: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

54

Caching: Variations

• N-gram caches:

• Conditional n-gram cache: use n-gram cache only if xy history

• Remove function-words from cache, like “the”, “to”

)(

)(

)|(

historyxyC

historyxyzC

historyzPcache

Page 55: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

55

5-grams

• Why stop at 3-grams?

• If P(z|…rstuvwxy) P(z|xy) is good, then

• P(z|…rstuvwxy) P(z|vwxy) is better!

• Very important to smooth well

• Interpolated Kneser-Ney works much better than Katz on 5-gram, more than on 3-gram

Page 56: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

56

N-gram versus smoothing algorithm

n-gram Katz Kneser-Ney

2 134 132

3 80 74

4 75 65

5 78 62

Page 57: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

57

Speech recognizer mechanics

• Keep many hypotheses alive

• Find acoustic, language model scores– P(acoustics | truth = .3), P(truth | tell the) = .1– P(acoustics | soup = .2), P(soup | smell the) = .01

“…tell the” (.01)“…smell the” (.01)

“…tell the truth” (.01 .3 .1)“…smell the soup” (.01 .2 .01)

Page 58: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

58

Speech recognizer slowdowns

• Speech recognizer uses tricks (dynamic programming) so merge hypotheses

Trigram: Fivegram:

“…tell the”“…smell the”

“…swear to tell the”“…swerve to smell the”

“swear too tell the”“swerve too smell the”

“swerve to tell the”“swerve too tell the”

Page 59: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

59

Speech recognizer vs. n-gram

• Recognizer can threshold out bad hypotheses

• Trigram works so much better than bigram, better thresholding, no slow-down

• 4-gram, 5-gram start to become expensive

Page 60: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

60

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models, – Structured language models

• Tools

Page 61: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

61

Skipping

• P(z|…rstuvwxy) P(z|vwxy)

• Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back word.

• Example: “P(time|show John a good)” ->

P(time | show ____ a good)• P(…rstuvwxy) P(z|vwxy) + P(z|vw_y) + (1--)P(z|v_xy)

Page 62: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

62

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models– Structured language models

• Tools

Page 63: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

63

Clustering

• CLUSTERING = CLASSES (same thing)

• What is P(“Tuesday | party on”)

• Similar to P(“Monday | party on”)

• Similar to P(“Tuesday | celebration on”)

• Put words in clusters: – WEEKDAY = Sunday, Monday, Tuesday, …– EVENT=party, celebration, birthday, …

Page 64: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

64

Clustering overview

• Major topic, useful in many fields

• Kinds of clustering– Predictive clustering– Conditional clustering– IBM-style clustering

• How to get clusters– Be clever or it takes forever!

Page 65: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

65

Predictive clustering

• Let “z” be a word, “Z” be its cluster• One cluster per word: hard clustering

– WEEKDAY = Sunday, Monday, Tuesday, …– MONTH = January, February, April, May, June,

…• P(z|xy) = P(Z|xy) P(z|xyZ)• P(Tuesday | party on) = P(WEEKDAY | party on)

P(Tuesday | party on WEEKDAY)• Psmooth(z|xy) Psmooth (Z|xy) Psmooth (z|xyZ)

Page 66: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

66

Predictive clustering example

• Find P(Tuesday | party on)

– Psmooth (WEEKDAY | party on)

Psmooth (Tuesday | party on WEEKDAY)

– C( party on Tuesday) = 0

– C(party on Wednesday) = 10

– C(arriving on Tuesday) = 10

– C(on Tuesday) = 100

• Psmooth (WEEKDAY | party on) is high

• Psmooth (Tuesday | party on WEEKDAY) backs off to Psmooth

(Tuesday | on WEEKDAY)

Page 67: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

67

Conditional clustering

• P(z|xy) = P(z|xXyY)• P(Tuesday | party on) = P(Tuesday | party EVENT on PREPOSITION)• Psmooth(z|xy) Psmooth (z|xXyY)

PML (Tuesday | party EVENT on PREPOSITION)+

PML (Tuesday | EVENT on PREPOSITION) +

PML (Tuesday | on PREPOSITION) +

MLP(Tuesday | PREPOSITION) +

(1- - - - ) PML (Tuesday)

Page 68: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

68

Conditional clustering example P (Tuesday | party EVENT on PREPOSITION)+ P (Tuesday | EVENT on PREPOSITION) + P(Tuesday | on PREPOSITION) + P(Tuesday | PREPOSITION) + (1- - - - ) P(Tuesday) = P (Tuesday | party on)+ P (Tuesday | EVENT on) + P(Tuesday | on) + P(Tuesday | PREPOSITION) + (1- - - - ) P(Tuesday) =

Page 69: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

69

Combined clustering

• P(z|xy) Psmooth(Z|xXyY) Psmooth(z|xXyYZ)

P(Tuesday| party on) Psmooth(WEEKDAY | party EVENT on PREPOSITION)

Psmooth(Tuesday | party EVENT on PREPOSITION WEEKDAY)

• Much larger than unclustered, somewhat lower perplexity.

Page 70: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

70

IBM Clustering

• P (z|xy) Psmooth(Z|XY) P(z|Z)• P(WEEKDAY|EVENT PREPOSITION) P(Tuesday |

WEEKDAY)• Small, very smooth, mediocre perplexity• P (z|xy) Psmooth (z|xy) + (1- )Psmooth(Z|XY) P(z|Z)• Bigger, better than no clusters, better than combined

clustering.• Improvement: use P(z|XYZ) instead of P(z|Z)

Page 71: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

71

Clustering by Position

• “A” and “AN”: same cluster or different cluster?

• Same cluster for predictive clustering

• Different clusters for conditional clustering

• Small improvement by using different clusters for conditional and predictive

Page 72: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

72

Clustering: how to get them

• Build them by hand– Works ok when almost no data

• Part of Speech (POS) tags– Tends not to work as well as automatic

• Automatic Clustering– Swap words between clusters to minimize perplexity

Page 73: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

73

Clustering: automatic

• Minimize perplexity of P(z|Y) Mathematical tricks speed it up

Use top-down splitting,

not bottom up merging!

Page 74: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

74

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models, – Structured language models

• Tools

Page 75: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

75

Sentence Mixture Models

• Lots of different sentence types:– Numbers (The Dow rose one hundred seventy

three points)– Quotations (Officials said “quote we deny all

wrong doing ”quote)– Mergers (AOL and Time Warner, in an attempt to

control the media and the internet, will merge)

• Model each sentence type separately

Page 76: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

76

Sentence Mixture Models

• Roll a die to pick sentence type, sk

with probability k

• Probability of sentence, given sk

• Probability of sentence across types:

m

k

n

ikiiik swwwP

1 112 )|(

Page 77: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

77

Sentence Model Smoothing

• Each topic model is smoothed with overall model.

• Sentence mixture model is smoothed with overall model (sentence type 0).

m

k

n

i iiik

kiiikk wwwP

swwwP

0 1 12

12

)|()1(

)|(

Page 78: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

78

Sentence Mixture Results

Sentence mixture models (10,000,000 training)

108

110

112

114

116

118

120

122

124

126

0 1 2 3 4 5 6 7

Log-2 Number Mixtures

Per

ple

xity

Sentence mixture

Baseline

13% reduction

Page 79: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

79

Sentence Clustering

• Same algorithm as word clustering

• Assign each sentence to a type, sk

• Minimize perplexity of P(z|sk ) instead of P(z|Y)

Page 80: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

80

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models– Structured language models

• Tools

Page 81: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

81

Structured Language Model

“The contract ended with a loss of 7 cents after”

Page 82: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

82

How to get structure data?

• Use a Treebank (a collection of sentences with structure hand annotated) like Wall Street Journal, Penn Tree Bank.

• Problem: need a treebank.• Or – use a treebank (WSJ) to train a parser; then

parse new training data (e.g. Broadcast News)• Re-estimate parameters to get lower perplexity

models.

Page 83: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

83

Structured Language Models

• Use structure of language to detect long distance information

• Promising results

• But: time consuming– Replacement: 5-grams, skipping, capture similar

information.

Page 84: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

84

Real Overview Overview

Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques

– Caching– Skipping– Clustering– Sentence-mixture models– Structured language models

Tools

Page 85: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

85

Tools: CMU Language Modeling Toolkit

• Can handle bigram, trigrams, more• Can handle different smoothing schemes • Many separate tools – output of one tool is input to

next: easy to use• Free for research purposes

– http://www.speech.cs.cmu.edu/SLM_info.html

Page 86: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

86

Tools: SRI Language Modeling Toolkit

• More powerful than CMU toolkit

• Can handles clusters, lattices, n-best lists, hidden tags

• Free for research use– http://www.speech.sri.com/projects/srilm

Page 87: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

IRSTLM Toolkit

• More friendly on Copyright issue

• Being recommended by standard SMT package, Moses.

• IRSTLM Toolkit– http://hlt.fbk.eu/en/irstlm

– http://sourceforge.net/projects/irstlm

87

Page 88: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

88

Tools: Text normalization

• What about “$3,100,000” convert to “Three million one hundred thousand dollars”, etc.

• Need to do this for dates, numbers, maybe abbreviations.

• Some text-normalization tools come with Wall Street Journal corpus, from LDC (Linguistic Data Consortium)

• Not much available• Write your own (use Perl!)

Page 89: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

89

Small enough

• Real language models are often huge• 5-gram models typically larger than the training data

– Consider Google’s web language model• Use count-cutoffs (eliminate parameters with fewer

counts) or, better• Use Stolcke pruning – finds counts that contribute

least to perplexity reduction, – P(City | New York”) P(City | York)– P(Friday | God it’s) P(Friday | it’s)

• Remember, Kneser-Ney helped most when lots of 1 counts

Page 90: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

90

Some Experiments

• Goodman re-implemented all techniques • Trained on 260,000,000 words of WSJ• Optimize parameters on heldout• Test on separate test section• Some combinations extremely time-consuming (days

of CPU time)– Don’t try this at home, or in anything you want to ship

• Rescored N-best lists to get results– Maximum possible improvement from 10% word error rate

absolute to 5%

Page 91: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

91

Overall Results: Perplexity  

Page 92: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

92

Overall Results: Word Accuracy

Accuracy rates -- all-no-puncKatz+ KN+ All-cache-

Accuracy 90.31 90.4 91.11%improve & & & 45.02\%\\ \cline{2-4} 8.26%skip 1.03% 2.40% 1.24%5-gram -0.52% 2.81% 1.46%sentence -0.41% -0.51% 1.35%cluster 1.55% 3.44%cache -2.99% -1.35% KN 0.93% 7.54%

Page 93: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

93

Conclusions

• Use trigram models

• Use any reasonable smoothing algorithm (Katz, Kneser-Ney)

• Use caching information, clustering, sentence mixtures, skipping not usually worth effort, if you have correction

Page 94: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

94

References

• Joshua Goodman’s web page: (Smoothing, introduction, more)– http://www.research.microsoft.com/~joshuago– Contains smoothing technical report: good introduction to smoothing and

lots of details too.– Will contain journal paper of this talk, updated results.

• Books (all are OK, none focus on language models)– Speech and Language Processing by Dan Jurafsky and Jim Martin

(especially Chapter 6)– Foundations of Statistical Natural Language Processing by Chris Manning

and Hinrich Schütze. – Statistical Methods for Speech Recognition, by Frederick Jelinek

Page 95: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

95

References

• Structured Language Models– Ciprian Chelba’s web page: http://

www.clsp.jhu.edu/people/chelba/• Maximum Entropy

– Roni Rosenfeld’s home page and thesis http://www.cs.cmu.edu/~roni/

• Stolcke Pruning– A. Stolcke (1998), Entropy-based pruning of backoff

language models. Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp. 270-274, Lansdowne, VA. NOTE: get corrected version from http://www.speech.sri.com/people/stolcke

Page 96: 1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University zhaohai@cs.sjtu.edu.cn zhaohai@cs.sjtu.edu.cn

96

References: Further Reading

• “An Empirical Study of Smoothing Techniques for Language Modeling”. Stanley Chen and Joshua Goodman. 1998. Harvard Computer Science Technical report TR-10-98.– (Gives a very thorough evaluation and description of a number

of methods.)

• “On the Convergence Rate of Good-Turing Estimators”. David McAllester and Robert E. Schapire. In Proceedings of COLT 2000.– (A pretty technical paper, giving confidence-intervals on Good-

Turing estimators. Theorems 1, 3 and 9 are useful in understanding the motivation for Good-Turing discounting.)