maximum entropy advanced statistical methods in nlp ling 572 january 31, 2012

98
Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Upload: sheena-whitehead

Post on 17-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum EntropyAdvanced Statistical Methods in NLP

Ling 572January 31, 2012

Page 2: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

2

Maximum Entropy

Page 3: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

RoadmapMaximum entropy:

OverviewGenerative & Discriminative models:

Naïve Bayes v Maxent

Maximum entropy principleFeature functionsModelingDecoding

Summary

Page 4: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

4

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

Page 5: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

5

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

Page 6: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

6

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Page 7: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

7

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

Page 8: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

8

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

Page 9: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

9

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Page 10: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

10

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

Page 11: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

11

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

Page 12: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

12

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

Page 13: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

13

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

Page 14: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

14

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Page 15: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

c

f1 f2 f3 fk

Page 16: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

16

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

Page 17: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

17

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

Page 18: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

18

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

Page 19: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

19

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Page 20: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

20

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

However, this is generally unrealistic

P(“cuts”|politics) = pcuts

What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Page 21: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

Page 22: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

22

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

Page 23: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

23

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

Page 24: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

24

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 25: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

25

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 26: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

26

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

Page 27: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

27

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, sign

Page 28: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weightsP(y|x) =

MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

28

Page 29: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

MaxEnt OverviewPrediction:

P(y|x)

29

Page 30: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

30

Page 31: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

31

Page 32: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

MaxEnt OverviewPrediction:

P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

32

Page 33: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

33

Page 34: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle

34

Page 35: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

35

Page 36: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

36

Page 37: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

37

Page 38: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle

Intuitively, model all that is known, and assume as little as possible about what is unknown

Maximum entropy = minimum commitment

Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

38

Page 39: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example I: (K&M 2003)Consider a coin flip

H(X)

39

Page 40: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?

40

Page 41: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

41

Page 42: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?

42

Page 43: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example I: (K&M 2003)Consider a coin flip

H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

What if you know P(X=H) =0.3?P(X=T)=0.7

43

Page 44: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

44

Page 45: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

45

Page 46: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

46

Page 47: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

47

Page 48: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

48

Page 49: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=

49

Page 50: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

50

Page 51: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

51

Page 52: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

52

Page 53: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

53

Example III: POS (K&M, 2003)

Page 54: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

54

Example III: POS (K&M, 2003)

Page 55: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

55

Example III: POS (K&M, 2003)

Page 56: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

56

Example III: POS (K&M, 2003)

Page 57: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

57

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

Page 58: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

58

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Etc

Page 59: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle:Summary

Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:

Page 60: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy Principle:Summary

Among all probability distributions p in P that satisfy the set of constraints, select p* that maximizes:

Questions:1) How do we model the constraints?2) How can select the distributions?

Page 61: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

MaxEnt Modeling

Page 62: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), where

Page 63: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predict

Page 64: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y: text categories: guns, mideast, misc…

Compute P(y|x)Select y with highest P(y|x) as classification

Page 65: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X

Page 66: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y:

Page 67: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Basic ApproachGiven some training data:

Collect training examples (x,y) in (X,Y), wherex in X: the training data samplesy in Y: the labels (classes) to predictFor example in text classification:

X: words in documents Y: text categories: guns, mideast, misc…

Compute P(y|x)Select y with highest P(y|x) as classification

Page 68: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Maximum Entropy ModelCompute P(y|x)

Select distribution that maximizes entropy subject to constraints in training data:

Page 69: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Modeling ComponentsFeature functions

Computing expectations

Calculating P(y|x) and P(x,y)

Page 70: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature Functions

Page 71: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsA feature function is a binary-valued indicator

function:

Page 72: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

Page 73: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

fj(x,y) = {1 if y=“guns” and x includes “rifle”

Page 74: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsA feature function is a binary-valued indicator

function:

In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

fj(x,y) = {1 if y=“guns” and x includes “rifle”

{0 otherwise

Page 75: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsFeature functions can be complex:

Page 76: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

Page 77: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or

currwd=“guns” and so

Page 78: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of featurese.g. class=“guns” and currwd=“rifles” and

prevwd=“allow”

class=y and a disjunction of featurese.g. class=“guns” and currwd=“rifles” or currwd=“guns”

and so on

Many are simple

Page 79: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Feature FunctionsFeature functions can be complex:

For example:Features could be:

class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow”

class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns”

and so

Many are simple

Feature selection will be an issue

Page 80: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Key PointsFeature functions are:

binary correspond to a (feature,class) pair

Page 81: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Key PointsFeature functions are:

binary correspond to a (feature,class) pair

MaxEnt training learns weights associated with feature functions

Page 82: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Computing Expectations

Page 83: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50

Page 84: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

Page 85: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

P(X=H)*100+P(X=T)*(-50)

2: Consider the more general case, outcomei : yield = vi

What is the expected yield?

Page 86: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Computing Expectations1: Consider a coin-flipping experiment

Heads: win 100Tails: lose 50What is the expected yield?

P(X=H)*100+P(X=T)*(-50)

2: Consider the more general case, outcomei : yield = vi

Page 87: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Calculating ExpectationsLet P(X=i) be a distribution of a random variable

X

Let f(x) be a function of x.

Let Ep(f) be the expectation of f(x) based on P(x)

Page 88: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Example due to F. Xia

Page 89: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Example due to F. Xia

Page 90: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Empirical expectation

Example due to F. Xia

Page 91: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Empirical ExpectionEmpirical distribution denoted:

Example: toss a coin 4 times:Result: H, T, H, HAverage return: (100-50+100+100)/4=62.5

Empirical distribution:

Empirical expectation = ¾*100 + ¼*(-50)=62.5

Example due to F. Xia

Page 92: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Model ExpectationModel: p(x)

Assume a fair coin

Page 93: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Model ExpectationModel: p(x)

Assume a fair coinP(X=H)= ½; P(X=T)= ½;

Model expectation:

Page 94: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Model ExpectationModel: p(x)

Assume a fair coinP(X=H)= ½; P(X=T)= ½;

Model expectation:½*100 + ½*(-50)=25

Page 95: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

Empirical Expectation

Page 96: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

ExampleTraining data:

x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3

Example due F. Xia

t1 t2 t3 t4

c1 1 1 2 1

c2 1 0 0 1

c3 1 0 1 0

Raw counts

Page 97: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

ExampleTraining data:

x1 c1 t1 t2 t3x2 c2 t1 t4x3 c1 t3 t4x4 c3 t1 t3

Example due F. Xia

t1 t2 t3 t4

c1 1/4 1/4 2/4 1/4

c2 1/4 0 0 1/4

c3 1/4 0 1/4 0

Empirical distribution

Page 98: Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012