intro to maxent

46
Adwait Ratnaparkhi Yahoo! Labs Introduction to maximum entropy models

Upload: m

Post on 10-Apr-2015

188 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Intro to Maxent

Adwait Ratnaparkhi

Yahoo! Labs

Introduction to maximum entropy models

Page 2: Intro to Maxent

- 2 - Yahoo! Confidential

Introduction

• This talk is geared for Technical Yahoos…– Who are unfamiliar with machine learning or maximum entropy– Who are familiar with machine learning but not maximum

entropy– Who want to see an example of machine learning on the grid– Who are looking for an introductory talk

• Lab session is geared for Technical Yahoos…– Who brought their laptop– Who want to obtain more numerical intuition about the

optimization algorithms– Who don’t mind coding a little PERL– Who have some patience in case things go awry!

Page 3: Intro to Maxent

- 3 - Yahoo! Confidential

Outline

• What is maximum entropy modeling?• Example applications of maximum entropy framework• Lab session

– Look at the mapper and reducer of a maximum entropy parameter estimation algorithm

Page 4: Intro to Maxent

- 4 - Yahoo! Confidential

What is a maximum entropy model?

• A framework for combining evidence into a probability model• Probability can then be used for classification, or input to

next component– Content match: P(click | page, ad, user)– Part-of-speech tagging: P(part-of-speech-tag | word )– Categorization: P(category | page )

• Principle of maximum entropy is used as optimization criterion– Relationship to maximum likelihood optimization– Same model form as multi-class logistic regression

Page 5: Intro to Maxent

- 5 - Yahoo! Confidential

Machine learning approach

• Data with labels– Automatic: ad click

• sponsored search, content match, graphical ads

– Manual: editorial process• page categories, part-of-speech tags

• Objective– Training phase

• Find a (large) training set• Use training set to construct a classifier or probability model

– Test phase• Find an (honest) test set• Use model to assign labels to previously unseen data

Page 6: Intro to Maxent

- 6 - Yahoo! Confidential

Why machine learning? Why maxent?

• Why machine learning?– Pervasive ambiguity

• Any task with natural language features

– For these tasks, easier to collect and annotate data vs. hand-coding expert system

– For certain web tasks, annotation is free (e.g., clicks)

• Why maxent?– Little restriction on kinds of evidence

• No independence assumption

– Works well with sparse features– Works well in parallel settings, like Hadoop– Appeal of maxent interpretation

Page 7: Intro to Maxent

- 7 - Yahoo! Confidential

Sentence Boundary Detection

• Which “.” denotes a sentence boundary?

Mr. B. Green from X.Y.Z. Corp. said that the U.S. budget deficit hit $1.42 trillion for the year that ended Sept. 30. The previous year’s deficit was $459 billion.

• Model: P({yes|no}| candidate boundary)

Correct Boundary

Page 8: Intro to Maxent

- 8 - Yahoo! Confidential

Part-of-speech tagging

• What is the part of speech of flies ?– Fruit flies like a banana.– Time flies like an arrow.

• Model: P(tag | current word, surrounding words, …)– End goal is sequence of tags

Page 9: Intro to Maxent

- 9 - Yahoo! Confidential

Content match ad clicks

P(click|page, ad, user)

Page 10: Intro to Maxent

- 10 - Yahoo! Confidential

Ambiguity resolution: an artificial example

• Tagging unknown, or “out-of-vocabulary”, words:– Given word, predict POS tag based on spelling features

• Assume all words are either:– Proper Noun (NNP)– Verb Gerund (VBG)

• Find a model: P(tag | word) where tag is in {NNP,VBG}• Assume a training set

– Editorially derived (word, tag) pairs– Training data tags are { NNP, VBG }– Reminder: artificial example!

Page 11: Intro to Maxent

- 11 - Yahoo! Confidential

Ambiguity resolution (cont’d)

• Ask the data– q1: Is first letter of word capitalized ?– q2: Does word end in ing ?

• Evidence for (q1, NNP)– Clinton, Bush, Reagan as NNP

• Evidence for (q2, VBG)– trading, offering, operating as VBG

• Rules based on q1,q2 can conflict– Boeing

• How to choose in ambiguous cases?

Page 12: Intro to Maxent

- 12 - Yahoo! Confidential

Features represent evidence

• a = what we are predicting• b = what we observe• Terminology

– Here, input to feature is (a,b) pair– But in a lot of ML literature, input to feature is just (b)

fy,q (a,b) =1 if a = y AND q(b) = true

0 otherwise

⎧ ⎨ ⎩

f1(a,b) =1 if a = NNP and q1(b) = true

f2(a,b) =1 if a = VBG and q2(b) = true

Page 13: Intro to Maxent

- 13 - Yahoo! Confidential

Combine the features: probability model

• Probability is product of weights of active features

• Why this form?

p(a | b) =1

Z(b)α j

f j (a,b )

j=1

k

Z(b) = α jf j (a ',b )

j=1

k

∏a '

f j (a,b)∈ {0,1} : features

α j > 0 : parameters

Page 14: Intro to Maxent

- 14 - Yahoo! Confidential

Probabilities for ambiguity resolution

• How do we find optimal parameter values?€

p(NNP | Boeing) =1

Zα 1

f1 (a,b )α 2f2 (a,b )

=1

Zα 1

f1 (a,b )

p(VBG | Boeing) =1

Zα 1

f1 (a,b )α 2f2 (a,b )

=1

Zα 2

f2 (a,b )

Page 15: Intro to Maxent

- 15 - Yahoo! Confidential

Maximum likelihood estimation

QpML

ba

k

j

baf

j

pLp

bapbarpL

(a,b)bar

bZbappQ j

=

=

=

=

==

)(maxarg

)|(log),()(

} of frequency Normalized{),(

})(

1)|(|{

,

1

),(α

Page 16: Intro to Maxent

- 16 - Yahoo! Confidential

Principle of maximum entropy (Jaynes, 1957)

• Use probability model that is maximally uncertain, w.r.t. to observed evidence

• Why? Anything else assumes a fact you have not observed.

PpME pHp

pH

P

∈==

=

)(maxarg

}p of entropy{)(}evidence withconsistent models{

Page 17: Intro to Maxent

- 17 - Yahoo! Confidential

Maxent example

• Task: estimate joint distribution p(A,B)– A is in {x,y}– B is in {0,1}

• Define a feature f– Assume some expected value over a training set

p(A,B) 0 1

x ? ?

y ? 0.7€

f (a,b) =1 iff (a = y & b =1), 0 otherwise

E[ f ] = 7 /10 = .7

p(a,b) =1a,b

Page 18: Intro to Maxent

- 18 - Yahoo! Confidential

Maxent example (cont’d)

• Define entropy

• One way to meet the constraints. H(p) = 1.25

• Maxent way to meet constraints. H(p) = 1.35

p(A,B) 0 1

x .05 0.2

y .05 0.7

p(A,B) 0 1

x 0.1 0.1

y 0.1 0.7

H( p) = − p(a,b)log p(a,b)a,b

Page 19: Intro to Maxent

- 19 - Yahoo! Confidential

Conditional maximum entropy (Berger et al, 1995)

E r f j = {observed expectation of f}

= r(a,b) f j

a,b

∑ (a,b)

E p f j = {model's expectation of f}

= r(b)p(a | b) f j (a,b)a,b

P = {p | E p f j = E r f j , j =1...k}

H( p) = − r(b) p(a | b)log p(a | b)a,b

pME = argmaxp∈P

H( p)

Page 20: Intro to Maxent

- 20 - Yahoo! Confidential

Duality of ML and ME

• Under ME it must be the case that:

• ML and ME solutions are the same– pme=pml

– ML: form is assumed without justification– ME: constraints are assumed, form is derived

pME (a | b) =1

Z(b)α j

f j (a,b )

j=1

k

Page 21: Intro to Maxent

- 21 - Yahoo! Confidential

Extensions: Minimum divergence modeling

• Kullback-Leibler divergence– Measures “distance” between 2 probability distributions– Not symmetric!– See (Cover and Thomas, Elements of Information Theory)

D( p,q) = r(b) p(a | b)logp(a | b)

q(a | b)a,b

D( p,q) ≥ 0

D( p,q) = 0 iff p = q

Page 22: Intro to Maxent

- 22 - Yahoo! Confidential

Extensions : Minimum divergence models (cont’d)

• Minimum divergence framework:– Start with prior model q– From the set of consistent models P, minimize KL divergence to q

• Parameters will reflect deviation from prior model– Use case: prior model is static

• Same as maximizing entropy when q is uniform• See (Della Pietra et al, 1992) for an example in language modeling

P = {p | E p f j = E r f j , j =1...k}

pMD = argminp∈P

D(p,q)

pMD (a | b) =

q(a | b) α jf j (a,b )

j=1

k

q(a' | b) α jf j (a ',b )

j=1

k

∏a'

Page 23: Intro to Maxent

- 23 - Yahoo! Confidential

Parameter estimation (an incomplete list)

• Generalized Iterative Scaling (Darroch & Ratcliff, 1972)– Find correction feature and

constant– Iterative updates

• Improved iterative scaling (Della Pietra et. al., 1997)

• Conjugate gradient• Sequential conditional GIS

(Goodman, 2002)

• Correction-free GIS (Curran and Clark, 2003)

Define correction feature :

fk +1(a,b) = C − f j (a,b)j=1

k

GIS :

α j(0) =1

α j(n ) = α j

(n−1) E r f j

E p(n−1) f j

⎣ ⎢

⎦ ⎥

1/C

Page 24: Intro to Maxent

- 24 - Yahoo! Confidential

Comparisons

• Same model form as multi-class logistic regression• Diverse forms of evidence• Compared to decision trees:

– Advantage: No data fragmentation– Disadvantage: No feature construction

• Compared to naïve bayes– No independence assumptions

• Scales well on sparse feature sets– Parameter estimation (GIS) is

O( [# of training samples] [# of predictions] [avg. # of features per training event])

Page 25: Intro to Maxent

- 25 - Yahoo! Confidential

Disadvantages

• “Perfect” predictors cause parameters to diverge– Suppose the word the only occurred with tag DT– Estimation algorithm is forcing p(a|b) = 1 in order to meet

constraints• Parameter for (the, DT) will diverge to infinity• May beat out other parameters estimated from many examples!

• A remedy– Gaussian priors or “fuzzy maximum entropy” (Chen &

Rosenfeld, 2000)– Discount observed expectations

Page 26: Intro to Maxent

- 26 - Yahoo! Confidential

How to specify a maxent model

• Outcomes– What are we predicting?

• Questions– What information is useful for predicting ?

• Feature selection– Candidate feature set consists of all (outcome, question) pairs– Given candidate feature set, what subset do we use?

Page 27: Intro to Maxent

- 27 - Yahoo! Confidential

Finding the part-of-speech

• Part of Speech (POS) Tagging– Return a sequence of POS tags

Input: Fruit flies like a banana Output: N N V D N

Input: Time flies like an arrowOutput: N V P D N

• Train maxent models from POS tags of Penn treebank (Marcus et al, 1993)

• Use heavily pruned search procedure to find highest probability tag sequence

Page 28: Intro to Maxent

- 28 - Yahoo! Confidential

Model for POS tagging(Ratnaparkhi, 1996)

• Outcomes– 45 POS tags (Penn Treebank)

• Question Patterns:– common words: word identity – rare words: presence of prefix, suffix, capitalization, hyphen,

and numbers– previous 2 tags– surrounding 2 words

• Feature Selection– count cutoff of 10

Page 29: Intro to Maxent

- 29 - Yahoo! Confidential

A training event

• Example:

– ...stories about well-heeled communities and ... NNS IN ???

• Outcome: JJ (adjective)• Questions

– w[i-1]=about, w[i-2]=stories, w[i+1]=communities, w[i+2]=and, t[i-1]=IN, t[i-2][i-1]=NNS IN,pre[1]=w, pre[2]=we, pre[3]=wel, pre[4]=well, suf[1]=d,suf[2]=ed,suf[3]=led,suf[4]=eled

Page 30: Intro to Maxent

- 30 - Yahoo! Confidential

Finding the best POS sequence

∏=

=n

iii

aan bapaa

n 1...

**1 )|(maxarg...

1

• Find maximum probability sequence of n tags– Use "top K" breadth first search: – Tag left-to-right, but maintain only top K ranked hypotheses

• Best ranked hypothesis is not guaranteed to be optimal• Alternative: Conditional random fields (Lafferty et al,

2001)

Page 31: Intro to Maxent

- 31 - Yahoo! Confidential

Performance

Domain Word accuracy Unknown word accuracy

Sentence accuracy

English: Wall St. Journal

96.6% 85.3% 47.3%

Spanish: CRATER corpus (with re-mapped tagset)

97.7% 83.3% 60.4%

Page 32: Intro to Maxent

- 32 - Yahoo! Confidential

Summary

• Errors: – are typically the words that are difficult to annotate

• that, about, more

• Architecture– Can be ported easily for similar tasks with different tag sets,

esp. named entity detection• Name detector

– Tags = { begin_Name, continue_Name, other }

– Sequence probability can be used downstream

• Available for download– MXPOST & MXTERMINATOR

Page 33: Intro to Maxent

- 33 - Yahoo! Confidential

Maxent model for Keystone content-match

• Yahoo’s Keystone content-match uses a click model

• Use features of the page,ad,user to predict click• Use cases:

– Select ads with (page,ad) cosine score, use click model to re-rank

– Select ads directly with click model score

P(click | page,ad,user) =1

Z( page,ad,user)α j

f j (click,page,ad ,user )

j

Page 34: Intro to Maxent

- 34 - Yahoo! Confidential

Maxent model for Keystone content-match

• Outcomes: click (1) or no click (0)• Questions:

– Unigrams, phrases, categories on the page side– Unigrams, phrases, categories on the ad side– The user’s BT category

• Feature selection– Count cutoff, mutual information

Page 35: Intro to Maxent

- 35 - Yahoo! Confidential

Some recent work

• Recent work– Using the (page,ad) cosine score (Opal) as a feature– Using page domain ad bid phrase mappings– Using user BT ad bid phrase mappings– Using user age+gender bid phrase mappings

• Contact– Andy Hatch (aohatch)– Abraham Bagherjeiran (abagher)

Page 36: Intro to Maxent

- 36 - Yahoo! Confidential

Maxent on the grid

• Several grid maxent implementations at Yahoo• Correction-Free GIS (Curran and Clark, 2003)

– Implemented for verification and lab session– Not product-ready code!

• Input for each iteration– Training data format:[label] [weight] [q1 … qN]

– Feature set format:[q] [label] [parameter]

• Output for each iteration– New feature set format:[q] [label] [new parameter]

Page 37: Intro to Maxent

- 37 - Yahoo! Confidential

Maxent on the grid (cont’d)

• Map phase (parallelization across training data)– Collect observed feature expectations– Collect model’s feature expectations w.r.t. current model– Use feature name as key

• Reduce phrase (parallelization across model parameters)– For each key (feature):

• Sum up observed and model feature expectations • Do the parameter update

– Write the new model

• Repeat for N iterations

Page 38: Intro to Maxent

- 38 - Yahoo! Confidential

Maxent on the grid (cont’d)

• maxent_mapper– args: [file of model params]– stdin: [training data, one instance per line]– stdout: [feature name] [observed val] [current param] [model val]

• maxent_reducer– args: [iteration] [correction constant]– stdin: ( input from maxent_mapper, sorted by key )– stdout: [feature name] [new param]

• Use hadoop streaming, can be used off the grid

Page 39: Intro to Maxent

- 39 - Yahoo! Confidential

Lab session

• Login to a UNIX machine or Mac• mkdir maxent_lab; cd maxent_lab• svn checkout

svn+ssh://svn.corp.yahoo.com/yahoo/adsciences/contextualadvertising/streaming_maxent/trunk/GIS .

• cd src; make clean all• cd ../unit_test• ./doloop

Page 40: Intro to Maxent

- 40 - Yahoo! Confidential

Lab Exercise: Sentence Boundary Detection

• Problem: given a “.” in free text, classify it:– Yes, it is a sentence boundary– No, it is not

• Not a super-hard problem, but not super-easy!– Hand-coded baseline can get high result– With foreign languages, hand-coding is tougher

• cd data• The Penn Treebank corpus : ptb.txt.gz

– gunzip –c ptb.txt.gz | head– Newline indicates sentence boundary– Penn treebank tokenizes text for NLP

• Undid tokenization as much as possible for this exercise

Page 41: Intro to Maxent

- 41 - Yahoo! Confidential

Lab: data generation

• cd ../lab/data• Look at mkData.sh

– creates train/development test/test– Feature extraction

• [true label] [weight] [q1] … [qN]

no 1.0 *default* prefix=Novyes 1.0 *default* prefix=29

• *default* is always on– Model can estimate prior probability of Yes and No

• Prefix: char sequence before “.” up to space char

• run ./mkData.sh

Page 42: Intro to Maxent

- 42 - Yahoo! Confidential

Lab: feature selection and training

• cd ../train• Look at dotrain

– selectfeatures• select features based on freq => cutoff

– train• find correction constant• Iterate (each iteration is one map/reduce job)

– maxent_mapper: collect stats– maxent_reducer: find update

• Look at dotest– accuracy.pl: Classifies as “yes” if prob > 0.5– Evaluation: # of correctly classified test instances

• Run ./dotrain

Page 43: Intro to Maxent

- 43 - Yahoo! Confidential

Lab: matching expectations

• GIS should bring model expectation closer to observed expectation

• After 1st map (1.mapped)

• After 9th map (9.mapped)

Feature observed parameter model

prefix=$1 no 462 1.51773 461.574

prefix=$1 yes 4 0.00731677 4.42558

Feature observed parameter model

prefix=$1 no 462 1 233

prefix=$1 yes 4 1 233

Page 44: Intro to Maxent

- 44 - Yahoo! Confidential

Lab: results

• Log-likelihood of training data must increase

Accuracy: – Train: 46670 correct out of 47771, or 97.6952544430722% – Dev: 14940 correct out of 15579, or 95.8983246678221%

1 2 3 4 5 6 7 8 9

-35000

-30000

-25000

-20000

-15000

-10000

-5000

0

Log Likelihood

Log Likelihood

Page 45: Intro to Maxent

- 45 - Yahoo! Confidential

Lab: your turn!!!

• Beat this result (on the development set only)!• Things to try

– Feature extraction• Look at data and find other things that would be useful for

sentence boundary detection• data/extractfeatures.pl

– Suffix features– Feature classes

– Feature selection• Pay attention to the number of features

– Number of iterations– Pay attention to train vs. test accuracy

Page 46: Intro to Maxent

- 46 - Yahoo! Confidential

Lab: Now let’s try the real test set

• Take your best model, and try it on the test set

./dotest N.features ../data/te

• Did you beat the baseline?

13937 correct out of 14586, or 95.5505279034691%

• Who has the highest result?