machine learning in natural introduction language...

39
ML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing Fernando Pereira University of Pennsylvania NASSLLI, June 2002 Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee, Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby ML in NLP Introduction ML in NLP Why ML in NLP n Examples are easier to create than rules n Rule writers miss low frequency cases n Many factors involved in language interpretation n People do it n AI n Cognitive science n Let the computer do it n Moore’s law n storage n lots of data ML in NLP Classification n Document topic n Word sense treasury bonds ´ chemical bonds politics, business national, environment

Upload: buituong

Post on 16-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 1

Machine Learning in NaturalLanguage Processing

Fernando PereiraUniversity of Pennsylvania

NASSLLI, June 2002Thanks to: William Bialek, John Lafferty, Andrew McCallum, Lillian Lee,Lawrence Saul, Yves Schabes, Stuart Shieber, Naftali Tishby

ML in NLP

Introduction

ML in NLP

Why ML in NLPn Examples are easier to create than rulesn Rule writers miss low frequency casesn Many factors involved in language interpretationn People do it

n AIn Cognitive science

n Let the computer do itn Moore’s lawn storagen lots of data

ML in NLP

Classificationn Document topic

n Word sensetreasury bonds ´ chemical bonds

politics, business national, environment

Page 2: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 2

ML in NLP

N(bin)

S(dumped)

NP-C(workers)

N(workers)

workers

VP(dumped)

V(dumped)

dumped

NP-C(sacks)

N(sacks)

sacks

PP(into)

P(into) NP-C(bin)

D(a)

a bin

into

Analysisn Tagging

n Parsing

laterdecadesupshowthatsymptomscausingJJNNSRPVBPWDTNNSVBG

ML in NLP

Language Modelingn Is this a likely English sentence?

n Disambiguate noisy transcriptionIt’s easy to wreck a nice beachIt’s easy to recognize speech

P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

ª 2 ¥105

ML in NLP

Inferencen Translation

n Information extractionligações covalentesficovalent bondsobrigações do tesourofitreasury bonds

Sara Lee to Buy 30% of DIM

Chicago, March 3 - Sara Lee Corp said it agreed to buy a 30 percent interest inParis-based DIM S.A., a subsidiary of BIC S.A., at cost of about 20 milliondollars. DIM S.A., a hosiery manufacturer, had sales of about 2 million dollars.

The investment includes the purchase of 5 million newly issued DIM sharesvalued at about 5 million dollars, and a loan of about 15 million dollars, it said.The loan is convertible into an additional 16 million DIM shares, it noted.

The proposed agreement is subject to approval by the French government, itsaid.

acquireracquired

ML in NLP

Machine Learning Approachn Algorithms that write programsn Specify

• Form of output programs• Accuracy criterion

n Input: set of training examplesn Output: program that performs as accurately

as possible on the training examplesn But will it work on new examples?

Page 3: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 3

ML in NLP

Fundamental Questionsn Generalization: is the learned program useful on new

examples?n Statistical learning theory: quantifiable tradeoffs between

number of examples, complexity of program class, andgeneralization error

n Computational tractability: can we find a good programquickly?n If not, can we find a good approximation?

n Adaptation: can the program learn quickly from newevidence?n Information-theoretic analysis: relationship between adaptation

and compression

ML in NLP

Learning Tradeoffs

Program class complexity

Err

or o

f be

st p

rogr

am

training

testing on new examples

Overfitting

Rote learning

ML in NLP

Machine Learning Methodsn Classifiers

n Document classificationn Disambiguation disambiguation

n Structured modelsn Taggingn Parsingn Extraction

n Unsupervised learningn Generalizationn Structure induction

ML in NLP

Jargonn Instance: event type of interest

n Document and its classn Sentence and its analysisn …

n Supervised learning: learn classificationfunction from hand-labeled instances

n Unsupervised learning: exploit correlations toorganize training instances

n Generalization: how well does it work onunseen data

n Features: map instance to set of elementaryevents

Page 4: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 4

ML in NLP

Classification Ideasn Represent instances by feature vectorsn Contentn Context

n Learn function from feature vectorsn Classn Class-probability distribution

n Redundancy is our friend: many weakclues

ML in NLP

Structured Model Ideasn Interdependent decisions

n Successive parts-of-speechn Parsing/generation stepsn Lexical choice

• Parsing• Translation

n Combining decisionsn Sequential decisionsn Generative modelsn Constraint satisfaction

ML in NLP

Unsupervised Learning Ideasn Clustering: class inductionn Latent variables

I’m thinking of sports fi more sporty wordsn Distributional regularities

• Know words by the company they keepn Data compressionn Infer dependencies among variables:

structure learningML in NLP

Methodological Detour

n Empiricist/information-theoretic view:words combine following their associationsin previous material

n Rationalist/generative view:words combine according to a formalgrammar in the class of possible natural-language grammars

Page 5: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 5

ML in NLP

Chomsky’s Challenge to Empiricism

(1) Colorless green ideas sleep furiously.

(2) Furiously sleep ideas green colorless.

… It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally ‘remote’ from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

Chomsky 57

ML in NLP

The Return of Empiricismn Empiricist methods work:

n Markov models can capture a surprising fraction ofthe unpredictability in language

n Statistical information retrieval methods beatalternatives

n Statistical parsers are more accurate thancompetitors based on rationalist methods

n Machine-learning, statistical techniques close tohuman performance in part-of-speech tagging, sensedisambiguation

n Just engineering tricks?

ML in NLP

Unseen Eventsn Chomsky’s implicit assumption: any model must

assign zero probability to unseen eventsn naïve estimation of Markov model probabilities from

frequenciesn no latent (hidden) events

n Any such model overfits data: many events arelikely to be missing in any finite sample

\ The learned model cannot generalize to unseendata

\ Support for poverty of the stimulus arguments

ML in NLP

The Science of Modelingn Probability estimates can be smoothed to

accommodate unseen eventsn Redundancy in language supports

effective statistical inference procedures\ the stimulus is richer than it might seemn Statistical learning theory: generalization

ability of a model class can be measuredindependently of model representation

n Beyond Markov models: effects of latentconditioning variables can be estimatedfrom data

Page 6: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 6

ML in NLP

Richness of the Stimulusn Information about: mutual information

n between linguistic and non-linguistic eventsn between parts of a linguistic event

n Global coherence:banks can now sell stocks and bonds

n Word statistics carry more informationthan it might seemn Markov models in speech recognitionn Success of bag-of-words model in information

retrievaln Statistical machine translation

n How far can these methods go?ML in NLP

Questionsn Generative or discriminative?n Structured models: local classification or

global constraint satisfaction?n Does unsupervised learning help?

ML in NLP

Classification

ML in NLP

Generative or Discriminative?n Generative modelsn Estimate the instance-label distribution

n Discriminative modelsn Estimate the label-given-instance distribution

n Minimize an upper-bound on training error

p(x,y)

p(y | x)

[[ f (xi ) ≠ yi ]]i

Â

Page 7: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 7

ML in NLP

Simple Generative Modeln Binary naïve Bayes:n Represent instances by sets of binary

features• Does word occur in document• …

n Finite predefined set of classes

C

F1 F2 F3 F4

P(F1,...,Fn ,C) = P(C) P(Fi | C)i

P(c | F1,...,Fn )µ P(c) P(Fi | C)i

ML in NLP

Generative Claimsn Easy to train: just countn Language modeling: probability of

observed formsn More robustn Small training setsn Label noise

n Full advantage of probabilistic methods

ML in NLP

Discriminative Modelsn Define functional form for

n Binary classification: define a discriminantfunction

n Adjust parameter(s) q to maximizeprobability of training labels/minimizeerror

p(y | x;q )

y = signh(x;q)

ML in NLP

Simple Discriminative Formsn Linear discriminant function

n Logistic form:

n Multi-class exponential form (maxent):

h(x;q0 ,q1,...,qn ) = q0 + qi fii (x)

P(+1 | x) = 11+ exp- h(x;q )

h(x,y;q0 ,q1,...,qn ) = q0 + qi fii (x,y)

P(y | x;q) = exp h(x, y;q )exp h(x, ¢ y ;q)

¢ y Â

Page 8: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 8

ML in NLP

Discriminative Claimsn Focus modeling resources on instance-to-

label mappingn Avoid restrictive probabilistic assumptions

on instance distributionn Optimize what you care aboutn Higher accuracy

ML in NLP

Classification Tasksn Document categorization

n News categorizationn Message filteringn Web page selection

n Taggingn Named entityn Part-of-speechn Sense disambiguation

n Syntactic decisionsn Attachment

ML in NLP

Document Modelsn Binary vector

n Frequency vector

n N-gram language model

ft (d) ≡ t Œ d

tf(d, t) = t Πd idf(d, t) = D d ΠD : t Πdraw frequency : rt (d) = tf(d,t)TF * IDF : xt (d) = log(1+ tf(d, t))log(1+ idf(d,t))

p(d | c) = p(d | c) p(di | d1...di-1;c)i=1d’

p(di | d1...di-1;c) ª p(di | di-n ...di-1;c)ML in NLP

Term Weighting and FeatureSelectionn Select or weigh most informative featuresn TF*IDF: adjust term weight by how

document-specific the term isn Feature selection:n Remove low, unreliable countsn Mutual informationn Information gainn Other statistics

Page 9: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 9

ML in NLP

Documents vs. Vectors (I)n Many documents have the same binary or

frequency vectorn Document multiplicity must be handled

correctly in probability modelsn Binary naïve Bayes

n Multiplicity is not recoverable

p(f | c) = ft p(t | c) + (1- ft )(1- p(t | c))[ ]t’

ML in NLP

Documents vs. Vectors (2)n Document probability (unigram language

model)

n Raw frequency vector probability

p(d | c) = p(d | c) p(di | c)i=1d’

rt = tf(d,t)

p(r | c) = p(L | c)L! p(t | c)rt

rt!t’ where L = rttÂ

ML in NLP

Documents vs. Vectors (3)n Unigram model:

n Vector model:

p(c | d) =p(c)p(d | c) p(di | c)i=1

d’p( ¢ c )p( d | ¢ c ) p(di | ¢ c )i=1

d’¢ c Â

p(c | r) =p(c)p(L | c) p(t | c)rt

t’p( ¢ c )p(L | ¢ c ) p(t | ¢ c )rt

t’¢ c Â

ML in NLP

Linear Classifiersn Embedding into high-dimensional vector

spacen Geometric intuitions and techniquesn Easier separability

• Increase dimension with interaction terms• Nonlinear embeddings (kernels)

n Swiss Army knife

Page 10: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 10

ML in NLP

Kinds of Linear Classifiersn Naïve Bayesn Exponential modelsn Large margin classifiersn Support vector machines (SVM)n Boosting

n Online methodsn Perceptronn Winnow

ML in NLP

Learning Linear Classifiersn Rocchio

n Widrow-Hoff

n (Balanced) winnow

wk = max 0,xkxŒcÂ

c-

xkxœcÂD - c

Ê

Ë

Á Á

ˆ

¯

˜ ˜

w ¨ w - 2h(w.x i - yi )x i

y = sign(w+ ⋅ x - w- ⋅ x -q )

positive error : w+ ¨ aw+ ,w- ¨ bw-,a > 1> b > 0

negative error : w+ ¨ bw+ ,w- ¨ aw-

ML in NLP

Linear Classificationn Linear discriminant function

h(x) = w ⋅ x + b = wk xkk + b

b

w

xxx

x

x

xx

x

xxx

oo

oo

ooo

oo

ML in NLP

Marginn Instance marginn Normalized (geometric) margin

n Training set margin g

gi = yi w ⋅ x i + b( )

xxx x

xx

xi

x

o

oo

oo oo

gi

xj

gjg

gi = yiww

⋅ x i +bw

Ê

Ë Á

ˆ

¯ ˜

Page 11: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 11

ML in NLP

Perceptron Algorithmn Givenn Linearly separable training set Sn Learning rate h>0

w ¨ 0;b ¨ 0;R = maxi x irepeat

for i = 1...Nif yi w ⋅ x i + b( ) £ 0

w ¨ w + hyi x ib ¨ b + hyi R2

until there are no mistakesML in NLP

Dualityn Final hypothesis is a linear combination of

training points

n Dual perceptron algorithm

w = ai yi x ii ai ≥ 0

a ¨ 0;b ¨ 0;R = maxi x irepeat

for i = 1...Nif yi a j y jx j ⋅ x ij + b( ) £ 0

ai ¨ ai + 1b ¨ b + yi R2

until there are no mistakes

ML in NLP

Why Maximize the Margin?n There is a constant c such that for any

data distribution D with support in a ball ofradius R and any training sample S of sizeN drawn from D

where g is the margin of h in S

p err(h) £cN

R2

g 2 log2 N + log 1 d( )Ê

Ë Á Á

ˆ

¯ ˜ ˜

Ê

Ë Á Á

ˆ

¯ ˜ ˜ ≥ 1- d

ML in NLP

Canonical Hyperplanesn Multiple representations for the same

hyperplane

n Canonical hyperplane: functional margin = 1n Geometric margin for canonical hyperplane

lw,lb( ) l > 0

g =12

ww

⋅ x+ -ww

⋅ x-

Ê

Ë Á

ˆ

¯ ˜

=1

2 ww ⋅ x+ - w ⋅ x-( )

=1w

x+g

x-

Page 12: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 12

ML in NLP

Convex Optimization (1)n Constrained optimization problem:

n Lagrangian function:

n Dual problem:

minwŒWÕ¬n f (w)subject to gi (w) £ 0

h j (w) = 0

L(w,a,b ) = f (w) + ai gi (w) + b jh j (w)jÂiÂ

maxa ,b infwŒW L(w,a ,b)subject to ai ≥ 0

ML in NLP

Convex Optimization (2)n Kuhn-Tucker conditions:n f convexn gi, hj affine (h(w)=Aw-b)n Solution w*, a*, b* must satisfy:

∂L(w*,a*,b* )∂w

= 0

∂L(w*,a*,b* )∂b

= 0

ai*gi (w* ) = 0gi (w* ) £ 0

ai* ≥ 0

Complementary condition:parameter is non-zero iffconstraint is active

ML in NLP

Maximizing the Margin (1)n Given a separable training sample:

n Lagrangian:†

minw,b w = w ⋅ w subject to yi w ⋅ x i + b( ) ≥1

L(w,b,a ) =12

w ⋅ w - ai yi w ⋅ x i + b( ) -1[ ]iÂ∂L(w,b,a)

∂w= w - yiai x i = 0iÂ

∂L(w,b,a)∂b

= yiaii = 0

ML in NLP

Maximizing the margin (2)n Dual Lagrangian at stationary point:

n Dual maximization problem:

n Maximum margin weight vector:

W(a) = L(w*,b*,a) = aii -12

yi y jaia jx i⋅i , j x j

maxa W (a)subject to ai ≥ 0

yiaii = 0

w* = yiai*x ii with margin g = 1 w* = ai

*iŒsvÂ( )-1 2

Page 13: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 13

ML in NLP

Building the Classifiern Computing the offset (from primal

constraints):

n Decision function:

b* =maxyi =-1 w* ⋅ x i + min yi =1 w* ⋅ x i

2

h(x) = sgn yiai*x i⋅ x + b*

iÂ( )

ML in NLP

Consequencesn Complementarity condition yields support

vectors:

n Functional margin of 1 implies minimumgeometric margin

ai yi w* ⋅ x i + b( ) -1[ ] = 0ai > 0 fi w* ⋅ x i + b = yi

g = 1 w*

ML in NLP

General SVM Formn Margin maximization for an arbitrary

kernel K

n Decision rule

maxa aii -12

yi y jaia jK (x i ,i , j x j )subject to ai ≥ 0

yiaii = 0

h(x) = sgn yiai*K (x i ,x) + b*

iÂ( )

ML in NLP

Soft Marginn Handles non-separable casen Primal problem (2-norm):

n Dual problem:

minw,b,x w ⋅ w + C xi2

iÂsubject to yi w ⋅ x i + b( ) ≥ 1- xi

xi ≥ 0

maxa aii -12

yi y jaia j x i ⋅ x j +1C

dijÊ Ë Á

ˆ ¯ ˜ i , jÂ

subject to ai ≥ 0yiaii = 0

Page 14: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 14

ML in NLP

Conditional Maxent Modeln Model form

n Useful propertiesn Multi-classn May use different features for different classesn Training is convex optimization

p(y | x;L) =exp lk fk (x,y)kÂ

Z(x;l)Z(x;L) = exp lk fk (x,y)kÂyÂ

ML in NLP

Dualityn Maximize conditional log likelihood

n Maximizing conditional entropy

subject to constraints

yields

˜ L = argmaxL log p(yi | x ii ;L)

˜ p = argmax p - p(y | x i ) log p(y | x i )yÂiÂ[ ]

fk (x i ,yi )i = p(y | x i ) fk (x i ,y)yÂ

˜ p (y | x) = p(y | x; ˜ L )

ML in NLP

Relationship to (Binary) LogisticDiscrimination

p(+1 | x) =exp lk fk (x,+1)kÂ

exp lk fk (x,+1)k + exp lk fk (x,-1)kÂ=

11+ exp- lk fk (x,+1) - fk (x,-1)( )kÂ

=1

1+ exp- lk gk (x)kÂ

ML in NLP

Relationship to LinearDiscriminationn Decision rule

n Bias term: parameter for “always on”feature

n Question: relationship to other trainers forlinear discriminant functions

sign log p(+1| x)p(-1 | x)

Ê

Ë Á

ˆ

¯ ˜ = sign lk gk (x)kÂ

Page 15: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 15

ML in NLP

Solution Techniques (I)n Generalized iterative scaling (GIS)

n Parameter updates

n Requires that features add up to constant independent ofinstance or label (add slack feature)

lk ¨ lk +1C

logfk (x i ,yi )iÂ

p(y | x i ;L) fk (x i ,y)yÂiÂ

fk (x i ,y)k = C "i,y

ML in NLP

Solution Techniques (2)n Improved iterative scaling (IIS)

n Parameter updates

n For binary features reduces to solving a polynomial withpositive coefficients

n Reduces to GIS if feature sum constant

lk ¨ lk + dk

fk (x i ,yi ) = p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂiÂiÂ

f # (x, y) = fk (x, y)kÂ

ML in NLP

Deriving IIS (1)n Conditional log-likelihood

n Log-likelihood update

l(L) = log p(yi | x ii ;L)

l(L + D) - l(L) = D ⋅ f (x i ,yi ) - log Z (x i ;L + D)Z (x i ;L)

È

Î Í

˘

˚ ˙ iÂ

= D ⋅ f (x i ,yi ) - log e L+D( )⋅ f (xi ,y)

Z (x i ;L)yÂiÂiÂ

= D ⋅ f (x i ,yi ) - log p(y | x i ;L)eD⋅ f (xi ,y)yÂiÂiÂ

log x £ x -1( ) ≥ D ⋅ f (x i ,yi ) + N - p(y | x i ;L)eD⋅ f (xi ,y)yÂiÂiÂ

A(D)1 2 4 4 4 4 4 4 4 4 4 3 4 4 4 4 4 4 4 4 4

ML in NLP

Deriving IIS (2)n By Jensen’s inequality:

n Maximize lower bound on update

A(D) ≥ D ⋅ f (x i ,yi ) + N -iÂ

p(y | x i ;L) fk (x i ,y)f # (x i ,y)

edk f # (xi ,y)kÂyÂi = B(D)

∂B(D)∂dk

= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂiÂ

Page 16: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 16

ML in NLP

Solution Techniques (3)n GIS very slow if slack variable takes large

valuesn IIS faster, but still problematicn Recent suggestion: use standard convex

optimization techniquesn Eg. Conjugate gradientn Some evidence of faster convergence

ML in NLP

Gaussian Priorn Log-likelihood gradient

n Modified IIS update

∂l(L)∂lk

= fk (x i ,yi )i - p(y | x i ;L) fk (x i ,y)iÂyÂi -lks k

2

lk ¨ lk + dk

fk (x i ,yi )i =

p(y | x i ;L) fk (x i ,y)edk f # (xi ,y)yÂi +

lk + dks k

2

f # (x,y) = fk (x,y)kÂ

ML in NLP

Instance Representationn Fixed-size instance (PP attachment):

binary featuresn Word identityn Word class

n Variable-size instance (documentclassification)n Word identityn Word relative frequency in document

ML in NLP

Enriching Featuresn Word n-gramsn Sparse word n-gramsn Character n-grams (noisy transcriptions:

speech, OCR)n Unknown word features: suffixes,

capitalizationn Feature combinations (cf. n-grams)

Page 17: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 17

ML in NLP

I understood each and every word you said but not the order in which they appeared.

ML in NLP

Structured Models:Finite State

ML in NLP

Structured Model Applicationsn Language modelingn Story segmentationn POS taggingn Information extraction (IE)n (Shallow) parsing

ML in NLP

Structured Modelsn Assign a labeling to a sequencen Story segmentationn POS taggingn Named entity extractionn (Shallow) parsing

Page 18: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 18

ML in NLP

Constraint Satisfaction inStructured Modelsn Train to minimize labeling loss

n Computing the best labeling:

n Efficient minimization requires:n A common currency for local labeling decisionsn Efficient algorithm to combine the decisions

argminy Loss(x,y | ˆ q )

ˆ q = argminq Loss(xi ,yi |q )iÂ

ML in NLP

Local Classification Modelsn Train to minimize the per-decision loss in

context

n Apply by guessing context and findingeach lowest-loss label:

ˆ q = argminq loss(yi, j | xi,yi( j);q )0£ j<|xi |

ÂiÂ

argminy j loss(yj | x, ˆ y ( j); ˆ q )

ML in NLP

Structured Model Claimsn Constraint satisfaction

n Principledn Probabilistic interpretation allows model

compositionn Efficient optimal decoding

n Local classificationn Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in

global models

ML in NLP

Example: Language Modelingn n-gram (n-1 order Markov) model:

n example, character n-grams (Shannon, Lucky):n0 XFOML RXKHRJFFJUJ ZLPWCFWKCYJ1 OCRO HLI RGWR NMIELWIS EU LL2 ON IE ANTSOUTINYS ARE T INCTORE ST3 IN NO IST LAT WHEY CRATICT FROURE4 ABOVERY UPONDULTS WELL THE CODERST

w1 wk – n + 1 wk – 1wk

n

P(wk|w1 wk – 1) ª P(wk|wk – n + 1 wk – 1)

Page 19: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 19

ML in NLP

Markov’s UnreasonableEffectivenessn Entropy estimates for English

n Local word relations dominate statistics (Jelinek):1 The are to know the issues necessary role

2 This will have this problems data thing

2 One the understand these the information that

7 Please need use problem people point

9 We insert all tools issues

98 resolve old

1641 important

1.34human prediction (Cover & King 78)1.75word trigrams (Brown et al 92)4.43compress

bits/charmodel

ML in NLP

Limits of Markov Modelsn No dependencyn Likelihoods based on

sequencing, not dependency

1 The are to know the issues necessary role

2 This will have this problems data thing

2 One the understand these the information that

7 Please need use problem people point

9 We insert all tools issues

98 resolve old

1641 important

ML in NLP

n What’s the probability of unseen events?n Bias forces nonzero probabilities for some

unseen eventsn Typical bias: tie probabilities of related events

n specific unseen event à general seen eventeat pineapple à eat _

n event decomposition: event à event1 « event2eat pineapple à eat _ « _ pineapple

n Factoring via latent variables:

Unseen Events (1)

P(eat | pineapple) ª P(eat | C) ¥ P(C | pineapple)CÂ

ML in NLP

Unseen Events (2)n Discount estimates for seen eventsn Use leftover for unseen eventsn How to allocate leftover?n Back-off from unseen event to less specific

seen events: n-gram to n-1-gramn Hypothesize hidden cause for unseen events:

latent variable modeln Relate unseen event to distributionally similar

seen events

Page 20: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 20

ML in NLP

Important Detour:Latent Variable Models

ML in NLP

Expectation-Maximization (EM)n Latent (hidden) variable models

n Examples:n Mixture modelsn Class-based models (hidden classes)n Hidden Markov models

p(y,x,z | L)p(y,x | L) = p(y,x,z | L)zÂ

ML in NLP

Maximizing Likelihoodn Data log-likelihood

n Find parameters that maximize (log-)likelihood

D = (x1,y1),...,(xN ,yN ){ }L(D | L) = log p(x i ,yi )i = ˜ p (x,y) log p(x,y | L)x ,yÂ

˜ p (x,y) =i : x i = x,yi = y

N

ˆ L = argmaxL ˜ p (x, y) log p(x,y | L)x ,yÂ

ML in NLP

Convenient Lower Bounds (1)n Convex function

n Jensen’s inequality

if f is convex and p is a probability density

f p(x)xxÂ( ) £ p(x) f (x)xÂ

f

x0

x1

ax0 + 1- a( )x1

f ax0 + 1- a( )x1( )

f x0( )

f x1( )

af x0( ) + 1- a( ) f x1( )

Page 21: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 21

ML in NLP

Convenient Lower Bounds (2)

L(D | l)

l

lower bound

E0

E1

M0

ML in NLP

Auxiliary Functionn Find a convenient non-negative function

that lower-bounds likelihood increase

n Maximize lower bound:

L(D | ¢ L ) - L(D | L) ≥ Q( ¢ L ,L) ≥ 0

Li+1 = argmax ¢ L Q ¢ L ,Li( )

ML in NLP

Commentsn Likelihood keeps increasing, but

n Can get stuck in local maximum (or saddle point!)n Can oscillate between different local maxima with

same log-likelihoodn If maximizing auxiliary function is too hard, find

some L that increases likelihood: generalizedEM (GEM)

n Sum over hidden variable values can beexponential if not done carefully (sometimes notpossible)

ML in NLP

Example: Mixture Modeln Base distributions

n Mixture coefficients

n Mixture distribution

pi (y) :1 £ i £ m

li ≥ 0 lii=1m = 1

p(y | L) = li pi (y)iÂ

Page 22: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 22

ML in NLP

Auxiliary Quantitiesn Mixture coefficient i = prior probability of

being in class In Joint probability

n Auxiliary function

p(c,y | L) = lc pc (y)

Q( ¢ L ,L) = ˜ p (y) p(c | y,L) log p(y,c | ¢ L )p(y,c | L)cÂyÂ

ML in NLP

Solutionn E step:

n M-step:

Ci =1¢ l i

˜ p (y) li pi (y)l j p j (y)jÂyÂ

li ¨Ci

C jjÂ

ML in NLP

More Finite-State Models

ML in NLP

Example: Information Extractionn Given: types of entities and relationships

we are interested inn People, places, organizations, dates,

amounts, materials, processes, …n Employed by, located in, used for, arrived

when, …n Find all entities and relationships of the

given types in source materialn Collect in suitable database

Page 23: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 23

ML in NLP

IE Example

n Rely on:n Syntactic structuren Phrase classification

Nance, who is also a paid consultant to ABC News , said …

person

person-descriptor

employeerelationCo-reference

organization

ML in NLP

IE Methodsn Partial matching:n Hand-built patternsn Automatically-trained hidden Markov modelsn Cascaded finite-state transducers

n Parsing-based:n Parse the whole text:

• Shallow parser (chunking)• Automatically-induced grammar

n Classify phrases and phrase relations asdesired entities and relationships

ML in NLP

Global Constraint Modelsn Train to minimize labeling loss

n Computing the best labeling:

n Efficient minimization requires:n A common currency for local labeling decisionsn A dynamic programming algorithm to combine the

decisions

argminy Loss(x,y | ˆ q )

ˆ q = argminq Loss(xi ,yi |q )iÂ

ML in NLP

Local Classification Modelsn Train to minimize the per-symbol loss in

context

n Apply by guessing context and findingeach lowest-loss label:

ˆ q = argminq loss(yi, j | xi,yi( j);q )0£ j<|xi |

ÂiÂ

argminy j loss(yj | x, ˆ y ( j); ˆ q )

Page 24: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 24

ML in NLP

Structured Model Claimsn Global constraintn Principledn Probabilistic interpretation allows model

compositionn Efficient optimal decoding

n Local classifiern Wider range of modelsn More efficient trainingn Heuristic decoding comparable to pruning in

global modelsML in NLP

Generative vs. Discriminativen Hidden Markov models (HMMs):

generative, globaln Conditional exponential models (MEMMs,

CRFs): discriminative, globaln Boosting, winnow: discriminative, local

ML in NLP

Generative Modelsn Stochastic process that generates

instance-label pairsn Process structuren Process parameters

n (Hypothesize structure)n Estimate parameters from training data

ML in NLP

Model Structuren Decompose the generation of instances

into elementary stepsn Define dependencies between stepsn Parameterize the dependenciesn Useful descriptive language: graphical

models

Page 25: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 25

ML in NLP

Binary Naïve Bayesn Represent instances by sets of binary

featuresn Does word occur in documentn …

n Finite predefined set of classes

C

F1 F2 F3 F4

P(F1,...,Fn ,C) = P(C) P(Fi | C)i

P(c | F1,...,Fn )µ P(c) P(Fi | C)i

ML in NLP

Discrete Hidden Markov Modeln Instances: symbol sequencesn Labels: class sequences

C0 C1 C2 C3 C4 …

X0 X1 X2 X3 X4

P(X,C) = P(C0 )P(X0 | C0 ) P(Ci | Ci-1)i’ P(Xi | Ci )

ML in NLP

Generating Multiple Featuresn Instances: sequences of feature sets

n Word identityn Word properties (eg. spelling, capitalization)

n Labels: class sequences

C0 C1 C2 C3 C4 …

F0,0

F0,1

F0,2

F1,0

F1,1

F1,2

F2,0

F2,1

F2,2

F3,0

F3,1

F3,2

F4,0

F4,1

F4,2

n Limitation: conditionally independent featuresML in NLP

Independence or Intractabilityn Trees are good: each node has a single

immediate ancestor, joint probabilitycomputed in linear time

n But that forces features to be conditionallyindependent given the class

n Unrealisticn Suffixes and capitalizationn “San” and “Francisco” in document

Page 26: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 26

ML in NLP

Score Card¸ No independence assumptions¸ Richer features: combinations of existing

features- Optimization problem for parameters- Limited probabilistic interpretation- Insensitive to input distribution

ML in NLP

Information Extraction with HMMs

n Parameters = P(s|s’) , P(o|s) for all states in S={s1,s2,…}n Observations: wordsn Training: maximize probability of observations (+ prior).n For IE, states indicate “database field”.

[Seymore & McCallum ‘99][Freitag & McCallum ‘99]

ML in NLP

Problems with HMMs1. Would prefer richer representation of text:

multiple overlapping features, whole chunks of textn Example line features:

n length of linen line is centeredn percent of non-alphabeticsn total amount of white spacen line contains two verbsn line begins with a numbern line is grammatically a question

n Example word features:n identity of wordn word is in all capsn word ends in “-tion”n word is part of a noun phrasen word is in bold fontn word is on left hand side of pagen word is under node X in WordNet

2. HMMs are generative models.Generative models do not handle easily overlapping, non-independent features.Would prefer a conditional model: P({s…}|{o…}).

ML in NLP

Solution: Conditional Model

P(o|s)P(s|s’)

P(s|o,s’)

For the time being, capture dependencyon s’ with |S| independent functions, Ps’(s|o)

Hidden Markov model Maximum entropyMarkov model

(represented by exponential model)

Each state contains a “next-state classifier”black box, that, given the next observation, willproduce a probability distribution over possiblenext states, Ps’(s|o).

s’

s

Page 27: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 27

ML in NLP

HMM MEMM

st-1 st

ot

P(o|s)P(s|s’) P(s|o,s’)

st-1 st

ot

Two Sequence Models

n Standard belief propagation: forward-backwardprocedure.

n Viterbi and Baum-Welch follow naturally.ML in NLP

Transition Featuresn Model Ps’ (s|o) in terms of multiple arbitrary

overlapping (binary) features.n Example observation predicates:l o is the word “apple”l o is capitalizedl o is on a left-justified line

n Feature f depends on both a predicate bn … and a destination state s.

f<b,s>(o, s' ) =1 if b(o) is true and s'= s0 otherwise

Ï Ì Ó

ML in NLP

n Per-state conditional maxent model

n Training: each state model independentlyfrom labeled sequences

Next-State Classifier

Ps ' (s | o) =1

Z(o, s' )exp l<b,q> f<b,q>(o, s)

<b,q>

ÂÊ

Ë Á

ˆ

¯ ˜ ˜

ML in NLP

X-NNTP-Poster: NewsHound v1.33Archive-name: acorn/faq/part2Frequency: monthly

2.6) What configuration of serial cable should I use?

Here follows a diagram of the necessary connections for common terminalprograms to work properly. They are as far as I know the informal standardagreed upon by commercial comms software developers for the Arc.

Pins 1, 4, and 8 must be connected together inside the 9 pin plug. Thisis to avoid the well known serial port chip bugs. The modem’s DCD (DataCarrier Detect) signal has been re-routed to the Arc’s RI (Ring Indicator)most modems broadcast a software RING signal anyway, and even then it’sreally necessary to detect it for the model to answer the call.

2.7) The sound from the speaker port seems quite muffled. How can I get unfiltered sound from an Acorn machine?

All Acorn machine are equipped with a sound filter designed to removehigh frequency harmonics from the sound output. To bypass the filter, hookinto the Unfiltered port. You need to have a capacitor. Look for LM324 (chip39) and and hook the capacitor like this:

Example: Q-A pairs from FAQ

Page 28: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 28

ML in NLP

n 38 files belonging to 7 UseNet FAQs

n Procedure: For each FAQ, train on one file, teston other; average.

Experimental Data

<head> X-NNTP-Poster: NewsHound v1.33<head> Archive-name: acorn/faq/part2<head> Frequency: monthly<head><question> 2.6) What configuration of serial cable should I use?<answer><answer> Here follows a diagram of the necessary connection<answer> programs to work properly. They are as far as I know<answer> agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer> is to avoid the well known serial port chip bugs. The

ML in NLP

Features in Experimentsbegins-with-numberbegins-with-ordinalbegins-with-punctuationbegins-with-question-wordbegins-with-subjectblankcontains-alphanumcontains-bracketed-numbercontains-httpcontains-non-spacecontains-numbercontains-pipe

contains-question-markcontains-question-wordends-with-question-markfirst-alpha-is-capitalizedindentedindented-1-to-4indented-5-to-10more-than-one-third-spaceonly-punctuationprev-is-blankprev-begins-with-ordinalshorter-than-30

ML in NLP

Models Testedn ME-Stateless: A single maximum entropy

classifier applied to each line independently.n TokenHMM: A fully-connected HMM with four

states, one for each of the line categories,each of which generates individual tokens(groups of alphanumeric characters andindividual punctuation characters).

n FeatureHMM: Identical to TokenHMM, onlythe lines in a document are first converted tosequences of features.

n MEMM: maximum entropy Markov modelML in NLP

ResultsLearner Segmentation

precisionSegmentation

recallME-Stateless 0.038 0.362

TokenHMM 0.276 0.140

FeatureHMM 0.413 0.529

MEMM 0.867 0.681

Page 29: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 29

ML in NLP

n Example (after Bottou ‘91):

n Bias toward states with fewer outgoingtransitions.

n Per-state normalization does not allow therequired score(1,2|ro) << score(1,2|ri).

Label Bias Problem

0

4 5

321

r

ri

o b

b

6

rib

rob

P(1,2 | ro) = P(1 | r)P(2 | o,1)P(1 | r)1P(1 | r)P(2 | i,1)P(1,2 | ri)

ML in NLP

Proposed Solutionsn Determinization:n not always possiblen state-space explosion

n Fully-connected models:n lacks prior structural knowledge.

n Conditional random fields (CRFs):n Allow some transitions to vote more

strongly than othersn Whole sequence rather than per-state

normalization

0

5

32

1r

i

o b

b

6

rib

rob

ML in NLP

From HMMs to CRFs

s = s1...sn o = o1...on

P(s | o) = P(st | st-1,ot )t=1

n

=1

Z(st-1,ot )exp

l f f (st ,st-1)f

Â

+ m f g(st ,ot )gÂ

Ê

Ë

Á Á

ˆ

¯

˜ ˜ t=1

n

HMM

MEMM

CRF

P(s | o) µ P(st | st-1)P(ot | st )t=1

n

P(s | o) =1

Z(o)exp

l f f (st ,st-1)f

Â

+ m f g(st ,ot )gÂ

Ê

Ë

Á Á

ˆ

¯

˜ ˜ t=1

n

’ML in NLP

CRF General Formn State sequence is a Markov random field

conditioned on the observation sequence.

n Model form:

n Features represent the dependency betweensuccessive states conditioned on theobservations

n Dependence on whole observation sequence o(not possible in HMMs).

P(s | o) =1

Z(o)exp l f f (st-1,st ,o,t)

t=1

n

ÂÊ

Ë Á

ˆ

¯ ˜ ˜

Page 30: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 30

ML in NLP

n Matrix notation

n Efficient normalization: forward-backwardalgorithm

Efficient Estimation

Mt (s',s | o) = expL t (s',s | o)L t (s',s | o) = l f f (st-1,st ,o,i)

fÂPL (s | o) =

1ZL (o)

Mi(st-1,st | o)t

’ZL (o) = (M1(o)M2(o)LMn +1(o))start,stop

ML in NLP

Forward-Backward Calculationsn For any path function

G(s) = gt (st-1,st )tÂ

ELG = PL (s | o)G(s)sÂ

=a t (s' | o)gt +1(s',s)Mt +1(s',s | o)b t +1(s | o)

ZL (o)t ,s',sÂ

a t (o) = a t-1(o)Mt (o)b t (o ¢ ) = Mt +1(o)b t +1(o)ZL (o) = an +1(end | o) = b0(start | o)

ML in NLP

Trainingn Maximizen Log-likelihood gradient

n Methods: iterative scaling, conjugategradient

n Comparable to standard Baum-Welch

L(L) = log PL (sk | ok )kÂ

∂L(L)∂l f

= # f (sk | ok )k - EL # f (S | ok )

# f (s | o) = f (st-1,st ,o,t)tÂ

ML in NLP

Label Bias Experimentn Data source: noisy version of

n P(intended symbol) = 29/30, P(other) = 1/30.n Train both an MEMM and a CRF with identical

topologies on data from the source.n Compute decoding error: CRF 4.6%, MEMM 42%

(2,000 training samples, 500 test)

0

4 5

321

r

ri

o b

b

6

rib

rob

Page 31: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 31

ML in NLP

Mixed-Order Sourcesn Data generated by mixing sparse first and second

order HMMs with varying mixing coefficient.n Modeled by first-order HMM, MEMM and CRF

(without contextual or overlapping features).

ML in NLP

Part-of-Speech Taggingn Trained on 50% of the 1.1 million words in the

Penn treebank. In this set, 5.45% of the wordsoccur only once, and were mapped to “oov”.

n Experiments with two different sets of features:n traditional: just the wordsn take advantage of power of conditional models: use

words, plus overlapping features: capitalized, beginswith #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.

ML in NLP

POS Tagging Resultsmodel error oov errorHMM 5.69% 45.99%MEMM 6.37% 54.61%CRF 5.55% 48.05%MEMM+ 4.81% 26.99%CRF+ 4.27% 23.76%

ML in NLP

Structured Models:Stochastic Grammars

Page 32: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 32

ML in NLP

Beyond Finite State

n Constituency and meaningful relationsWhat sleeps? How does it sleep? What kind of ideas?

n Related sentencesSleep furiously is all that colorless green ideas do.

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

ML in NLP

Stochastic Context-Free GrammarsS

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

S Æ NP VP 1.0NP Æ Det N¢ 0.2NP Æ N¢ 0.8N¢ Æ Adj N¢ 0.3N¢ Æ N 0.7VP Æ VP Adv 0.4VP Æ V 0.6N Æ ideas 0.1N Æ people 0.2

V Æ sleep 0.3

ML in NLP

Stochastic CFG Inferencen Inside-outside algorithm (Baker 79): find rule

probabilities that locally maximize thelikelihood of a training corpus (instance of EM)

n Extended inside-outside algorithm: useinformation about training corpus phrasestructure to guide rule probability reestimation

l Better modeling of phrase structurel Improved convergencel Improved computational complexity

ML in NLP

SCFG Derivations

=

¥

¥n Context-freeness fi

independence ofderivation steps

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

P( ) S

NP

V

N'

VP

AdvVP

advance

slowly

P( )

Adj

revolutionary

P( ) N'

N

Adj N'

ideas

new

P( )

P(N¢ Æ Adj N¢)

¥

Page 33: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 33

ML in NLP

S

NP

N'

N

Adj

Adj V

N'

N'

VP

AdvVP

ideas

revolutionary

new

advance

slowly

Inside-Outside Reestimationn Ratios of expected rule frequencies to

expected phrase frequenciesphrase context

phrase

n Computed fromthe probabilities ofeach phrase andphrase context inthe trainingcorpus

n Iterate

Pn + 1(N¢ Æ Adj N¢) =

En(#N¢ Æ Adj N¢)En(#N¢)

ML in NLP

Problems with I-O Reestimationn Hill-climbing procedure: sensitivity to initial

rule probabilitiesn Does not learn grammar structure directly:

only implicit in rule probabilities

n Linguistically inadequate grammars: highmutual information sequences are groupedinto phrases

((What (((is the) cheapest)fare))((I can) (get ?)))))))

Contrast: Is $300 the cheapest fare?

ML in NLP

(Partially) Bracketed Textn Hand-annotated text with

(some) phrase boundaries(((List (the fares (for((flight) (number891)))))).)

n Use only derivationscompatible with trainingbracketing

ML in NLP

Predictive Power

2.53

3.54

4.55

0 20 40 60 80iteration

cros

s ent

ropy

raw train

bracketedtrain

2.53

3.54

4.55

0 20 40 60 80iteration

cros

s ent

ropy

raw test

bracketedtest

Training:

Test:

Page 34: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 34

ML in NLP

Bracketing Accuracyn Accuracy criterion: proportion of phrases in

most likely analysis compatible with tree bankbracketing

20

40

60

80

100

0 20 40 60 80iteration

accu

racy

raw test

bracketedtest

n Conclusion: structure is not evident from distributionalone

ML in NLP

Limitations of SCFGsn Likelihoods independent of particular wordsn Markovian assumption on syntactic categories

Markov models:

Hierarchical models:

We need to resolve the issue

ML in NLP

We need to resolve the issue

We need to resolve the issue

Lexicalization

nMarkov model:

nDependency model:

ML in NLP

Best Current Modelsn Representation: surface trees with head-word

propagationn Generative power still context-freen Model variables:

head word, dependency type, argument vs. adjunct, heaviness,slash

n Main challenge: smoothing method for unseendependencies

n Learned from hand-parsed text (treebank)n Around 90% constituent accuracy

Page 35: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 35

ML in NLP

Lexicalized Tree (Collins 98)

Dependency Direction Relationworkers Æ dumped L ·NP, S, VPÒsacks Æ dumped R ·NP, VP, VÒinto Æ dumped R ·PP, VP, VÒa Æ bin L ·D, NP, NÒbin Æ into R ·NP, PP, PÒ

S(dumped)

NP-C(workers)

N(workers)

workers

VP(dumped)

V(dumped)

dumped

NP-C(sacks)

N(sacks)

sacks

PP(into)

P(into) NP-C(bin)

D(a) N(bin)

a bin

into

ML in NLP

Inducing Representations

ML in NLP

Unsupervised Learningn Latent variable modelsn Model observables from latent variablesn Search for “good” set of latent variables

n Information bottleneckn Find efficient compression of some

observablesn … preserving the information about other

observables

ML in NLP

Do Induced Classes Help?n Generalizationn Better statistics for coarser events

n Dimensionality reductionn Smaller modelsn Improved classification accuracy

Page 36: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 36

ML in NLP

Chomsky’s Challengeto Empiricism

(1) Colorless green ideas sleep furiously.

(2) Furiously sleep ideas green colorless.

… It is fair to assume that neither sentence (1) nor(2) (nor indeed any part of these sentences) hasever occurred in an English discourse. Hence, inany statistical model for grammaticalness, thesesentences will be ruled out on identical grounds asequally ‘remote’ from English. Yet (1), thoughnonsensical, is grammatical, while (2) is not.

Chomsky 57

ML in NLP

Complex Eventsn What Chomsky was talking about: Markov

models — state is just a record ofobservations

n But statistical models can have hiddenstate:n representation of past experiencen uncertainty about correct grammarn uncertainty about correct interpretation of

experience: ambiguityn Probabilistic relationships involving

hidden variables can be induced fromobservable data alone: EM algorithm

ML in NLP

In “Any Model”?n Factored bigram model:

n Trained for large-vocabulary speech recognitionfrom newswire text by EM

P(colorless green ideas sleep furiously)P(furiously sleep ideas green colorless)

ª 2 ¥105

P(wi+1 | wi ) ª P(wi+1 | c) ¥ P(c | wi )c=1

16Â

P(w1Lwn ) ª P(w1) P(i=2

n’ wi+1 | wi )

ML in NLP

Distributional Clusteringn Automatic grouping of words according to the

contexts in which they appearn Approach to data sparseness: approximate the

distribution of a relatively rare event (word) by thecollective distribution of similar events (cluster)

n Sense ambiguity ≈ membership in several “soft”clusters

n Case study: cluster nouns according to the verbsthat take them as direct objects

Page 37: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 37

ML in NLP

Training Datan Universe: two word classes V and N, a

single relation between them (eg. mainverb – head noun of verb’s direct object)

n Data: frequencies fvn of (v,n) pairsextracted from text by parsing or patternmatching

ML in NLP

Distributional RepresentationDescribing n ΠN: use conditional

distribution p(V | n)

00.020.04

0.060.08

0.1

buy

hold

issue

own

purc

hase

sele

ct sell

tend

er

trade

rela

tive

frequ

ency stock

bond

ML in NLP

n Markov condition:

n Find p(Ñ | N) to maximizemutual information forfixed I(Ñ ,N)

n Solution:

Reminder: Bottleneck Model

p( ˜ n | v) = p( ˜ n n | n)p(n | v)

I( ˜ N ,V ) = p( ˜ n ,v )˜ n ,v log p( ˜ n ,v )

p( ˜ n )p(v )

Ñ

N V

I(Ñ,N)

I(N,V)

I(Ñ,V)

p( ˜ n | n) =p( ˜ n )Zn

exp(-bDKL p(V | n) || p(V | ˜ n )( )ML in NLP

n The scale parameter b (“inversetemperature”) determines how much anoun contributes to nearby centroids

Search for Cluster Solutions

b0

n b increases fi clusters split fi hierarchicalclustering

Page 38: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 38

ML in NLP

Small Examplen Cluster the 64 most common direct objects of

fire in 1988 Associated Press newswire

missile 0.835rocket 0.850bullet 0.917gun 0.940

officer 0.484aide 0.612chief 0.649manager 0.651

shot 0.858bullet 0.925rocket 0.930missile 1.037

gun 0.758missile 0.786weapon 0.862rocket 0.875

ML in NLP

Mutual Information Ratios

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8

I(Ñ,N)/H(N)

I(Ñ,V

)/I(Ñ

,N)

Increa

sing b

ML in NLP

Using Clusters for Predictionn Model verb–object associations through object

clusters:

n Depends on b

n Intuition: the associations of a word are a mixtureof the associations of the sense classes to whichthe word belongs

ˆ p (v | n) = p(v | ˜ n )p( ˜ n | n)˜ n

Â

ˆ p (v | n) =

p(v | ˜ n )p( ˜ n |n)

˜ n ’

Zn

used inexperiments

more appropriate

ML in NLP

Evaluationn Relative entropy of held-out data

to asymmetric modeln Decision task: which of two verbs

is more likely to take a noun asdirect object, estimated from themodel for training data in whichthe pairs relating the noun to oneof the verbs have been deleted

Page 39: Machine Learning in Natural Introduction Language …web.stanford.edu/group/nasslli/courses/pereira/NASSLLI.pdfML in NLP June 02 NASSLI 1 Machine Learning in Natural Language Processing

ML in NLP June 02

NASSLI 39

ML in NLP

Relative Entropy

0

1

2

3

4

5

6

0 200 400 600

number of clusters

avg

rela

tive

entro

py (b

its)

traintestnew

n Train: 756,721 verb-object pair trainingset

n Test: 81,240 pairheld-out test set

n New: held-out datafor 1000 nouns notin the training data

Verb-object pairs from 1988 AP Newswire

ML in NLP

Decision Task

0

0.2

0.4

0.6

0.8

1

1.2

0 100 200 300 400 500

number of clustersde

cisio

n er

ror

all

exceptional

n Which of v1 and v2 is more likely to take object on Held out: 104 (v2,o) pairs – need to guess from

pc(v2)n Testing: compare all v1 and v2 such that (v2,o) was

held out and (v1,o) occursn All: all test datan Exceptional: verb pairs

whose frequency ratiowith o is reversed fromtheir overall frequencyratio