noah a. smith and jason eisner department of computer science

35
ACL 2005 N. A. Smith and J. Eisner • Contrastive Estimation Contrastive Estimation: (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu

Upload: lynna

Post on 09-Feb-2016

48 views

Category:

Documents


1 download

DESCRIPTION

Contrastive Estimation : (Efficiently) Training Log-Linear Models (of Sequences) on Unlabeled Data. Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University {nasmith,jason}@cs.jhu.edu. Nutshell Version. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu

Page 2: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models

Page 3: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

“Red leaves don’t hide blue jays.”

Page 4: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*

Page 5: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays

? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is whatEM does.

Page 6: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass

numerator

denominator

Page 7: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Focusing Probability Mass

numerator

denominator

Page 8: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Conditional Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNSp

p

x

y

red leaves don’t hide blue jays

? ? ? ? ? ?

(x) × Λ*

A differentdenominator!

Page 9: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective FunctionsObjective

Optimization

AlgorithmNumerator Denominat

or

MLE Count & Normalize*

tags & words Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words (words) × Λ*

Perceptron Backprop tags & words

hypothesized tags & words*For generative models.

Page 10: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Objective FunctionsObjective

Optimization

AlgorithmNumerator Denominat

or

MLE Count & Normalize*

tags & words Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words (words) × Λ*

Perceptron Backprop tags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers(in this

talk, LMVM L-BFGS)

Page 11: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ...in the unsupervised case.

A good denominator can improve accuracy and

tractability.

Page 12: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)red leaves don’t hide blue jays

Why didn’t he say, “birds fly” or “dancing granola” or “the wash

dishes” orany other sequence of

words?

EM

Page 13: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Language Learning (Syntax)red leaves don’t hide blue jays

Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...

Page 14: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.

Page 15: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue

Page 16: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays

Page 17: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

red leaves don’t hide blue jays

? ? ? ? ? ?p

p

leaves red don’t hide blue jays? ? ? ? ? ?

red don’t leaves hide blue jays? ? ? ? ? ?

red leaves hide don’t blue jays? ? ? ? ? ?

red leaves don’t blue hide jays? ? ? ? ? ?

red leaves don’t hide jays blue? ? ? ? ? ?

red leaves don’t hide blue jays? ? ? ? ? ?

sentences inTRANS1

neighborhood

Page 18: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

p

p

red leaves don’t hide blue jays

? ? ? ? ? ?

red leaves don’t hide blue jaysleaves

don’thide blue

jays

don’t hide blue jays

leavesdon’t

hidebluered

(with any tagging) sentences inTRANS1

neighborhood

Page 19: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.

Page 20: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

This talk is about denominators ...in the unsupervised case.

A good denominator can improve accuracy and tractability.

Page 21: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences

Page 22: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods

Page 23: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Neighborhoods• Guide the learner toward models that do

what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.

Page 24: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Neighborhoodsneighborhood size

latticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

) O(n2) delete any contiguous subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1

Page 25: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.

Page 26: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:tag trigramstag trigramstag/word pairs from a POS dictionary

Page 27: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

35.158.760.462.1

66.670.0

78.879.079.3

97.299.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRFHMM

random

EM

LENGTH

TRANS1DELORTRANS1

DAEM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)Merialdo (1994)

≈ log-linear EM

Page 28: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

What if wedamagethe POS

dictionary?

Page 29: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

90.1 90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0tag

ging a

ccur

acy (

all w

ords

)Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELORT

RANS1

LENGTH EM

rando

m

Page 30: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Trigram Tagging Model + Spelling

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

feature set:tag trigramstag trigramstag/word pairs from a POS dictionary1- to 3-character suffixes, contains hyphen, digit

Page 31: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation91

.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1 90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0tag

ging a

ccur

acy (

all w

ords

)

DELORT

RANS1

LENGTH EM

rando

m

DELORT

RANS1

+ spell

ing LENGTH

+ spell

ing

Spelling features

aided recovery,

but only with a smart

neighborhood.

Page 32: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

The model need not be finite-state.

Page 33: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

clever

uninformative

23.633.8 48.7

35.242.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

atta

chm

ent a

ccur

acy

See our paperat the IJCAI 2005

Grammatical Inferenceworkshop.

Klei

n &

Man

ning

(200

4)

Page 34: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

To Sum Up ...Contrastive Estimation means

picking your own denominatorfor tractabilityor for accuracy

(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).

Page 35: Noah A. Smith  and  Jason Eisner Department of Computer Science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation