noah a. smith and jason eisner department of computer science

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation

Contrastive Estimation:(Efficiently) Training Log-

Linear Models (of Sequences) on Unlabeled Data

Noah A. Smith and Jason EisnerDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins University

{nasmith,jason}@cs.jhu.edu


Nutshell Version

tractabletraining

unannotated text

“max ent” features

Experiments on unlabeled data:

POS tagging: 46% error rate reduction (relative to EM)

“Max ent” features make it possible

to survive damage to tag dictionary

Dependency parsing: 21% attachment error reduction

(relative to EM)

contrastive estimationwith lattice neighborhoods

sequence models


“Red leaves don’t hide blue jays.”


Maximum Likelihood Estimation(Supervised)

red leaves don’t hide blue jays

JJ NNS MD VB JJ NNS

?

?

p

p *

x

y

Σ* × Λ*



? ? ? ? ? ?

?

?

p

p *

x

Σ* × Λ*

Maximum Likelihood Estimation

(Unsupervised)

This is whatEM does.


Focusing Probability Mass

numerator

denominator


Conditional Estimation(Supervised)


JJ NNS MD VB JJ NNSp

p

x

y


? ? ? ? ? ?

(x) × Λ*

A differentdenominator!


Objective FunctionsObjective

Optimization

AlgorithmNumerator Denominat

or

MLE Count & Normalize*

tags & words Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words (words) × Λ*

Perceptron Backprop tags & words

hypothesized tags & words*For generative models.


Objective FunctionsObjective

Optimization

AlgorithmNumerator Denominat

or

MLE Count & Normalize*

tags & words Σ* × Λ*

MLE with hidden

variablesEM* words Σ* × Λ*

Conditional Likelihood

Iterative Scaling

tags & words (words) × Λ*

Perceptron Backprop tags & words

hypothesized tags & words*For generative models.

Contrastive Estimation

observed data

(in this talk, raw

word sequence, sum over

all possible taggings)

?

generic numerical

solvers(in this

talk, LMVM L-BFGS)


This talk is about denominators ...in the unsupervised case.

A good denominator can improve accuracy and

tractability.


Language Learning (Syntax)red leaves don’t hide blue jays

Why didn’t he say, “birds fly” or “dancing granola” or “the wash

dishes” orany other sequence of

words?

EM


Language Learning (Syntax)red leaves don’t hide blue jays

Why did he pick that sequence for those words?

Why not say “leaves red ...” or “... hide don’t ...” or ...


What is a syntax model supposed to explain?

Each learning hypothesis

corresponds to

a denominator / neighborhood.


The Job of Syntax“Explain why each word is necessary.”

→ DEL1WORD neighborhood

red leavesdon’t hide blue jays

leaves don’t hide blue jays

red don’t hide blue jays

red leaves hide blue jays

red leaves don’t blue jays

red leaves don’t hide jays

red leaves don’t hide blue


The Job of Syntax“Explain the (local) order of the words.”

→ TRANS1 neighborhood

red leavesdon’t hide blue jays

leaves red don’t hide blue jays

red leaves hide don’t blue jays

red don’t leaves hide blue jays

red leaves don’t hide jays blue

red leaves don’t blue hide jays



? ? ? ? ? ?p

p

leaves red don’t hide blue jays? ? ? ? ? ?

red don’t leaves hide blue jays? ? ? ? ? ?

red leaves hide don’t blue jays? ? ? ? ? ?

red leaves don’t blue hide jays? ? ? ? ? ?

red leaves don’t hide jays blue? ? ? ? ? ?

red leaves don’t hide blue jays? ? ? ? ? ?

sentences inTRANS1

neighborhood


p

p


? ? ? ? ? ?

red leaves don’t hide blue jaysleaves

don’thide blue

jays

don’t hide blue jays

leavesdon’t

hidebluered

(with any tagging) sentences inTRANS1

neighborhood


The New Modeling Imperative

numerator

denominator(“neighborhood”)

“Make the good sentence

likely, at the expense

of those bad neighbors.”

A good sentence

hints that a set of bad

ones is nearby.


This talk is about denominators ...in the unsupervised case.

A good denominator can improve accuracy and tractability.


Log-Linear Models score of x, y

partition function

Computing is

undesirable!

ConditionalEstimation

(Supervised)

ContrastiveEstimation

(Unsupervised)

Sums over all possible taggings of all possible sentences!1

sentencea few

sentences


A Big Picture: Sequence Model Estimation

tractable sums

overlapping features

generative,MLE: p(x, y)

log-linear,conditionalestimation:

p(y | x)

unannotated data

generative,EM: p(x)

log-linear,MLE: p(x, y)

log-linear,EM: p(x)

log-linear,CE with

lattice neighborhoods


Contrastive Neighborhoods• Guide the learner toward models that do

what syntax is supposed to do.

• Lattice representation → efficient algorithms.

There is an art to

choosing neighborhood functions.


Neighborhoodsneighborhood size

latticearcs

perturbations

n+1 O(n) delete up to 1 word

n O(n) transpose any bigram

O(n) O(n)

O(n2

) O(n2) delete any contiguous subsequence

(EM) ∞ - replace each word with anything

DEL1SUBSEQUENCE

TRANS1

DEL1WORD

DELORTRANS1

Σ*

DEL1WORD TRANS1


The Merialdo (1994) Task

Given unlabeled text

and a POS dictionary(that tells all possible tags for each word type),

learn to tag.A form of

supervision.


Trigram Tagging Model


JJ NNS MD VB JJ NNS

feature set:tag trigramstag trigramstag/word pairs from a POS dictionary


35.158.760.462.1

66.670.0

78.879.079.3

97.299.5

30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0tagging accuracy (ambiguous words)

10 × data

supervised

• 96K words• full POS dictionary• uninformative initializer• best of 8 smoothing conditions

CRFHMM

random

EM

LENGTH

TRANS1DELORTRANS1

DAEM

DEL1WORD

DEL1SUBSEQUENCE

Smith & Eisner (2004)Merialdo (1994)

≈ log-linear EM


Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

What if wedamagethe POS

dictionary?


90.1 90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0tag

ging a

ccur

acy (

all w

ords

)Dictionary includes ...

■ all words■ words from 1st half of corpus■ words with count 2■ words with count 3

Dictionary excludesOOV words,which can get any tag.

• 96K words• 17 coarse POS tags• uninformative initializer

DELORT

RANS1

LENGTH EM

rando

m


Trigram Tagging Model + Spelling


JJ NNS MD VB JJ NNS

feature set:tag trigramstag trigramstag/word pairs from a POS dictionary1- to 3-character suffixes, contains hyphen, digit

ACL 2005 • N. A. Smith and J. Eisner • Contrastive Estimation91

.1

91.9

90.8

83.2

90.3

73.8

89.8

73.6

90.1 90.4

84.4

69.5

84.8

78.3 80

.5

60.5

81.3

75.2

70.9

56.6

77.2

72.3

66.5

51.0

50.0

55.0

60.0

65.0

70.0

75.0

80.0

85.0

90.0

95.0tag

ging a

ccur

acy (

all w

ords

)

DELORT

RANS1

LENGTH EM

rando

m

DELORT

RANS1

+ spell

ing LENGTH

+ spell

ing

Spelling features

aided recovery,

but only with a smart

neighborhood.


The model need not be finite-state.


clever

uninformative

23.633.8 48.7

35.242.1

37.4

0.0

10.0

20.0

30.0

40.0

50.0

Unsupervised Dependency Parsing

EM

LENGTH

initializer

TRANS1

atta

chm

ent a

ccur

acy

See our paperat the IJCAI 2005

Grammatical Inferenceworkshop.

Klei

n &

Man

ning

(200

4)


To Sum Up ...Contrastive Estimation means

picking your own denominatorfor tractabilityor for accuracy

(or, as in our case, for both).

Now we can use the task to guide the unsupervised learner

It’s a particularly good fit for log-linear models:

with max ent featuresunsupervised sequence models

all in time for ACL 2006.

(like discriminative techniques do for supervised learners).