log-linear models in nlp

81
Log-Linear Models in NLP Noah A. Smith Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University [email protected]

Upload: donnel

Post on 29-Jan-2016

55 views

Category:

Documents


0 download

DESCRIPTION

Log-Linear Models in NLP. Noah A. Smith Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University [email protected]. Outline. Maximum Entropy principle Log-linear models Conditional modeling for classification Ratnaparkhi’s tagger - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Log-Linear Models in NLP

Log-Linear Models in NLP

Noah A. SmithDepartment of Computer Science /

Center for Language and Speech ProcessingJohns Hopkins [email protected]

Page 2: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 3: Log-Linear Models in NLP

DataFor now, we’re

just talking about modeling data. No task.

How to assign

probability to each

shape type?

Page 4: Log-Linear Models in NLP

3.19

2.12

0 0

1.06

4.25

3.19

Maximum Likelihood

1.06

0 0

0 0

1.06

0 0

1.06

11 degrees of freedom (12 –

1).

How to smooth?

Fewer parameters?

Page 5: Log-Linear Models in NLP

Some other kinds of models

Color

Shape Size

Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape | Color) • Pr(Size | Color, Shape)

0.5

0.5

0.125

0.375

0.500

0.125

0.375

0.500

large 0.000

small 1.000

large 0.333

small 0.667

large 0.250

small 0.750

large 1.000

small 0.000

large 0.000

small 1.000

large 0.000

small 1.000

11 degrees of freedom (1 + 4 +

6).

These two are the same!

These two are the same!

Page 6: Log-Linear Models in NLP

Some other kinds of models

Color

Shape Size

Pr(Color, Shape, Size) = Pr(Color) • Pr(Shape) • Pr(Size | Color, Shape)

0.5

0.5

0.125

0.375

0.500

9 degrees of freedom (1 + 2 +

6).large 0.000

small 1.000

large 0.333

small 0.667

large 0.250

small 0.750

large 1.000

small 0.000

large 0.000

small 1.000

large 0.000

small 1.000

Page 7: Log-Linear Models in NLP

Some other kinds of models

Color

Shape Size

Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape | Size) • Pr(Color | Size)

large

0.667

0.333

small

0.462

0.538

large

0.333

0.333

0.333

small 0.077

0.385

0.538

large 0.375

small 0.625

7 degrees of freedom (1 + 2 +

4).

No zeroes here ...

Page 8: Log-Linear Models in NLP

Some other kinds of models

Color

Shape Size

Pr(Color, Shape, Size) = Pr(Size) • Pr(Shape) • Pr(Color)

0.125

0.375

0.500

large 0.375

small 0.625

4 degrees of freedom (1 + 2 +

1).0.5

0.5

Page 9: Log-Linear Models in NLP

This is difficult.

Different factorizations affect:smoothing

# parameters (model size)model complexity

“interpretability”goodness of fit

...Usually, this

isn’t done empirically,

either!

Page 10: Log-Linear Models in NLP

Desiderata

•You decide which features to use.•Some intuitive criterion tells you how to use them in the model.•Empirical.

Page 11: Log-Linear Models in NLP

Maximum Entropy

“Make the model as uniform as possible ...

but I noticed a few things that I want to model ...

so pick a model that fits the data on those things.”

Page 12: Log-Linear Models in NLP

Occam’s Razor

One should not increase,

beyond what is necessary, the

number of entities

required to explain

anything.

Page 13: Log-Linear Models in NLP

Uniform model

small 0.083 0.083 0.083

small 0.083 0.083 0.083

large 0.083 0.083 0.083

large 0.083 0.083 0.083

Page 14: Log-Linear Models in NLP

Constraint: Pr(small) = 0.625

small 0.104 0.104 0.104

small 0.104 0.104 0.104

large 0.063 0.063 0.063

large 0.063 0.063 0.063

0.625

Page 15: Log-Linear Models in NLP

Pr( , small) = 0.048

small 0.024 0.144 0.144

small 0.024 0.144 0.144

large 0.063 0.063 0.063

large 0.063 0.063 0.063

0.625

0.048

Page 16: Log-Linear Models in NLP

Pr(large, ) = 0.125

small 0.024 0.144 0.144

small 0.024 0.144 0.144

large 0.063 0.063 0.063

large 0.063 0.063 0.063

0.625

?

0.048

Page 17: Log-Linear Models in NLP

Questions

Does a solution always exist?

Is there a way to express the

model succinctly?

Is there an efficient way to

solve this problem?

What to do if it

doesn’t?

Page 18: Log-Linear Models in NLP

Entropy

• A statistical measurement on a distribution.• Measured in bits. [0, log2||]• High entropy: close to uniform• Low entropy: close to deterministic• Concave in p.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Page 19: Log-Linear Models in NLP

The Max Ent Problem

0

0.5

1 0

0.5

1

0

0.5

1

1.5

p1p2

H

Max

Page 20: Log-Linear Models in NLP

The Max Ent Problem

objective function is H

probabilities sum to 1 ...

... and are nonnegative

expected feature value under the

model

expected feature value from the data

n constraints

picking a distribution

Page 21: Log-Linear Models in NLP

The Max Ent Problem

0

0.5

1 0

0.5

1

0

0.5

1

1.5

p1p2

H

Page 22: Log-Linear Models in NLP

About feature constraints

1 if x is small,

0 otherwise

1 if x is a small ,

0 otherwise

1 if x is large and

light,0 otherwise

Page 23: Log-Linear Models in NLP

Mathematical Magic

Max

constrained|| variables (p)concave in p

unconstrainedN variables (θ)concave in θ

Page 24: Log-Linear Models in NLP

What’s the catch?

The model takes on a specific, parameterized form.

It can be shown that any max-ent model must take this form.

Page 25: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 26: Log-Linear Models in NLP

Log-linear models

Log linear

Page 27: Log-Linear Models in NLP

Log-linear models

Unnormalized probability, or

weight

Partition function

One parameter (θi) for each feature.

Page 28: Log-Linear Models in NLP

Mathematical Magic

Max

constrained|| variables (p)concave in p

unconstrainedN variables (θ)concave in θ

Max ent problem

Log-linear ML problem

Page 29: Log-Linear Models in NLP

What does MLE mean?

Independence among examples

Arg max is the same in the log

domain

Page 30: Log-Linear Models in NLP

MLE: Then and Now

Directed models Log-linear models

Concave Concave

Constrained (simplex) Unconstrained

“Count and normalize”

(closed form solution)Iterative methods

Page 31: Log-Linear Models in NLP

Iterative Methods

• Generalized Iterative Scaling• Improved Iterative Scaling• Gradient Ascent• Newton/Quasi-Newton Methods

– Conjugate Gradient– Limited-Memory Variable Metric– ...

All of these methods are

correct and will converge to the right answer; it’s just a matter of

how fast.

Page 32: Log-Linear Models in NLP

Questions

Does a solution always exist?

Is there a way to express the

model succinctly?

Is there an efficient way to

solve this problem?

Yes, if the constraints come from the data.

Yes, many iterative methods.

Yes, a log-linear model.

Page 33: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 34: Log-Linear Models in NLP

Conditional Estimation

Classification Rule:

Training Objective:

examples labels

Page 35: Log-Linear Models in NLP

Maximum Likelihood

object

label

Page 36: Log-Linear Models in NLP

Maximum Likelihood

object

label

Page 37: Log-Linear Models in NLP

Maximum Likelihood

object

label

Page 38: Log-Linear Models in NLP

Maximum Likelihood

object

label

Page 39: Log-Linear Models in NLP

Conditional Likelihood

object

label

Page 40: Log-Linear Models in NLP

Remember:

log-linear models

conditional estimation

Page 41: Log-Linear Models in NLP

The Whole Picture

Directed models Log-linear models

MLE“Count &

Normalize”®

Unconstrained concave

optimization

CLEConstrained

concave optimization

Unconstrained concave

optimization

Page 42: Log-Linear Models in NLP

Log-linear models: MLE vs. CLE

Sum over all example types

all labels.

Sum over all labels.

Page 43: Log-Linear Models in NLP

Classification Rule

Pick the most probable label y:

We don’t need to compute the

partition function at test time!But it does need to

be computed during training.

Page 44: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 45: Log-Linear Models in NLP

Ratnaparkhi’s POS Tagger (1996)

• Probability model:

• Assume unseen words behave like rare words.– Rare words ≡ count < 5

• Training: GIS• Testing/Decoding: beam search

Page 46: Log-Linear Models in NLP

Features: common words

the

stories

about

well-heeled

communities

anddeveloper

s

DT

NNS IN JJ NNS CC NNS

about

IN

stories

IN

the

IN

well-heeled

IN

communities

INNNS INDT

NNS IN

Page 47: Log-Linear Models in NLP

Features: rare words

the

stories

about

well-heeled

communities

anddeveloper

s

DT

NNS IN JJ NNS CC NNS

about

JJ

stories

JJ

communities

JJ

and

JJIN JJNNS IN JJ

w...

JJ

we...

JJ

wel...

JJ

well...

JJ

...d

JJ

...ed

JJ

...led

JJ

...eled

JJ

...-...

JJ

Page 48: Log-Linear Models in NLP

The “Label Bias” Problem

born towealt

h

VBN TO NN

born torun

VBN TO VBborn to

wealth

VBN TO NN

born towealt

h

VBN TO NN

born towealt

h

VBN TO NN

born torun

VBN TO VB

born torun

VBN TO VB

born torun

VBN TO VB

born torun

VBN TO VB

born torun

VBN TO VB

born torun

VBN TO VB

(4)

(6)

Page 49: Log-Linear Models in NLP

The “Label Bias” Problem

■VBN

VBN, IN

IN, NN

VBN, TO

born

to

towealth

run

born to wealth

Pr(VBN | born) Pr(IN | VBN, to) Pr(NN | VBN, IN, wealth) = 1 * .4 * 1

Pr(VBN | born) Pr(TO | VBN, to) Pr(VB | VBN, TO, wealth) = 1 * .6 * 1

TO, VB

Page 50: Log-Linear Models in NLP

Is this symptomati

c of log-linear

models?No!

Page 51: Log-Linear Models in NLP

Tagging Decisions

tag1tag1

tag2

tag2

tag2

tag3

tag3

tag3

tag3

tag3

tag3

tag3

At each decision point, the total weight is 1.

tagn

Choose the path with the greatest weight.

tag3

A

B

C

A

B

C

D

A

B

D

B

You must choose tag2 = B, even if B is a terrible tag

for word2.Pr(tag2 = B | anything at

all!) = 1

You never pay a penalty for it!

Page 52: Log-Linear Models in NLP

Tagging Decisions in an HMM

tag1tag1

tag2

tag2

tag2

tag3

tag3

tag3

tag3

tag3

tag3

tag3

At each decision point, the total weight can be 0.

tagn

Choose the path with the greatest weight.

tag3

A

B

C

A

B

C

D

A

B

D

B

You may choose to discontinue this path if B can’t tag word2. Or pay a

high cost.

Page 53: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 54: Log-Linear Models in NLP

Conditional Random Fields

• Lafferty, McCallum, and Pereira (2001)

• Whole sentence model with local features:

Page 55: Log-Linear Models in NLP

Simple CRFs as Graphs

PRP$ NN VBZ ADV

My cat begs silently

PRP$ NN VBZ ADV

My cat begs silently

Compare with an HMM:

Weights, added together

.

Log-probs, added

together.

Page 56: Log-Linear Models in NLP

What can CRFs do that HMMs can’t?

PRP$ NN VBZ ADV

My cat begs silently

ADV

...ly

VBZ

...s

Page 57: Log-Linear Models in NLP

An Algorithmic Connection

What is the partition function?

Total weight of all paths.

Page 58: Log-Linear Models in NLP

CRF weight training

• Maximize log-likelihood:

• Gradient:

Total weight of all paths.

Forward algorith

m.Expected feature values.

Forward-backwar

d algorith

m.

Page 59: Log-Linear Models in NLP

Forward, Backward, and Expectations

fk is the number

of firings; each

firing is at some position

Markovian property

forward weight backward weight

forward weight

Page 60: Log-Linear Models in NLP

Forward, Backward, and Expectations

forward weight backward weight

forward weight to final state = weight of all paths

Page 61: Log-Linear Models in NLP

Forward-Backward’s Clients

Training a CRF Baum-Welch

supervised(labeled data)

unsupervised

concave bumpy

converges to global max

converges to local max

max p(y | x)(conditional training)

max p(x) (y unknown)

Page 62: Log-Linear Models in NLP

A Glitch

• Suppose we notice that -ly words are always adverbs.

• Call this feature 7.

-ly words are all ADV; this is maximal.

The expectation can’t exceed the max (it can’t even reach it).The gradient will always be positive.

Page 63: Log-Linear Models in NLP

The Dark Side of Log-Linear Models

Page 64: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 65: Log-Linear Models in NLP

Regularization

• θs shouldn’t have huge magnitudes• Model must generalize to test data

• Example: quadratic penalty

Page 66: Log-Linear Models in NLP

Bayesian Regularization:Maximum A Posteriori

Estimation

Page 67: Log-Linear Models in NLP

Independent Gaussians Prior (Chen and Rosenfeld, 2000)

Independence

Gaussian

0-mean, identical variance.

Quadratic penalty!

Page 68: Log-Linear Models in NLP

Alternatives

• Different variances for different parameters• Laplacian prior (1-norm)

• Exponential prior (Goodman, 2004)

• Relax the constraints (Kazama & Tsujii, 2003)

All θk ≥ 0.

Not differentiable.

Page 69: Log-Linear Models in NLP

Effect of the penalty

unsmoothedGoodman's smoothing

θk

Page 70: Log-Linear Models in NLP

Kazama & Tsujii’s box constraints

The primal Max Ent problem:

Page 71: Log-Linear Models in NLP

Sparsity

• Fewer features → better generalization

• E.g., support vector machines

• Kazama & Tsujii’s prior, and Goodman’s, give sparsity.

Page 72: Log-Linear Models in NLP

Sparsity

-1 -0.5 0 0.5 1 1.5 2 2.5 3

Kazama & Tsujii's smoothingGoodman's smoothing

Gaussian smoothing

θk

penalty

Cusp; function is not differentiable

here.

Gradient is 0.

Page 73: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 74: Log-Linear Models in NLP

Feature Selection

• Sparsity from priors is one way to pick the features. (Maybe not a good way.)

• Della Pietra, Della Pietra, and Lafferty (1997) gave another way.

Page 75: Log-Linear Models in NLP

Back to the original example.

Page 76: Log-Linear Models in NLP

Nine features.

• f1 = 1 if , 0 otherwise

• f2 = 1 if , 0 otherwise

• f3 = 1 if , 0 otherwise

• f4 = 1 if , 0 otherwise

• f5 = 1 if , 0 otherwise

• f6 = 1 if , 0 otherwise

• f7 = 1 if , 0 otherwise

• f8 = 1 if , 0 otherwise

• f9 = 1 unless some other feature fires; θ9 << 0

What’s

wrong

here?

θi = log counti

Page 77: Log-Linear Models in NLP

The Della Pietras’ & Lafferty’s Algorithm

1. Start out with no features.2. Consider a set of candidates.

• Atomic features.• Current features conjoined with atomic features.

3. Pick the candidate g with the greatest gain:

4. Add g to the model.5. Retrain all parameters.6. Go to 2.

Page 78: Log-Linear Models in NLP

PRP$

Feature Induction: Example

PRP$ NN VBZ ADV

My cat begs silently

Atomic features:

NN VBZ ADV

My cat begs silently

Selected features:

Other candidates:

NN VBZ

PRP$ NNNN

cat

Page 79: Log-Linear Models in NLP

Outline

Maximum Entropy principleLog-linear modelsConditional modeling for

classificationRatnaparkhi’s tagger Conditional random fieldsSmoothingFeature Selection

Page 80: Log-Linear Models in NLP

Conclusions

Probabilistic models:robustness

data-orientedmathematically understood

Hacks:explanatory power

exploit expert’s choice of features(can be) more data-oriented

Log-linear models

The math is beautiful and easy

to implement. You pick

the features; the rest is just math!

Page 81: Log-Linear Models in NLP

Thank you!