log-linear modelslog-linear models with structured outputs natural language processing cs...

196
Log-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum) 1

Upload: others

Post on 20-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Log-Linear Modelswith Structured Outputs

Natural Language ProcessingCS 6120—Spring 2013

Northeastern University

David Smith(some slides from Andrew McCallum)

1

Page 2: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Overview

• Sequence labeling task (cf. POS tagging)

• Independent classifiers

• HMMs

• (Conditional) Maximum Entropy Markov Models

• Conditional Random Fields

• Beyond Sequence Labeling

2

Page 3: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Sequence Labeling

• Inputs: x = (x1, …, xn)

• Labels: y = (y1, …, yn)

• Typical goal: Given x, predict y

• Example sequence labeling tasks

– Part-of-speech tagging

– Named-entity-recognition (NER)

• Label people, places, organizations

3

Page 4: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER Example:

4

Page 5: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

First Solution:

Maximum Entropy Classifier• Conditional model p(y|x).

– Do not waste effort modeling p(x), since x

is given at test time anyway.

– Allows more complicated input features,

since we do not need to model

dependencies between them.

• Feature functions f(x,y):

– f1(x,y) = { word is Boston & y=Location }

– f2(x,y) = { first letter capitalized & y=Name }

– f3(x,y) = { x is an HTML link & y=Location}

5

Page 6: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

First Solution: MaxEnt Classifier

• How should we choose a classifier?

• Principle of maximum entropy

– We want a classifier that:

• Matches feature constraints from training data.

• Predictions maximize entropy.

• There is a unique, exponential family

distribution that meets these criteria.

6

Page 7: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

First Solution: MaxEnt Classifier

• Problem with using a maximum entropy

classifier for sequence labeling:

• It makes decisions at each position

independently!

7

Page 8: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Second Solution: HMM

• Defines a generative process.

• Can be viewed as a weighted finite

state machine.!

P(y,x) = P(yt | yt"1)P(x | yt )t

#

8

Page 9: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Second Solution: HMM

• How can represent we multiple features

in an HMM?

– Treat them as conditionally independent

given the class label?

• The example features we talked about are not

independent.

– Try to model a more complex generative

process of the input features?

• We may lose tractability (i.e. lose a dynamic

programming for exact inference).

9

Page 10: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Second Solution: HMM

• Let’s use a conditional model instead.

10

Page 11: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Third Solution: MEMM

• Use a series of maximum entropy

classifiers that know the previous label.

• Define a Viterbi algorithm for inference.

!

P(y | x) = Pyt"1 (yt | x)t

#

11

Page 12: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Third Solution: MEMM

• Combines the advantages of maximum

entropy and HMM!

• But there is a problem…

12

Page 13: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Problem with MEMMs: Label Bias

• In some state space configurations,

MEMMs essentially completely ignore

the inputs.

• Example (ON BOARD).

• This is not a problem for HMMs,

because the input sequence is

generated by the model.

maximum entropy framework. Previously published exper-imental results show MEMMs increasing recall and dou-bling precision relative to HMMs in a FAQ segmentationtask.

MEMMs and other non-generative finite-state modelsbased on next-state classifiers, such as discriminativeMarkov models (Bottou, 1991), share a weakness we callhere the label bias problem: the transitions leaving a givenstate compete only against each other, rather than againstall other transitions in the model. In probabilistic terms,transition scores are the conditional probabilities of pos-sible next states given the current state and the observa-tion sequence. This per-state normalization of transitionscores implies a “conservation of score mass” (Bottou,1991) whereby all the mass that arrives at a state must bedistributed among the possible successor states. An obser-vation can affect which destination states get the mass, butnot how much total mass to pass on. This causes a bias to-ward states with fewer outgoing transitions. In the extremecase, a state with a single outgoing transition effectivelyignores the observation. In those cases, unlike in HMMs,Viterbi decoding cannot downgrade a branch based on ob-servations after the branch point, and models with state-transition structures that have sparsely connected chains ofstates are not properly handled. The Markovian assump-tions in MEMMs and similar state-conditional models in-sulate decisions at one state from future decisions in a waythat does not match the actual dependencies between con-secutive states.

This paper introduces conditional random fields (CRFs), asequence modeling framework that has all the advantagesof MEMMs but also solves the label bias problem in aprincipled way. The critical difference between CRFs andMEMMs is that a MEMM uses per-state exponential mod-els for the conditional probabilities of next states given thecurrent state, while a CRF has a single exponential modelfor the joint probability of the entire sequence of labelsgiven the observation sequence. Therefore, the weights ofdifferent features at different states can be traded off againsteach other.

We can also think of a CRF as a finite state model with un-normalized transition probabilities. However, unlike someother weighted finite-state approaches (LeCun et al., 1998),CRFs assign a well-defined probability distribution overpossible labelings, trained by maximum likelihood or MAPestimation. Furthermore, the loss function is convex,2 guar-anteeing convergence to the global optimum. CRFs alsogeneralize easily to analogues of stochastic context-freegrammars that would be useful in such problems as RNAsecondary structure prediction and natural language pro-cessing.

2In the case of fully observable states, as we are discussinghere; if several states have the same label, the usual local maximaof Baum-Welch arise.

0

1r:_

4r:_

2i:_

3

b:rib

5o:_b:rob

Figure 1. Label bias example, after (Bottou, 1991). For concise-ness, we place observation-label pairs o : l on transitions ratherthan states; the symbol ‘ ’ represents the null output label.

We present the model, describe two training procedures andsketch a proof of convergence. We also give experimentalresults on synthetic data showing that CRFs solve the clas-sical version of the label bias problem, and, more signifi-cantly, that CRFs perform better than HMMs and MEMMswhen the true data distribution has higher-order dependen-cies than the model, as is often the case in practice. Finally,we confirm these results as well as the claimed advantagesof conditional models by evaluating HMMs, MEMMs andCRFs with identical state structure on a part-of-speech tag-ging task.

2. The Label Bias ProblemClassical probabilistic automata (Paz, 1971), discrimina-tive Markov models (Bottou, 1991), maximum entropytaggers (Ratnaparkhi, 1996), and MEMMs, as well asnon-probabilistic sequence tagging and segmentation mod-els with independently trained next-state classifiers (Pun-yakanok & Roth, 2001) are all potential victims of the labelbias problem.

For example, Figure 1 represents a simple finite-statemodel designed to distinguish between the two words rib

and rob. Suppose that the observation sequence is r i b.In the first time step, r matches both transitions from thestart state, so the probability mass gets distributed roughlyequally among those two transitions. Next we observe i.Both states 1 and 4 have only one outgoing transition. State1 has seen this observation often in training, state 4 has al-most never seen this observation; but like state 1, state 4has no choice but to pass all its mass to its single outgoingtransition, since it is not generating the observation, onlyconditioning on it. Thus, states with a single outgoing tran-sition effectively ignore their observations. More generally,states with low-entropy next state distributions will take lit-tle notice of observations. Returning to the example, thetop path and the bottom path will be about equally likely,independently of the observation sequence. If one of thetwo words is slightly more common in the training set, thetransitions out of the start state will slightly prefer its cor-responding transition, and that word’s state sequence willalways win. This behavior is demonstrated experimentallyin Section 5.

Leon Bottou (1991) discussed two solutions for the labelbias problem. One is to change the state-transition struc-

13

Page 14: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Fourth Solution:

Conditional Random Field

• Conditionally-trained, undirected

graphical model.

• For a standard linear-chain structure:

!

P(y | x) = "k (yt ,yt#1,x)t

$

"k (yt ,yt#1,x) = exp %kk

& f (yt ,yt#1,x)'

( )

*

+ ,

14

Page 15: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Fourth Solution: CRF

• Have the advantages of MEMMs, but

avoid the label bias problem.

• CRFs are globally normalized, whereas

MEMMs are locally normalized.

• Widely used and applied. CRFs give

state-the-art results in many domains.

15

Page 16: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Fourth Solution: CRF

• Have the advantages of MEMMs, but

avoid the label bias problem.

• CRFs are globally normalized, whereas

MEMMs are locally normalized.

• Widely used and applied. CRFs give

state-the-art results in many domains.Remember, Z is the normalization constant. How do we compute it?

15

Page 17: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

CRF Applications• Part-of-speech tagging

• Named entity recognition

• Document layout (e.g. table) classification

• Gene prediction

• Chinese word segmentation

• Morphological disambiguation

• Citation parsing

• Etc., etc.

16

Page 18: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

17

Page 19: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

17

Page 20: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Capitalized word

17

Page 21: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Capitalized word

Ends in “s”

17

Page 22: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Capitalized word

Ends in “s” Ends in “ans”

17

Page 23: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Capitalized word

Ends in “s” Ends in “ans”

Previous word “the”

17

Page 24: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

17

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Capitalized word

Ends in “s” Ends in “ans”

Previous word “the”

“Phoenicians” in gazetteer

17

Page 25: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

18

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

18

Page 26: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

18

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Not capitalized

18

Page 27: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

18

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Not capitalized

Tagged as “VB”

18

Page 28: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

18

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Not capitalized

Tagged as “VB”B-E to right

18

Page 29: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

19

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

19

Page 30: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

19

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Word “sea”

19

Page 31: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

19

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Word “sea”

Word “sea” preceded by “the ADJ”

19

Page 32: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

NER as Sequence Tagging

19

The Phoenicians came from the Red Sea

O B-E O O O B-L I-L

Word “sea”

Word “sea” preceded by “the ADJ”

Hard constraint: I-L must follow B-L or I-L

19

Page 33: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Overview

• What computations do we need?

• Smoothing log-linear models

• MEMMs vs. CRFs again

• Action-based parsing and dependency parsing

20

Page 34: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

21

Page 35: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

Where have we seen expected counts before?

21

Page 36: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Recipe for Conditional Training of p(y | x)

1. Gather constraints/features from training data

2. Initialize all parameters to zero

3. Classify training data with current parameters; calculate expectations

4. Gradient is

5. Take a step in the direction of the gradient

6. Repeat from 3 until convergence 43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

43 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

4343 43

Recipe for a Conditional

MaxEnt Classifier

1. Gather constraints from training data:

2. Initialize all parameters to zero.

3. Classify training data with current parameters. Calculateexpectations.

4. Gradient is

5. Take a step in the direction of the gradient

6. Until convergence, return to step 3.

43

Where have we seen expected counts before?

EM!

21

Page 37: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Gradient-Based Training

• λ := λ + rate * Gradient(F)

• After all training examples? (batch)

• After every example? (on-line)

• Use second derivative for faster learning?

• A big field: numerical optimization

22

Page 38: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Overfitting• If we have too many features, we can choose

weights to model the training data perfectly

• If we have a feature that only appears in spam training, not ham training, it will get weight ∞ to maximize p(spam | feature) at 1.

• These behaviors

• Overfit the training data

• Will probably do poorly on test data

23

Page 39: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Solutions to Overfitting• Throw out rare features.

• Require every feature to occur > 4 times, and > 0 times with ling, and > 0 times with spam.

• Only keep, e.g., 1000 features.

• Add one at a time, always greedily picking the one that most improves performance on held-out data.

• Smooth the observed feature counts.

• Smooth the weights by using a prior.

• max p(λ|data) = max p(λ, data) =p(λ)p(data|λ)

• decree p(λ) to be high when most weights close to 0

24

Page 40: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Smoothing with Priors• What if we had a prior expectation that parameter values

wouldn’t be very large?

• We could then balance evidence suggesting large (or infinite) parameters against our prior expectation.

• The evidence would never totally defeat the prior, and parameters would be smoothed (and kept finite)

• We can do this explicitly by changing the optimization objective to maximum posterior likelihood:

log P(y, λ | x) = log P(λ) + log P(y | x, λ)

Posterior Prior Likelihood

25

Page 41: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

42 42

42

26

Page 42: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Parsing as Structured Prediction

27

Page 43: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb � book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det � that(Verb Det) flight shift(Verb Det flight) reduce, Noun � flight(Verb Det Noun) reduce, NOM � Noun(Verb Det NOM) reduce, NP � Det NOM(Verb NP) reduce, VP � Verb NP(Verb) reduce, S � V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

28

Page 44: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb � book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det � that(Verb Det) flight shift(Verb Det flight) reduce, Noun � flight(Verb Det Noun) reduce, NOM � Noun(Verb Det NOM) reduce, NP � Det NOM(Verb NP) reduce, VP � Verb NP(Verb) reduce, S � V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

28

Page 45: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb � book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det � that(Verb Det) flight shift(Verb Det flight) reduce, Noun � flight(Verb Det Noun) reduce, NOM � Noun(Verb Det NOM) reduce, NP � Det NOM(Verb NP) reduce, VP � Verb NP(Verb) reduce, S � V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

Train log-linear model of p(action | context)

28

Page 46: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Shift-reduce parsing

Stack Input remaining Action() Book that flight shift(Book) that flight reduce, Verb � book, (Choice #1 of 2)(Verb) that flight shift(Verb that) flight reduce, Det � that(Verb Det) flight shift(Verb Det flight) reduce, Noun � flight(Verb Det Noun) reduce, NOM � Noun(Verb Det NOM) reduce, NP � Det NOM(Verb NP) reduce, VP � Verb NP(Verb) reduce, S � V(S) SUCCESS!

Ambiguity may lead to the need for backtracking.

Train log-linear model of p(action | context)

Compare to an MEMM

28

Page 47: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models

29

Page 48: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

score(out, in) = � · features(out, in)

29

Page 49: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

• Get a probability distribution by normalizing

p(out | in) =1Z

escore(out,in)

score(out, in) = � · features(out, in)

Z =�

out��GEN(in)

escore(out�,in)

29

Page 50: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

• Get a probability distribution by normalizing✤ Viz. logistic regression, Markov random fields, undirected

graphical models

p(out | in) =1Z

escore(out,in)

score(out, in) = � · features(out, in)

Z =�

out��GEN(in)

escore(out�,in)

29

Page 51: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

• Get a probability distribution by normalizing✤ Viz. logistic regression, Markov random fields, undirected

graphical models

p(out | in) =1Z

escore(out,in)

score(out, in) = � · features(out, in)Usually the bottleneck in NLP

Z =�

out��GEN(in)

escore(out�,in)

29

Page 52: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

• Get a probability distribution by normalizing✤ Viz. logistic regression, Markov random fields, undirected

graphical models

• Inference: sampling, variational methods, dynamic programming, local search, ...

p(out | in) =1Z

escore(out,in)

score(out, in) = � · features(out, in)Usually the bottleneck in NLP

Z =�

out��GEN(in)

escore(out�,in)

29

Page 53: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

29

Structured Log-Linear Models• Linear model for scoring structures

• Get a probability distribution by normalizing✤ Viz. logistic regression, Markov random fields, undirected

graphical models

• Inference: sampling, variational methods, dynamic programming, local search, ...

• Training: maximum likelihood, minimum risk, etc.

p(out | in) =1Z

escore(out,in)

score(out, in) = � · features(out, in)Usually the bottleneck in NLP

Z =�

out��GEN(in)

escore(out�,in)

29

Page 54: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

30

• Several layers of linguistic structure

• Unknown correspondences

• Naturally handled by probabilistic framework

• Several inference setups, for example:

With latent variables

Structured Log-Linear Models

p(out1 | in) =�

out2,alignment

p(out1, out2, alignment | in)

30

Page 55: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

30

• Several layers of linguistic structure

• Unknown correspondences

• Naturally handled by probabilistic framework

• Several inference setups, for example:

With latent variables

Another computational problem

Structured Log-Linear Models

p(out1 | in) =�

out2,alignment

p(out1, out2, alignment | in)

30

Page 56: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

31

Edge-Factored Parsers• No global features of a parse (McDonald et al. 2005)

• Each feature is attached to some edge

• MST or CKY-like DP for fast O(n2) or O(n3) parsing

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

31

Page 57: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

32

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

“It bright cold day April and clocks were thirteen”was a in the striking

32

Page 58: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

32

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

yes, lots of positive features ...

“It bright cold day April and clocks were thirteen”was a in the striking

32

Page 59: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

33

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

“It bright cold day April and clocks were thirteen”was a in the striking

33

Page 60: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

33

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

“It bright cold day April and clocks were thirteen”was a in the striking

33

Page 61: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

34

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

V A A A N J N V C

“It bright cold day April and clocks were thirteen”was a in the striking

34

Page 62: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

34

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

jasný ß N(“bright NOUN”)

V A A A N J N V C

“It bright cold day April and clocks were thirteen”was a in the striking

34

Page 63: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

35

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

jasný ß N(“bright NOUN”)

V A A A N J N V C

“It bright cold day April and clocks were thirteen”was a in the striking

35

Page 64: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

35

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

jasný ß N(“bright NOUN”)

V A A A N J N V C

A ß N

“It bright cold day April and clocks were thirteen”was a in the striking

35

Page 65: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

36

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

jasný ß N(“bright NOUN”)

V A A A N J N V C

A ß N

“It bright cold day April and clocks were thirteen”was a in the striking

36

Page 66: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

36

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• Is this a good edge?

jasný ß den(“bright day”)

jasný ß N(“bright NOUN”)

V A A A N J N V C

A ß Npreceding

conjunction A ß N

“It bright cold day April and clocks were thirteen”was a in the striking

36

Page 67: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

37

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

“It bright cold day April and clocks were thirteen”was a in the striking

37

Page 68: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

37

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

not as good, lots of red ...

“It bright cold day April and clocks were thirteen”was a in the striking

37

Page 69: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

38

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

“It bright cold day April and clocks were thirteen”was a in the striking

38

Page 70: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

38

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasný ß hodiny(“bright clocks”)

“It bright cold day April and clocks were thirteen”was a in the striking

38

Page 71: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

38

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasný ß hodiny(“bright clocks”)

... undertrained ...

“It bright cold day April and clocks were thirteen”was a in the striking

38

Page 72: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

39

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný ß hodiny(“bright clocks”)

... undertrained ...

“It bright cold day April and clocks were thirteen”was a in the striking

39

Page 73: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

39

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný ß hodiny(“bright clocks”)

... undertrained ...

jasn ß hodi(“bright clock,”

stems only)

“It bright cold day April and clocks were thirteen”was a in the striking

39

Page 74: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

40

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasn ß hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

jasný ß hodiny(“bright clocks”)

... undertrained ...

“It bright cold day April and clocks were thirteen”was a in the striking

40

Page 75: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

40

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasn ß hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Asingular ß Nplural

jasný ß hodiny(“bright clocks”)

... undertrained ...

“It bright cold day April and clocks were thirteen”was a in the striking

40

Page 76: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

41

jasný ß hodiny(“bright clocks”)

... undertrained ...

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasn ß hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Asingular ß Nplural

“It bright cold day April and clocks were thirteen”was a in the striking

41

Page 77: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

41

jasný ß hodiny(“bright clocks”)

... undertrained ...

Edge-Factored Parsers

Byl jasný studený dubnový den a hodiny odbíjely třináctou

• How about this competing edge?

V A A A N J N V C

jasn ß hodi(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Asingular ß Nplural A ß N

where N followsa conjunction

“It bright cold day April and clocks were thirteen”was a in the striking

41

Page 78: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

42

jasný

Edge-Factored Parsers

Byl studený dubnový den a hodiny odbíjely třináctou

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

• Which edge is better?

• “bright day” or “bright clocks”?

“It bright cold day April and clocks were thirteen”was a in the striking

42

Page 79: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

43

jasný

Edge-Factored Parsers

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

• Which edge is better?

• Score of an edge e = θ ⋅ features(e)

• Standard algos è valid parse with max total score

43

Page 80: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

43

jasný

Edge-Factored Parsers

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen”was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

• Which edge is better?

• Score of an edge e = θ ⋅ features(e)

• Standard algos è valid parse with max total score

our current weight vector

43

Page 81: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

44

n First, a familiar exampleq Conditional Random Field (CRF) for POS tagging

Local factors in a graphical model

……

find preferred tags

44

Page 82: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

44

n First, a familiar exampleq Conditional Random Field (CRF) for POS tagging

Local factors in a graphical model

……

find preferred tags

Observed input sentence (shaded)

44

Page 83: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

44

n First, a familiar exampleq Conditional Random Field (CRF) for POS tagging

Local factors in a graphical model

……

find preferred tags

v v v

Possible tagging (i.e., assignment to remaining variables)

Observed input sentence (shaded)

44

Page 84: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

45

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

Possible tagging (i.e., assignment to remaining variables)Another possible tagging

Observed input sentence (shaded)

45

Page 85: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

45

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v a n

Possible tagging (i.e., assignment to remaining variables)Another possible tagging

Observed input sentence (shaded)

45

Page 86: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

46

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

”Binary” factor that measures

compatibility of 2 adjacent tags

46

Page 87: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

46

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

”Binary” factor that measures

compatibility of 2 adjacent tags

Model reusessame parametersat this position

46

Page 88: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

47

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

47

Page 89: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

47

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tag

47

Page 90: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

47

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

47

Page 91: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

47

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

can’t be adj

v 0.2n 0.2a 0

47

Page 92: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

48

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

48

Page 93: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

48

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagIts values depend on corresponding word

(could be made to depend onentire observed sentence)

48

Page 94: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

49

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

v 0.3n 0.02a 0

v 0.3n 0a 0.1

49

Page 95: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

49

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v 0.2n 0.2a 0

“Unary” factor evaluates this tagDifferent unary factor at each position

v 0.3n 0.02a 0

v 0.3n 0a 0.1

49

Page 96: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

50

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

50

Page 97: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

50

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

p(v a n) is proportionalto the product of all

factors’ values on v a n

50

Page 98: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

51

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

p(v a n) is proportionalto the product of all

factors’ values on v a n

51

Page 99: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

51

Local factors in a graphical modeln First, a familiar example

q Conditional Random Field (CRF) for POS tagging

……

find preferred tags

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0.02a 0

v n av 0 2 1n 2 1 0a 0 3 1

v 0.3n 0a 0.1

v 0.2n 0.2a 0

v a n

= … 1*3*0.3*0.1*0.2 …p(v a n) is proportional

to the product of allfactors’ values on v a n

51

Page 100: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

52

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

52

Page 101: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

52

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

52

Page 102: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

52

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

52

Page 103: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

53

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

Possible parse...

53

Page 104: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

53

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

tf

ff

ft

Possible parse...

... and variable

assignment

53

Page 105: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

54

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

Another parse...

54

Page 106: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

54

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

ff

tf

tf

Another parse...

... and variable

assignment

54

Page 107: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

55

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

ft

tf

tf

An illegal parse...

... with a cycle!

55

Page 108: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

56

• First, a labeling example✤ CRF for POS tagging

• Now let’s do dependency parsing!✤ O(n2) boolean variables for the possible links

v a n

Graphical Models for Parsing

find preferred links ……

ft

tf

tt

An illegal parse...

... with a cycle and multiple parents!

56

Page 109: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

57

• What factors determine parse probability?

Local Factors for Parsing

find preferred links ……

57

Page 110: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

57

• What factors determine parse probability?✤ Unary factors to score each link in isolation

Local Factors for Parsing

find preferred links ……

t 2f 1

57

Page 111: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

57

• What factors determine parse probability?✤ Unary factors to score each link in isolation

Local Factors for Parsing

find preferred links ……

t 2f 1

t 1f 2

t 1f 2

t 1f 6

t 1f 3

t 1f 8

57

Page 112: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

57

• What factors determine parse probability?✤ Unary factors to score each link in isolation

• But what if the best assignment isn’t a tree?

Local Factors for Parsing

find preferred links ……

t 2f 1

t 1f 2

t 1f 2

t 1f 6

t 1f 3

t 1f 8

57

Page 113: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

58

• What factors determine parse probability?✤ Unary factors to score each link in isolation

Global Factors for Parsing

find preferred links ……

58

Page 114: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

58

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

Global Factors for Parsing

find preferred links ……

ffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

58

Page 115: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

58

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

Global Factors for Parsing

find preferred links ……

ffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

58

Page 116: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

59

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

Global Factors for Parsing

find preferred links ……

tf

ff

ft

ffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

we’relegal!

59

Page 117: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

59

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

Global Factors for Parsing

find preferred links ……

tf

ff

ft

ffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

optionally require the tree to be projective (no crossing links)

So far, this is equivalent toedge-factored parsing

we’relegal!

64 entries (0/1)

59

Page 118: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

59

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

Global Factors for Parsing

find preferred links ……

tf

ff

ft

ffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

optionally require the tree to be projective (no crossing links)

So far, this is equivalent toedge-factored parsing

we’relegal!

64 entries (0/1)

Note: traditional parsers don’t loop through this table to consider exponentially many trees one at a time.They use combinatorial algorithms; so should we!

59

Page 119: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

60

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

✤ Second order effects: factors on 2 variables

• Grandparent–parent–child chains

Local Factors for Parsing

find preferred links ……

60

Page 120: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

60

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

✤ Second order effects: factors on 2 variables

• Grandparent–parent–child chains

Local Factors for Parsing

find preferred links ……

f t

f 1 1

t 1 3

60

Page 121: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

61

Local Factors for Parsing

find preferred links ……

• What factors determine parse probability?✤ Unary factors to score each link in isolation

✤ Global TREE factor to require links to form a legal tree

• A hard constraint: potential is either 0 or 1

✤ Second order effects: factors on 2 variables

• Grandparent–parent–child chains

• No crossing links

• Siblings

✤ Hidden morphological tags

✤ Word senses and subcategorization frames

61

Page 122: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

adapted from MacKay (2003) textbook62

Page 123: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

adapted from MacKay (2003) textbook

Count the soldiers

62

Page 124: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

adapted from MacKay (2003) textbook

Count the soldiers

3 behind

you

2 behind

you

1 behind

you

62

Page 125: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

there’s1 of me

adapted from MacKay (2003) textbook

Count the soldiers

3 behind

you

2 behind

you

1 behind

you

62

Page 126: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

4 behind

you

5 behind

you

there’s1 of me

adapted from MacKay (2003) textbook

Count the soldiers

3 behind

you

2 behind

you

1 behind

you

62

Page 127: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

4 behind

you

5 behind

you

1 before

you

2 before

you

adapted from MacKay (2003) textbook

Count the soldiers

3 behind

you

2 behind

you

1 behind

you

62

Page 128: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

62

Great Ideas in ML: Message Passing

4 behind

you

5 behind

you

1 before

you

2 before

you

3 before

you

4 before

you

5 before

you

adapted from MacKay (2003) textbook

Count the soldiers

3 behind

you

2 behind

you

1 behind

you

62

Page 129: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

63

Great Ideas in ML: Message Passing

2 before

you

there’s1 of me

only seemy incomingmessages

Count the soldiers

adapted from MacKay (2003) textbook

3 behind

you

63

Page 130: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

63

Great Ideas in ML: Message Passing

2 before

you

there’s1 of me

Belief:Must be

2 + 1 + 3 = 6 of us

only seemy incomingmessages

Count the soldiers

adapted from MacKay (2003) textbook

3 behind

you

63

Page 131: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

63

Great Ideas in ML: Message Passing

2 before

you

there’s1 of me

Belief:Must be

2 + 1 + 3 = 6 of us

only seemy incomingmessages

2 31

Count the soldiers

adapted from MacKay (2003) textbook

3 behind

you

63

Page 132: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

64

Belief:Must be

2 + 1 + 3 = 6 of us

2 31

Great Ideas in ML: Message Passing

1 before

you

there’s1 of me

only seemy incomingmessages

Count the soldiers

adapted from MacKay (2003) textbook

4 behind

you

64

Page 133: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

64

Great Ideas in ML: Message Passing

1 before

you

there’s1 of me

only seemy incomingmessages

Belief:Must be

1 + 1 + 4 = 6 of us

1 41

Count the soldiers

adapted from MacKay (2003) textbook

4 behind

you

64

Page 134: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

65

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook65

Page 135: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

65

Great Ideas in ML: Message Passing

7 here

3 here

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook65

Page 136: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

65

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

1 of me

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook65

Page 137: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

66

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook66

Page 138: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

66

Great Ideas in ML: Message Passing

3 here

3 here

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook66

Page 139: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

66

Great Ideas in ML: Message Passing

3 here

3 here

7 here(= 3+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook66

Page 140: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

67

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook67

Page 141: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

67

Great Ideas in ML: Message Passing

7 here

3 here

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook67

Page 142: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

67

Great Ideas in ML: Message Passing

7 here

3 here

11 here(= 7+3+1)

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook67

Page 143: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

68

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook68

Page 144: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

68

Great Ideas in ML: Message Passing

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook68

Page 145: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

68

Great Ideas in ML: Message Passing

7 here

3 here

3 here

Belief:Must be14 of us

Each soldier receives reports from all branches of tree

adapted from MacKay (2003) textbook68

Page 146: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

69

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

7 here

3 here

3 here

Belief:Must be14 of us

adapted from MacKay (2003) textbook69

Page 147: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

69

Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree

7 here

3 here

3 here

Belief:Must be14 of us

wouldn’t work correctlywith a “loopy” (cyclic) graph

adapted from MacKay (2003) textbook69

Page 148: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

n In the CRF, message passing = forward-backward

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 149: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 1.8n 0a 4.2

belief

n In the CRF, message passing = forward-backward

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 150: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 1.8n 0a 4.2

α β

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 3n 1a 6

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 151: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α β

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 3n 1a 6

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 152: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

v n av 0 2 1n 2 1 0a 0 3 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 153: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 154: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

70

Page 155: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 156: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

70

Page 157: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

70

……

find preferred tags

Great ideas in ML: Forward-Backward

v 0.3n 0a 0.1

v 1.8n 0a 4.2

α βα

belief

message message

v 2n 1a 7

n In the CRF, message passing = forward-backward

v 7n 2a 1

v 3n 1a 6

βv n a

v 0 2 1n 2 1 0a 0 3 1

v 3n 6a 1

v n av 0 2 1n 2 1 0a 0 3 1

70

Page 158: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

71

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 0.3n 0a 0.1

71

Page 159: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

71

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 0.3n 0a 0.1

71

Page 160: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

71

n Extend CRF to “skip chain” to capture non-local factor

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 0.3n 0a 0.1

71

Page 161: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

71

n Extend CRF to “skip chain” to capture non-local factorq More influences on belief J

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4n 0a 25.2

v 0.3n 0a 0.1

71

Page 162: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

72

n Extend CRF to “skip chain” to capture non-local factorq More influences on belief Jq Graph becomes loopy L

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

72

Page 163: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

72

n Extend CRF to “skip chain” to capture non-local factorq More influences on belief Jq Graph becomes loopy L

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

Red messages not independent?Pretend they are!

72

Page 164: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

72

n Extend CRF to “skip chain” to capture non-local factorq More influences on belief Jq Graph becomes loopy L

……

find preferred tags

Great ideas in ML: Forward-Backward

v 3n 1a 6

v 2n 1a 7

α β

v 3n 1a 6

v 5.4`n 0a 25.2`

v 0.3n 0a 0.1

Red messages not independent?Pretend they are!“Loopy Belief Propagation”

72

Page 165: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

propagation

73

Page 166: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

belief propagation

73

Page 167: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

beliefloopy propagation

73

Page 168: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

beliefloopy propagation

73

Page 169: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

beliefloopy propagation

73

Page 170: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

beliefloopy propagation

loopy belief propagation

73

Page 171: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

73

Terminological Clarification

beliefloopy propagation

loopy belief propagation

73

Page 172: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

74

• Loopy belief propagation is easy for local factors

• How do combinatorial factors (like TREE) compute the message to the link in question?✤ “Does the TREE factor think the link is probably t given the

messages it receives from all the other links?”

Propagating Global Factors

find preferred links ……

?

74

Page 173: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

74

• Loopy belief propagation is easy for local factors

• How do combinatorial factors (like TREE) compute the message to the link in question?✤ “Does the TREE factor think the link is probably t given the

messages it receives from all the other links?”

Propagating Global Factors

find preferred links ……

?

74

Page 174: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

74

• Loopy belief propagation is easy for local factors

• How do combinatorial factors (like TREE) compute the message to the link in question?✤ “Does the TREE factor think the link is probably t given the

messages it receives from all the other links?”

Propagating Global Factors

find preferred links ……

?TREE factorTREE factorffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

74

Page 175: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

75

• How does the TREE factor compute the message to the link in question?✤ “Does the TREE factor think the link is probably t given the

messages it receives from all the other links?”

Propagating Global Factors

find preferred links ……

?TREE factorTREE factorffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

75

Page 176: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

75

• How does the TREE factor compute the message to the link in question?✤ “Does the TREE factor think the link is probably t given the

messages it receives from all the other links?”

Propagating Global Factors

find preferred links ……

?TREE factorTREE factorffffff 0ffffft 0fffftf 0… …

fftfft 1… …

tttttt 0

Old-school parsing to the rescue!

This is the outside probability of the link in an edge-factored parser!

∴TREE factor computes all outgoing messages at once (given all incoming messages)

Projective case: total O(n3) time by inside-outside

Non-projective: total O(n3) time by inverting Kirchhoff matrix

75

Page 177: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

76

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row

and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

76

Page 178: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

76

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row

and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

76

Page 179: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

76

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row

and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!

76

Page 180: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

77

Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) −s(n,0)0 0 −s(2,1) −s(n,1)0 −s(1,2) 0 −s(n,2) 0 −s(1,n) −s(2,n) 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

lNegate edge scoreslSum columns

(children)lStrike root row/col.lTake determinant

77

Page 181: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

77

Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) −s(n,0)0 0 −s(2,1) −s(n,1)0 −s(1,2) 0 −s(n,2) 0 −s(1,n) −s(2,n) 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

lNegate edge scoreslSum columns

(children)lStrike root row/col.lTake determinant

77

Page 182: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

77

Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) −s(n,0)0 0 −s(2,1) −s(n,1)0 −s(1,2) 0 −s(n,2) 0 −s(1,n) −s(2,n) 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

0 −s(1,0) −s(2,0) −s(n,0)0 s(1, j)

j≠1∑ −s(2,1) −s(n,1)

0 −s(1,2) s(2, j)j≠2∑ −s(n,2)

0 −s(1,n) −s(2,n) s(n, j)

j≠n∑

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

lNegate edge scoreslSum columns

(children)lStrike root row/col.lTake determinant

77

Page 183: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

77

Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) −s(n,0)0 0 −s(2,1) −s(n,1)0 −s(1,2) 0 −s(n,2) 0 −s(1,n) −s(2,n) 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

0 −s(1,0) −s(2,0) −s(n,0)0 s(1, j)

j≠1∑ −s(2,1) −s(n,1)

0 −s(1,2) s(2, j)j≠2∑ −s(n,2)

0 −s(1,n) −s(2,n) s(n, j)

j≠n∑

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

s(1, j)j≠1∑ −s(2,1) −s(n,1)

−s(1,2) s(2, j)j≠2∑ −s(n,2)

−s(1,n) −s(2,n) s(n, j)

j≠n∑

lNegate edge scoreslSum columns

(children)lStrike root row/col.lTake determinant

77

Page 184: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

77

Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) −s(n,0)0 0 −s(2,1) −s(n,1)0 −s(1,2) 0 −s(n,2) 0 −s(1,n) −s(2,n) 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

0 −s(1,0) −s(2,0) −s(n,0)0 s(1, j)

j≠1∑ −s(2,1) −s(n,1)

0 −s(1,2) s(2, j)j≠2∑ −s(n,2)

0 −s(1,n) −s(2,n) s(n, j)

j≠n∑

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

s(1, j)j≠1∑ −s(2,1) −s(n,1)

−s(1,2) s(2, j)j≠2∑ −s(n,2)

−s(1,n) −s(2,n) s(n, j)

j≠n∑

lNegate edge scoreslSum columns

(children)lStrike root row/col.lTake determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.

77

Page 185: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based Parsing

• Linear time

• Online

• Train a classifier to predict next action

• Deterministic or beam-search strategies

• But... generally less accurate

78

Page 186: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingParsing Methods

Arc-Eager Shift-Reduce Parsing [Nivre 2003]

Start state: ([ ], [1, . . . , n], { })

Final state: (S, [ ], A)

Shift: (S, i |B, A) ) (S|i , B, A)

Reduce: (S|i , B, A) ) (S, B, A)

Right-Arc: (S|i , j |B, A) ) (S|i |j , B, A [ {i ! j})

Left-Arc: (S|i , j |B, A) ) (S, j |B, A [ {i j})

Bare-Bones Dependency Parsing 20(30)

Arc-eager shift-reduce parsing (Nivre, 2003)

79

Page 187: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[ ]S [who, did, you, see]B { }

who OBJ � diddid SBJ�! youdid VG�! see }

Bare-Bones Dependency Parsing 21(30)

80

Page 188: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[who]S [did, you, see]B { }

who OBJ � diddid SBJ�! youdid VG�! see }

Bare-Bones Dependency Parsing 21(30)

Shift

81

Page 189: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[ ]S [did, you, see]B { who OBJ � did }

did SBJ�! youdid VG�! see }

Bare-Bones Dependency Parsing 21(30)

Left-arcOBJ

82

Page 190: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did]S [you, see]B { who OBJ � did }

did SBJ�! youdid VG�! see }

Bare-Bones Dependency Parsing 21(30)

Shift

83

Page 191: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did, you]S [see]B { who OBJ � did,did SBJ�! you }

did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Right-arcSBJ

84

Page 192: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did]S [see]B { who OBJ � did,did SBJ�! you }

did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Reduce

85

Page 193: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did, see]S [ ]B { who OBJ � did,did SBJ�! you,did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Right-arcVG

86

Page 194: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did, you]S [see]B { who OBJ � did,did SBJ�! you }

did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Right-arcSBJ

87

Page 195: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did, you]S [see]B { who OBJ � did,did SBJ�! you }

did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Right-arcSBJ

Choose action w/best classifier score100k - 1M features

87

Page 196: Log-Linear ModelsLog-Linear Models with Structured Outputs Natural Language Processing CS 6120—Spring 2013 Northeastern University David Smith (some slides from Andrew McCallum)

Transition-Based ParsingArc-eager shift-reduce parsing (Nivre, 2003)

Parsing Methods

Parsing Example

Stack Buffer Arcs

[did, you]S [see]B { who OBJ � did,did SBJ�! you }

did VG�! see }

Bare-Bones Dependency Parsing 21(30)

Right-arcSBJ

Choose action w/best classifier score100k - 1M features

Very fast linear-time performanceWSJ 23 (2k sentences) in 3 s

87