decomposing structured prediction via constrained conditional models

June 2013SLG Workshop, ICML, Atlanta GA

Decomposing Structured Prediction viaConstrained Conditional Models

Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign

With thanks to: Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar,

Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)

Comprehension

1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.

(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.This is an Inference

Problem

Learning and Inference

Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems:

Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not

all component models can be learned simultaneously We need to think about (learned) models for different sub-problems,

often pipelined Knowledge relating sub-problems (constraints) becomes more

essential and may appear only at evaluation time Goal: Incorporate models’ information, along with prior

knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context

specific knowledge/constraints.

Outline Constrained Conditional Models

A formulation for global inference with knowledge modeled as expressive structural constraints

A Structured Prediction Perspective

Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a

small output space Provide conditions for when DecL is provably identical to global structural

learning (GL)

Three Ideas Underlying Constrained Conditional Models Idea 1: Separate modeling and problem formulation from algorithms

Similar to the philosophy of probabilistic modeling

Idea 2: Keep models simple, make expressive decisions (via constraints)

Unlike probabilistic modeling, where models become more expressive

Idea 3: Expressive structured decisions can be supported by simply learned models

Amplified and minimally supervised by exploiting dependencies among models’ outcomes.

Modeling

Inference

Learning

Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations

Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3

R12 R2

3

Key Questions: How to guide the global inference? Over independently learned or pipelined modelsHow to learn? Independently, Pipeline, Jointly?

other 0.05

per 0.85

loc 0.10

other 0.05

per 0.50

loc 0.45

other 0.10

per 0.60

loc 0.30

irrelevant 0.10

spouse_of 0.05

born_in 0.85

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.05

spouse_of 0.45

born_in 0.50

other 0.05

per 0.85

loc 0.10

other 0.10

per 0.60

loc 0.30

other 0.05

per 0.50

loc 0.45

irrelevant 0.05

spouse_of 0.45

born_in 0.50

irrelevant 0.10

spouse_of 0.05

born_in 0.85

other 0.05

per 0.50

loc 0.45

Significant performance Improvement

Models could be learned separately; constraints may come up only at decision time.

Note: Non Sequential Modely = argmax y score[y=v] 1[y=v] =

= argmax scoreE1 = PER ¢ 1E1 = PER

+ scoreE1 = LOC ¢ 1E1 = LOC +…

scoreR1 = S-of ¢ 1R1 = S-of +…..

Subject to Constraints

An Objective function that incorporates learned models with knowledge (constraints)

A constrained Conditional Model

Constrained Conditional Models

How to solve?

This is an Integer Linear Program

Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible

(Soft) constraints component

Weight Vector for “local” models

Penalty for violatingthe constraint.

How far y is from a “legal” assignment

Features, classifiers; log-linear models (HMM, CRF) or a combination

How to train?

Training is learning the objective function

Decouple? Decompose?

How to exploit the structure to minimize supervision?

Inferning workshop

Inference: given input x (a document, a sentence),

predict the best structure y = {y1,y2,…,yn} 2 Y (entities & relations) Assign values to the y1,y2,…,yn, accounting for dependencies among yis

Inference is expressed as a maximization of a scoring function

y’ = argmaxy 2 Y wT Á (x,y)

Inference requires, in principle, enumerating all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP)

Structured Prediction: Inference

Joint features on inputs and outputsFeature Weights

(estimated during learning)Set of allowed structures

Placing in context: a crash course in structured prediction

Structured Prediction: Learning

Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w

such that for each given annotated example (xi, yi):

Structured Prediction: Learning

Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for

each given annotated example (xi, yi):

We call these conditions the learning constraints.

In most structured learning algorithms used today, the update of the weight vector w is done in an on-line fashion

W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: What follows is a Structured Perceptron, but with minor variations this procedure applies to

CRFs and Linear Structured SVM

Score of annotated structure

Score of any other structure

Penalty for predicting other

structure8 y

In the structured case, the prediction (inference) step is often intractable and needs to be done many times

Structured Prediction: Learning Algorithm

For each example (xi, yi) Do: (with the current weight vector w)

Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wT Á ( xi ,y)

Check the learning constraints Is the score of the current prediction better than of (xi, yi)?

If Yes – a mistaken prediction Update w

Otherwise: no need to update w on this example EndFor


For each example (xi, yi) Do:

Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASY

T ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)

Check the learning constraint Is the score of the current prediction better than of (xi, yi)?


Otherwise: no need to update w on this example EndDo

Solution I: decompose the scoring function to EASY and HARD parts

EASY: could be feature functions that correspond to an HMM, a linear CRF, or a bank of classifiers (omitting dependence on y at learning time).May not be enough if the HARD part is still part of each inference step.







Otherwise: no need to update w on this example EndDo

Solution II: Disregard some of the dependencies: assume a simple model.







Otherwise: no need to update w on this example EndDo yi’ = argmaxy 2 Y wEASY


This is the most commonly used solution in NLP today

Solution III: Disregard some of the dependencies during learning; take into account at decision time

Linguistics Constraints

Cannot have both A states and B states in an output sequence.

Linguistics Constraints

If a modifier chosen, include its headIf verb is chosen, include its arguments

Examples: CCM Formulations

CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models

Sequential Prediction

HMM/CRF based: Argmax ¸ij xij

Sentence Compression/Summarization:

Language Model based: Argmax ¸ijk xijk

Formulate NLP Problems as ILP problems (inference may be done otherwise)1. Sequence tagging (HMM/CRF + Global constraints)2. Sentence Compression (Language Model + Global Constraints)3. SRL (Independent classifiers + Global Constraints)

(Soft) constraints component is more general since constraints can be declarative, non-grounded statements.

Constrained Conditional Models Allow: Learning a simple model (or multiple; or pipelines) Make decisions with a more complex model Accomplished by directly incorporating constraints to bias/re-rank

global decisions composed of simpler models’ decisions More sophisticated algorithmic approaches exist to bias the

output [CoDL: Cheng et. al 07,12; PR: Ganchev et. al. 10; UEM: Samdani et. al 12]

Outline Constrained Conditional Models

A formulation for global inference with knowledge modeled as expressive structural constraints

A Structured Prediction Perspective

Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a

small output space Provide conditions for when DecL is provably identical to global structural

learning (GL)

Training: Independently of the constraints (L+I) Jointly, in the presence of the constraints (IBT, GL) Decomposed to simpler models

Not surprisingly, decomposition is good. See [Chang et. al., Machine Learning Journal 2012]

Little can be said theoretically on the quality/generalization of predictions made with a decomposed model

Next, an algorithmic approach to decomposition that is both good, and comes with interesting guarantees.

Decompose ModelTraining Constrained Conditional Models

Decompose Model from constraints

In Global Learning, the output space is exponential in the number of variables – accurate learning can be intractable

“Standard” ways to decompose it forget some of the structure and bring it back only at decision time

Decomposed Structured Prediction

Learning is driven by the attempt to find a weight vector w such that for each given annotated example (xi , yi):

8 y

Entity Relationy1

y3

y6y5

y2

y4

we wr

Learning

Inference

y1y3

y6y5

y2

y4

y = argmaxy 2 Y wT Á (x,y)

ywT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi)

Decomposed Structural Learning (DecL) [samdani & Roth IMCL’12]

Algorithm: Restrict the ‘argmax’ inference to a small subset of variables while fixing the remaining variables to their ground truth values yi

… and repeating for different subsets of the output variables: a decomposition The resulting set of assignments considered for yi is called a neighborhood(yi )

Key contribution: We give conditions under which DecL is provably equivalent to Global Learning (GL)

Show experimentally that DecL provides results close to GL when such conditions do not exactly hold

y1

y3

y6

y5

y2

y4

00011011

00 01 10 11

000001010011100101110111

Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10

• DecL: Separate ground truth from 8y 2 nbr(yj)o 16 outputs

y1

y3

y6

y5

y2

y4

00011011

00 01 10 11

000001010011100101110111

DecL vs. Global Learning (GL) GL: Separate ground truth

from 8y 2 Y 26 = 64 outputs

y1

y3

y6

y5

y2

y4

000000000001000010000011

Likely scenario: nbr(yj) ¿ Y8 y 2 Y 8 y 2 nbr(yj)

What are good neighborhoods?

wT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi)

Creating Decompositions

DecL allows different decompositions Sj for different training instances yj

Example: Learning with decompositions in which all subsets of size k are considered: DecL-k For each k-subset of the variables, enumerate; keep n-k to gold. K=1 is Pseudomax [Sontag et al, 2010] K=2 is Constraint Classification [Har-Peled, Zimak, Roth 2002;

Crammer, Singer 2002] In practice, neighborhoods should be determined based on

domain knowledge Put highly coupled variables in the same set

The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods?

Different label space

Exactness of DecL Key Result: YES. Under “reasonable conditions”, DecL with small sized

neighborhoods nbr(yj) gives the same results as Global Learning. For analyzing the equivalence between DecL and GL, we need a

notion of ‘separability’ of the data Separability: existence of set of weights W* that satisfy

W*: {w | w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 Y } Separating weights for DecLWdecl: {w| w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 nbr(yj)}

Naturally: W* µ Wdecl

Exactness Results: The set of separating weights for DecL is equal to the set of separating weights for GL: W*=Wdecl

wyj

W*Wdecl

Score of ground truth yj

Score of all non ground truth y

Example of Exactness: Pairwise Markov Networks

Scoring function define over a graph with edges E

Assume domain knowledge on W*: for correct (separating) w 2 W*, which of the pairwise Ái,k (.;w) are: Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0)

Supermodular: Ái,k(0,0)+ Ái,k(1,1) < Ái,k(0,1) + Ái,k(1,0)

y1

y3

y6y5

y2

y4

Singleton/Vertex components

Pairwise/Edge components

1 1 0 0 1 0 0 1<

1 1 0 0 1 0 0 1>OR

Decomposition for Pairwise Markov Networks

For an example (xj,yj), define Ej by removing edges from E where the labels disagree with the Á‘s

Theorem: Decomposing the variables as connected components of Ej yields Exactness.

E Ej

sub(Á) sup(Á)

1 0 X X

X

Experiments: Information extraction

Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language

. PhD thesis .DIKU , University of Copenhagen , May1994 .

[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]

Violates lots of natural constraints!

Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .

Adding Expressivity via Constraints

Each field must be a consecutive list of words and can appear at most once in a citation.

State transitions must occur on punctuation marks.

The citation can only start with AUTHOR or EDITOR.

The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….

Information Extraction with Constraints Adding constraints, we get correct results!

[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and

specialization for the C Programming

language .[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .

Experimental Goal: Investigate DecL with small neighborhoods Note that the required theoretical conditions hold only approximately:

Output tokens tend to appear in contiguous blocks Use neighborhoods similar to the PMN

Typical Results: Information Extraction (Ads Data) F1

Sco

res

HMM (LL) L+I GL DecL75

76

77

78

79

80

81

Accuracy

Accuracy

Typical Results: Information Extraction (Ads Data)F1

Sco

res

HMM (LL) L+I GL DecL75

76

77

78

79

80

81

0

10

20

30

40

50

60

70

80TimeTim

e taken to train (Minutes)

Conclusion Presented Constrained Conditional Models:

An ILP formulation for structured prediction that augments statistically learned models with declarative constraints as a way to incorporate knowledge and support decisions in expressive output spaces

Supports joint inference while maintaining modularity and tractability of training Interdependent components are learned (independently or pipelined) and, via

joint inference, support coherent decisions, modulo declarative constraints. Presented Decomposed Learning (DecL): efficient joint learning by reducing

the learning-time-inference to a small output space Provided conditions for when DecL is provably identical to global structural

learning (GL) Interesting open questions in developing further understanding of how to

support efficient joint inference

Thank You!

Check out our tools, demos, tutorials

decomposing structured prediction via constrained conditional models

Documents

models information

local models

expressive decisions

models simple

models outcomes

component models

inference global decisions

coherent decisions decisions