decomposing structured prediction via constrained conditional models
DESCRIPTION
Decomposing Structured Prediction via Constrained Conditional Models. Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign. With thanks to: Collaborators: Ming-Wei Chang , Lev Ratinov, Rajhans Samdani, Vivek Srikumar, Many others - PowerPoint PPT PresentationTRANSCRIPT
June 2013SLG Workshop, ICML, Atlanta GA
Decomposing Structured Prediction viaConstrained Conditional Models
Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
Page 1
With thanks to: Collaborators: Ming-Wei Chang, Lev Ratinov, Rajhans Samdani, Vivek Srikumar,
Many others Funding: NSF; DHS; NIH; DARPA. DASH Optimization (Xpress-MP)
Comprehension
1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.This is an Inference
ProblemPage 2
Learning and Inference
Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. In current NLP we often think about simpler structured problems:
Parsing, Information Extraction, SRL, etc. As we move up the problem hierarchy (Textual Entailment, QA,….) not
all component models can be learned simultaneously We need to think about (learned) models for different sub-problems,
often pipelined Knowledge relating sub-problems (constraints) becomes more
essential and may appear only at evaluation time Goal: Incorporate models’ information, along with prior
knowledge (constraints) in making coherent decisions Decisions that respect the local models as well as domain & context
specific knowledge/constraints.
Page 3
Outline Constrained Conditional Models
A formulation for global inference with knowledge modeled as expressive structural constraints
A Structured Prediction Perspective
Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a
small output space Provide conditions for when DecL is provably identical to global structural
learning (GL)
Page 4
Three Ideas Underlying Constrained Conditional Models Idea 1: Separate modeling and problem formulation from algorithms
Similar to the philosophy of probabilistic modeling
Idea 2: Keep models simple, make expressive decisions (via constraints)
Unlike probabilistic modeling, where models become more expressive
Idea 3: Expressive structured decisions can be supported by simply learned models
Amplified and minimally supervised by exploiting dependencies among models’ outcomes.
Modeling
Inference
Learning
Page 5
Inference with General Constraint Structure [Roth&Yih’04,07]Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
Key Questions: How to guide the global inference? Over independently learned or pipelined modelsHow to learn? Independently, Pipeline, Jointly?
other 0.05
per 0.85
loc 0.10
other 0.05
per 0.50
loc 0.45
other 0.10
per 0.60
loc 0.30
irrelevant 0.10
spouse_of 0.05
born_in 0.85
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.05
spouse_of 0.45
born_in 0.50
other 0.05
per 0.85
loc 0.10
other 0.10
per 0.60
loc 0.30
other 0.05
per 0.50
loc 0.45
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.10
spouse_of 0.05
born_in 0.85
other 0.05
per 0.50
loc 0.45
Significant performance Improvement
Models could be learned separately; constraints may come up only at decision time.
Page 6
Note: Non Sequential Modely = argmax y score[y=v] 1[y=v] =
= argmax scoreE1 = PER ¢ 1E1 = PER
+ scoreE1 = LOC ¢ 1E1 = LOC +…
scoreR1 = S-of ¢ 1R1 = S-of +…..
Subject to Constraints
An Objective function that incorporates learned models with knowledge (constraints)
A constrained Conditional Model
Constrained Conditional Models
How to solve?
This is an Integer Linear Program
Solving using ILP packages gives an exact solution. Cutting Planes, Dual Decomposition & other search techniques are possible
(Soft) constraints component
Weight Vector for “local” models
Penalty for violatingthe constraint.
How far y is from a “legal” assignment
Features, classifiers; log-linear models (HMM, CRF) or a combination
How to train?
Training is learning the objective function
Decouple? Decompose?
How to exploit the structure to minimize supervision?
Page 7
Inferning workshop
Inference: given input x (a document, a sentence),
predict the best structure y = {y1,y2,…,yn} 2 Y (entities & relations) Assign values to the y1,y2,…,yn, accounting for dependencies among yis
Inference is expressed as a maximization of a scoring function
y’ = argmaxy 2 Y wT Á (x,y)
Inference requires, in principle, enumerating all y 2 Y at decision time, when we are given x 2 X and attempt to determine the best y 2 Y for it, given w For some structures, inference is computationally easy. Eg: Using the Viterbi algorithm In general, NP-hard (can be formulated as an ILP)
Structured Prediction: Inference
Joint features on inputs and outputsFeature Weights
(estimated during learning)Set of allowed structures
Placing in context: a crash course in structured prediction
Page 8
Structured Prediction: Learning
Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w
such that for each given annotated example (xi, yi):
Page 9
Structured Prediction: Learning
Learning: given a set of structured examples {(x,y)} find a scoring function w that minimizes empirical loss. Learning is thus driven by the attempt to find a weight vector w such that for
each given annotated example (xi, yi):
We call these conditions the learning constraints.
In most structured learning algorithms used today, the update of the weight vector w is done in an on-line fashion
W.l.o.g. (almost) we can thus write the generic structured learning algorithm as follows: What follows is a Structured Perceptron, but with minor variations this procedure applies to
CRFs and Linear Structured SVM
Score of annotated structure
Score of any other structure
Penalty for predicting other
structure8 y
Page 10
In the structured case, the prediction (inference) step is often intractable and needs to be done many times
Structured Prediction: Learning Algorithm
For each example (xi, yi) Do: (with the current weight vector w)
Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wT Á ( xi ,y)
Check the learning constraints Is the score of the current prediction better than of (xi, yi)?
If Yes – a mistaken prediction Update w
Otherwise: no need to update w on this example EndFor
Page 11
Structured Prediction: Learning Algorithm
For each example (xi, yi) Do:
Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASY
T ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)
Check the learning constraint Is the score of the current prediction better than of (xi, yi)?
If Yes – a mistaken prediction Update w
Otherwise: no need to update w on this example EndDo
Solution I: decompose the scoring function to EASY and HARD parts
EASY: could be feature functions that correspond to an HMM, a linear CRF, or a bank of classifiers (omitting dependence on y at learning time).May not be enough if the HARD part is still part of each inference step.
Page 12
Structured Prediction: Learning Algorithm
For each example (xi, yi) Do:
Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASY
T ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)
Check the learning constraint Is the score of the current prediction better than of (xi, yi)?
If Yes – a mistaken prediction Update w
Otherwise: no need to update w on this example EndDo
Solution II: Disregard some of the dependencies: assume a simple model.
Page 13
Structured Prediction: Learning Algorithm
For each example (xi, yi) Do:
Predict: perform Inference with the current weight vector yi’ = argmaxy 2 Y wEASY
T ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)
Check the learning constraint Is the score of the current prediction better than of (xi, yi)?
If Yes – a mistaken prediction Update w
Otherwise: no need to update w on this example EndDo yi’ = argmaxy 2 Y wEASY
T ÁEASY ( xi ,y) + wHARDT ÁHARD ( xi ,y)
This is the most commonly used solution in NLP today
Solution III: Disregard some of the dependencies during learning; take into account at decision time
Page 14
Linguistics Constraints
Cannot have both A states and B states in an output sequence.
Linguistics Constraints
If a modifier chosen, include its headIf verb is chosen, include its arguments
Examples: CCM Formulations
CCMs can be viewed as a general interface to easily combine declarative domain knowledge with data driven statistical models
Sequential Prediction
HMM/CRF based: Argmax ¸ij xij
Sentence Compression/Summarization:
Language Model based: Argmax ¸ijk xijk
Formulate NLP Problems as ILP problems (inference may be done otherwise)1. Sequence tagging (HMM/CRF + Global constraints)2. Sentence Compression (Language Model + Global Constraints)3. SRL (Independent classifiers + Global Constraints)
Page 15
(Soft) constraints component is more general since constraints can be declarative, non-grounded statements.
Constrained Conditional Models Allow: Learning a simple model (or multiple; or pipelines) Make decisions with a more complex model Accomplished by directly incorporating constraints to bias/re-rank
global decisions composed of simpler models’ decisions More sophisticated algorithmic approaches exist to bias the
output [CoDL: Cheng et. al 07,12; PR: Ganchev et. al. 10; UEM: Samdani et. al 12]
Outline Constrained Conditional Models
A formulation for global inference with knowledge modeled as expressive structural constraints
A Structured Prediction Perspective
Decomposed Learning (DecL) Efficient structure learning by reducing the learning-time-inference to a
small output space Provide conditions for when DecL is provably identical to global structural
learning (GL)
Page 16
Training: Independently of the constraints (L+I) Jointly, in the presence of the constraints (IBT, GL) Decomposed to simpler models
Not surprisingly, decomposition is good. See [Chang et. al., Machine Learning Journal 2012]
Little can be said theoretically on the quality/generalization of predictions made with a decomposed model
Next, an algorithmic approach to decomposition that is both good, and comes with interesting guarantees.
Decompose ModelTraining Constrained Conditional Models
Decompose Model from constraints
Page 17
In Global Learning, the output space is exponential in the number of variables – accurate learning can be intractable
“Standard” ways to decompose it forget some of the structure and bring it back only at decision time
Decomposed Structured Prediction
Learning is driven by the attempt to find a weight vector w such that for each given annotated example (xi , yi):
Page 18
8 y
Entity Relationy1
y3
y6y5
y2
y4
we wr
Learning
Inference
y1y3
y6y5
y2
y4
y = argmaxy 2 Y wT Á (x,y)
ywT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi)
Decomposed Structural Learning (DecL) [samdani & Roth IMCL’12]
Algorithm: Restrict the ‘argmax’ inference to a small subset of variables while fixing the remaining variables to their ground truth values yi
… and repeating for different subsets of the output variables: a decomposition The resulting set of assignments considered for yi is called a neighborhood(yi )
Key contribution: We give conditions under which DecL is provably equivalent to Global Learning (GL)
Show experimentally that DecL provides results close to GL when such conditions do not exactly hold
y1
y3
y6
y5
y2
y4
00011011
00 01 10 11
000001010011100101110111
Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10
Page 19
• DecL: Separate ground truth from 8y 2 nbr(yj)o 16 outputs
y1
y3
y6
y5
y2
y4
00011011
00 01 10 11
000001010011100101110111
DecL vs. Global Learning (GL) GL: Separate ground truth
from 8y 2 Y 26 = 64 outputs
y1
y3
y6
y5
y2
y4
000000000001000010000011
Likely scenario: nbr(yj) ¿ Y8 y 2 Y 8 y 2 nbr(yj)
What are good neighborhoods?
wT Á (xi , yi ) ¸ wT Á (xi , y) + ¢ (y,yi)
Page 20
Creating Decompositions
DecL allows different decompositions Sj for different training instances yj
Example: Learning with decompositions in which all subsets of size k are considered: DecL-k For each k-subset of the variables, enumerate; keep n-k to gold. K=1 is Pseudomax [Sontag et al, 2010] K=2 is Constraint Classification [Har-Peled, Zimak, Roth 2002;
Crammer, Singer 2002] In practice, neighborhoods should be determined based on
domain knowledge Put highly coupled variables in the same set
The goal is to get results that are close to doing exact inference. Are there small & good neighborhoods?
Page 21
Different label space
Exactness of DecL Key Result: YES. Under “reasonable conditions”, DecL with small sized
neighborhoods nbr(yj) gives the same results as Global Learning. For analyzing the equivalence between DecL and GL, we need a
notion of ‘separability’ of the data Separability: existence of set of weights W* that satisfy
W*: {w | w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 Y } Separating weights for DecLWdecl: {w| w¢Á(xj, yj) ¸ w¢Á(xj, y) + ¢(yj,y), 8 y 2 nbr(yj)}
Naturally: W* µ Wdecl
Exactness Results: The set of separating weights for DecL is equal to the set of separating weights for GL: W*=Wdecl
wyj
W*Wdecl
Score of ground truth yj
Score of all non ground truth y
Page 22
Example of Exactness: Pairwise Markov Networks
Scoring function define over a graph with edges E
Assume domain knowledge on W*: for correct (separating) w 2 W*, which of the pairwise Ái,k (.;w) are: Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0)
Supermodular: Ái,k(0,0)+ Ái,k(1,1) < Ái,k(0,1) + Ái,k(1,0)
y1
y3
y6y5
y2
y4
Singleton/Vertex components
Pairwise/Edge components
1 1 0 0 1 0 0 1<
1 1 0 0 1 0 0 1>OR
Page 23
Decomposition for Pairwise Markov Networks
For an example (xj,yj), define Ej by removing edges from E where the labels disagree with the Á‘s
Theorem: Decomposing the variables as connected components of Ej yields Exactness.
E Ej
sub(Á) sup(Á)
1 0 X X
X
Page 24
Experiments: Information extraction
Prediction result of a trained HMMLars Ole Andersen . Program analysis andspecialization for the C Programming language
. PhD thesis .DIKU , University of Copenhagen , May1994 .
[AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE]
Violates lots of natural constraints!
Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 .
Page 25
Adding Expressivity via Constraints
Each field must be a consecutive list of words and can appear at most once in a citation.
State transitions must occur on punctuation marks.
The citation can only start with AUTHOR or EDITOR.
The words pp., pages correspond to PAGE. Four digits starting with 20xx and 19xx are DATE. Quotations can appear only in TITLE …….
Page 26
Information Extraction with Constraints Adding constraints, we get correct results!
[AUTHOR] Lars Ole Andersen . [TITLE] Program analysis and
specialization for the C Programming
language .[TECH-REPORT] PhD thesis .[INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 .
Experimental Goal: Investigate DecL with small neighborhoods Note that the required theoretical conditions hold only approximately:
Output tokens tend to appear in contiguous blocks Use neighborhoods similar to the PMN
Page 27
Typical Results: Information Extraction (Ads Data) F1
Sco
res
HMM (LL) L+I GL DecL75
76
77
78
79
80
81
Accuracy
Accuracy
Page 28
Typical Results: Information Extraction (Ads Data)F1
Sco
res
HMM (LL) L+I GL DecL75
76
77
78
79
80
81
0
10
20
30
40
50
60
70
80TimeTim
e taken to train (Minutes)
Page 29
Conclusion Presented Constrained Conditional Models:
An ILP formulation for structured prediction that augments statistically learned models with declarative constraints as a way to incorporate knowledge and support decisions in expressive output spaces
Supports joint inference while maintaining modularity and tractability of training Interdependent components are learned (independently or pipelined) and, via
joint inference, support coherent decisions, modulo declarative constraints. Presented Decomposed Learning (DecL): efficient joint learning by reducing
the learning-time-inference to a small output space Provided conditions for when DecL is provably identical to global structural
learning (GL) Interesting open questions in developing further understanding of how to
support efficient joint inference
Thank You!
Check out our tools, demos, tutorials
Page 30