minimum-risk training of approximate crf-based nlp systems veselin stoyanov and jason eisner 1
TRANSCRIPT
![Page 1: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/1.jpg)
1
Minimum-Risk Training of Approximate CRF-Based NLP Systems
Veselin Stoyanov and Jason Eisner
![Page 2: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/2.jpg)
2
Overview
• We will show significant improvements on three data sets.
• How do we do it?– A new training algorithm!
• Don’t be afraid of discriminative models with approximate inference!
• Use our software instead!
![Page 3: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/3.jpg)
3
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• NLP Systems:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat. Ut wisi enim ad minim veniam, quis nostrud exerci tation ullamcorper suscipit lobortis nisl ut aliquip ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse molestie consequat, vel illum dolore eu feugiat nulla facilisis at vero eros et accumsan et iusto odio dignissim qui blandit praesent luptatum zzril delenit augue duis dolore te feugait nulla facilisi. Nam liber tempor cum soluta nobis eleifend option congue nihil imperdiet doming id quod mazim placerat facer possim assum. Typi non habent claritatem insitam; est usus legentis in iis qui facit eorum claritatem. Investigationes demonstraverunt lectores legere me lius quod ii legunt saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc putamus…
NLP System
![Page 4: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/4.jpg)
4
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• Conditional random fields (CRFs) [Lafferty et al., 2001]
• Discriminative models of probability p(Y|X).
• Used successfully for many NLP problems.
![Page 5: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/5.jpg)
5
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• Linear chain CRF:
• Exact inference is tractable.• Training via maximum likelihood
estimation is tractable and convex.
x1 x2 x3 x4
Y1 Y2 Y3 Y4
![Page 6: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/6.jpg)
6
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• CRFs (like BNs and MRFs) are models of conditional probability.
• In NLP we are interested in making predictions.
• Build prediction systems around CRFs.
![Page 7: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/7.jpg)
7
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• Inference: compute quantities about the distribution.
The
cat sat on themat
.
DT .9NN .05
…
NN .8JJ .1
…
VBD .7VB .1
…
IN .9NN .01
…
DT .9NN .05
…
NN .4JJ .3
…
. .99, .001
…
![Page 8: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/8.jpg)
8
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• Decoding: coming up with predictions based on the probabilities.
The
cat sat on themat
.
DT NN VBD IN DT NN .
![Page 9: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/9.jpg)
9
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• General CRFs: Unrestricted model structure.
• Inference is intractable.• Learning?
Y1
Y2
Y4
Y3
X1 X2 X3
![Page 10: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/10.jpg)
10
General CRFs
• Why sacrifice tractable inference and convex learning?
• Because a loopy model can represent the data better!
• Now you can train your loopy CRF using ERMA (Empirical Risk Minimization under Approximations)!
![Page 11: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/11.jpg)
11
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• In linear-chain CRFs, we can use Maximum Likelihood Estimation (MLE):– Compute gradients of the log likelihood running
exact inference.– The likelihood is convex, so learning finds a
global minimizer.
![Page 12: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/12.jpg)
12
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• We use CRFs with several approximations:– Approximate inference.– Approximate decoding.– Mis-specified model structure.– MAP training (vs. Bayesian).
• And we are still be maximizing data likelihood?
Could be present in linear-chain CRFs as well.
![Page 13: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/13.jpg)
13
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• End-to-End Learning [Stoyanov, Ropson & Eisner,
AISTATS2011]:– We should learn parameters that work
well in the presence of approximations.– Match the training and test conditions.– Find the parameters that minimize
training loss.
![Page 14: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/14.jpg)
14
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• Select ϴ that minimizes training loss.• i.e., perform Empirical Risk Minimization
under Approximations (ERMA).
p(y|x)x
(Appr.)Inferenc
e
(Appr.)Decodin
g
ŷ L(y*,ŷ)Black box decision
functionparameterized by ϴ
![Page 15: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/15.jpg)
15
Optimization Criteria
Approximation AwareNo Yes
Loss Aware
No
Yes
![Page 16: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/16.jpg)
16
Optimization Criteria
Approximation AwareNo Yes
Loss Aware
No
Yes
MLE
![Page 17: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/17.jpg)
17
Optimization Criteria
Approximation AwareNo Yes
Loss Aware
No
Yes SVMstruct
[Finley and Joachims, 2008]
M3N [Taskar et al., 2003]
Softmax-margin [Gimpel & Smith, 2010]
MLE
![Page 18: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/18.jpg)
18
Optimization Criteria
Approximation AwareNo Yes
Loss Aware
No
Yes SVMstruct
[Finley and Joachims, 2008]
M3N [Taskar et al., 2003]
Softmax-margin [Gimpel & Smith, 2010]
ERMA
MLE
![Page 19: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/19.jpg)
19
Minimum-Risk Training of Approximate CRF-Based NLP Systems through Back Propagation
• Use back propagation to compute gradients with respect to output loss
• Use a local optimizer to find the parameters that (locally) minimize training loss
![Page 20: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/20.jpg)
20
Our Contributions
• Apply ERMA [Stoyanov, Ropson and Eisner;
AISTATS2011] to three NLP problems.• We show that:
– General CRFs work better when they match dependencies in the data.
– Minimum risk training results in more accurate models.
– ERMA software package available at www.clsp.jhu.edu/~ves/software
![Page 21: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/21.jpg)
21
The Rest of this Talk
• Experimental results• A brief explanation of the ERMA
algorithm
![Page 22: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/22.jpg)
22
Experimental Evaluation
![Page 23: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/23.jpg)
23
Implementation
• The ERMA software package(www.clsp.jhu.edu/~ves/software)
• Includes syntax for describing general CRFs.
• Can optimize several commonly used loss functions: MSE, Accuracy, F-score.
• The package is generic:– Little effort to model new problems.– About1-3 days to express each problem in our
formalism.
![Page 24: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/24.jpg)
24
Specifics
• CRFs used with loopy BP for inference.– sum-product BP
• i.e., loopy forward-backward
– max-product BP (annealed)• i.e., loopy Viterbi
• Two loss functions: Accuracy and F1.
![Page 25: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/25.jpg)
25
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
![Page 26: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/26.jpg)
26
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
![Page 27: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/27.jpg)
27
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the heroic actions of the passengers of United
flight 93 who forced the plane down over
Pennsylvania, congress's ability to serve …
Yea
Mr. Sensenbrenner
![Page 28: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/28.jpg)
28
Modeling Congressional Votes
The ConVote corpus [Thomas et al., 2006]
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just
for the underlying bill…
Yea
Had it not been for the heroic actions of the passengers of United
flight 93 who forced the plane down over
Pennsylvania, congress's ability to serve …
Yea
Mr. Sensenbrenner
![Page 29: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/29.jpg)
29
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
![Page 30: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/30.jpg)
30
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates. Y/N
![Page 31: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/31.jpg)
31
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just for the underlying
bill…
Y/N
Text
![Page 32: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/32.jpg)
32
Modeling Congressional Votes
An example from the ConVote corpus [Thomas et al., 2006]
• Predict representative votes based on debates.
First , I want to commend the
gentleman from Wisconsin (Mr.
Sensenbrenner), the chairman of the
committee on the judiciary , not just for the underlying
bill…
Y/N
Text
Y/N
Context Tex
t
![Page 33: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/33.jpg)
33
Modeling Congressional Votes
Accuracy
Non-loopy baseline(2 SVMs + min-cut) 71.2
![Page 34: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/34.jpg)
34
Modeling Congressional Votes
Accuracy
Non-loopy baseline(2 SVMs + min-cut) 71.2Loopy CRF models(inference via loopy sum-prod BP)
![Page 35: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/35.jpg)
35
Modeling Congressional Votes
Accuracy
Non-loopy baseline(2 SVMs + min-cut) 71.2Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
78.2
![Page 36: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/36.jpg)
36
Modeling Congressional Votes
Accuracy
Non-loopy baseline(2 SVMs + min-cut) 71.2Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
78.2
Softmax-margin(loss-aware) 79.0
![Page 37: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/37.jpg)
37
Modeling Congressional Votes
Accuracy
Non-loopy baseline(2 SVMs + min-cut) 71.2Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
78.2
Softmax-margin(loss-aware) 79.0ERMA (loss- and approximation-aware)
84.5
*Boldfaced results are significantly better than all others (p < 0.05)
![Page 38: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/38.jpg)
38
Information Extraction from Semi-Structured Text
What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package
CMU Seminar Announcement Corpus [Freitag, 2000]
![Page 39: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/39.jpg)
39
start timelocation
speaker
speaker
Information Extraction from Semi-Structured Text
What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of TechnologyTopic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737)
ABSTRACT: We will demonstrate the system "automata" that implements finite state machines……After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package
CMU Seminar Announcement Corpus [Freitag, 2000]
![Page 40: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/40.jpg)
40
Skip-Chain CRF for Info Extraction
• Extract speaker, location, stime, and etime from seminar announcement emails
Sutner
S
Who:
O
Prof.
S
Klaus
S
will
O
Prof.
S
Sutner
S… …
… …
CMU Seminar Annoncement Corupus [Freitag, 2000]Skip-chain CRF [Sutton and McCallum, 2005; Finkel
et al., 2005]
![Page 41: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/41.jpg)
41
Semi-Structured Information Extraction
F1
Non-loopy baseline(linear-chain CRF) 86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood)
87.1
![Page 42: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/42.jpg)
42
Semi-Structured Information Extraction
F1
Non-loopy baseline(linear-chain CRF) 86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood)
87.1
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
89.5
![Page 43: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/43.jpg)
43
Semi-Structured Information Extraction
F1
Non-loopy baseline(Linear-chain CRF) 86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood)
87.1
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
89.5
Softmax-margin(loss-aware) 90.2
![Page 44: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/44.jpg)
44
Semi-Structured Information Extraction
F1
Non-loopy baseline(Linear-chain CRF) 86.2Non-loopy baseline + ERMA(trained for loss instead of likelihood)
87.1
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
89.5
Softmax-margin(loss-aware) 90.2ERMA (loss- and approximation-aware)
90.9
*Boldfaced results are significantly better than all others (p < 0.05).
![Page 45: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/45.jpg)
45
Collective Multi-Label Classification
Reuters Corpus Version 2[Lewis et al, 2004]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
![Page 46: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/46.jpg)
46
Collective Multi-Label Classification
Reuters Corpus Version 2[Lewis et al, 2004]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
![Page 47: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/47.jpg)
47
Collective Multi-Label Classification
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
![Page 48: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/48.jpg)
48
Collective Multi-Label Classification
[Ghamrawi and McCallum, 2005;Finley and Joachims, 2008]
The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly.Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later.…
Oil
Libya
Sports
![Page 49: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/49.jpg)
49
Multi-Label Classification
F1
Non-loopy baseline(logistic regression for each label)
81.6
![Page 50: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/50.jpg)
50
Multi-Label Classification
F1
Non-loopy baseline(independent max-ent models)
81.6
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
84.0
![Page 51: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/51.jpg)
51
Multi-Label Classification
F1
Non-loopy baseline(logistic regression for each label)
81.6
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
84.0
Softmax-margin(loss-aware) 83.8
![Page 52: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/52.jpg)
52
Multi-Label Classification
F1
Non-loopy baseline(logistic regression for each label)
81.6
Loopy CRF models(inference via loopy sum-prod BP)
Maximum-likelihood training(with approximate inference)
84.0
Softmax-margin(loss-aware) 83.8ERMA (loss- and approximation-aware)
84.6
*Boldfaced results are significantly better than all others (p < 0.05)
![Page 53: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/53.jpg)
53
Summary
Congressional Vote Modeling
(Accuracy)
Semi-str. Inf.
Extraction (F1)
Multi-label Classificati
on (F1)
Non-loopy baseline
71.2 87.1 81.6Loopy CRF models
Maximum-likelihood training
78.2 89.5 84.0
ERMA 84.5 90.9 84.6ERMA also helps on a range of synthetic data graphical model problems (AISTATS'11 paper).
![Page 54: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/54.jpg)
54
ERMA training
![Page 55: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/55.jpg)
55
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
x L(y*,ŷ)
Black box decision function
parameterized by ϴ
![Page 56: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/56.jpg)
56
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
x L(y*,ŷ)
![Page 57: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/57.jpg)
57
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
x L(y*,ŷ)
Neural network
![Page 58: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/58.jpg)
58
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
x L(y*,ŷ)
Neural network
![Page 59: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/59.jpg)
59
Y1
Y2
Y4
Y3
X1 X2 X3
Back-Propagation of Error for Empirical Risk Minimization
• Back propagation of error (automatic differentiation in the reverse mode) to compute gradients of the loss with respect to θ.
• Gradient-based local optimization method to find the θ* that (locally) minimizes the training loss.
x L(y*,ŷ)
CRF System
![Page 60: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/60.jpg)
60
Error Back-Propagation
![Page 61: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/61.jpg)
61
Error Back-Propagation
![Page 62: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/62.jpg)
62
Error Back-Propagation
![Page 63: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/63.jpg)
63
Error Back-Propagation
![Page 64: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/64.jpg)
64
Error Back-Propagation
![Page 65: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/65.jpg)
65
Error Back-Propagation
![Page 66: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/66.jpg)
66
Error Back-Propagation
![Page 67: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/67.jpg)
67
Error Back-Propagation
![Page 68: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/68.jpg)
68
Error Back-Propagation
![Page 69: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/69.jpg)
69
Error Back-Propagation
VoteReidbill77
P(VoteReidbill77=Yeah|x)
m(y1y2)=m(y3y1)*m(y4y
1)
ϴ
![Page 70: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/70.jpg)
70
Error Back-Propagation
• Applying the chain rule of derivation over and over.
• Forward pass:– Regular computation (inference +
decoding) in the model (+ remember intermediate quantities).
• Backward pass:– Replay the forward pass in reverse
computing gradients.
![Page 71: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/71.jpg)
71
• Run inference and decoding:
Inference (loopy BP)
The Forward Pass
θ messages beliefs
Decoding
output
Loss
L
![Page 72: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/72.jpg)
72
• Replay the computation backward calculating gradients:
Inference (loopy BP)
The Backward Pass
θ messages beliefs
Decoding
output
Loss
L
ð(L)=1ð(output)
ð(f)= L/f
ð(messages) ð(beliefs)ð(θ)
![Page 73: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/73.jpg)
73
Gradient-Based Optimization
• Use a local optimizer to find θ* that minimize training loss.
• In practice, we use a second-order method, Stochastic Meta Descent [Schradoulph, 1999].– Some more automatic differentiation magic
needed to compute vector-Hessian products.
• Both gradient and vector-Hessian computation have the same complexity as the forward pass (small constant factor).
![Page 74: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/74.jpg)
74
Minimum-Risk Training of Approximate CRF-Based NLP Systems
• ERMA leads to surprisingly large gains improving the state of the art on 3 problems
• You should try rich CRF models for YOUR application
Approximation-aware
No Yes
Loss-aware
No
Yes
SVMstruct
M3NSoftmax-margin
ERMA
- Even if you have to approximate
- Just train to minimize loss given the approximations!
- Using our ERMA software.
MLE
![Page 75: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/75.jpg)
75
What can ERMA do for you?
Future Work• Learn speed-aware models for fast
test-time inference• Learn evidence-specific structures• Applications to relational data
• Erma software package available at www.clsp.jhu.edu/~ves/software
![Page 76: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/76.jpg)
76
Thank you.Questions?
![Page 77: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/77.jpg)
77
Deterministic Annealing
• Some loss functions are not differentiable (e.g., accuracy)
• Some inference methods are not differentiable (e.g., max-product BP).
• Replace Max with Softmax and anneal.
![Page 78: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/78.jpg)
78
Linear-Chain CRFs for Sequences
• Defined in terms of potentials functions for transitions fj(yi-1,yi) and emissions fj(xi,yi):
x1 x2 x3 x4
Y1 Y2 Y3 Y4
ki
iikkji
iijjx
yxyyZ
xyp,,
1 ),(f),(fexp1
)|(
![Page 79: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/79.jpg)
79
Synthetic Data
• Generate a CRF at random– Structure– Parameters
• Use Gibbs sampling to generate data• Forget the parameters (but not the
structure)• Learn the parameters from the sampled data• Evaluate using one of four loss functions• Total of 12 models of different size and
connectivity
![Page 80: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/80.jpg)
80
Synthetic Data: Results
Test Loss Train Loss Δ Loss wins⦁ties⦁losses
MSEApprLogL .71
MSE . 05 12⦁0⦁0
AccuracyApprLogL . 75
Accuracy .01 11⦁0⦁1
F-ScoreApprLogL 1.17
F-Score .08 10⦁2⦁0ApprLogL ApprLogL -.31
![Page 81: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/81.jpg)
81
Synthetic Data: Introducing Structure Mismatch
10% 20% 30% 40%0
0.005
0.01
0.015
0.02
0.025
ALogL -- MSE
MSE -- MSE
ALogL -- F-score
F-score -- F-score
Structure Mismatch
Loss
![Page 82: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/82.jpg)
82
Synthetic Data: Varying Approximation Quality
100 30 20 100
0.005
0.01
0.015
0.02
0.025
0.03
0.035
ALogL -- MSE
MSE -- MSE
ALogL -- F-score
F-score -- F-score
Max BP Iterations
Loss
![Page 83: Minimum-Risk Training of Approximate CRF-Based NLP Systems Veselin Stoyanov and Jason Eisner 1](https://reader030.vdocument.in/reader030/viewer/2022032612/56649ef25503460f94c036d1/html5/thumbnails/83.jpg)
83
t4=t2*t3
¶V/t3= ( V/t4)*( t4/t3) V/t3=1*t2
Automatic Differentiation in the Reverse Mode
• f(x,y) = xy2 f/x=? f/y=?
x
t3=x
t4=xy2
y
t1=y
t2=y2
ð(g)= V/g
^*
ðx=t2
ðt3=t2
ðt4=1
ðy=2t1t3
ðt1=2t1t
3
ðt2=t3