nips2007: structured prediction

83
Structured Prediction: A Large Margin Approach Ben Taskar University of Pennsylvania

Upload: zukun

Post on 29-Jun-2015

359 views

Category:

Education


0 download

TRANSCRIPT

Page 1: NIPS2007: structured prediction

Structured Prediction:A Large Margin Approach

Ben TaskarUniversity of Pennsylvania

Page 2: NIPS2007: structured prediction

Acknowledgments

Drago AnguelovVassil ChatalbashevCarlos GuestrinMichael Jordan

Dan KleinDaphne KollerSimon Lacoste-JulienPaul Vernaza

Page 3: NIPS2007: structured prediction

Structured Prediction Prediction of complex outputs

Structured outputs: multivariate, correlated, constrained

Novel, general way to solve many learning problems

Page 4: NIPS2007: structured prediction

Handwriting Recognition

brace

Sequential structure

x y

Page 5: NIPS2007: structured prediction

Object Segmentation

Spatial structure

x y

Page 6: NIPS2007: structured prediction

Natural Language Parsing

The screen was a sea of red

Recursive structure

x y

Page 7: NIPS2007: structured prediction

Bilingual Word Alignment

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

x yWhat

is the

anticipated

costof

collecting fees

under the

new proposal

?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?Combinatorial structure

Page 8: NIPS2007: structured prediction

Protein Structure and Disulfide Bridges

Protein: 1IMT

AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRMHHTCPCAPNLACVQTSPKKFKCLSK

Page 9: NIPS2007: structured prediction

Local Prediction

Classify using local information Ignores correlations & constraints!

br ea c

Page 10: NIPS2007: structured prediction

Local Predictionbuildingtreeshrubground

Page 11: NIPS2007: structured prediction

Structured Prediction

Use local information Exploit correlations

br ea c

Page 12: NIPS2007: structured prediction

Structured Predictionbuildingtreeshrubground

Page 13: NIPS2007: structured prediction

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Page 14: NIPS2007: structured prediction

Structured Models

Mild assumption:

linear combination

space of feasible outputs

scoring function

Page 15: NIPS2007: structured prediction

Chain Markov Net (aka CRF*)

a-z

a-z

a-z

a-z

a-z

y

x

*Lafferty et al. 01

Page 16: NIPS2007: structured prediction

Chain Markov Net (aka CRF*)

a-z

a-z

a-z

a-z

a-z

y

x

*Lafferty et al. 01

Page 17: NIPS2007: structured prediction

Associative Markov Nets

Point featuresspin-images, point height

Edge featureslength of edge, edge orientation

yj

yk

jk

j

“associative” restriction

Page 18: NIPS2007: structured prediction

CFG Parsing

#(NP DT NN)

#(PP IN NP)

#(NN ‘sea’)

Page 19: NIPS2007: structured prediction

Bilingual Word Alignment

position orthography association

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

Page 20: NIPS2007: structured prediction

Disulfide Bonds: Non-bipartite Matching

1

2 3

4

6 5

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

6

1

2

4

5

3

Fariselli & Casadio `01, Baldi et al. ‘04

Page 21: NIPS2007: structured prediction

Scoring Function

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

1

2 3

4

6 5

amino acid identities phys/chem properties

Page 22: NIPS2007: structured prediction

Structured Models

Mild assumptions:

linear combination

sum of part scores

space of feasible outputs

scoring function

Page 23: NIPS2007: structured prediction

Supervised Structured Prediction

Learning Prediction

Estimate w

Example:Weighted matching

Generally: Combinatorial

optimization

Data

Model:

Likelihood(can be intractable)

MarginLocal(ignores

structure)

Page 24: NIPS2007: structured prediction

Local Estimation

Treat edges as independent decisions

Estimate w locally, use globally E.g., naïve Bayes, SVM, logistic regression Cf. [Matusov+al, 03] for matchings

Simple and cheap Not well-calibrated for matching model Ignores correlations & constraints

Data

Model:

Page 25: NIPS2007: structured prediction

Conditional Likelihood Estimation

Estimate w jointly:

Denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93]

Tractable model, intractable learning

Need tractable learning method margin-based estimation

Data

Model:

Page 26: NIPS2007: structured prediction

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Page 27: NIPS2007: structured prediction

We want:

Equivalently:

OCR Example

a lot!…

“brace”

“brace”

“aaaaa”

“brace” “aaaab”

“brace” “zzzzz”

Page 28: NIPS2007: structured prediction

We want:

Equivalently:

‘It was red’

Parsing Example

a lot!

SA B

C D

SA BD F

SA B

C D

SE F

G H

SA B

C D

SA B

C D

SA B

C D

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

Page 29: NIPS2007: structured prediction

We want:

Equivalently:

‘What is the’‘Quel est le’

Alignment Example

a lot!…

123

123

‘What is the’‘Quel est le’

123

123

‘What is the’‘Quel est le’

123

123

‘What is the’‘Quel est le’

123

123

123

123

123

123

123

123

‘What is the’‘Quel est le’

‘What is the’‘Quel est le’

‘What is the’‘Quel est le’

Page 30: NIPS2007: structured prediction

Structured Loss

b c a r e b r o r e b r o c eb r a c e

2 2 10

123

123

123

123

123

123

123

123

‘What is the’‘Quel est le’

0 1 2 2S

A EC D

SB E

A C

SB D

A C

SA B

C D‘It was red’

0 1 2 3

Page 31: NIPS2007: structured prediction

Large margin estimation Given training examples , we want:

Maximize margin

Mistake weighted margin:

# of mistakes in y

*Collins 02, Altun et al 03, Taskar 03

Page 32: NIPS2007: structured prediction

Large margin estimation

Eliminate

Add slacks for inseparable case (hinge loss)

Page 33: NIPS2007: structured prediction

Large margin estimation Brute force enumeration

Min-max formulation

‘Plug-in’ linear program for inference

Page 34: NIPS2007: structured prediction

Min-max formulation

LP Inference

Structured loss (Hamming):

Inference

discrete optim.

Key step:

continuous optim.

Page 35: NIPS2007: structured prediction

Simple iterative method

Unstable for structured output: fewer instances, big updates

May not converge if non-separable Noisy

Voted / averaged perceptron [Freund & Schapire 99, Collins 02]

Regularize / reduce variance by aggregating over iterations

Alternatives: Perceptron

Page 36: NIPS2007: structured prediction

Add most violated constraint

Handles several more general loss functions Need to re-solve QP many times Theorem: Only polynomial # of constraints needed

to achieve -error [Tsochantaridis et al, 04]

Worst case # of constraints larger than factored

Alternatives: Constraint Generation

[Collins 02; Altun et al, 03]

Page 37: NIPS2007: structured prediction

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Page 38: NIPS2007: structured prediction

Matching Inference LP

Has integral solutions z(A is totally unimodular)

degree

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

[Nemhauser+Wolsey 88]

Need Hamming-like loss

Page 39: NIPS2007: structured prediction

y z Map for Markov Nets

0 1 . 0

0 0 . 0

. . . 0

0 0 0 0

1

0

:

0

0

1

:

0

1

0

:

0

0

1

:

0

0

1

:

0

a

b

:

z

0 0 . 0

1 0 . 0

. . . 0

0 0 0 0

0 1 . 0

0 0 . 0

. . . 0

0 0 0 0

0 0 . 0

0 1 . 0

. . . 0

0 0 0 0

a

b

:

z

a b . z a b . z a b . z a b . z

Page 40: NIPS2007: structured prediction

Markov Net Inference LP

0 0 0 0

0 0 0 0

0 1 0 0

0 0 0 0

0

0

1

0

0 1 0 0

Has integral solutions z for chains, (hyper)treesCan be fractional for untriangulated networks

normalization

agreement

[Chekuri+al 01, Wainright+al 02]

Page 41: NIPS2007: structured prediction

Associative MN Inference LP

For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal (results for larger cliques)

“associative” restriction

0

1

0

0

0

1

0

0

0 1 0 0

[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]

Page 42: NIPS2007: structured prediction

CFG Chart

CNF tree = set of two types of parts: Constituents (A, s, e) CF-rules (A B C, s, m, e)

Page 43: NIPS2007: structured prediction

CFG Inference LP

inside

outside

Has integral solutions z

root

Page 44: NIPS2007: structured prediction

LP Duality Linear programming duality

Variables constraints Constraints variables

Optimal values are the same When both feasible regions are bounded

Page 45: NIPS2007: structured prediction

Min-max Formulation

LP duality

Page 46: NIPS2007: structured prediction

Min-max formulation summary

Formulation produces concise QP for Low-treewidth Markov networks Associative MNs (K=2) Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2

*Taskar et al 04

Page 47: NIPS2007: structured prediction

Unfactored Primal/Dual

QP duality

Exponentially many constraints/variables

Page 48: NIPS2007: structured prediction

Factored Primal/Dual

By QP duality

Dual inherits structure from problem-specific inference LP

Variables correspond to a decomposition of variables of the flat case

Page 49: NIPS2007: structured prediction

The Connection

b c a r e b r o r e b r o c eb r a c e

rc

ao

cr

.2.15.25

.4

.2 .35

.65.8

.4

.61b 1e

2 2 10

Page 50: NIPS2007: structured prediction

Duals and Kernels

Kernel trick works: Factored dual Local functions (log-potentials) can use

kernels

Page 51: NIPS2007: structured prediction

3D Mapping

Laser Range Finder

GPS

IMU

Data provided by: Michael Montemerlo & Sebastian Thrun

Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

Page 52: NIPS2007: structured prediction
Page 53: NIPS2007: structured prediction
Page 54: NIPS2007: structured prediction
Page 55: NIPS2007: structured prediction
Page 56: NIPS2007: structured prediction

Segmentation results

Hand labeled 180K test pointsModel

Accuracy

SVM 68%

V-SVM

73%

M3N 93%

Page 57: NIPS2007: structured prediction

Fly-through

Page 58: NIPS2007: structured prediction

LAGRbot: Real-time Navigation

LAGRbot: Paul Vernaza & Dan Lee

Range of stereo vision limited to approximately 15 m or less

Page 59: NIPS2007: structured prediction

LAGRbot: Real-time Navigation

Model Error

Local 17%

Structured 8%

160x120 images: Real time prediction/learning (~100ms)Current work with Paul Vernaza, Dan Lee

Page 60: NIPS2007: structured prediction

0

5

10

15

20

Tes

t Err

or

SVMs RMNS M^3Ns

Hypertext Classification WebKB dataset

Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth

53% error reduction over SVMs

38% error reduction over RMNs

relaxed LP

*Taskar et al 02

better

loopy belief propagation

Page 61: NIPS2007: structured prediction

Word Alignment Results

Model *Error

Data: [Hansards – Canadian Parliament] Features induced on 1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges)

[Taskar+al 05]

*Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06]

GIZA/IBM4 [Och & Ney 03] 6.5

+Our approach+QAP 4.5

+Local learning+matching 5.4

+Our approach 4.9

Page 62: NIPS2007: structured prediction

Modeling First Order Effects

Monotonicity Local inversion Local fertility

QAP NP-complete Sentences (30 words, 1k vars) few seconds (Mosek) Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral

Page 63: NIPS2007: structured prediction

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Page 64: NIPS2007: structured prediction

Certificate formulation Non-bipartite matchings:

O(n3) combinatorial algorithm No polynomial-size LP known

Spanning trees No polynomial-size LP known Simple certificate of optimality

Intuition: Verifying optimality easier than optimizing

Compact optimality condition of wrt.

1

2 3

4

6 5

ijkl

Page 65: NIPS2007: structured prediction

Certificate for non-bipartite matching

Alternating cycle: Every other edge is in matching

Augmenting alternating cycle: Score of edges not in matching greater than edges in matching

Negate score of edges not in matching Augmenting alternating cycle = negative length alternating

cycle

Matching is optimal no negative alternating cycles

1

2 3

4

6 5

Edmonds ‘65

Page 66: NIPS2007: structured prediction

Certificate for non-bipartite matching

Pick any node r as root

= length of shortest alternating

path from r to j

Triangle inequality:

Theorem:

No negative length cycle distance function d exists

Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints

1

2 3

4

6 5

Page 67: NIPS2007: structured prediction

Certificate formulation

Formulation produces compact QP for Spanning trees Non-bipartite matchings Any problem with compact optimality condition

*Taskar et al. ‘05

Page 68: NIPS2007: structured prediction

Disulfide Bonding Prediction Data [Swiss Prot 39]

450 sequences (4-10 cysteines) Features:

windows around C-C pair physical/chemical properties

[Taskar+al 05]

Model *Acc

Local learning+matching 41%

Recursive Neural Net [Baldi+al’04] 52%

Our approach (certificate) 55%

*Accuracy: % proteins with all correct bonds

Page 69: NIPS2007: structured prediction

Formulation summary

Brute force enumeration

Min-max formulation ‘Plug-in’ convex program for inference

Certificate formulation Directly guarantee optimality of

Page 70: NIPS2007: structured prediction

Scalable Algorithms Convex quadratic program # variables and constraints linear in # parameters, edges Can solve using off-the-shelf software

Matlab, CPLEX, Mosek, etc. Superlinear convergence

Problem: linear is too large Second-order methods run out of memory (quadratic)

Need scalable memory-efficient methods Space/time tradeoff Structured SMO [Taskar+al 04] Structured exponentiated gradient [Bartlett+al 04,

Collins+al 07] Don’t work for matchings, min-cuts

Page 71: NIPS2007: structured prediction

Saddle-point Problem

Page 72: NIPS2007: structured prediction

Extragradient Method

[Korpelevich76]

Prediction:

Correction:

= Euclidean projection = step size

Theorem: Extragradient converges linearly

Key computation is Euclidean projection

usually easy harder

Page 73: NIPS2007: structured prediction

for Bipartite Matchings: Min Cost Flow

Min-cost quadratic flow computes projection O(N1.5) complexity for fixed precision (N=num

edges) Reduction to flow for min-cuts also possible[Taskar+al 06]

j

s t

k

All capacities = 1quel

est

le

coût

prévu

What

is

the

anticipate

d

cost Flow cost

Page 74: NIPS2007: structured prediction

Structured Extragradient

Extragradient method [Korpelevich 76, Nesterov 03] Linear convergence Key computation: projection min-cost quadratic flow for matchings & cuts

Extensions (using Bregman divergence) dynamic programming for decomposable

models “Online-envy” – want memory proportional

to # parameters independent of # examples solves problems with million edges

[Taskar+al 06]

Page 75: NIPS2007: structured prediction

Other approaches Online methods

Online updates with respect to most violated constraints [Crammer+al 05,06]

Regression based methods Regression from input to transformed output space

[Cortes+al 07]

Learning to search Learn classifier to guide local search for structured

solution [Daume+al 05] Many others

Page 76: NIPS2007: structured prediction

Generalization Bounds

“If the past any indication of the future, he’ll have a cruller.”

Page 77: NIPS2007: structured prediction

Generalization Bounds

Page 78: NIPS2007: structured prediction

Several Pointers Perceptron bound [Colllins 01]

Assume separability with margin Bound on 0-1 loss

Covering-number bound [Taskar+al 03] Bound on Hamming loss Logarithmic dependence on # variables in each y

Regret Bounds [Crammer+al 06] Online-style guarantees for more general loss

PAC-Bayes bound [McAllester 07] Tighter analysis, consistency

Bounds for Learning with Approximate Inference [Kulesza & Pereira, Today]

Page 79: NIPS2007: structured prediction

Open Questions for Large-Margin Estimation

Statistical consistency Hinge loss not consistent for non-binary output [See Tewari & Bartlett 05, McAllester 07]

Semi-supervised Laplacian-regularization [Altun+McAllester 05] Co-regularization [Brefeld+al 05]

Latent variables Machine Translation [Liang+al 06] CCG Parsing to Logical Form

[Zettlemoyer+Collins 07] Learning with approximate inference

Page 80: NIPS2007: structured prediction

Learning with LP relaxations Does constant factor approximate inference

guarantee anything a-priori about learning?

No [See Kulesza & Pereira, tonight] Simple 3-node counter example Separable with exact inference,

not separable with approximate

Question: What other (stronger?) approximate inference

guarantees will translate into learning guarantees?

Page 81: NIPS2007: structured prediction

References Edited collection: G. Bakir+al 07

Predicting Structured Data MIT Press

Code: SVMstruct by Thorsten Joachims

Slides, more papers at: http://www.cis.upenn.edu/~taskar

Page 82: NIPS2007: structured prediction

Thanks!

Page 83: NIPS2007: structured prediction

Segmentation Model Min-Cut

0 1

Local evidence

Spatial smoothness

Computing is hard in general, but if edge potentials attractive min-cut algorithmMultiway-cut for multiclass case use LP relaxation

[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]