nips2007: structured prediction

Post on 29-Jun-2015

359 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Structured Prediction:A Large Margin Approach

Ben TaskarUniversity of Pennsylvania

Acknowledgments

Drago AnguelovVassil ChatalbashevCarlos GuestrinMichael Jordan

Dan KleinDaphne KollerSimon Lacoste-JulienPaul Vernaza

Structured Prediction Prediction of complex outputs

Structured outputs: multivariate, correlated, constrained

Novel, general way to solve many learning problems

Handwriting Recognition

brace

Sequential structure

x y

Object Segmentation

Spatial structure

x y

Natural Language Parsing

The screen was a sea of red

Recursive structure

x y

Bilingual Word Alignment

What is the anticipated cost of collecting fees under the new proposal?

En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?

x yWhat

is the

anticipated

costof

collecting fees

under the

new proposal

?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?Combinatorial structure

Protein Structure and Disulfide Bridges

Protein: 1IMT

AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRMHHTCPCAPNLACVQTSPKKFKCLSK

Local Prediction

Classify using local information Ignores correlations & constraints!

br ea c

Local Predictionbuildingtreeshrubground

Structured Prediction

Use local information Exploit correlations

br ea c

Structured Predictionbuildingtreeshrubground

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Structured Models

Mild assumption:

linear combination

space of feasible outputs

scoring function

Chain Markov Net (aka CRF*)

a-z

a-z

a-z

a-z

a-z

y

x

*Lafferty et al. 01

Chain Markov Net (aka CRF*)

a-z

a-z

a-z

a-z

a-z

y

x

*Lafferty et al. 01

Associative Markov Nets

Point featuresspin-images, point height

Edge featureslength of edge, edge orientation

yj

yk

jk

j

“associative” restriction

CFG Parsing

#(NP DT NN)

#(PP IN NP)

#(NN ‘sea’)

Bilingual Word Alignment

position orthography association

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

Disulfide Bonds: Non-bipartite Matching

1

2 3

4

6 5

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

6

1

2

4

5

3

Fariselli & Casadio `01, Baldi et al. ‘04

Scoring Function

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6

1

2 3

4

6 5

amino acid identities phys/chem properties

Structured Models

Mild assumptions:

linear combination

sum of part scores

space of feasible outputs

scoring function

Supervised Structured Prediction

Learning Prediction

Estimate w

Example:Weighted matching

Generally: Combinatorial

optimization

Data

Model:

Likelihood(can be intractable)

MarginLocal(ignores

structure)

Local Estimation

Treat edges as independent decisions

Estimate w locally, use globally E.g., naïve Bayes, SVM, logistic regression Cf. [Matusov+al, 03] for matchings

Simple and cheap Not well-calibrated for matching model Ignores correlations & constraints

Data

Model:

Conditional Likelihood Estimation

Estimate w jointly:

Denominator is #P-complete [Valiant 79, Jerrum & Sinclair 93]

Tractable model, intractable learning

Need tractable learning method margin-based estimation

Data

Model:

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

We want:

Equivalently:

OCR Example

a lot!…

“brace”

“brace”

“aaaaa”

“brace” “aaaab”

“brace” “zzzzz”

We want:

Equivalently:

‘It was red’

Parsing Example

a lot!

SA B

C D

SA BD F

SA B

C D

SE F

G H

SA B

C D

SA B

C D

SA B

C D

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

‘It was red’

We want:

Equivalently:

‘What is the’‘Quel est le’

Alignment Example

a lot!…

123

123

‘What is the’‘Quel est le’

123

123

‘What is the’‘Quel est le’

123

123

‘What is the’‘Quel est le’

123

123

123

123

123

123

123

123

‘What is the’‘Quel est le’

‘What is the’‘Quel est le’

‘What is the’‘Quel est le’

Structured Loss

b c a r e b r o r e b r o c eb r a c e

2 2 10

123

123

123

123

123

123

123

123

‘What is the’‘Quel est le’

0 1 2 2S

A EC D

SB E

A C

SB D

A C

SA B

C D‘It was red’

0 1 2 3

Large margin estimation Given training examples , we want:

Maximize margin

Mistake weighted margin:

# of mistakes in y

*Collins 02, Altun et al 03, Taskar 03

Large margin estimation

Eliminate

Add slacks for inseparable case (hinge loss)

Large margin estimation Brute force enumeration

Min-max formulation

‘Plug-in’ linear program for inference

Min-max formulation

LP Inference

Structured loss (Hamming):

Inference

discrete optim.

Key step:

continuous optim.

Simple iterative method

Unstable for structured output: fewer instances, big updates

May not converge if non-separable Noisy

Voted / averaged perceptron [Freund & Schapire 99, Collins 02]

Regularize / reduce variance by aggregating over iterations

Alternatives: Perceptron

Add most violated constraint

Handles several more general loss functions Need to re-solve QP many times Theorem: Only polynomial # of constraints needed

to achieve -error [Tsochantaridis et al, 04]

Worst case # of constraints larger than factored

Alternatives: Constraint Generation

[Collins 02; Altun et al, 03]

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Matching Inference LP

Has integral solutions z(A is totally unimodular)

degree

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?

j

k

[Nemhauser+Wolsey 88]

Need Hamming-like loss

y z Map for Markov Nets

0 1 . 0

0 0 . 0

. . . 0

0 0 0 0

1

0

:

0

0

1

:

0

1

0

:

0

0

1

:

0

0

1

:

0

a

b

:

z

0 0 . 0

1 0 . 0

. . . 0

0 0 0 0

0 1 . 0

0 0 . 0

. . . 0

0 0 0 0

0 0 . 0

0 1 . 0

. . . 0

0 0 0 0

a

b

:

z

a b . z a b . z a b . z a b . z

Markov Net Inference LP

0 0 0 0

0 0 0 0

0 1 0 0

0 0 0 0

0

0

1

0

0 1 0 0

Has integral solutions z for chains, (hyper)treesCan be fractional for untriangulated networks

normalization

agreement

[Chekuri+al 01, Wainright+al 02]

Associative MN Inference LP

For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal (results for larger cliques)

“associative” restriction

0

1

0

0

0

1

0

0

0 1 0 0

[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]

CFG Chart

CNF tree = set of two types of parts: Constituents (A, s, e) CF-rules (A B C, s, m, e)

CFG Inference LP

inside

outside

Has integral solutions z

root

LP Duality Linear programming duality

Variables constraints Constraints variables

Optimal values are the same When both feasible regions are bounded

Min-max Formulation

LP duality

Min-max formulation summary

Formulation produces concise QP for Low-treewidth Markov networks Associative MNs (K=2) Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2

*Taskar et al 04

Unfactored Primal/Dual

QP duality

Exponentially many constraints/variables

Factored Primal/Dual

By QP duality

Dual inherits structure from problem-specific inference LP

Variables correspond to a decomposition of variables of the flat case

The Connection

b c a r e b r o r e b r o c eb r a c e

rc

ao

cr

.2.15.25

.4

.2 .35

.65.8

.4

.61b 1e

2 2 10

Duals and Kernels

Kernel trick works: Factored dual Local functions (log-potentials) can use

kernels

3D Mapping

Laser Range Finder

GPS

IMU

Data provided by: Michael Montemerlo & Sebastian Thrun

Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points

Segmentation results

Hand labeled 180K test pointsModel

Accuracy

SVM 68%

V-SVM

73%

M3N 93%

Fly-through

LAGRbot: Real-time Navigation

LAGRbot: Paul Vernaza & Dan Lee

Range of stereo vision limited to approximately 15 m or less

LAGRbot: Real-time Navigation

Model Error

Local 17%

Structured 8%

160x120 images: Real time prediction/learning (~100ms)Current work with Paul Vernaza, Dan Lee

0

5

10

15

20

Tes

t Err

or

SVMs RMNS M^3Ns

Hypertext Classification WebKB dataset

Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth

53% error reduction over SVMs

38% error reduction over RMNs

relaxed LP

*Taskar et al 02

better

loopy belief propagation

Word Alignment Results

Model *Error

Data: [Hansards – Canadian Parliament] Features induced on 1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges)

[Taskar+al 05]

*Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06]

GIZA/IBM4 [Och & Ney 03] 6.5

+Our approach+QAP 4.5

+Local learning+matching 5.4

+Our approach 4.9

Modeling First Order Effects

Monotonicity Local inversion Local fertility

QAP NP-complete Sentences (30 words, 1k vars) few seconds (Mosek) Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral

Outline Structured prediction models

Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings

Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation

Certificate formulation Non-bipartite matchings:

O(n3) combinatorial algorithm No polynomial-size LP known

Spanning trees No polynomial-size LP known Simple certificate of optimality

Intuition: Verifying optimality easier than optimizing

Compact optimality condition of wrt.

1

2 3

4

6 5

ijkl

Certificate for non-bipartite matching

Alternating cycle: Every other edge is in matching

Augmenting alternating cycle: Score of edges not in matching greater than edges in matching

Negate score of edges not in matching Augmenting alternating cycle = negative length alternating

cycle

Matching is optimal no negative alternating cycles

1

2 3

4

6 5

Edmonds ‘65

Certificate for non-bipartite matching

Pick any node r as root

= length of shortest alternating

path from r to j

Triangle inequality:

Theorem:

No negative length cycle distance function d exists

Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints

1

2 3

4

6 5

Certificate formulation

Formulation produces compact QP for Spanning trees Non-bipartite matchings Any problem with compact optimality condition

*Taskar et al. ‘05

Disulfide Bonding Prediction Data [Swiss Prot 39]

450 sequences (4-10 cysteines) Features:

windows around C-C pair physical/chemical properties

[Taskar+al 05]

Model *Acc

Local learning+matching 41%

Recursive Neural Net [Baldi+al’04] 52%

Our approach (certificate) 55%

*Accuracy: % proteins with all correct bonds

Formulation summary

Brute force enumeration

Min-max formulation ‘Plug-in’ convex program for inference

Certificate formulation Directly guarantee optimality of

Scalable Algorithms Convex quadratic program # variables and constraints linear in # parameters, edges Can solve using off-the-shelf software

Matlab, CPLEX, Mosek, etc. Superlinear convergence

Problem: linear is too large Second-order methods run out of memory (quadratic)

Need scalable memory-efficient methods Space/time tradeoff Structured SMO [Taskar+al 04] Structured exponentiated gradient [Bartlett+al 04,

Collins+al 07] Don’t work for matchings, min-cuts

Saddle-point Problem

Extragradient Method

[Korpelevich76]

Prediction:

Correction:

= Euclidean projection = step size

Theorem: Extragradient converges linearly

Key computation is Euclidean projection

usually easy harder

for Bipartite Matchings: Min Cost Flow

Min-cost quadratic flow computes projection O(N1.5) complexity for fixed precision (N=num

edges) Reduction to flow for min-cuts also possible[Taskar+al 06]

j

s t

k

All capacities = 1quel

est

le

coût

prévu

What

is

the

anticipate

d

cost Flow cost

Structured Extragradient

Extragradient method [Korpelevich 76, Nesterov 03] Linear convergence Key computation: projection min-cost quadratic flow for matchings & cuts

Extensions (using Bregman divergence) dynamic programming for decomposable

models “Online-envy” – want memory proportional

to # parameters independent of # examples solves problems with million edges

[Taskar+al 06]

Other approaches Online methods

Online updates with respect to most violated constraints [Crammer+al 05,06]

Regression based methods Regression from input to transformed output space

[Cortes+al 07]

Learning to search Learn classifier to guide local search for structured

solution [Daume+al 05] Many others

Generalization Bounds

“If the past any indication of the future, he’ll have a cruller.”

Generalization Bounds

Several Pointers Perceptron bound [Colllins 01]

Assume separability with margin Bound on 0-1 loss

Covering-number bound [Taskar+al 03] Bound on Hamming loss Logarithmic dependence on # variables in each y

Regret Bounds [Crammer+al 06] Online-style guarantees for more general loss

PAC-Bayes bound [McAllester 07] Tighter analysis, consistency

Bounds for Learning with Approximate Inference [Kulesza & Pereira, Today]

Open Questions for Large-Margin Estimation

Statistical consistency Hinge loss not consistent for non-binary output [See Tewari & Bartlett 05, McAllester 07]

Semi-supervised Laplacian-regularization [Altun+McAllester 05] Co-regularization [Brefeld+al 05]

Latent variables Machine Translation [Liang+al 06] CCG Parsing to Logical Form

[Zettlemoyer+Collins 07] Learning with approximate inference

Learning with LP relaxations Does constant factor approximate inference

guarantee anything a-priori about learning?

No [See Kulesza & Pereira, tonight] Simple 3-node counter example Separable with exact inference,

not separable with approximate

Question: What other (stronger?) approximate inference

guarantees will translate into learning guarantees?

References Edited collection: G. Bakir+al 07

Predicting Structured Data MIT Press

Code: SVMstruct by Thorsten Joachims

Slides, more papers at: http://www.cis.upenn.edu/~taskar

Thanks!

Segmentation Model Min-Cut

0 1

Local evidence

Spatial smoothness

Computing is hard in general, but if edge potentials attractive min-cut algorithmMultiway-cut for multiclass case use LP relaxation

[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]

top related