structured prediction cascades ben taskar david weiss, ben sapp, alex toshev university of...

47
Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Upload: stuart-barker

Post on 15-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Structured Prediction CascadesBen Taskar

David Weiss, Ben Sapp, Alex Toshev

University of Pennsylvania

Page 2: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Supervised Learning

Learn from

• Regression:

• Binary Classification:

• Multiclass Classification:

• Structured Prediction:

Page 3: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Handwriting Recognition

`structured’

x y

Page 4: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Machine Translation

x y

‘Ce n'est pas un autreproblème de classification.’

‘This is not another classification problem.’

Page 5: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Pose Estimation

x y

© Arthur Gretton

Page 6: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Structured Models

space of feasible outputs

scoring function

parts = cliques, productions

Complexity of inference depends on “part” structure

Page 7: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Supervised Structured Prediction

Learning Prediction

Discriminative estimation of θ

Data

Model:

Intractable/impracticalfor complex models

Intractable/impractical for complex models

3rd order OCR model = 1 mil states * lengthBerkeley English grammar = 4 mil productions * length^3Tree-based pose model = 100 mil states * # joints

Page 8: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Approximation vs. Computation

• Usual trade-off: approximation vs. estimation – Complex models need more data– Error from over-fitting

• New trade-off: approximation vs. computation– Complex models need more time/memory – Error from inexact inference

• This talk: enable complex models via cascades

Page 9: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

An Inspiration: Viola Jones Face Detector

Scanning window at every location, scale and orientation

Page 10: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Classifier Cascade

• Most patches are non-face• Filter out easy cases fast!!• Simple features first• Low precision, high recall• Learned layer-by-layer • Next layer more complex

C1

C2

C3

Cn

Non-face

Non-face

Non-face

Non-face Face

Page 11: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Related Work

• Global Thresholding and Multiple-Pass Parsing. J Goodman, 97• A maximum-entropy-inspired parser. E. Charniak, 00.• Coarse-to-fine n-best parsing and MaxEnt discriminative

reranking, E Charniak & M Johnson, 05• TAG, dynamic pro- gramming, and the perceptron for efficient,

feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08• Coarse-to-Fine Natural Language Processing, S. Petrov, 09• Coarse-to-fine face detection. Fleuret, F., Geman, D, 01• Robust real-time object detection. Viola, P., Jones, M, 02• Progressive search space reduction for human pose estimation,

Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08

Page 12: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

What’s a Structured Cascade?

• What to filter?– Clique assignments

• What are the layers?– Higher order models– Higher resol. models

• How to learn filters?– Novel convex loss– Simple online algorithm– Generalization bounds

F1

F2

F3

Fn

??????

??????

??????

?????? ‘structured’

Page 13: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Trade-off in Learning Cascades

• Accuracy: Minimize the number of errors incurred by each level

• Efficiency: Maximize the number of filtered assignments at each level

Filter 1

Filter 2

Filter D

Predict

Page 14: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Max-marginals (Sequences)

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

Score of an output:

Computemax (*bc*)

Max marginal:

Page 15: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Filtering with Max-marginals

• Set threshold• Filter clique assignment if

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

d

Page 16: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Filtering with Max-marginals

• Set threshold• Filter clique assignment if

a

b

c

d

a

b

c

d

a

b

c

d

a

b

c

dRemove edge

bc

Page 17: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Why Max-marginals?

• Valid path guarantee– Filtering leaves at least one valid global assignment

• Faster inference – No exponentiation, O(k) vs. O(k log k) in some cases

• Convex estimation– Simple stochastic subgradient algorithm

• Generalization bounds for error/efficiency– Guarantees on expected trade-off

Page 18: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Choosing a Threshold

• Threshold must be specific to input x– Max-marginal scores are not normalized

• Keep top-K max-marginal assignments?– Hard(er) to handle and analyze

• Convex alternative: max-mean-max function:

max score mean max marginalm = #edges

Page 19: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Choosing a Threshold (cont)

Max marginal scores

Mean max marginalMax score

Score of truth

Range of possible thresholdsα = efficiency level

Page 20: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Example OCR Cascade

a b c d e f g h i j k l …144 269 -137 80 52 -42 49 360 -27 -774 368 -24

Mean ≈ 0 Max = 368 α = 0.55 Threshold ≈ 204

a c d e f g i j l144 -137 80 52 -42 49 -27 -774 -24

b h k …b h k

Page 21: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Example OCR Cascade

b h k

a e n u r

a b d g

a c e g n o u

a e m n u w

b h k

Page 22: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

b h k

Example OCR Cascade

a e n u r

a b d g

a c e g n o u

▪b ▪h ▪k

ba ha ka be he …

aa ab ad ag ea …

aa ac ae ag an …

a e m n u waa ae am an au …

Page 23: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Example OCR Cascade

▪b ▪h ▪k

ba ha ka be he …

aa ab ad ag ea …

aa ac ae ag an …

aa ae am an au …

1440 1558 1575

1440 1558 1575 1413 1553

1400 1480 1575 1397 1393

1257 1285 1302 1294 1356

1336 1306 1390 1346 1306

Mean ≈ 1412

Max = 1575

α = 0.44

Threshold ≈ 1502

Page 24: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

▪b

ba be

aa ab ag ea

aa ac ae ag an

aa ae am an au

▪b

ba be

aa ab ag ea

aa ac ae ag an

aa ae am an au

Example OCR Cascade

▪h ▪k

ha ka he …

ad …

1440 1558 1575

1440 1558 1575 1413 1553

1400 1480 1575 1397 1393

1257 1285 1302 1294 1356

1336 1306 1390 1346 1306

Mean ≈ 1412

Max = 15751440

1440 1413

1400 1480 1397 1393

1257 1285 1302 1294 1356

1336 1306 1390 1346 1306

α = 0.44

Threshold ≈ 1502

Page 25: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Example OCR Cascade

ka

▪h ▪k

ha he …

ad …

Mean ≈ 1412

Max = 1575▪h ▪k

ha ka he ke

ad ed

kn

do

ow om

nd

α = 0.44

Threshold ≈ 1502

Page 26: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Example OCR Cascade

▪h ▪k

ha ka he ke

ad ed

kn

do

ow om

nd

▪▪h ▪▪k

▪ha ▪ka ▪he ▪ke

had kad hed ked

ado edo ndo

dow dom

Page 27: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

10x10x12 10x10x24 40x40x24 80x80x24 110×122×24

Example: Pictorial Structure Cascade

Upperarm

Lowerarm

H

T ULA

LLA

URA

LRA

k ≈ 320,000k2 ≈ 100mil

Page 28: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

• Filter loss

If score(truth) > threshold, all true states are safe• Efficiency loss

Proportion of unfiltered clique assignments

Quantifying Loss

Page 29: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Learning One Cascade Level

• Fix α, solve convex problem for θ

Minimize filter mistakes at efficiency level α

No filter mistakes Margin w/ slack

Page 30: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

An Online Algorithm

• Stochastic sub-gradient update:

Features of truth

Convex combination: Features of best guess+ Average features of max marginal “witnesses”

Page 31: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Generalization Bounds

• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data

Expected loss on true distribution

Page 32: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

)-ramp(

Generalization Bounds

• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data

-Empirical -ramp upper bound

0

1

Page 33: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Generalization Bounds

• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data

n number of examples m number of clique assignments number of cliquesB

Page 34: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Generalization Bounds

• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data

Similar bound holds for Le and all ® 2 [0,1]

Page 35: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

OCR Experiments

Dataset: http://www.cis.upenn.edu/~taskar/ocr

Character Error Word Error0

10

20

30

40

50

60

70

80

22.5

73.35

14.31

50.56

12.05

26.17

7.7515.54

1st order2nd order3rd order4rd order

% E

rror

Page 36: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Efficiency Experiments

• POS Tagging (WSJ + CONLL datasets)– English, Bulgarian, Portuguese

• Compare 2nd order model filters– Structured perceptron (max-sum)– SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum)– CRF log-likelihood (sum-product marginals)

• Tightly controlled– Use random 40% of training set– Use development set to fit regularization parameter ¸

and α for all methods

Page 37: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

POS Tagging Results

Error Cap (%) on Development Set

Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)

English

Page 38: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

POS Tagging Results

Error Cap (%) on Development Set

Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)

Portuguese

Page 39: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

POS Tagging Results

Error Cap (%) on Development Set

Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)

Bulgarian

Page 40: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

English POS Cascade (2nd order)

Full SCP CRF Taglist

Accuracy (%) 96.83 96.82 96.84 ---

Filter Loss (%) 0 0.12 0.024 0.118

Test Time (ms) 173.28 1.56 4.16 10.6

Avg. Num States 1935.7 3.93 11.845 95.39

The cascade is efficient.

DT NN VRB ADJ

Page 41: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

method torso head upper arms

lower arms total

Ferrari et al 08 -- -- -- -- 61.4%

Ferrari et al. 09 -- -- -- -- 74.5%

Andriluka et al. 09 98.3% 95.7% 86.8% 51.7% 78.8%

Eichner et al. 09 98.72% 97.87% 92.8% 59.79% 80.1%

CPS (ours) 100.00% 99.57% 96.81% 62.34% 86.31%

Pose Estimation (Buffy Dataset)

“Ferrari score:” percent of parts correct within radius

Page 42: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Pose Estimation

Page 43: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Cascade Efficiency vs. Accuracy

Page 44: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Features: Shape/Segmentation Match

Page 45: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Features: Contour Continuation

Page 46: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Conclusions

• Novel loss significantly improves efficiency

• Principled learning of accurate cascades

• Deep cascades focus structured inference and allow rich models

• Open questions: – How to learn cascade structure– Dealing with intractable models– Other applications: factored dynamic models, grammars

Page 47: Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania

Thanks!

Structured Prediction Cascades, Weiss & Taskar, AISTATS10

Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10