machine learning of structured outputs

Machine Learning ofStructured Outputs

Christoph LampertIST Austria

(Institute of Science and Technology Austria)Klosterneuburg

Feb 2, 2011

Machine Learning of Structured Outputs

Overview...Introduction to Structured LearningStructured Support Vector MachinesApplications in Computer Vision

Slides available athttp://www.ist.ac.at/~chl

http://www.ist.ac.at/~chl

What is Machine Learning?

Definition [T. Mitchell]:Machine Learning is the study of computer algorithmsthat improve their performance in a certain taskthrough experience.

Example: BackgammonI Task: play backgammonI Experience: self-playI Performance measure: games won against humans

Example: Object RecognitionI Task: determine which objects are visible in imagesI Experience: annotated training dataI Performance measure: object recognized correctly

What is structured data?

Definition [ad hoc]:Data is structured if it consists of several parts, andnot only the parts themselves contain information, butalso the way in which the parts belong together.

Text Molecules / Chemical Structures

Documents/HyperText Images

The right tool for the problem.

Example: Machine Learning for/of Structured Data

image body model model fit

Task: human pose estimationExperience: images with manually annotated body posePerformance measure: number of correctly localized body parts

Other tasks:

Natural Language Processing:I Automatic Translation (output: sentences)I Sentence Parsing (output: parse trees)

Bioinformatics:I RNA Structure Prediction (output: bipartite graphs)I Enzyme Function Prediction (output: path in a tree)

Speech Processing:I Automatic Transcription (output: sentences)I Text-to-Speech (output: audio signal)

Robotics:I Planning (output: sequence of actions)

This talk: only Computer Vision examples

"Normal" Machine Learning:f : X → R.

inputs X can be any kind of objectsI images, text, audio, sequence of amino acids, . . .

output y is a real numberI classification, regression, . . .

many way to construct f :I f (x) = a · ϕ(x) + b,I f (x) = decision tree,I f (x) = neural network

Structured Output Learning:f : X → Y .

inputs X can be any kind of objectsoutputs y ∈ Y are complex (structured) objects

I images, parse trees, folds of a protein, . . .

how to construct f ?

Predicting Structured Outputs: Image Denosing

f : 7→input: images output: denoised images

input set X = {grayscale images} =̂ [0, 255]M ·N

output set Y = {grayscale images} =̂ [0, 255]M ·N

energy minimization f (x) := argminy∈Y E(x , y)

E(x , y) = λ∑

i(xi − yi)2 + µ∑

i,j |yi − yj |

Predicting Structured Outputs: Human Pose Estimation

7→input: image body model output: model fit

input set X = {images}

output set Y = {positions/angles of K body parts} =̂ R4K .

energy minimization f (x) := argminy∈Y E(x , y)

E(x , y) = ∑i w>i ϕfit(xi , yi) +∑

i,j w>ij ϕpose(yi , yj)

Predicting Structured Outputs: Shape Matching

input: image pairs

output: mapping y : xi ↔ y(xi)

scoring functionF(x , y) = ∑

i w>i ϕsim(xi , y(xi)) +∑i,j w>ij ϕdist(xi , xj , y(xi), y(xj))

predict f : X → Y by f (x) := argmaxy∈Y F(x , y)

[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]

Predicting Structured Outputs: Tracking (by Detection)

input:image

output:object position

input set X = {images}

output set Y = R2 (box center) or R4 (box coordinates)

predict f : X → Y by f (x) := argmaxy∈Y F(x , y)

scoring function F(x , y) = w>ϕ(x , y) e.g. SVM score

images: [C. L., Jan Peters, "Active Structured Learning for High-Speed Object Detection", DAGM 2009]

Predicting Structured Outputs: Summary

Image Denoising

y = argminy E(x , y) E(x , y) = w1∑

i(xi − yi)2 + w2∑

i,j |yi − yj |

Pose Estimationy = argminy E(x , y) E(x , y) =

∑i w>i ϕ(xi , yi) +

∑i,j w>ij ϕ(yi , yj)

Point Matching

y = argmaxy F(x , y) F(x , y) =∑

i w>i ϕ(xi , yi) +∑

i,j w>ij ϕ(yi , yj)

Tracking

y = argmaxy F(x , y) F(x , y) = w>ϕ(x , y)

Unified FormulationPredict structured output by maximization

y = argmaxy∈Y

F(x , y)

of a compatiblity function

F(x , y) = 〈w, ϕ(x , y)〉

that is linear in a parameter vector w.

Structured Prediction: how to evaluate argmaxy F(x , y)?

chain treeloop-free graphs: Shortest-Path / Belief Propagation (BP)

grid arbitrary graphloopy graphs: GraphCut, approximate inference (e.g. loopy BP)

Structured Learning: how to learn F(x , y) from examples?

Machine Learning for Structured Outputs

Learning Problem:Task: predict structured objects f : X → YExperience: example pairs {(x1, y1), . . . , (xN , yN )} ⊂ X × Y :typical inputs with "correct" outputs for them.

{ , , ,. . . }Performance measure: ∆ : Y × Y → R

Our choice:parametric family: F(x , y; w) = 〈w, ϕ(x , y)〉

prediction method: f (x) = argmaxy∈Y F(x , y; w)Task: determine "good" w

Reminder: regularized risk minimization

Find w for decision function F = 〈w, ϕ(x , y)〉 by

minw∈Rd λ‖w‖2 +N∑

n=1`(yn,F(xn, ·; w))

Regularization + empirical loss (on training data)

Logistic Loss: Conditional Random FieldsI `(yn , F(xn , ·; w)) = log

( ∑y∈Y

exp[F(xn , y; w)− F(xn , yn ; w)])

Hinge-loss: Maximum Margin TrainingI `(yn , F(xn , ·; w)) = max

y∈Y

[∆(yn , y)+F(xn , y; w)−F(xn , yn ; w)

]Exponential Loss: Boosting

I `(yn , F(xn , ·; w)) =∑

y∈Y\{yn}exp[F(xn , y; w)− F(xn , yn ; w)]

)

Maximum Margin Trainingof Structured Models

(Structured SVMs)

Structured Support Vector Machine

Structured Support Vector Machine:

minw∈Rd12‖w‖

2

+ CN

N∑n=1

maxy∈Y

∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

Unconstrained optimization, convex, non-differentiable objective.

[I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. "Large Margin Methods for Structured and Interdependent

Output Variables", JMLR, 2005.]

S-SVM Objective Function for w ∈ R2:

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=0.01

3 2 1 0 1 2 3 4 52

1

0

1

2


3 2 1 0 1 2 3 4 52

1

0

1

2


3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C→∞

Structured Support Vector Machine:

minw∈Rd12‖w‖

2

+ CN

N∑n=1

maxy∈Y

∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

Unconstrained optimization, convex, non-differentiable objective.

Structured SVM (equivalent formulation):

minw∈Rd ,ξ∈Rn+

12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N ,

maxy∈Y

[∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

]≤ ξn

n non-linear contraints, convex, differentiable objective.

Structured SVM (also equivalent formulation):

minw∈Rd ,ξ∈Rn+

12‖w‖

2 + CN

N∑n=1

ξn


∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉 ≤ ξn, for all y ∈ Y

|Y|n linear constraints, convex, differentiable objective.

Example: A "True" Multiclass SVM

Y = {1, 2, . . . ,K}, ∆(y, y ′) =

1 for y 6= y ′

0 otherwise..

ϕ(x , y) =(Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = KKΦ(x)

)= Φ(x)e>y with ey=y-th unit vector

Solve:

minw,ξ12‖w‖

2 + CN

N∑n=1

ξn


〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ 1− ξn for all y ∈ Y .

Classification: MAP f (x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Crammer-Singer Multiclass SVM

Hierarchical Multiclass Classification

Loss function can reflect hierarchy:cat dog car bus

∆(y, y ′) := 12(distance in tree)

∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.

Solve:

minw,ξ12‖w‖

2 + CN

N∑n=1

ξn


〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ ∆(yn, y)− ξn for all y ∈ Y .

Kernelized S-SVM problem:Define

joint kernel function k : (X × Y)× (X × Y)→ R,kernel matrix Knn′yy′ = k( (xn, y), (xn′ , y ′) ).

maxα∈Rn|Y|

+

∑n=1,...,N

y∈Y

αny∆(yn, y)− 12

∑y,y′∈Y

n,n′=1,...,N

αnyαn′y′Knn′yy′


∑y∈Y

αny ≤CN .

Kernelized prediction function:

f (x) = argmaxy∈Y

∑ny′αny′k( (xn, yn), (x , y) )

Too many variables: train with working set of αny.

Applicationsin Computer Vision

Example 1: Category-Level Object Localization

What objects are present? person, car

Example 1: Category-Level Object Localization

Where are the objects?

Object Localization ⇒ Scene Interpretation

A man inside of a car A man outside of a car⇒ He’s driving. ⇒ He’s passing by.

Object Localization as Structured Learning:Given: training examples (xn, yn)n=1,...,N

Wanted: prediction function f : X → Y whereI X = {all images}I Y = {all boxes}

fcar

=

Structured SVM framework

Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.

Solve: minw,ξ12‖w‖

2 + C ∑Nn=1 ξ

N subject to

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).

• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.

Feature function: how to represents a (image,box)-pair (x , y)?

Obs: whether y is the right box for x , depends only on x|y.

ϕ(x , y) := h(x|y)

where h(r) is a (bag of visual word) histogram representation of theregion r .

ϕ( )

= h( ) ≈ h( ) = ϕ( )

ϕ( )

= h( ) 6≈ h( ) = ϕ( )

ϕ( )

= h( ) ≈ h( ) = ϕ( )

. . .




2 + C ∑Nn=1 ξ

N subject to




Loss function: how to compare two boxes y and y ′?

∆(y, y ′) := 1− area overlap between y and y ′

= 1− area(y ∩ y ′)area(y ∪ y ′)




2 + C ∑Nn=1 ξ

N subject to




How to solve f (x) = argmaxy ∆(yn, y) + 〈w, ϕ(xn, y)〉 ?

Option 1) Sliding Window

1− 0.3 = 0.71− 0.8 = 0.21− 0.1 = 0.91− 0.2 = 0.8. . .0.3 + 1.4 = 1.70 + 1.5 = 1.5. . .1− 1.2 = −0.21− 0.3 = 0.7

Option 2) Branch-and-Bound Search (another talk)

• C.L., M. Blaschko, T. Hofmann: Beyond Sliding Windows: Object Localization by Efficient Subwindow Search, CVPR 2008.

Structured Support Vector Machine

S-SVM Optimization: minw,ξ12‖w‖

2 + CN∑

n=1ξn

subject to for n = 1, . . . ,N :


Solve via constraint generation:Iterate:

I Solve minimization with working set of contraints: new wI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate

Polynomial time convergence to any precision ε

Example: Training set (x1, z1), . . . , (x4, y4)

Initialize: no constraints

Solve minimization with working set of contraints ⇒ w=0Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

I 〈w, ϕ(xn , y)〉 = 0 → pick any window with ∆(y, yn) = 1

Add violated constraints to working set and iterate

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.

Working set of constraints:

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.

Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

Add violated constraints to working set and iterate

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,

〈w, 〉 − 〈w, 〉 ≥ 0.8, 〈w, 〉 − 〈w, 〉 ≥ 0.01.

Working set of constraints:

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.8,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.01.

Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

Add violated constraints to working set and iterate,. . .

S-SVM Optimization: minw,ξ12‖w‖

2 + CN∑

n=1ξn

subject to for n = 1, . . . ,N :


Solve via constraint generation:Iterate:

I Solve minimization with working set of contraintsI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate

Similar to classical bootstrap training, but:I force margin between correct and incorrect location scores,I handle overlapping detections by fractional scores.

Results: PASCAL VOC 2006

Example detections for VOC 2006 bicycle, bus and cat.

Precision–recall curves for VOC 2006 bicycle, bus and cat.

Structured training improves detection accuracy.

More Recent Results (PASCAL VOC 2009)

aeroplane


horse


sheep


sofa

Why does it work?

Learned weights from binary (center) and structured training (right).

Both training methods: positive weights at object region.Structured training: negative weights for features just outsidethe bounding box position.Posterior distribution over box coordinates becomes morepeaked.

Example II: Category-Level Object Segmentation

Where exactly are the objects?

Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N

{ , , ,

, }Wanted: prediction function f : X → Y with

I X = {all images}I Y = {all binary segmentations}


Define:Feature functions: ϕ(x , y)→ Rd

I unary terms ϕi(x, yi) for each pixel iI pairwise terms ϕij(x, yi , yj) for neighbors (i, j)

Loss function ∆ : Y × Y → R.I ideally decomposes like ϕ


2 + C ∑Nn=1 ξ

n subject to


Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,segmentation function: f (x) = argmaxy F(x , y).

Example choices:

Feature functions: unary terms c = {i}:

ϕi(x , yi) =

(0, h(xi)) for y = 0,(hi(x),0) for y = 1.

h(xi) is the color histogram of the pixel i.

Feature functions: pairwise terms c = {i, j}:ϕij(x , yi , yj) = Jyi 6= yjK.

Loss function: Hamming loss∆(y, y ′) = ∑

iJyi 6= y ′iK

How to solve argminy

[∆(yn, y) + argmaxy F(xn, y)

]?

∆(yn, y) + F(xn, y)=∑

iJyn

i 6= yiK +∑

iw>i h(xn

i ) + wij∑ij

Jyi 6= yjK

=∑

i[w>i h(xn

i ) + Jyni 6= yiK] + wij

∑ij

Jyi 6= yjK

if wij ≥ 0 (which makes sense), then E := −F is submodular.use GraphCut algorithm to find global optimum efficiently.

I also possible: (loopy) belief propagation, variational inference,greedy search, simulated annealing, . . .

• [M. Szummer, P. Kohli: "Learning CRFs using graph cuts", ECCV 2008]

Extension: Image segmentation with connectedness constraints

Knowing that the object is connected improves segmentation quality.

← →ordinary original connected

segmentation segmentation

Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N

Wanted: prediction function f : X → Y whereI X = {all images (as superpixels)}I Y = {all connected binary segmentations}

• S. Nowozin, C.L.: Global Connectivity Potentials for Random Field Models, CVPR 2009.

Feature functions: unary terms c = {i}:

ϕi(x , yi) =

(0, h(xi)) for y = 0,(hi(x),0) for y = 1.

h(xi) is the bag of visual words histogram of the superpixel i.

Feature functions: pairwise terms c = {i, j}:ϕij(yi , yj) = Jyi 6= yjK.

Loss function: Hamming loss∆(y, y ′) = ∑

iJyi 6= y ′iK

How to solve f (x) = argmax{y is connected}

∆(yn, y) + F(xn, y) ?

Linear programming relaxation with connectivity constraintsrewrite energy such that it is linear in new variables µl

i and µll′ij ,

F(x , y) =∑

i

[w>1 hi(x)µ1

i + w>2 hi(x)µ−1i +

∑l 6=l′

w3µll′ij

]

subject to

µli ∈ {0, 1}, µll′

ij ∈ {0, 1},∑lµl

i = 1,∑

lµll′

ij = µl′i ,

∑l′µll′

ij = µl′j

relax to µli ∈ [0, 1] and µll′

ij ∈ [0, 1]solve linear program with additional linear constraints:

µ1i + µ1

j −∑k∈S

µ1k ≤ 1 for any set S of nodes separating i and j.

Example Results:

original segmentation with connectivity

. . . still room for improvement . . .

Summary

Machine Learning of Structured OutputsTask: predict f : X → Y for (almost) arbitrary YKey idea:

I learn scoring function F : X × Y → RI predict using f (x) := argmaxy F(x, y)

Structured Support Vector MachinesParametrize F(x , y) = 〈w, ϕ(x , y)〉Learn w from training data by maximum-margin criterionNeeds only:

I feature function ϕ(x, y)I loss function ∆(y, y′)I routine to solve argmaxy ∆(yn , y) + F(xn , y)

ApplicationsMany different applications in unified framework

I Natural Language Prediction: parsingI CompBio: secondary structured predictionI Computer Vision: pose estimation, object

localization/segmentationI . . .

Open ProblemsTheory:

I what output structures are useful?I (how) can we use approximate argmaxy?

Practice:I more application? new domains?I training speed!

Thank you!

machine learning of structured outputs

Technology

y w f x n

x y experience

y n n n

output y

y wf x n

estimation y

argmaxyy f x

argmaxy f x