machine learning of structured outputs
TRANSCRIPT
Machine Learning ofStructured Outputs
Christoph LampertIST Austria
(Institute of Science and Technology Austria)Klosterneuburg
Feb 2, 2011
Machine Learning of Structured Outputs
Overview...Introduction to Structured LearningStructured Support Vector MachinesApplications in Computer Vision
Slides available athttp://www.ist.ac.at/~chl
What is Machine Learning?
Definition [T. Mitchell]:Machine Learning is the study of computer algorithmsthat improve their performance in a certain taskthrough experience.
Example: BackgammonI Task: play backgammonI Experience: self-playI Performance measure: games won against humans
Example: Object RecognitionI Task: determine which objects are visible in imagesI Experience: annotated training dataI Performance measure: object recognized correctly
What is structured data?
Definition [ad hoc]:Data is structured if it consists of several parts, andnot only the parts themselves contain information, butalso the way in which the parts belong together.
Text Molecules / Chemical Structures
Documents/HyperText Images
The right tool for the problem.
Example: Machine Learning for/of Structured Data
image body model model fit
Task: human pose estimationExperience: images with manually annotated body posePerformance measure: number of correctly localized body parts
Other tasks:
Natural Language Processing:I Automatic Translation (output: sentences)I Sentence Parsing (output: parse trees)
Bioinformatics:I RNA Structure Prediction (output: bipartite graphs)I Enzyme Function Prediction (output: path in a tree)
Speech Processing:I Automatic Transcription (output: sentences)I Text-to-Speech (output: audio signal)
Robotics:I Planning (output: sequence of actions)
This talk: only Computer Vision examples
"Normal" Machine Learning:f : X → R.
inputs X can be any kind of objectsI images, text, audio, sequence of amino acids, . . .
output y is a real numberI classification, regression, . . .
many way to construct f :I f (x) = a · ϕ(x) + b,I f (x) = decision tree,I f (x) = neural network
Structured Output Learning:f : X → Y .
inputs X can be any kind of objectsoutputs y ∈ Y are complex (structured) objects
I images, parse trees, folds of a protein, . . .
how to construct f ?
Predicting Structured Outputs: Image Denosing
f : 7→input: images output: denoised images
input set X = {grayscale images} =̂ [0, 255]M ·N
output set Y = {grayscale images} =̂ [0, 255]M ·N
energy minimization f (x) := argminy∈Y E(x , y)
E(x , y) = λ∑
i(xi − yi)2 + µ∑
i,j |yi − yj |
Predicting Structured Outputs: Human Pose Estimation
7→input: image body model output: model fit
input set X = {images}
output set Y = {positions/angles of K body parts} =̂ R4K .
energy minimization f (x) := argminy∈Y E(x , y)
E(x , y) = ∑i w>i ϕfit(xi , yi) +∑
i,j w>ij ϕpose(yi , yj)
Predicting Structured Outputs: Shape Matching
input: image pairs
output: mapping y : xi ↔ y(xi)
scoring functionF(x , y) = ∑
i w>i ϕsim(xi , y(xi)) +∑i,j w>ij ϕdist(xi , xj , y(xi), y(xj))
predict f : X → Y by f (x) := argmaxy∈Y F(x , y)
[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]
Predicting Structured Outputs: Tracking (by Detection)
input:image
output:object position
input set X = {images}
output set Y = R2 (box center) or R4 (box coordinates)
predict f : X → Y by f (x) := argmaxy∈Y F(x , y)
scoring function F(x , y) = w>ϕ(x , y) e.g. SVM score
images: [C. L., Jan Peters, "Active Structured Learning for High-Speed Object Detection", DAGM 2009]
Predicting Structured Outputs: Summary
Image Denoising
y = argminy E(x , y) E(x , y) = w1∑
i(xi − yi)2 + w2∑
i,j |yi − yj |
Pose Estimationy = argminy E(x , y) E(x , y) =
∑i w>i ϕ(xi , yi) +
∑i,j w>ij ϕ(yi , yj)
Point Matching
y = argmaxy F(x , y) F(x , y) =∑
i w>i ϕ(xi , yi) +∑
i,j w>ij ϕ(yi , yj)
Tracking
y = argmaxy F(x , y) F(x , y) = w>ϕ(x , y)
Unified FormulationPredict structured output by maximization
y = argmaxy∈Y
F(x , y)
of a compatiblity function
F(x , y) = 〈w, ϕ(x , y)〉
that is linear in a parameter vector w.
Structured Prediction: how to evaluate argmaxy F(x , y)?
chain treeloop-free graphs: Shortest-Path / Belief Propagation (BP)
grid arbitrary graphloopy graphs: GraphCut, approximate inference (e.g. loopy BP)
Structured Learning: how to learn F(x , y) from examples?
Machine Learning for Structured Outputs
Learning Problem:Task: predict structured objects f : X → YExperience: example pairs {(x1, y1), . . . , (xN , yN )} ⊂ X × Y :typical inputs with "correct" outputs for them.
{ , , ,. . . }Performance measure: ∆ : Y × Y → R
Our choice:parametric family: F(x , y; w) = 〈w, ϕ(x , y)〉
prediction method: f (x) = argmaxy∈Y F(x , y; w)Task: determine "good" w
Reminder: regularized risk minimization
Find w for decision function F = 〈w, ϕ(x , y)〉 by
minw∈Rd λ‖w‖2 +N∑
n=1`(yn,F(xn, ·; w))
Regularization + empirical loss (on training data)
Logistic Loss: Conditional Random FieldsI `(yn , F(xn , ·; w)) = log
( ∑y∈Y
exp[F(xn , y; w)− F(xn , yn ; w)])
Hinge-loss: Maximum Margin TrainingI `(yn , F(xn , ·; w)) = max
y∈Y
[∆(yn , y)+F(xn , y; w)−F(xn , yn ; w)
]Exponential Loss: Boosting
I `(yn , F(xn , ·; w)) =∑
y∈Y\{yn}exp[F(xn , y; w)− F(xn , yn ; w)]
)
Maximum Margin Trainingof Structured Models
(Structured SVMs)
Structured Support Vector Machine
Structured Support Vector Machine:
minw∈Rd12‖w‖
2
+ CN
N∑n=1
maxy∈Y
∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉
Unconstrained optimization, convex, non-differentiable objective.
[I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. "Large Margin Methods for Structured and Interdependent
Output Variables", JMLR, 2005.]
S-SVM Objective Function for w ∈ R2:
3 2 1 0 1 2 3 4 52
1
0
1
2
3S-SVM objective C=0.01
3 2 1 0 1 2 3 4 52
1
0
1
2
3S-SVM objective C=0.10
3 2 1 0 1 2 3 4 52
1
0
1
2
3S-SVM objective C=1.00
3 2 1 0 1 2 3 4 52
1
0
1
2
3S-SVM objective C→∞
Structured Support Vector Machine:
minw∈Rd12‖w‖
2
+ CN
N∑n=1
maxy∈Y
∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉
Unconstrained optimization, convex, non-differentiable objective.
Structured SVM (equivalent formulation):
minw∈Rd ,ξ∈Rn+
12‖w‖
2 + CN
N∑n=1
ξn
subject to, for n = 1, . . . ,N ,
maxy∈Y
[∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉
]≤ ξn
n non-linear contraints, convex, differentiable objective.
Structured SVM (also equivalent formulation):
minw∈Rd ,ξ∈Rn+
12‖w‖
2 + CN
N∑n=1
ξn
subject to, for n = 1, . . . ,N ,
∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉 ≤ ξn, for all y ∈ Y
|Y|n linear constraints, convex, differentiable objective.
Example: A "True" Multiclass SVM
Y = {1, 2, . . . ,K}, ∆(y, y ′) =
1 for y 6= y ′
0 otherwise..
ϕ(x , y) =(Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = KKΦ(x)
)= Φ(x)e>y with ey=y-th unit vector
Solve:
minw,ξ12‖w‖
2 + CN
N∑n=1
ξn
subject to, for n = 1, . . . ,N ,
〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ 1− ξn for all y ∈ Y .
Classification: MAP f (x) = argmaxy∈Y
〈w, ϕ(x , y)〉
Crammer-Singer Multiclass SVM
Hierarchical Multiclass Classification
Loss function can reflect hierarchy:cat dog car bus
∆(y, y ′) := 12(distance in tree)
∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.
Solve:
minw,ξ12‖w‖
2 + CN
N∑n=1
ξn
subject to, for n = 1, . . . ,N ,
〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ ∆(yn, y)− ξn for all y ∈ Y .
Kernelized S-SVM problem:Define
joint kernel function k : (X × Y)× (X × Y)→ R,kernel matrix Knn′yy′ = k( (xn, y), (xn′ , y ′) ).
maxα∈Rn|Y|
+
∑n=1,...,N
y∈Y
αny∆(yn, y)− 12
∑y,y′∈Y
n,n′=1,...,N
αnyαn′y′Knn′yy′
subject to, for n = 1, . . . ,N ,
∑y∈Y
αny ≤CN .
Kernelized prediction function:
f (x) = argmaxy∈Y
∑ny′αny′k( (xn, yn), (x , y) )
Too many variables: train with working set of αny.
Applicationsin Computer Vision
Example 1: Category-Level Object Localization
What objects are present? person, car
Example 1: Category-Level Object Localization
Where are the objects?
Object Localization ⇒ Scene Interpretation
A man inside of a car A man outside of a car⇒ He’s driving. ⇒ He’s passing by.
Object Localization as Structured Learning:Given: training examples (xn, yn)n=1,...,N
Wanted: prediction function f : X → Y whereI X = {all images}I Y = {all boxes}
fcar
=
Structured SVM framework
Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.
Solve: minw,ξ12‖w‖
2 + C ∑Nn=1 ξ
N subject to
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
Feature function: how to represents a (image,box)-pair (x , y)?
Obs: whether y is the right box for x , depends only on x|y.
ϕ(x , y) := h(x|y)
where h(r) is a (bag of visual word) histogram representation of theregion r .
ϕ( )
= h( ) ≈ h( ) = ϕ( )
ϕ( )
= h( ) 6≈ h( ) = ϕ( )
ϕ( )
= h( ) ≈ h( ) = ϕ( )
. . .
Structured SVM framework
Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.
Solve: minw,ξ12‖w‖
2 + C ∑Nn=1 ξ
N subject to
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
Loss function: how to compare two boxes y and y ′?
∆(y, y ′) := 1− area overlap between y and y ′
= 1− area(y ∩ y ′)area(y ∪ y ′)
Structured SVM framework
Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.
Solve: minw,ξ12‖w‖
2 + C ∑Nn=1 ξ
N subject to
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).
• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.
How to solve f (x) = argmaxy ∆(yn, y) + 〈w, ϕ(xn, y)〉 ?
Option 1) Sliding Window
1− 0.3 = 0.71− 0.8 = 0.21− 0.1 = 0.91− 0.2 = 0.8. . .0.3 + 1.4 = 1.70 + 1.5 = 1.5. . .1− 1.2 = −0.21− 0.3 = 0.7
Option 2) Branch-and-Bound Search (another talk)
• C.L., M. Blaschko, T. Hofmann: Beyond Sliding Windows: Object Localization by Efficient Subwindow Search, CVPR 2008.
Structured Support Vector Machine
S-SVM Optimization: minw,ξ12‖w‖
2 + CN∑
n=1ξn
subject to for n = 1, . . . ,N :
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Solve via constraint generation:Iterate:
I Solve minimization with working set of contraints: new wI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate
Polynomial time convergence to any precision ε
Example: Training set (x1, z1), . . . , (x4, y4)
Initialize: no constraints
Solve minimization with working set of contraints ⇒ w=0Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉
I 〈w, ϕ(xn , y)〉 = 0 → pick any window with ∆(y, yn) = 1
Add violated constraints to working set and iterate
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.
Working set of constraints:
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.
Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉
Add violated constraints to working set and iterate
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,
〈w, 〉 − 〈w, 〉 ≥ 0.8, 〈w, 〉 − 〈w, 〉 ≥ 0.01.
Working set of constraints:
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.8,
〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.01.
Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉
Add violated constraints to working set and iterate,. . .
S-SVM Optimization: minw,ξ12‖w‖
2 + CN∑
n=1ξn
subject to for n = 1, . . . ,N :
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Solve via constraint generation:Iterate:
I Solve minimization with working set of contraintsI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate
Similar to classical bootstrap training, but:I force margin between correct and incorrect location scores,I handle overlapping detections by fractional scores.
Results: PASCAL VOC 2006
Example detections for VOC 2006 bicycle, bus and cat.
Precision–recall curves for VOC 2006 bicycle, bus and cat.
Structured training improves detection accuracy.
More Recent Results (PASCAL VOC 2009)
aeroplane
More Recent Results (PASCAL VOC 2009)
horse
More Recent Results (PASCAL VOC 2009)
sheep
More Recent Results (PASCAL VOC 2009)
sofa
Why does it work?
Learned weights from binary (center) and structured training (right).
Both training methods: positive weights at object region.Structured training: negative weights for features just outsidethe bounding box position.Posterior distribution over box coordinates becomes morepeaked.
Example II: Category-Level Object Segmentation
Where exactly are the objects?
Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N
{ , , ,
, }Wanted: prediction function f : X → Y with
I X = {all images}I Y = {all binary segmentations}
Structured SVM framework
Define:Feature functions: ϕ(x , y)→ Rd
I unary terms ϕi(x, yi) for each pixel iI pairwise terms ϕij(x, yi , yj) for neighbors (i, j)
Loss function ∆ : Y × Y → R.I ideally decomposes like ϕ
Solve: minw,ξ12‖w‖
2 + C ∑Nn=1 ξ
n subject to
∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,
Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,segmentation function: f (x) = argmaxy F(x , y).
Example choices:
Feature functions: unary terms c = {i}:
ϕi(x , yi) =
(0, h(xi)) for y = 0,(hi(x),0) for y = 1.
h(xi) is the color histogram of the pixel i.
Feature functions: pairwise terms c = {i, j}:ϕij(x , yi , yj) = Jyi 6= yjK.
Loss function: Hamming loss∆(y, y ′) = ∑
iJyi 6= y ′iK
How to solve argminy
[∆(yn, y) + argmaxy F(xn, y)
]?
∆(yn, y) + F(xn, y)=∑
iJyn
i 6= yiK +∑
iw>i h(xn
i ) + wij∑ij
Jyi 6= yjK
=∑
i[w>i h(xn
i ) + Jyni 6= yiK] + wij
∑ij
Jyi 6= yjK
if wij ≥ 0 (which makes sense), then E := −F is submodular.use GraphCut algorithm to find global optimum efficiently.
I also possible: (loopy) belief propagation, variational inference,greedy search, simulated annealing, . . .
• [M. Szummer, P. Kohli: "Learning CRFs using graph cuts", ECCV 2008]
Extension: Image segmentation with connectedness constraints
Knowing that the object is connected improves segmentation quality.
← →ordinary original connected
segmentation segmentation
Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N
Wanted: prediction function f : X → Y whereI X = {all images (as superpixels)}I Y = {all connected binary segmentations}
• S. Nowozin, C.L.: Global Connectivity Potentials for Random Field Models, CVPR 2009.
Feature functions: unary terms c = {i}:
ϕi(x , yi) =
(0, h(xi)) for y = 0,(hi(x),0) for y = 1.
h(xi) is the bag of visual words histogram of the superpixel i.
Feature functions: pairwise terms c = {i, j}:ϕij(yi , yj) = Jyi 6= yjK.
Loss function: Hamming loss∆(y, y ′) = ∑
iJyi 6= y ′iK
How to solve f (x) = argmax{y is connected}
∆(yn, y) + F(xn, y) ?
Linear programming relaxation with connectivity constraintsrewrite energy such that it is linear in new variables µl
i and µll′ij ,
F(x , y) =∑
i
[w>1 hi(x)µ1
i + w>2 hi(x)µ−1i +
∑l 6=l′
w3µll′ij
]
subject to
µli ∈ {0, 1}, µll′
ij ∈ {0, 1},∑lµl
i = 1,∑
lµll′
ij = µl′i ,
∑l′µll′
ij = µl′j
relax to µli ∈ [0, 1] and µll′
ij ∈ [0, 1]solve linear program with additional linear constraints:
µ1i + µ1
j −∑k∈S
µ1k ≤ 1 for any set S of nodes separating i and j.
Example Results:
original segmentation with connectivity
. . . still room for improvement . . .
Summary
Machine Learning of Structured OutputsTask: predict f : X → Y for (almost) arbitrary YKey idea:
I learn scoring function F : X × Y → RI predict using f (x) := argmaxy F(x, y)
Structured Support Vector MachinesParametrize F(x , y) = 〈w, ϕ(x , y)〉Learn w from training data by maximum-margin criterionNeeds only:
I feature function ϕ(x, y)I loss function ∆(y, y′)I routine to solve argmaxy ∆(yn , y) + F(xn , y)
ApplicationsMany different applications in unified framework
I Natural Language Prediction: parsingI CompBio: secondary structured predictionI Computer Vision: pose estimation, object
localization/segmentationI . . .
Open ProblemsTheory:
I what output structures are useful?I (how) can we use approximate argmaxy?
Practice:I more application? new domains?I training speed!
Thank you!