curriculum learning for latent structural svm m. pawan kumar (under submission) daphne...
TRANSCRIPT
Curriculum Learning forLatent Structural SVM
M. Pawan Kumar
(under submission)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Daphne Koller
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Benjamin Packer
AimTo learn accurate parameters for latent structural SVM
Input x
Output y Y
“Deer”
Hidden Variableh H
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
AimTo learn accurate parameters for latent structural SVM
Feature (x,y,h)(HOG, BoW)
(y*,h*) = maxyY,hH wT(x,y,h)
Parameters w
Motivation
Real Numbers
Imaginary Numbers
eiπ+1 = 0
Math is forlosers !!
FAILURE … BAD LOCAL MINIMUM
Motivation
Real Numbers
Imaginary Numbers
eiπ+1 = 0
Euler wasa Genius!!
SUCCESS … GOOD LOCAL MINIMUMCurriculum Learning: Bengio et al, ICML 2009
Motivation
Start with “easy” examples, then consider “hard” ones
Easy vs. Hard
Expensive
Easy for human Easy for machine
Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Latent Structural SVM
Training samples xi
Ground-truth label yi
Loss Function(yi, yi(w), hi(w))
Felzenszwalb et al, 2008, Yu and Joachims, 2009
Latent Structural SVM
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i(yi, yi(w), hi(w))
Non-convex Objective
Minimize an upper bound
Latent Structural SVM
min ||w||2 + C∑i i
maxhiwT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Still non-convex Difference of convex
CCCP Algorithm - converges to a local minimum
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Concave-Convex Procedure
Start with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
hi = maxhH wtT(xi,yi,h)
Concave-Convex Procedure
Looks at all samples simultaneously
“Hard” samples will cause confusion
Start with “easy” samples, then consider “hard” ones
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Curriculum Learning
REMINDER
Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances
Curriculum Learning
Start with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
hi = maxhH wtT(xi,yi,h)
Curriculum Learning
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
Curriculum Learning
min ||w||2 + C∑i vii
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
vi {0,1}
Trivial Solution
Curriculum Learning
vi {0,1}
Large K Medium K Small K
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
Curriculum Learning
vi [0,1]
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
Large K Medium K Small K
BiconvexProblem
Curriculum LearningStart with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i vii - ∑i vi/K
wT(xi,yi,hi) - wT(xi,y,h)≥ (yi, y, h) - i
hi = maxhH wtT(xi,yi,h)
Decrease K K/
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Curriculum Learning
• Experiments
Object Detection
Feature (x,y,h) - HOG
Input x - Image
Output y Y
Latent h - Box
- 0/1 Loss
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Object Detection
271 images, 6 classes
90/10 train/test split
5 folds
Mammals Dataset
Object DetectionCCCP Curriculum
Object DetectionCCCP Curriculum
Object DetectionCCCP Curriculum
Object DetectionCCCP Curriculum
44.14.24.34.44.54.64.74.84.9
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Objective value
0
5
10
15
20
25
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5CCCP Curriculum
Test error
Object Detection
Handwritten Digit Recognition
Feature (x,y,h) - PCA + Projection
Input x - Image
Output y Y
Y = {0, 1, … , 9}
Latent h - Rotation
MNIST Dataset
- 0/1 Loss
Handwritten Digit Recognition
- Significant Difference
C
C
C
Handwritten Digit Recognition
- Significant Difference
C
C
C
Handwritten Digit Recognition
- Significant Difference
C
C
C
Handwritten Digit Recognition
- Significant Difference
C
C
C
Motif Finding
Feature (x,y,h) - Ng and Cardie, ACL 2002
Input x - DNA Sequence
Output y Y
Y = {0, 1}
Latent h - Motif Location
- 0/1 Loss
Motif Finding
40,000 sequences
50/50 train/test split
5 folds
UniProbe Dataset
Motif FindingAverage Hamming Distance of Inferred Motifs
Motif Finding
020406080
100120140160
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
CCCPCurr
Objective Value
Motif Finding
0
10
20
30
40
50
Fold1
Fold2
Fold3
Fold4
Fold5
CCCPCurr
Test Error
Noun Phrase Coreference
Feature (x,y,h) - Yu and Joachims, ICML 2009
Input x - Nouns Output y - Clustering
Latent h - Spanning Forest over Nouns
Noun Phrase Coreference60 documents
50/50 train/test split 1 predefined fold
MUC6 Dataset
Noun Phrase Coreference
- Significant Improvement
- Significant Decrement
MITRELoss
PairwiseLoss
Noun Phrase Coreference
MITRELoss
PairwiseLoss
Noun Phrase Coreference
MITRELoss
PairwiseLoss
Summary
• Automatic Curriculum Learning
• Concave-Biconvex Procedure
• Generalization to other Latent models– Expectation-Maximization– E-step remains the same
– M-step includes indicator variables vi