some useful machine learning tools
DESCRIPTION
Some Useful Machine Learning Tools. M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay , Île-de-France. Outline. Part I : Supervised Learning Part II: Weakly Supervised Learning. Outline – Part I. Introduction to Supervised Learning Probabilistic Methods - PowerPoint PPT PresentationTRANSCRIPT
Some Useful Machine Learning Tools
M. Pawan KumarÉcole Centrale Paris
École des Ponts ParisTechINRIA Saclay, Île-de-France
• Part I : Supervised Learning
• Part II: Weakly Supervised Learning
Outline
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine– Structured output support vector machine
Outline – Part I
Image Classification
Is this an urban or rural area?
Input: x Output: y {-1,+1}
Image Classification
Is this scan healthy or unhealthy?
Input: x Output: y {-1,+1}
Image Classification
Which city is this?
Input: x Output: y {1,2,…,C}
Image Classification
What type of tumor does this scan contain?
Input: x Output: y {1,2,…,C}
Object Detection
Where is the object in the image?
Input: x Output: y {Pixels}
Object Detection
Where is the rupture in the scan?
Input: x Output: y {Pixels}
Segmentation
What is the semantic class of each pixel?
Input: x Output: y {1,2,…,C}|Pixels|
car
roadgrass
treesky
sky
Segmentation
What is the muscle group of each pixel?
Input: x Output: y {1,2,…,C}|Pixels|
A Simplified View of the Pipeline
Inputx
FeaturesΦ(x)
Scoresf(Φ(x),y)
Extract Features
ComputeScores
maxy f(Φ(x),y)Prediction
y(f)
Learn f
http://deeplearning.net
Learning Objective
Data distribution P(x,y)
Prediction
f* = argminf EP(x,y) Error(y(f),y)
Ground Truth
Measure of prediction quality
Distribution is unknown
Expectation overdata distribution
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Prediction
f* = argminf EP(x,y) Error(y(f),y)
Ground Truth
Measure of prediction quality
Expectation overdata distribution
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
Prediction
f* = argminf Σi Error(yi(f),yi)
Ground Truth
Measure of prediction quality
Expectation overempirical distribution
Finite samples
Learning Objective
Training data {(xi,yi), i = 1,2,…,n}
f* = argminf Σi Error(yi(f),yi) + λ R(f)
Finite samples
RegularizerRelative weight(hyperparameter)
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine– Structured output support vector machine
Outline – Part I
Logistic RegressionInput: x Output: y {-1,+1}Features: Φ(x)
f(Φ(x),y) = yθTΦ(x) Prediction: sign(θTΦ(x))
P(y|x) = l(f(Φ(x),y))
l(z) = 1/(1+e-z)
Logistic function
Is the distribution normalized?
Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ R(θ)
NegativeLog-likelihood
Regularizer
Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Convex optimization problem
Proof left as an exercise.
Hint: Prove that Hessian H is PSD
aTHa ≥ 0, for all a
Gradient Descent
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Start with an initial estimate θ0
θt+1 θt - μ dL(θ)dθ
Repeat until decrease in objective is below a threshold
θt
Gradient Descent
Small μ Large μ
Gradient Descent
Small μ Large μ
Gradient Descent
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Start with an initial estimate θ0
θt+1 θt - μ dL(θ)dθ
Repeat until decrease in objective is below a threshold
θt
Small constant orLine search
Newton’s Method
Minimize g(z) Solution at iteration t = zt
Define gt(Δz) = g(zt + Δz)
Second-order Taylor’s Seriesgt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2
Derivative wrt Δz = 0, implies g’(zt) + g’’(zt) Δz = 0
Solving for Δz provides the learning rate
Newton’s Method
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Start with an initial estimate θ0
θt+1 θt - μ dL(θ)dθ
Repeat until decrease in objective is below a threshold
θt
μ-1 = d2L(θ)dθ2
θt
Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)
Train C 1-vs-all logistic regression binary classifiers
Prediction: Maximum probability of +1 over C classifiers
Simple extension, easy to code
Loses the probabilistic interpretation
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine– Structured output support vector machine
Outline – Part I
Multiclass Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
Ψ(x,1) = [Φ(x) 0 0 … 0]
Ψ(x,2) = [0 Φ(x) 0 … 0]
Ψ(x,C) = [0 0 0 … Φ(x)]
…
Multiclass Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = θTΨ(x,y)
Prediction: maxy θTΨ(x,y))
P(y|x) = exp(f(Ψ(x,y)))/Z(x)
Partition function Z(x) = Σy exp(f(Ψ(x,y)))
Multiclass Logistic Regression
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Convex optimization problem
Gradient Descent, Newton’s Method, and many others
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine– Structured output support vector machine
Outline – Part I
Regularized Maximum LikelihoodInput: x Output: y {1,2,…,C}mFeatures: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]
[Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]
Regularized Maximum LikelihoodInput: x Output: y {1,2,…,C}mFeatures: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]
[Ψ(x,yi), for all i; Ψ(x,yij), for all i, j]
[Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]
Input: x Output: y {1,2,…,C}mFeatures: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = θTΨ(x,y)
Prediction: maxy θTΨ(x,y))
P(y|x) = exp(f(Ψ(x,y)))/Z(x)
Partition function Z(x) = Σy exp(f(Ψ(x,y)))
Regularized Maximum Likelihood
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi)) + λ ||θ||2
Partition function is expensive to compute
Regularized Maximum Likelihood
Approximate inference (Nikos Komodakis’ tutorial)
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine (multiclass)– Structured output support vector machine
Outline – Part I
Multiclass SVMInput: x Output: y {1,2,…,C}Features: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
Ψ(x,1) = [Φ(x) 0 0 … 0]
Ψ(x,2) = [0 Φ(x) 0 … 0]
Ψ(x,C) = [0 0 0 … Φ(x)]
…
Multiclass SVMInput: x Output: y {1,2,…,C}Features: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = wTΨ(x,y)
Prediction: maxy wTΨ(x,y))
Predicted Output: y(w) = argmaxy wTΨ(x,y))
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
Δ(yi,yi(w))
Loss function for i-th sample
Minimize the regularized sum of loss over training data
Highly non-convex in w
Regularization plays no role (overfitting may occur)
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
Δ(yi,yi(w))wTΨ(x,yi(w)) + - wTΨ(x,yi(w))
≤ wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi)
≤ maxy { wTΨ(x,y) + Δ(yi,y) } - wTΨ(x,yi)
ConvexSensitive to regularization of w
Multiclass SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y
minw ||w||2 + C Σiξi
Specialized software packages freely available
http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html
Quadratic program with polynomial # of constraints
• Introduction to Supervised Learning
• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood
• Loss-based Methods– Support vector machine (multiclass)– Structured output support vector machine
Outline – Part I
Input: x Output: y {1,2,…,C}mFeatures: Φ(x)
Joint feature vector of input and output: Ψ(x,y)
f(Ψ(x,y)) = wTΨ(x,y)
Prediction: maxy wTΨ(x,y))
Structured Output SVM
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y
minw ||w||2 + C Σiξi
Quadratic program with exponential # of constraints
Many polynomial time algorithms
Cutting Plane Algorithm
Define working sets Wi = {}
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Wi
minw ||w||2 + C Σiξi
ŷi = argmaxy wTΨ(x,y) + Δ(yi,y)
Update w by solving the following problem
Compute the most violated constraint for all samples
Update the working sets Wi by adding ŷi
REPEAT
Cutting Plane Algorithm
Number of iterations = max{O(n/ε),O(C/ε2)}
Termination criterion: Violation of ŷi < ξi + ε, for all i
Ioannis Tsochantaridis et al., JMLR 2005
At each iteration, convex dual of problem increases.
Convex dual can be upper bounded.
http://svmlight.joachims.org/svm_struct.html
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
minw ||w||2 + C Σiξi
for all y {1,2,…,C}m
Number of constraints = nCm
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi
minw ||w||2 + C Σiξi
for all y Y
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi) ≤ ξi
minw ||w||2 + C Σiξi
for all zi Y
Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ Σiξi
minw ||w||2 + C Σiξi
for all Z = {zi,i=1,…,n} Yn
Equivalent problem to structured output SVM
Number of constraints = Cmn
1-Slack Structured Output SVM
Training data {(xi,yi), i = 1,2,…,n}
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ
minw ||w||2 + C ξ
for all Z = {zi,i=1,…,n} Yn
Cutting Plane Algorithm
Define working sets W = {}
Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ for all Z W
minw ||w||2 + C ξ
zi = argmaxy wTΨ(x,y) + Δ(yi,y)
Update w by solving the following problem
Compute the most violated constraint for all samples
Update the working sets W by adding {zi, i=1,…n}
REPEAT
Cutting Plane Algorithm
Number of iterations = O(C/ε)
Termination criterion: Violation of {zi} < ξ + ε
Thorsten Joachims et al., Machine Learning 2009
At each iteration, convex dual of problem increases.
Convex dual can be upper bounded.
http://svmlight.joachims.org/svm_struct.html
• Introduction to Weakly Supervised Learning– Two types of problems
• Probabilistic Methods– Expectation maximization
• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning
Outline – Part II
Computer Vision Data
Segmentation
Information
Log
(Size
)
~ 2000
Computer Vision Data
Segmentation
Log
(Size
)
~ 2000
Information
Bounding Box
~ 1 M
Computer Vision Data
Segmentation
Log
(Size
)
Bounding BoxImage-Level ~ 2000
~ 1 M> 14 M
“Car” “Chair”Information
Computer Vision Data
Segmentation
Log
(Size
)
Image-Level
Noisy Label~ 2000
> 14 M
> 6 B
Information
Bounding Box
~ 1 M
Data
Learn with missing information (latent variables)
Detailed annotation is expensive
Often, in medical imaging, annotation is impossible
Desired annotation keeps changing
• Introduction to Weakly Supervised Learning– Two types of problems
• Probabilistic Methods– Expectation maximization
• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning
Outline – Part II
Annotation MismatchLearn to classify an image
Image x
Annotation y = “Deer”
Mismatch between desired and available annotations
h
Exact value of latent variable is not “important”
Desired Outputy
Annotation MismatchLearn to classify a DNA sequence
Mismatch between desired and possible annotations
Exact value of latent variable is not “important”
Sequence x
Annotation y {+1, -1}
Latent Variables h
Desired Output y
Output MismatchLearn to detect an object in an image
Mismatch between output and available annotations
Exact value of latent variable is important
Image x
Annotation y = “Deer”
hDesired Output
(y,h)
Output MismatchLearn to segment an image
Image Desired Output
Output MismatchLearn to segment an image
Bird
(x, y) (y, h)
Output MismatchLearn to segment an image
(x, y) (y, h)
Cow
Mismatch between output and available annotations
Exact value of latent variable is important
• Introduction to Weakly Supervised Learning– Two types of problems
• Probabilistic Methods– Expectation maximization
• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning
Outline – Part II
Expectation MaximizationInput: x Latent Variables: hAnnotation: y
Joint feature vector: Ψ(x,y,h)
f(Ψ(x,y,h)) = θTΨ(x,y,h)
Prediction: maxy P(y|x;θ) = maxy Σh P(y,h|x;θ)
P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)
Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))
Expectation MaximizationInput: x Latent Variables: hAnnotation: y
Joint feature vector: Ψ(x,y,h)
f(Ψ(x,y,h)) = θTΨ(x,y,h)
Prediction: maxy,h P(y,h|x;θ)
P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)
Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
- log P(y|x;θ)
log P(y,h|x;θ)EP(h|y,x;θ’)log P(h|y,x;θ) -EP(h|y,x;θ’)
Left as exerciseMaximized at θ = θ’
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
- log P(y|x;θ)
log P(y,h|x;θ)EP(h|y,x;θ’)log P(h|y,x;θ) -EP(h|y,x;θ’)
Maximized at θ = θ’
minθ
Expectation Maximization
Training data {(xi,yi), i = 1,2,…,n}
minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2
Annotation Mismatch
- log P(y|x;θ)
log P(y,h|x;θ)EP(h|y,x;θ’)-
minθ
minθ
Expectation Maximization
Start with an initial estimate θ0
minθ Σi –EP(h|yi,xi;θt) log(P(yi,h|xi;θ)) + λ ||θ||2
E-step: Compute P(h|y,x;θt)
M-step: Obtain θt+1 by solving the following problem
Repeat until convergence
• Introduction to Weakly Supervised Learning– Two types of problems
• Probabilistic Methods– Expectation maximization
• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning
Outline – Part II
Latent SVM
Input x
Output y Y
“Deer”
Hidden Variableh H
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Latent SVM
Feature (x,y,h)(HOG, BoW)
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Parameters w
Latent SVM
Training samples xi
Ground-truth label yi
Loss Function(yi, yi(w))
Annotation Mismatch
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
- wT(xi,yi(w),hi(w))
“Very” non-convex
(yi, yi(w))wT(xi,yi(w),hi(w)) +
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
Upper Bound
(yi, yi(w))wT(xi,yi(w),hi(w)) +
- maxhi wT(xi,yi,hi)
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
Upper Bound
(yi, y)wT(xi,y,h) +maxy,h
- maxhi wT(xi,yi,hi)
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
min ||w||2 + C∑i i
maxhiwT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y) - i
So is this convex?
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
(yi, y)wT(xi,y,h) +maxy,h
- maxhi wT(xi,yi,hi)
Convex
Convex
Difference-of-convex !!
Concave-Convex Procedure
+
Linear upper-bound of concave part
+
Linear upper-bound of concave part
Concave-Convex Procedure
+
Until Convergence
Concave-Convex Procedure
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
(yi, y)wT(xi,y,h) +maxy,h
- maxhi wT(xi,yi,hi)
Linear upper bound at wt
(xi,yi,hi*) hi* = argmaxhi wt
T(xi,yi,hi)
(y(w),h(w)) = maxyY,hH wT(x,y,h)
Latent SVM
min ||w||2 + C∑i i
maxhiwT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y) - i
Solve using CCCP
CCCP for Latent SVM
Start with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i
hi* = argmaxhiH wtT(xi,yi,hi)
http://webdocs.cs.ualberta.ca/~chunnam/
CCCP for Human Learning
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Math is for losers !!
FAILURE … BAD LOCAL MINIMUM
Self-Paced Learning
Euler wasa Genius!!
SUCCESS … GOOD LOCAL MINIMUM
1 + 1 = 2
1/3 + 1/6 = 1/2
eiπ+1 = 0
Self-Paced LearningStart with “easy” examples, then consider “hard” ones
Easy vs. Hard
Expensive
Easy for human Easy for machine
Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances
CCCP for Latent SVM
Start with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i
hi* = argmaxhiH wtT(xi,yi,hi)
Self-Paced Learning
min ||w||2 + C∑i i
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i
Self-Paced Learning
min ||w||2 + C∑i vii
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i
vi {0,1}
Trivial Solution
Self-Paced Learning
vi {0,1}
Large K Medium K Small K
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i
Self-Paced Learning
vi [0,1]
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i
Large K Medium K Small K
BiconvexProblem
AlternatingConvex Search
SPL for Latent SVM
Start with an initial estimate w0
Update
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i - ∑i vi/K
wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i
hi* = argmaxhiH wtT(xi,yi,hi)
Decrease K K/
http://cvc.centrale-ponts.fr/personnel/pawan/
• Introduction to Weakly Supervised Learning– Two types of problems
• Probabilistic Methods– Expectation maximization
• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning (if time permits)
Outline – Part II