some useful machine learning tools

Some Useful Machine Learning Tools

M. Pawan KumarÉcole Centrale Paris

École des Ponts ParisTechINRIA Saclay, Île-de-France

• Part I : Supervised Learning

• Part II: Weakly Supervised Learning

Outline

• Introduction to Supervised Learning

• Probabilistic Methods– Logistic regression– Multiclass logistic regression– Regularized maximum likelihood

• Loss-based Methods– Support vector machine– Structured output support vector machine

Outline – Part I

Image Classification

Is this an urban or rural area?

Input: x Output: y {-1,+1}


Is this scan healthy or unhealthy?

Input: x Output: y {-1,+1}


Which city is this?

Input: x Output: y {1,2,…,C}


What type of tumor does this scan contain?

Input: x Output: y {1,2,…,C}

Object Detection

Where is the object in the image?

Input: x Output: y {Pixels}

Object Detection

Where is the rupture in the scan?

Input: x Output: y {Pixels}

Segmentation

What is the semantic class of each pixel?

Input: x Output: y {1,2,…,C}|Pixels|

car

roadgrass

treesky

sky

Segmentation

What is the muscle group of each pixel?

Input: x Output: y {1,2,…,C}|Pixels|

A Simplified View of the Pipeline

Inputx

FeaturesΦ(x)

Scoresf(Φ(x),y)

Extract Features

ComputeScores

maxy f(Φ(x),y)Prediction

y(f)

Learn f

http://deeplearning.net

Learning Objective

Data distribution P(x,y)

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth

Measure of prediction quality

Distribution is unknown

Expectation overdata distribution

Learning Objective

Training data {(xi,yi), i = 1,2,…,n}

Prediction

f* = argminf EP(x,y) Error(y(f),y)

Ground Truth


Expectation overdata distribution

Learning Objective


Prediction

f* = argminf Σi Error(yi(f),yi)

Ground Truth


Expectation overempirical distribution

Finite samples

Learning Objective


f* = argminf Σi Error(yi(f),yi) + λ R(f)

Finite samples

RegularizerRelative weight(hyperparameter)




Outline – Part I

Logistic RegressionInput: x Output: y {-1,+1}Features: Φ(x)

f(Φ(x),y) = yθTΦ(x) Prediction: sign(θTΦ(x))

P(y|x) = l(f(Φ(x),y))

l(z) = 1/(1+e-z)

Logistic function

Is the distribution normalized?

Logistic Regression


minθ Σi –log(P(yi|xi)) + λ R(θ)

NegativeLog-likelihood

Regularizer

Logistic Regression


minθ Σi –log(P(yi|xi)) + λ ||θ||2

Convex optimization problem

Proof left as an exercise.

Hint: Prove that Hessian H is PSD

aTHa ≥ 0, for all a

Gradient Descent



Start with an initial estimate θ0

θt+1 θt - μ dL(θ)dθ

Repeat until decrease in objective is below a threshold

θt

Gradient Descent

Small μ Large μ

Gradient Descent






θt

Small constant orLine search

Newton’s Method

Minimize g(z) Solution at iteration t = zt

Define gt(Δz) = g(zt + Δz)

Second-order Taylor’s Seriesgt(Δz) ≈ g(zt) + g’(zt)Δz + g’’(zt) (Δz)2

Derivative wrt Δz = 0, implies g’(zt) + g’’(zt) Δz = 0

Solving for Δz provides the learning rate

Newton’s Method






θt

μ-1 = d2L(θ)dθ2

θt

Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)

Train C 1-vs-all logistic regression binary classifiers

Prediction: Maximum probability of +1 over C classifiers

Simple extension, easy to code

Loses the probabilistic interpretation




Outline – Part I

Multiclass Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)

Joint feature vector of input and output: Ψ(x,y)

Ψ(x,1) = [Φ(x) 0 0 … 0]

Ψ(x,2) = [0 Φ(x) 0 … 0]

Ψ(x,C) = [0 0 0 … Φ(x)]

…

Multiclass Logistic RegressionInput: x Output: y {1,2,…,C}Features: Φ(x)


f(Ψ(x,y)) = θTΨ(x,y)

Prediction: maxy θTΨ(x,y))

P(y|x) = exp(f(Ψ(x,y)))/Z(x)

Partition function Z(x) = Σy exp(f(Ψ(x,y)))

Multiclass Logistic Regression



Convex optimization problem

Gradient Descent, Newton’s Method, and many others




Outline – Part I

Regularized Maximum LikelihoodInput: x Output: y {1,2,…,C}mFeatures: Φ(x)


[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]

[Ψ(x,yi), for all i; Ψ(x,yi,yj), for all i, j]

Regularized Maximum LikelihoodInput: x Output: y {1,2,…,C}mFeatures: Φ(x)


[Ψ(x,y1); Ψ(x,y2); …; Ψ(x,ym)]

[Ψ(x,yi), for all i; Ψ(x,yij), for all i, j]

[Ψ(x,yi), for all i; Ψ(x,yc), c is a subset of variables]

Input: x Output: y {1,2,…,C}mFeatures: Φ(x)


f(Ψ(x,y)) = θTΨ(x,y)

Prediction: maxy θTΨ(x,y))

P(y|x) = exp(f(Ψ(x,y)))/Z(x)

Partition function Z(x) = Σy exp(f(Ψ(x,y)))

Regularized Maximum Likelihood



Partition function is expensive to compute

Regularized Maximum Likelihood

Approximate inference (Nikos Komodakis’ tutorial)



• Loss-based Methods– Support vector machine (multiclass)– Structured output support vector machine

Outline – Part I

Multiclass SVMInput: x Output: y {1,2,…,C}Features: Φ(x)


Ψ(x,1) = [Φ(x) 0 0 … 0]

Ψ(x,2) = [0 Φ(x) 0 … 0]

Ψ(x,C) = [0 0 0 … Φ(x)]

…

Multiclass SVMInput: x Output: y {1,2,…,C}Features: Φ(x)


f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y))

Predicted Output: y(w) = argmaxy wTΨ(x,y))

Multiclass SVM


Δ(yi,yi(w))

Loss function for i-th sample

Minimize the regularized sum of loss over training data

Highly non-convex in w

Regularization plays no role (overfitting may occur)

Multiclass SVM


Δ(yi,yi(w))wTΨ(x,yi(w)) + - wTΨ(x,yi(w))

≤ wTΨ(x,yi(w)) + Δ(yi,yi(w)) - wTΨ(x,yi)

≤ maxy { wTΨ(x,y) + Δ(yi,y) } - wTΨ(x,yi)

ConvexSensitive to regularization of w

Multiclass SVM


wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y

minw ||w||2 + C Σiξi

Specialized software packages freely available

http://www.cs.cornell.edu/People/tj/svm_light/svm_multiclass.html

Quadratic program with polynomial # of constraints



• Loss-based Methods– Support vector machine (multiclass)– Structured output support vector machine

Outline – Part I

Input: x Output: y {1,2,…,C}mFeatures: Φ(x)


f(Ψ(x,y)) = wTΨ(x,y)

Prediction: maxy wTΨ(x,y))

Structured Output SVM



wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y


Quadratic program with exponential # of constraints

Many polynomial time algorithms

Cutting Plane Algorithm

Define working sets Wi = {}

wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi for all y Wi


ŷi = argmaxy wTΨ(x,y) + Δ(yi,y)

Update w by solving the following problem

Compute the most violated constraint for all samples

Update the working sets Wi by adding ŷi

REPEAT


Number of iterations = max{O(n/ε),O(C/ε2)}

Termination criterion: Violation of ŷi < ξi + ε, for all i

Ioannis Tsochantaridis et al., JMLR 2005

At each iteration, convex dual of problem increases.

Convex dual can be upper bounded.

http://svmlight.joachims.org/svm_struct.html



wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi


for all y {1,2,…,C}m

Number of constraints = nCm



wTΨ(x,y) + Δ(yi,y) - wTΨ(x,yi) ≤ ξi


for all y Y



wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi) ≤ ξi


for all zi Y



Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ Σiξi


for all Z = {zi,i=1,…,n} Yn

Equivalent problem to structured output SVM

Number of constraints = Cmn

1-Slack Structured Output SVM


Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ

minw ||w||2 + C ξ

for all Z = {zi,i=1,…,n} Yn


Define working sets W = {}

Σi (wTΨ(x,zi) + Δ(yi,zi) - wTΨ(x,yi)) ≤ ξ for all Z W

minw ||w||2 + C ξ

zi = argmaxy wTΨ(x,y) + Δ(yi,y)

Update w by solving the following problem

Compute the most violated constraint for all samples

Update the working sets W by adding {zi, i=1,…n}

REPEAT


Number of iterations = O(C/ε)

Termination criterion: Violation of {zi} < ξ + ε

Thorsten Joachims et al., Machine Learning 2009

At each iteration, convex dual of problem increases.

Convex dual can be upper bounded.

http://svmlight.joachims.org/svm_struct.html

• Introduction to Weakly Supervised Learning– Two types of problems

• Probabilistic Methods– Expectation maximization

• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning

Outline – Part II

Computer Vision Data

Segmentation

Information

Log

(Size

)

~ 2000


Segmentation

Log

(Size

)

~ 2000

Information

Bounding Box

~ 1 M


Segmentation

Log

(Size

)

Bounding BoxImage-Level ~ 2000

~ 1 M> 14 M

“Car” “Chair”Information


Segmentation

Log

(Size

)

Image-Level

Noisy Label~ 2000

> 14 M

> 6 B

Information

Bounding Box

~ 1 M

Data

Learn with missing information (latent variables)

Detailed annotation is expensive

Often, in medical imaging, annotation is impossible

Desired annotation keeps changing




Outline – Part II

Annotation MismatchLearn to classify an image

Image x

Annotation y = “Deer”

Mismatch between desired and available annotations

h

Exact value of latent variable is not “important”

Desired Outputy

Annotation MismatchLearn to classify a DNA sequence

Mismatch between desired and possible annotations

Exact value of latent variable is not “important”

Sequence x

Annotation y {+1, -1}

Latent Variables h

Desired Output y

Output MismatchLearn to detect an object in an image

Mismatch between output and available annotations

Exact value of latent variable is important

Image x

Annotation y = “Deer”

hDesired Output

(y,h)

Output MismatchLearn to segment an image

Image Desired Output


Bird

(x, y) (y, h)


(x, y) (y, h)

Cow

Mismatch between output and available annotations

Exact value of latent variable is important




Outline – Part II

Expectation MaximizationInput: x Latent Variables: hAnnotation: y

Joint feature vector: Ψ(x,y,h)

f(Ψ(x,y,h)) = θTΨ(x,y,h)

Prediction: maxy P(y|x;θ) = maxy Σh P(y,h|x;θ)

P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)

Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))

Expectation MaximizationInput: x Latent Variables: hAnnotation: y

Joint feature vector: Ψ(x,y,h)

f(Ψ(x,y,h)) = θTΨ(x,y,h)

Prediction: maxy,h P(y,h|x;θ)

P(y,h|x;θ) = exp(f(Ψ(x,y,h)))/Z(x;θ)

Partition function Z(x;θ) = Σy,h exp(f(Ψ(x,y,h)))

Expectation Maximization


minθ Σi –log(P(yi|xi;θ)) + λ ||θ||2

Annotation Mismatch

- log P(y|x;θ)

log P(y,h|x;θ)EP(h|y,x;θ’)log P(h|y,x;θ) -EP(h|y,x;θ’)

Left as exerciseMaximized at θ = θ’




Annotation Mismatch

- log P(y|x;θ)

log P(y,h|x;θ)EP(h|y,x;θ’)-

minθ

minθ



minθ Σi –EP(h|yi,xi;θt) log(P(yi,h|xi;θ)) + λ ||θ||2

E-step: Compute P(h|y,x;θt)

M-step: Obtain θt+1 by solving the following problem

Repeat until convergence




Outline – Part II

Latent SVM

Input x

Output y Y

“Deer”

Hidden Variableh H

Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }

Latent SVM

Feature (x,y,h)(HOG, BoW)

(y(w),h(w)) = maxyY,hH wT(x,y,h)

Parameters w

Latent SVM

Training samples xi

Ground-truth label yi

Loss Function(yi, yi(w))

Annotation Mismatch


Latent SVM

- wT(xi,yi(w),hi(w))

“Very” non-convex

(yi, yi(w))wT(xi,yi(w),hi(w)) +


Latent SVM

Upper Bound

(yi, yi(w))wT(xi,yi(w),hi(w)) +

- maxhi wT(xi,yi,hi)


Latent SVM

Upper Bound

(yi, y)wT(xi,y,h) +maxy,h



Latent SVM

min ||w||2 + C∑i i

maxhiwT(xi,yi,hi) - wT(xi,y,h)

≥ (yi, y) - i

So is this convex?


Latent SVM



Convex

Convex

Difference-of-convex !!

Concave-Convex Procedure

+

Linear upper-bound of concave part

+

Linear upper-bound of concave part


+

Until Convergence



Latent SVM



Linear upper bound at wt

(xi,yi,hi*) hi* = argmaxhi wt

T(xi,yi,hi)


Latent SVM

min ||w||2 + C∑i i

maxhiwT(xi,yi,hi) - wT(xi,y,h)

≥ (yi, y) - i

Solve using CCCP

CCCP for Latent SVM

Start with an initial estimate w0

Update

Update wt+1 by solving a convex problem

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y) - i

hi* = argmaxhiH wtT(xi,yi,hi)

http://webdocs.cs.ualberta.ca/~chunnam/

CCCP for Human Learning

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Math is for losers !!

FAILURE … BAD LOCAL MINIMUM

Self-Paced Learning

Euler wasa Genius!!

SUCCESS … GOOD LOCAL MINIMUM

1 + 1 = 2

1/3 + 1/6 = 1/2

eiπ+1 = 0

Self-Paced LearningStart with “easy” examples, then consider “hard” ones

Easy vs. Hard

Expensive

Easy for human Easy for machine

Simultaneously estimate easiness and parametersEasiness is property of data sets, not single instances

CCCP for Latent SVM


Update


min ||w||2 + C∑i i



Self-Paced Learning

min ||w||2 + C∑i i

wT(xi,yi,hi*) - wT(xi,y,h)≥ (yi, y, h) - i

Self-Paced Learning

min ||w||2 + C∑i vii


vi {0,1}

Trivial Solution

Self-Paced Learning

vi {0,1}

Large K Medium K Small K

min ||w||2 + C∑i vii - ∑ivi/K


Self-Paced Learning

vi [0,1]

min ||w||2 + C∑i vii - ∑ivi/K


Large K Medium K Small K

BiconvexProblem

AlternatingConvex Search

SPL for Latent SVM


Update


min ||w||2 + C∑i i - ∑i vi/K



Decrease K K/

http://cvc.centrale-ponts.fr/personnel/pawan/



• Loss-based Methods– Latent support vector machine– Dissimilarity coefficient learning (if time permits)

Outline – Part II

some useful machine learning tools

Documents

y erroryf

gzt gzt z

t dldrepeat

agradient descenttraining

argminf epx

iteration t

useful machine learning

object detectionwhere