optimization for machine learningniejiazhong/slides/vishy-lec5.pdf · e = 2 kwk s.v:n. vishwanathan...

Optimization for Machine LearningLecture 4: SMO-MKL

S.V.N. (vishy) Vishwanathan

Purdue Universityvishy@purdue.edu

July 11, 2012

S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22

Motivation

Binary Classification

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

Motivation

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

Motivation

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

Motivation

yi = −1

yi = +1

{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}

{x | 〈w , x〉+ b = 1}

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w

‖w‖ , x1 − x2⟩

= 2‖w‖

Motivation

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

2‖w‖2 + C

m∑i=1

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

Motivation

The Kernel Trick

Motivation

The Kernel Trick

Motivation

The Kernel Trick

Motivation

Kernel Trick

minw ,b,ξ

2‖w‖2 + C

m∑i=1

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

Motivation

Kernel Trick

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyj 〈φ(xi ), φ(xj)〉

Motivation

Kernel Trick

2α>Hα + 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Hij = yiyjk(xi , xj)

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning Answer

Cook up as many (base) kernels as you can

Compute a data dependent kernel function as a linear combination ofbase kernels

k(x , x ′) =∑k

dkkk(x , x ′) s.t. dk ≥ 0

Motivation

Key Question

Which kernel should I use?

The Multiple Kernel Learning Answer

Cook up as many (base) kernels as you can

Compute a data dependent kernel function as a linear combination ofbase kernels

k(x , x ′) =∑k

dkkk(x , x ′) s.t. dk ≥ 0

Motivation

Object Detection

Localize a specified object of interest if it exists in a given image

Motivation

Some Examples of MKL Detection

Motivation

Summary of Our Results

Sonar Dataset with 800 kernels

pTraining Time (s) # Kernels Selected

SMO-MKL Shogun SMO-MKL Shogun

1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80

Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes

Sonar with a hundred thousand kernels

Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes

Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)

Motivation

Setting up the Optimization Problem -I

The Setup

We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map

φ(x) =

√d1φ1(x)

...√dnφn(x)

=⇒ w =

w1...wn

Motivation

Setting up the Optimization Problem

minw ,b,ξ

2‖w‖2 + C

m∑i=1

s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

Motivation

minw ,b,ξ,d

‖wk‖2 + Cm∑i=1

s.t. yi

√dk 〈wk , φk(xi )〉+ b

)≥ 1− ξi for all i

ξi ≥ 0

dk ≥ 0 for all k

Motivation

minw ,b,ξ,d

‖wk‖2 + Cm∑i=1

ξi +ρ

s.t. yi

√dk 〈wk , φk(xi )〉+ b

ξi ≥ 0

dk ≥ 0 for all k

Motivation

minw ,b,ξ,d

‖wk‖2

m∑i=1

ξi +ρ

s.t. yi

〈wk , φk(xi )〉+ b

ξi ≥ 0

dk ≥ 0 for all k

Motivation

dkα>Hkα + 1> α +

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

Motivation

Saddle Point Problem

−4 −2 0 2 4 −5

5−20

Motivation

Solving the Saddle Point

Saddle Point Problem

dkα>Hkα + 1> α +

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

dk ≥ 0

Our Approach

The Key Insight

Eliminate d

D(α) := maxα

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Not a QP but very close to one!

Our Approach

SMO-MKL: High Level Overview

D(α) := maxα

(α>Hkα

)q) 2q

+ 1> α

s.t. 0 ≤ αi ≤ C∑i

αiyi = 0

Algorithm

Choose two variables αi and αj to optimize

Solve the one dimensional reduced optimization problem

Repeat until convergence

Our Approach

Selecting the Working Set

Compute directional derivative and directional Hessian

Greedily select the variables

Solving the Reduced Problem

Analytic solution for p = q = 2 (one dimensional quartic)

For other values of p use Newton Raphson

Our Approach

Selecting the Working Set

Compute directional derivative and directional Hessian

Greedily select the variables

Solving the Reduced Problem

Analytic solution for p = q = 2 (one dimensional quartic)

For other values of p use Newton Raphson

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.080

Australian

SMO-MKL Shogun

Experiments

Generalization Performance

1.1 1.33 1.66 2.0 2.33 2.66 3.088

ionosphere

SMO-MKL Shogun

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

Number of Training Examples

SMO-MKLShogun

Experiments

Scaling with Training Set Size

Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5101

Observed

O(n1.9)

Experiments

On Another Dataset

Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1

103.5 104 104.5

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

Number of Kernels

Experiments

Scaling with Number of Kernels

Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1

104 105

Number of Kernels

SMO-MKL

Experiments

On Another Dataset

Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1

100.6 100.8 101 101.2 101.4 101.6104

SMO-MKL

References

S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.

Code available for download from http://research.microsoft.

com/en-us/um/people/manik/code/SMO-MKL/download.html

optimization for machine learningniejiazhong/slides/vishy-lec5.pdf · e = 2 kwk s.v:n. vishwanathan...

Documents

vishy anand: world chess champion - gambit chess · pdf...

nomad: non-locking, stochastic multi-machine algorithm for...

arxiv:0804.3835v5 [stat.ml] 22 feb 2010s.v.n. vishwanathan...

new quasi-newton methods for efficient large-scale...

machine learning - bayesian decision theory and...

icml 2009 tutorial survey of boosting from an optimization...

nonlinear optimization: discrete optimization

numerical optimization - constrained optimization -...

my journey & learnings - mgaadi - vishy kuruganti

1999-2000 tuck guide to case...

streamsvm - linear svms and logistic regression...

corruption shleifer and vishy

presented by vishy grandhi. lesson 1: role based security ...

improving design optimization and optimization - …

sorting - insertion and bubble...

user’s guide release 11i · vishy parthasarathy, sumeda...

lecture 1: introduction to engineering optimization · 1...

nebnewspapers.unl.edu · the omaha daily bee : wednesday,...

vishy launches - niit

introduction to optimization problems and...