optimization for machine learningniejiazhong/slides/vishy-lec5.pdf · e = 2 kwk s.v:n. vishwanathan...
Post on 20-Jul-2020
7 Views
Preview:
TRANSCRIPT
Optimization for Machine LearningLecture 4: SMO-MKL
S.V.N. (vishy) Vishwanathan
Purdue Universityvishy@purdue.edu
July 11, 2012
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22
Motivation
Binary Classification
yi = −1
yi = +1
{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}
{x | 〈w , x〉+ b = 1}
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w
‖w‖ , x1 − x2⟩
= 2‖w‖
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation
Binary Classification
yi = −1
yi = +1
{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}
{x | 〈w , x〉+ b = 1}
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w
‖w‖ , x1 − x2⟩
= 2‖w‖
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation
Binary Classification
yi = −1
yi = +1
{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}
{x | 〈w , x〉+ b = 1}
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w
‖w‖ , x1 − x2⟩
= 2‖w‖
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation
Binary Classification
yi = −1
yi = +1
{x | 〈w , x〉+ b = 0}{x | 〈w , x〉+ b = −1}
{x | 〈w , x〉+ b = 1}
x2x1
〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w
‖w‖ , x1 − x2⟩
= 2‖w‖
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation
Linear Support Vector Machines
Optimization Problem
minw ,b,ξ
1
2‖w‖2 + C
m∑i=1
ξi
s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 22
Motivation
The Kernel Trick
x
y
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation
The Kernel Trick
x
y
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation
The Kernel Trick
x
x2+y2
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation
Kernel Trick
Optimization Problem
minw ,b,ξ
1
2‖w‖2 + C
m∑i=1
ξi
s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation
Kernel Trick
Optimization Problem
maxα
− 1
2α>Hα + 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
Hij = yiyj 〈φ(xi ), φ(xj)〉
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation
Kernel Trick
Optimization Problem
maxα
− 1
2α>Hα + 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
Hij = yiyjk(xi , xj)
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation
Key Question
Which kernel should I use?
The Multiple Kernel Learning Answer
Cook up as many (base) kernels as you can
Compute a data dependent kernel function as a linear combination ofbase kernels
k(x , x ′) =∑k
dkkk(x , x ′) s.t. dk ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22
Motivation
Key Question
Which kernel should I use?
The Multiple Kernel Learning Answer
Cook up as many (base) kernels as you can
Compute a data dependent kernel function as a linear combination ofbase kernels
k(x , x ′) =∑k
dkkk(x , x ′) s.t. dk ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22
Motivation
Object Detection
Localize a specified object of interest if it exists in a given image
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 22
Motivation
Some Examples of MKL Detection
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 22
Motivation
Summary of Our Results
Sonar Dataset with 800 kernels
pTraining Time (s) # Kernels Selected
SMO-MKL Shogun SMO-MKL Shogun
1.1 4.71 47.43 91.20 258.001.33 3.21 19.94 248.20 374.202.0 3.39 34.67 661.20 664.80
Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes
Sonar with a hundred thousand kernels
Precomputed: ≈ 8 minutesKernels computed on-the-fly: ≈ 30 minutes
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 22
Motivation
Setting up the Optimization Problem -I
The Setup
We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map
φ(x) =
√d1φ1(x)
...√dnφn(x)
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22
Motivation
Setting up the Optimization Problem -I
The Setup
We are given K kernel functions k1, . . . , kn with corresponding featuremaps φ1(·), . . . , φn(·)We are interested in deriving the feature map
φ(x) =
√d1φ1(x)
...√dnφn(x)
=⇒ w =
w1...wn
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ
1
2‖w‖2 + C
m∑i=1
ξi
s.t. yi (〈w , φ(xi )〉+ b) ≥ 1− ξi for all i
ξi ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
1
2
∑k
‖wk‖2 + Cm∑i=1
ξi
s.t. yi
(∑k
√dk 〈wk , φk(xi )〉+ b
)≥ 1− ξi for all i
ξi ≥ 0
dk ≥ 0 for all k
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
1
2
∑k
‖wk‖2 + Cm∑i=1
ξi +ρ
2
(∑k
dpk
) 2p
s.t. yi
(∑k
√dk 〈wk , φk(xi )〉+ b
)≥ 1− ξi for all i
ξi ≥ 0
dk ≥ 0 for all k
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation
Setting up the Optimization Problem
Optimization Problem
minw ,b,ξ,d
1
2
∑k
‖wk‖2
dk+ C
m∑i=1
ξi +ρ
2
(∑k
dpk
) 2p
s.t. yi
(∑k
〈wk , φk(xi )〉+ b
)≥ 1− ξi for all i
ξi ≥ 0
dk ≥ 0 for all k
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation
Setting up the Optimization Problem
Optimization Problem
mind
maxα
− 1
2
∑k
dkα>Hkα + 1> α +
ρ
2
(∑k
dpk
) 2p
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
dk ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation
Saddle Point Problem
−4 −2 0 2 4 −5
0
5−20
0
20
d
α
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 22
Motivation
Solving the Saddle Point
Saddle Point Problem
mind
maxα
− 1
2
∑k
dkα>Hkα + 1> α +
ρ
2
(∑k
dpk
) 2p
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
dk ≥ 0
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 22
Our Approach
The Key Insight
Eliminate d
D(α) := maxα
− 1
8ρ
(∑k
(α>Hkα
)q) 2q
+ 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
1
p+
1
q= 1
Not a QP but very close to one!
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 22
Our Approach
SMO-MKL: High Level Overview
D(α) := maxα
− 1
8ρ
(∑k
(α>Hkα
)q) 2q
+ 1> α
s.t. 0 ≤ αi ≤ C∑i
αiyi = 0
Algorithm
Choose two variables αi and αj to optimize
Solve the one dimensional reduced optimization problem
Repeat until convergence
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 22
Our Approach
SMO-MKL: High Level Overview
Selecting the Working Set
Compute directional derivative and directional Hessian
Greedily select the variables
Solving the Reduced Problem
Analytic solution for p = q = 2 (one dimensional quartic)
For other values of p use Newton Raphson
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22
Our Approach
SMO-MKL: High Level Overview
Selecting the Working Set
Compute directional derivative and directional Hessian
Greedily select the variables
Solving the Reduced Problem
Analytic solution for p = q = 2 (one dimensional quartic)
For other values of p use Newton Raphson
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22
Experiments
Generalization Performance
1.1 1.33 1.66 2.0 2.33 2.66 3.080
82
84
86
88
90T
est
Acc
ura
cy(%
)
Australian
SMO-MKL Shogun
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22
Experiments
Generalization Performance
1.1 1.33 1.66 2.0 2.33 2.66 3.088
90
92
94
Tes
tA
ccu
racy
(%)
ionosphere
SMO-MKL Shogun
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22
Experiments
Scaling with Training Set Size
Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5101
102
103
104
Number of Training Examples
CP
UT
ime
inse
con
ds
SMO-MKLShogun
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22
Experiments
Scaling with Training Set Size
Adult: 123 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5101
102
103
104
Number of Training Examples
CP
UT
ime
inse
con
ds
Observed
O(n1.9)
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22
Experiments
On Another Dataset
Web: 300 dimensions, 50 RBF kernels, p = 1.33, C = 1
103.5 104 104.5
101
102
103
Number of Training Examples
CP
UT
ime
inse
con
ds
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 22
Experiments
Scaling with Number of Kernels
Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1
104 105
102.5
103
Number of Kernels
CP
UT
ime
inse
con
ds
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22
Experiments
Scaling with Number of Kernels
Sonar: 208 examples, 59 dimensions, p = 1.33, C = 1
104 105
102.5
103
Number of Kernels
CP
UT
ime
inse
con
ds
SMO-MKL
O(n)
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 22
Experiments
On Another Dataset
Real-sim: 72,309 examples, 20,958 dimensions, p = 1.33, C = 1
100.6 100.8 101 101.2 101.4 101.6104
105
Number of Training Examples
CP
UT
ime
inse
con
ds
SMO-MKL
O(n)
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 22
References
References
S. V. N. Vishwanathan, Zhaonan Sun, Theera-Ampornpunt, andManik Varma. Multiple Kernel Learning and the SMO Algorithm.NIPS 2010. Pages 2361–2369.
Code available for download from http://research.microsoft.
com/en-us/um/people/manik/code/SMO-MKL/download.html
S.V.N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 22
top related