spicymkl e fficient multiple kernel learning method using dual augmented lagrangian taiji suzuki...
TRANSCRIPT
![Page 1: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/1.jpg)
SpicyMKL Efficient multiple kernel learning
method using dual augmented Lagrangian
Taiji SuzukiRyota Tomioka
The University of TokyoGraduate School of Information Science and Technology
Department of Mathematical Informatics
23th Oct., 2009@ISM1
![Page 2: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/2.jpg)
Multiple Kernel Learning
Fast Algorithm
2
SpicyMKL
![Page 3: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/3.jpg)
Kernel Learning
3
Regression, Classification
Kernel Function
![Page 4: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/4.jpg)
• Gaussian, polynomial, chi-square, ….– Parameters : Gaussian width, polynomial
degree
• Features– Computer Vision : color, gradient, sift (sift,
hsvsift, huesift, scaling of sift), Geometric Blur, image regions, ...
4
Thousands of kernels
Multiple Kernel Learning(Lanckriet et al. 2004)
Select important kernels and combine them
![Page 5: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/5.jpg)
5
Multiple Kernel Learning (MKL)Single Kernel
Multiple Kernel
dm:convex combination, sparse (many 0 components)
![Page 6: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/6.jpg)
6
LassoGroup Lasso
Multiple Kernel Learning (MKL)
grouping
kernelize
Sparse learning
: N samples
: M kernels
: RKHS associated with km
: Convex loss ( hinge, square, logistic )
L1-regularization
→sparse
![Page 7: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/7.jpg)
7
: Gram matrix of mth kernel
Representer theorem
Finite dimensional problem
![Page 8: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/8.jpg)
8
Proposed:SpicyMKL
• Scales well against # of kernels.• Formulated in a general framework.
SpicyMKL
![Page 9: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/9.jpg)
Outline
• Introduction• Details of Multiple Kernel Learning • Our method
– Proximal minimization– Skipping inactive kernels
• Numerical Experiments– Bench-mark datasets– Sparse or dense
9
![Page 10: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/10.jpg)
Details of Multiple Kernel Learning(MKL)
10
![Page 11: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/11.jpg)
11
MKL
Generalized Formulation
: sparse regularization
![Page 12: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/12.jpg)
12
hinge loss
logistic loss
elastic net
L1regularization
elastic net:
L1regularization:
hinge:
logistic:
![Page 13: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/13.jpg)
Rank 1 decomposition
13
If g is monotonically increasing and f` is diff’ble at the optimum, the solution of MKL is rank 1:
such that
convex combination of kernels
Proof Derivative w.r.t.
![Page 14: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/14.jpg)
Existing approach 1
14
Constraints based formulation: [Lanckriet et al. 2004, JMLR] [Bach, Lanckriet & Jordan 2004, ICML]
Dual
• Lanckriet et al. 2004: SDP• Bach, Lanckriet & Jordan 2004: SOCP
primal
![Page 15: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/15.jpg)
Existing approach 2
15
Upper bound based formulation: [Sonnenburg et al. 2006, JMLR] [Rakotomamonjy et al. 2008, JMLR] [Chapelle & Rakotomamonjy 2008, NIPS workshop]
• Sonnenburg et al. 2006: Cutting plane (SILP)• Rakotomamonjy et al. 2008: Gradient descent (SimpleMKL)• Chapelle & Rakotomamonjy 2008: Newton method (HessianMKL)
primal
(Jensen’s inequality)
![Page 16: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/16.jpg)
Problem of existing methods
16
• Do not make use of sparsity
during the optimization.
→ Do not perform efficiently when
# of kernels is large.
![Page 17: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/17.jpg)
Outline
• Introduction• Details of Multiple Kernel Learning • Our method
– Proximal minimization– Skipping inactive kernels
• Numerical Experiments– Bench-mark datasets– Sparse or dense
17
![Page 18: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/18.jpg)
Our method SpicyMKL
18
![Page 19: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/19.jpg)
• DAL (Tomioka & Sugiyama 09)– Dual Augmented Lagrangian– Lasso, group Lasso, trace norm regularization
• SpicyMKL = DAL + MKL– Kernelization of DAL
19
![Page 20: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/20.jpg)
Difficulty of MKL ( Lasso )
20
is non-diff’ble at 0.
0
Relax by Proximal Minimization
(Rockafellar, 1976)
![Page 21: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/21.jpg)
Our algorithm
21
SpicyMKL: Proximal Minimization (Rockafellar, 1976)
Set some initial solution . Choose increasing sequence .
Iterate the following update rule until convergence:
1
2
Regularization toward the last update.
Taking its dual, the update step can be solved efficiently
super linearly !
• •
It can be shown that
![Page 22: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/22.jpg)
22
Fenchel’s duality theorem
Split into two parts
![Page 23: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/23.jpg)
23
dual
Non-diff’ble Smooth !Moreau envelope
dual
Non-diff’ble
![Page 24: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/24.jpg)
Inner optimization
• Newton method is applicable.
24
Twice Differentiable Twice Differentiable (a.e.)
Even if is not differentiable,we can apply almost the same algorithm.
![Page 25: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/25.jpg)
Rigorous Derivation of
25
Convolution and duality
![Page 26: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/26.jpg)
Moreau envelope
26
Moreau envelope (Moreau 65)
For a convex function , we define its Moreau envelope as
![Page 27: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/27.jpg)
27
Moreau envelope
L1-penaltydual
Smooth !
![Page 28: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/28.jpg)
28
We arrived at …
Dual of update rule in proximal minimization
• Differentiable• Only active kernels need to be calculated.
(next slide)
What we have observed
![Page 29: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/29.jpg)
29
Soft threshold
Hard threshold
If (inactive), ,and its gradient also vanishes.
Skipping inactive kernels
![Page 30: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/30.jpg)
30
The objective function:
Its derivatives:
where
• Only active kernels contributes their computations.
→ Efficient for large number of kernels.
Derivatives• We can apply Newton method for the dual
optimization.
![Page 31: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/31.jpg)
If is strongly convex and is sparse reg., for any increasing sequence ,the convergence rate is super linear.
For any proper convex functions and , and any non-decreasing sequence ,
Convergence result
31
Thm.
Thm.
![Page 32: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/32.jpg)
• If loss is non-differentiable, almost the same algorithm is available by utilizing Moreau envelope of .
32
primal dual Moreau envelope of dual
Non-differentiable loss
![Page 33: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/33.jpg)
Dual of elastic net
33
Smooth !
primal dual
Naïve optimization in the dual is applicable.But we applied prox. minimization methodfor small to avoid numerical instability.
![Page 34: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/34.jpg)
Why ‘Spicy’ ?
• SpicyMKL = DAL + MKL.
• DAL also means a major Indian cuisine.
→ hot, spicy
→ SpicyMKL
34
from Wikipedia
“Dal”
![Page 35: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/35.jpg)
Outline
• Introduction• Details of Multiple Kernel Learning • Our method
– Proximal minimization– Skipping inactive kernels
• Numerical Experiments– Bench-mark datasets– Sparse or dense
35
![Page 36: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/36.jpg)
Numerical Experiments
36
![Page 37: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/37.jpg)
37
SpicyMKL
CPU time against # of kernels
• Scales well against # of kernels.
RingnormSplice
IDA data set, L1 regularization.
![Page 38: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/38.jpg)
38
SpicyMKL
• Scaling against # of samples is almost the same as the existing methods.
RingnormSplice
CPU time against # of samplesIDA data set, L1 regularization.
![Page 39: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/39.jpg)
39
UCI dataset # of kernels 189 243 918 891 1647
![Page 40: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/40.jpg)
40
Comparison in elastic net(Sparse v.s. Dense)
![Page 41: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/41.jpg)
where .
Sparse or Dense ?
41
Sparse Dense0 1
Elastic Net
Compare performances of sparse and dense solutionsin a computer vision task.
![Page 42: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/42.jpg)
caltech101 dataset (Fei-Fei et al., 2004)
• object recognition task• anchor, ant, cannon, chair, cup
• 10 classification problems (10=5×(5-1)/2) • # of kernels: 1760 = 4 (features)×22 (regions) ×
あああああ 2 (kernel functions) × 10 (kernel parameters)– sift features [Lowe99]: colorDescriptor [van de Sande et al.10]
hsvsift, sift (scale = 4px, 8px, automatic scale)– regions: whole region, 4 division, 16 division, spatial
pyramid. (computed histogram of features on each region.)– kernel functions: Gaussian kernels, kernels with
10 parameters. 42
![Page 43: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/43.jpg)
43
Medium density
dense
Performance of sparse solution is comparable with dense solution.
Performances are averaged over all classification problems.
Best
![Page 44: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/44.jpg)
まとめと今後の課題Conclusion• We proposed a new MKL algorithm that is
efficient when # of kernels is large.– proximal minimization– neglect ‘in-active’ kernels
• Medium density showed the best performance, but sparse solution also works well.
Future work• Second order update
44
Conclusion and Future works
![Page 45: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/45.jpg)
• Technical report
T. Suzuki & R. Tomioka: SpicyMKL.
arXiv: http://arxiv.org/abs/0909.5026
• DAL R. Tomioka & M. Sugiyama: Dual Augmented Lagrangian
Method for Efficient Sparse Reconstruction. IEEE Signal Proccesing Letters, 16 (12) pp. 1067--1070, 2009.
45
![Page 46: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/46.jpg)
Thank you for your attention!
46
![Page 47: SpicyMKL E fficient multiple kernel learning method using dual augmented Lagrangian Taiji Suzuki Ryota Tomioka The University of Tokyo Graduate School](https://reader035.vdocument.in/reader035/viewer/2022070307/551a4f725503463e778b5471/html5/thumbnails/47.jpg)
47
Moreau envelope
proximal operator
Some properties
0
(Fenchel duality th.)
(Fenchel duality th.)
Differentiable !