kernel methods in a regularization framework...kernel methods in a regularization framework...
TRANSCRIPT
Kernel Methodsin a Regularization Framework
Yoonkyung LeeDepartment of Statistics
The Ohio State University
October 31, 2008Korean Statistical Society Meeting
Chung-Ang University, Seoul
Overview
◮ Part I:Introduction to Kernel Methods for Statistical Learning andModeling
◮ Part II:Theory of Reproducing Kernel Hilbert Spaces Methods
◮ Part III:Regularization Approach to Feature Selection
Part I: Introduction to Kernel Methods for StatisticalLearning and Modeling
◮ Statistical learning problems◮ Methods of regularization◮ Smoothing splines◮ Support vector machines◮ Statistical issues
Prediction of Annual Household Incomehttp://www-stat.stanford.edu/∼tibs/ElemStatLearn/
=<
Gra
de 8
Gra
de 9
−11
Hig
h S
choo
l
Col
lege
1−
3
Col
lege
Gra
duat
e
20
40
60
80
100In
com
e
Education
14−
17
18−
24
25−
34
35−
44
45−
54
55−
64
>65
20
40
60
80
100
Inco
me
Age
Mal
e
Fem
ale
20
40
60
80
100
Inco
me
Gender
Mar
ried
Tog
ethe
r
Div
orce
d
Wid
owed
Sin
gle
20
40
60
80
100
Inco
me
Marital Status
Pro
f.
Sal
es
Fac
tory
Cle
rical
Hom
e
Stu
dent
Mili
tary
Ret
ire
Une
mpl
oy
20
40
60
80
100
Inco
me
Occupation
Ow
n
Ren
t
Par
ents
20
40
60
80
100
Inco
me
House Status
Figure: Boxplots of the annual household income with education,age, gender, marital status, occupation, and householder status outof 13 demographic attributes in the data
Cancer Diagnosis with Microarray Data
◮ Microarrays measure relative amount of mRNAs of (tensof) thousands of genes.
◮ Golub et al.(Science, 1999): Acute leukemia dataALL (acute lymphoblastic leukemia) with subtype B-celland T-cell lineage, and AML (acute myeloid leukemia)
Acute Leukemia Gene Expression Profiles
Patient gene1 gene2 · · · gene7129 class
1 x1,1 x1,2 · · · x1,7129 ALL− B2 x2,1 x2,2 · · · x2,7129 ALL− B...
......
...20 x20,1 x20,2 · · · x20,7129 ALL− T...
......
...28 x28,1 x28,2 · · · x28,7129 AML...
......
...38 x38,1 x38,2 · · · x38,7129 AML
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
ALL−B
ALL−T
AML
−−−−−−
L049
53U0
9278
D385
24U4
9742
M58
285
U479
28D8
6640
D837
82U3
7707
U580
87HG
2525
−HT2
621
M76
559
D799
98M
8786
0U1
5085
U385
45U1
4518
HG30
63−H
T322
4
L263
36D4
9677
L173
30M
5858
3U5
2154
M74
719
D823
44U1
5177
U382
68HG
4185
−HT4
455
M74
096
D439
48HG
3987
−HT4
257
HG41
57−H
T442
7M
9673
9S7
3205
M77
016
M73
489
U505
34M
5554
3
L049
53U0
9278
D385
24U4
9742
M58
285
U479
28D8
6640
D837
82U3
7707
U580
87HG
2525
−HT2
621
M76
559
D799
98M
8786
0U1
5085
U385
45U1
4518
HG30
63−H
T322
4U2
1051
−rna
1L2
6336
D496
77L1
7330
M58
583
U521
54M
7471
9D8
2344
U151
77M
3798
4−rn
a1U3
8268
HG41
85−H
T445
5M
7409
6D4
3948
HG39
87−H
T425
7HG
4157
−HT4
427
M96
739
S732
05M
7701
6M
7348
9U5
0534
M55
543
−−−−−−
Figure: The heat map shows the expression levels of 40 most important genes for the training samples whenthey are appropriately standardized. Each row corresponds to a sample, which is grouped into the three classes,and the columns represent genes. The 40 genes are clustered in a way the similarity within each class and thedissimilarity between classes are easily recognized.
Handwritten Digit Recognition
Figure: 16 × 16 gray scale images
Statistical Learning
◮ Multivariate function estimation◮ A training data set {(x i , yi), i = 1, . . . , n}◮ Learn functional relationship f between x = (x1, . . . , xp)
and y from the training data, which can be generalized tonovel cases.e.g. f (x) = E(Y |X = x)
◮ Examples includeRegression: continuous y ∈ R, andClassification: categorical y ∈ {1, . . . , k}.
Goodness of a Statistical Procedure for Learning
◮ Accurate prediction with respect to a given loss L(y , f (x))e.g. L(y , f (x)) = (y − f (x))2 for regression
◮ Flexible (nonparametric) and data-adaptive◮ Interpretability
e.g. subset (variable/feature) selection◮ Computational ease for large p (high dimensional input)
and n (large sample)
Large Model Space
◮ Large number of variables for high dimensional data◮ Large number of basis functions for nonparametric
modeling◮ Need to deal with a large parameter (or model) space.◮ Classical maximum likelihood estimation (MLE) or
empirical risk minimization (ERM) no longer works as thesolution may not be well-defined or there may be infinitelymany solutions that overfit data.
◮ How to explore the large model space for stable modelfitting and prediction?
Regularization
◮ Tikhonov regularization (1943):solving ill-posed integral equation numerically
◮ Process of modifying ill-posed problems by introducingadditional information about the solution
◮ Modification of the maximum likelihood principle orempirical risk minimization principle(Bickel & Li 2006)
◮ Smoothness, sparsity, small norm, large margin, ...◮ Bayesian connection
Methods of Regularization (Penalization)
Find f (x) ∈ F minimizing
1n
n∑
i=1
L(yi , f (x i)) + λJ(f ).
◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): complexity of the model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem
Examples of Regularization Methods
◮ Ridge regression (Hoerl and Kennard 1970)◮ LASSO (Tibshirani 1996)◮ Smoothing splines (Wahba 1990)◮ Support vector machines (Vapnik 1998)◮ Regularized neural network, boosting, logistic regression,
...
Nonparametric Regression
yi = f (xi) + ǫi for i = 1, . . . , n where ǫi ∼ N(0, σ2)
0.0 0.2 0.4 0.6 0.8 1.0
−4
−2
02
4
x
y
Smoothing Splines
Wahba (1990), Spline Models for Observational Data.
Find f (x) ∈ W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing
1n
n∑
i=1
(yi − f (xi))2 + λ
∫ 1
0(f ′′(x))2dx .
◮ J(f ) =∫ 1
0 (f ′′(x))2dx = ‖P1f‖2: curvature of f◮ λ → 0: interpolation◮ λ → ∞: linear fit◮ 0 < λ < ∞: piecewise cubic polynomial with two
continuous derivatives
Small λ: Overfit
0.0 0.2 0.4 0.6 0.8 1.0
−4
−2
02
4
x
y
Figure: λ → 0: interpolation
Large λ: Underfit
0.0 0.2 0.4 0.6 0.8 1.0
−4
−2
02
4
x
y
Figure: λ → ∞: linear fit
Moderate λ
0.0 0.2 0.4 0.6 0.8 1.0
−4
−2
02
4
x
y
Figure: 0 < λ < ∞: piecewise cubic polynomial with two continuousderivatives
Risk Estimate◮ Risk: E [1
n
∑ni=1
(
fλ(xi) − f (xi))2
]
◮ The Mallows-type criterion:
U(λ) =1n‖(I − A(λ))y‖2 + 2
σ2
ntr [A(λ)].
◮ d .f .(λU) = 9.
5 10 15 20 25 30
0.5
1.0
1.5
2.0
df
Average Prediction ErrorUnbiased Risk Estimate
Classification
◮ x = (x1, . . . , xp) ∈ Rp
◮ y ∈ Y = {1, . . . , k}◮ Learn a rule φ : R
p → Y from the training data{(x i , yi), i = 1, . . . , n}.
◮ The 0-1 loss function:
L(y , φ(x)) = I(y 6= φ(x))
Separating Hyperplane
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
wx+b=−1
wx+b=0
wx+b=1
Margin=2/||w||
Figure: y = +1 in red and y = −1 in blue
Support Vector Machines
Boser, Guyon, & Vapnik (1992)Vapnik (1995), The Nature of Statistical Learning Theory.
◮ yi ∈ {−1, 1}, class labels in binary case◮ f (x) = w⊤x + b (real-valued discriminant function)◮ Separating hyperplane with the maximum margin:
f minimizing ‖w‖2
subject to yi f (x i) ≥ 1 for all i = 1, . . . , n◮ Classification rule: φ(x) = sign(f (x))
Linear SVM in Non-Separable Case
◮ Relax the separability condition yi f (x i) ≥ 1.◮ Hinge loss: L(y , f (x)) = (1 − yf (x))+ where
(t)+ = max(t , 0).◮ If yf (x) < 1, the loss is proportional to the distance of x
from the soft margin yf (x) = 1◮ Find f ∈ F = {f (x) = w⊤x + b | w ∈ R
p and b ∈ R}minimizing
1n
n∑
i=1
(1 − yi f (x i))+ + λ‖w‖2,
where J(f ) = J(w⊤x + b) = ‖w‖2.
Hinge Loss
−2 −1 0 1 2
0
1
2
t=yf
[−t]*
(1−t)+
Figure: (1 − yf (x))+ is an upper bound of the misclassification lossfunction I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t ]∗ = I(t ≥ 0)and (t)+ = max{t , 0}.
Nonlinear SVM
◮ Linear SVM solution:
f (x) =n∑
i=1
cix⊤i x + b
◮ Replace the Euclidean inner product x⊤x ′ withK (x , x ′) = Φ(x)⊤Φ(x ′) for a mapping Φ from R
p to ahigher dimensional ‘feature space.’
◮ Nonlinear kernels:K (x , x ′) = (1 + x⊤x ′)d , exp(−‖x − x ′‖2/2σ2), ...e.g. For p = 2 and x = (x1, x2),Φ(x) = (1,
√2x1,
√2x2, x2
1 , x22 ,√
2x1x2) givesK (x , x ′) = (1 + x⊤x ′)2.
Classification
yi ∈ {1 : red , −1 : blue}
−2 0 2 4
−2
01
23
4
x1
x2
Classification Boundary with a Large λ
−2 0 2 4
−2
01
23
4
x1
x2
Classification Boundary with a Small λ
−2 0 2 4
−2
01
23
4
x1
x2
Test Error Rates
0.00 0.02 0.04 0.06 0.08
0.14
0.18
0.22
λ
test
err
or r
ate
Figure: Error rate over 1,000 test cases as a function of λ.
Statistical Issues
◮ Risk or generalization error estimation◮ Model selection/averaging - choice of tuning parameter(s)
(CV, GCV, resampling, risk bounds, ...)◮ Variable or feature selection◮ Computation
◮ e.g. Least squares problem for Smoothing splines andQuadratic Programming for SVM
◮ Need to solve a family of optimization problems indexed byλ.
◮ Use the characteristics of regularized solutions for efficientalgorithms.
Summary
◮ Many statistical learning methods can be cast in aregularization framework.
◮ Examples include Smoothing Splines and Support VectorMachines.
◮ Regularization entails a model selection problem.Tuning parameters need to be chosen to optimize the“bias-variance tradeoff.”
◮ More formal treatment of kernel methods will be given inPart II.
Part II: Theory of Reproducing Kernel Hilbert SpacesMethods
◮ Regularization in RKHS◮ Reproducing kernel Hilbert spaces◮ Properties of kernels◮ Examples of RKHS methods◮ Representer Theorem
Regularization in RKHS
Find f =∑M
ν=1 dνφν + h with h ∈ HK minimizing
1n
n∑
i=1
L(yi , f (x i)) + λ‖h‖2HK
.
◮ HK : a reproducing Kernel Hilbert space of functionsdefined on a domain which can be arbitrary
◮ J(f ) = ‖h‖2HK
: penalty
◮ The null space spanned by {φν}Mν=1
Why consider RKHS?
◮ Theoretical basis for some of popular regularizationmethods
◮ Unified framework for function estimation and modelingvarious data
◮ Allow general domains for functions◮ Permit geometric understanding◮ Can do much more than estimation of function values
(e.g. integrals and derivatives)
Reproducing Kernel Hilbert Spaces
◮ Consider a Hilbert space H of real valued functions on adomain X .
◮ A Hilbert space H is a complete inner product linear space.◮ For example, the domain X could be
◮ {1, . . . , k}◮ [0, 1]◮ {A, C, G, T}◮ R
p
◮ S: sphere.
◮ A reproducing kernel Hilbert space is a Hilbert space ofreal valued functions, where the evaluation functionalLx(f ) = f (x) is bounded in H for each x ∈ X .
Riesz Representation Theorem
◮ For every bounded linear functional L in a Hilbert space H,there exists a unique gL ∈ H such that L(f ) = (gL, f ),∀f ∈ H.
◮ gL is called the representer of L.
Reproducing Kernel
Aronszajn (1950), Theory of Reproducing kernels.
◮ By the Riesz representation theorem, there exists Kx ∈ H,the representer of Lx(·), such that (Kx , f ) = f (x), ∀f ∈ H.
◮ K (x , t) = Kx(t) is called the reproducing kernel.◮ K (x , ·) ∈ H for each x◮ (K (x , ·), f (·)) = f (x) for all f ∈ H
◮ Note that K (x , t) = Kx(t) = (Kt(·), Kx(·)) = (Kx(·), Kt(·))(the reproducing property)
Example of RKHS
◮ F = {f | f : {1, . . . , k} → R} = Rk with the Euclidean inner
product (f , g) = f tg =∑k
j=1 fjgj
◮ Note that |f (j)| ≤ ‖f‖. That is, the evaluation functionalLj(f ) = f (j) is bounded in F for each j ∈ X .
◮ Lj(f ) = f (j) = (ej , f ) where ej is the j th column of Ik .Hence K (i , j) = δij = I[i = j] or [K (i , j)] = Ik .
The Mercer-Hilbert-Schmidt Theorem
◮ If∫
X
∫
XK 2(s, t)dsdt < ∞ for a continuous symmetric
non-negative definite K , then there exists an orthonormalsequence of continuous eigenfunctions Φ1, Φ2, . . . in L2[X ]and eigenvalues λ1 ≥ λ2 ≥ · · · ≥ 0 with
∑∞i=1 λ2
i < ∞ with∫
XK (s, t)Φi(s)ds = λiΦi(t) and
K (s, t) =∞∑
i=1
λiΦi(s)Φi(t).
◮ The inner product in H of functions f with∑
i(f2i /λi) < ∞
(f , g) =∞∑
i=1
figi
λi
where fi =∫
Xf (t)Φi(t)dt .
◮ Feature mapping: Φ(x) = (√
λ1Φ1(x),√
λ2Φ2(x), . . . )
Reproducing Kernel is Non-Negative Definite
◮ A bivariate function K (·, ·) : X × X → R is non-negativedefinite if for every n, and every x1, . . . , xn ∈ X , and everya1, . . . , an ∈ R
n,
n∑
i,j=1
aiajK (xi , xj) ≥ 0.
In other words, letting a = (a1, . . . , an)t ,
at[
K (xi , xj)]
a ≥ 0.
◮ For a reproducing kernel K ,
n∑
i,j=1
aiajK (xi , xj) =∥
∥
∥
n∑
i=1
aiK (xi , ·)∥
∥
∥
2≥ 0.
The Moore-Aronszajn Theorem
◮ For every RKHS H of functions on X , there corresponds aunique reproducing kernel (RK) K (s, t), which is n.n.d.
◮ Conversely, for every n.n.d. function K (s, t) on X , therecorresponds a unique RKHS H that has K (s, t) as its RK.
How to construct RKHS given an n.n.d. K (s, t)
◮ For each fixed x ∈ X , define Kx(·) = K (x , ·).◮ Taking all finite linear combinations of the form
∑
i aiKxi forall choices of n, a1, . . . , an, and x1, . . . , xn, construct alinear space M.
◮ Define an inner product via (Kxi , Kxj ) = K (xi , xj) and extendit by linearity.
(∑
i
aiKxi ,∑
j
bjKtj ) =∑
i,j
aibj(Kxi , Ktj ) =∑
i,j
aibjK (xi , tj).
◮ For any f of the form∑
i aiKxi , (Kx , f ) = f (x).◮ Complete M by adjoining all the limits of Cauchy
sequences of functions in M.
Sum of Reproducing Kernels
◮ The sum of two n.n.d. matrices is n.n.d.◮ Hence, the sum of two n.n.d. functions defined on the
same domain X is n.n.d.◮ In particular, if H = H1 ⊕H2, and Kj(s, t) is the RK of Hj
for j = 1, 2, then K (s, t) = K1(s, t) + K2(s, t) is the RK forthe tensor sum space of H1 ⊕H2.
Product of Reproducing Kernels
◮ Suppose that K1(x1, x2) is n.n.d. on X1 and K2(t1, t2) isn.n.d. on X2.
◮ Consider the tensor product of K1 and K2
K(
(x1, t1), (x2, t2))
= K1(x1, x2)K2(t1, t2)
on X = X1 ×X2.K (·, ·) is n.n.d. on X .
◮ The tensor product of two RK’s K1 and K2 for H1 and H2 isan RK for the tensor product space of H1 ⊗H2 on X .
Constructing Kernels
◮ Use reproducing kernels on a univariate domain asbuilding blocks.
◮ Tensor sums and products of reproducing kernels.◮ Systematic approach to estimating multivariate functions.◮ Other tricks to expand and design kernels:
Haussler (1999), Convolution Kernels on DiscreteStructures.
◮ Learning kernels (n.n.d. matrices) from data:Lanckriet et al. (2004), Learning the Kernel Matrix withSemidefinite Programming, JMLR.
Cubic Smoothing Splines
Find f (x) ∈ W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing
1n
n∑
i=1
(yi − f (xi))2 + λ
∫ 1
0(f ′′(x))2dx .
◮ The null space: M = 2, φ1(x) = 1, and φ2(x) = x .◮ The penalized space: HK = W 0
2 [0, 1] ={f ∈ W2[0, 1] : f (0) = 0, f ′(0) = 0} is an RKHS withi)(f , g) =
∫ 10 f ′′(x)g′′(x)dx
ii) ‖f‖2 =∫ 1
0 (f ′′(x))2dx
iii) K (x , x ′) =∫ 1
0 (x − u)+(x ′ − u)+du.
SVM in General
Find f (x) = b + h(x) with h ∈ HK minimizing
1n
n∑
i=1
(1 − yi f (x i))+ + λ‖h‖2HK
.
◮ The null space: M = 1 and φ1(x) = 1◮ Linear SVM:
HK = {h(x) = w⊤x | w ∈ Rp} with
i) K (x , x ′) = x⊤x ′
ii) ‖h‖2HK
= ‖w⊤x‖2HK
= ‖w‖2
◮ Nonlinear SVM: K (x , x ′) = (1 + x⊤x ′)d (polynomialkernel), exp(−‖x − x ′‖2/2σ2) (Gaussian kernel), ...
Representer Theorem
Kimeldorf and Wahba (1971)◮ The minimizer f =
∑Mν=1 dνφν + h with h ∈ HK of
1n
n∑
i=1
L(yi , f (x i)) + λ‖h‖2HK
has a representation of the form
f (x) =
M∑
ν=1
dνφν(x) +
n∑
i=1
ciK (x i , x).
◮ ‖h‖2HK
=∑
i,j cicjK (x i , x j).
Sketch of the Proof
◮ Write h ∈ HK as
h(x) =n∑
i=1
ciK (x i , x) + ρ(x)
where ρ(x) is some element in HK perpendicular toK (x i , x) for i = 1, . . . , n.
◮ Note that h(x i) = (K (x i , ·), h(·))HK does not depend on ρand ‖h‖2
HK=∑n
i=1∑n
j=1 cicjK (x i , x j) + ‖ρ‖2HK
◮ Then, ‖ρ‖2HK
needs to be zero.
◮ Hence, the minimizer f is of the form
f (x) =M∑
ν=1
dνφν(x) +n∑
i=1
ciK (x i , x).
Remarks on the Representer Theorem
◮ It holds for an arbitrary loss function L.◮ Minimizer of RKHS method resides in a finite dimensional
space.◮ So the solution is computable even if the RKHS had infinite
dimension.◮ The resulting optimization problems with the
representation f depend on L and the penalty J(f ).
Loss Function and its Risk Minimizer f
L(y , f (x)) argmin E [L(Y , f (X ))|X = x ]
Regression(y − f (x))2 E(Y |X = x)|y − f (x)| median(Y |X = x)
Classification with y = ±1SVM: (1 − yf (x))+ sign{p(x) − 1/2}Logistic regression:log{1 + exp(−yf (x))} log p(x)/(1 − p(x))Boosting: exp(−yf (x)) (1/2) log p(x)/(1 − p(x))
where p(x) = P(Y = 1|X = x)
Hinge vs -log likelihood
−2 −1 0 1 2−1.5
−1
−0.5
0
0.5
1
1.5
x
truthlogistic regressionSVM
Figure: Solid: 2p(x) − 1, dotted: 2pLG(x) − 1 and dashed: fSVM(x)
Smoothing Splines for Modeling Non-GaussianResponse
◮ Generalized linear models for the data with y fromexponential families can be extended nonparametrically.
◮ The main idea of the extension of smoothing splines tonon-Gaussian case is to replace the squared error loss bythe negative log likelihood function associated with y .
◮ Thus, the penalized least squares method becomes themethod of penalized negative log likelihood.
◮ Non-Gaussian data can be treated in the framework ofRKHS method in a unified way, yet they require somecomputational modification.
Logistic Regression
◮ Suppose that we have independent data points of (xi , yi),i = 1, . . . , n, and Yi |xi ∼ Bernoulli(p(xi)), wherep(x) = P(Y = 1|X = x). Note that yi ∈ {0, 1}.
◮ The likelihood of the data conditional on x is
n∏
i=1
p(xi)yi (1 − p(xi))
1−yi .
◮ Model the logit function of p(x), log{p(x)/(1 − p(x))} byf ∈ H.
◮ In terms of the logit f (x), the likelihood is given by
n∏
i=1
exp(yi f (xi)−log(1+ef (xi ))) = exp(n∑
i=1
yi f (xi)−log(1+ef (xi ))).
Formulation of Penalized Logistic Regression
◮ Use the negative log likelihood as a loss function.◮ Find fλ ∈ H = H0 ⊕H1 minimizing
1n
n∑
i=1
{−yi f (xi) + log(1 + ef (xi ))} + λ‖P1f‖2.
◮ By the representer theorem,
fλ(x) =M∑
ν=1
dνφν(x) +n∑
i=1
ciK (xi , x).
◮ The optimization problem of finding d = (d1, . . . , dM)t andc = (c1, . . . , cn)
t entails iterations.◮ The solution can be approximated by iteratively reweighted
least squares.
Machine Learning View on Kernels
◮ K (x , x ′) = Φ(x)⊤Φ(x ′) through feature mapping Φ from Rp
to a feature space.◮ Kernel trick:
Kernelize, that is, replace the Euclidean inner product x⊤x ′
in your linear method with K (x , x ′)!
◮ This idea goes beyond supervised learning problems.◮ nonlinear dimension reduction and data visualization:
kernel PCA◮ clustering: kernel k-means algorithm◮ ...
Summary
◮ RKHS methods provide a unified framework for statisticalmodel building.
◮ Kernels are now used as a versatile tool for flexiblemodeling and learning in various contexts.
◮ Feature selection for kernel methods will be discussed inPart III based on the idea of kernel construction by sumsand products.
Part III: Regularization Approach to Feature Selection
◮ Feature selection procedures◮ LASSO, modeling with l1 constraint◮ Generalization of LASSO for kernel methods◮ Application for finding biomarkers
Motivation for Feature Selection
◮ Key questions in many scientific investigations.◮ Achieve parsimony (Occam’s razor)
“Entities should not be multiplied beyond necessity.”◮ Enhance interpretation.◮ Often reduce variance, hence improve prediction accuracy.
Feature Selection Procedures
◮ Combinatorial approach:Best subset selection, Forward selection, Backwardelimination, Stepwise regressione.g. Guyon et al. (2002), Recursive feature selection
◮ l1 penalty for simultaneous fitting and selection:e.g. Bradley and Mangasarian (1998),Linear SVM with l1 penaltyTibshirani (1996), LASSO(Least Absolute Shrinkage and Selection Operator)Chen and Donoho (1994), Basis Pursuit
◮ Other variants for groupings of variables, model hierarchy,adaptiveness, and efficiency.
LASSO
minβ
n∑
i=1
(yi−p∑
j=1
βjxij)2+λ‖β‖1 ⇔ min
β
n∑
i=1
(yi−p∑
j=1
βjxij)2 s.t. ‖β‖1 ≤ s.
β
Ridge Regression
minβ
n∑
i=1
(yi −p∑
j=1
βjxij)2 + λ‖β‖2
2.
LASSO Coefficient Paths
** * * * * * * * * ** *
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
** * * ** *
* * * ** *
**
* * * * * * * * ** *
** **
* * * * * * ** *
** * * * * **
* *
**
*
** * * * * * * * *
***
** * ** * * * * *
***
** * * * * * ** * ** *
**
* * * * * * * ***
*
** * * * * * * * * ** *
LASSO
52
18
69
0 2 3 4 5 7 8 10 12
Median Regression with l1 Penalty for Income Data
0 10 20 30 40 50 60
−4−2
02
46
8
s
β
Figure: The coefficient paths of the main effects model. Positive: home ownership (in dark blue relative torenting), education (in brown), dual income due to marriage (in purple relative to ‘not married’), age (in skyblue), andmale (in light green). Negative: single or divorced (in red relative to ‘married’) and student, clerical worker, retired orunemployed (in green relative to professionals/managers)
Median Regression with l1 Penalty for Income Data
0 20 40 60 80 100 120
−50
5
s
β
Figure: The coefficient paths of a partial two-way interaction model.Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗education’, and ‘married but no dual income ∗ education’. Negative:‘single ∗ education’ and ‘home ownership ∗ age’
Risk Path
20 40 60 80 100 120
7.4
7.6
7.8
8.0
8.2
8.4
s
Estim
ated
Risk
Figure: The risks of the two-way fitted models are estimated by usinga test data set with 4,876 observations.
Generalization of LASSO
◮ Kernel methods may be difficult to interpret when theembedding into feature space is implicit.
◮ Regression:Lin and Zhang (2003), COmponent Selection andSmoothing OperatorGunn and Kandola (2002), Structural modelling withsparse kernel
◮ Classification:Zhang (2006) for the binary SVMLee et al. (2006) for the multiclass SVM
Strategy for Feature Selection
◮ Structured representation of f using functional ANOVAdecomposition
◮ A sparse solution approach with l1 penalty◮ A unified treatment for regression and classification
(both linear and nonlinear cases)◮ Inexpensive additional computation◮ Systematic elaboration of f with features
Functional ANOVA Decomposition
◮ For f defined on a product domain X =∏p
j=1 Xj ,
f =∏
j
[Aj + (I − Aj)]f
= (∏
j
Aj)f +∑
i
(∏
j 6=i
Aj)(I − Ai)f
+∑
i<j
(∏
r 6=i,j
Ar )(I − Ai)(I − Aj)f + · · ·
◮ Functional “overall mean” + “main effects” + “two-wayinteractions” + · · · .
◮ Define Aj appropriately so that the decomposition of Aj andI − Aj corresponds to {1} ⊕ Hj .
ANOVA Spaces and Kernels
Wahba (1990), smoothing spline ANOVA models
◮ Function: f (x) = b +∑p
α=1 fα(xα) +∑
α<β fαβ(xα, xβ) + · · ·
◮ Functional space: f ∈ H = ⊗pα=1({1} ⊕ Hα),
H = {1} ⊕∑pα=1 Hα ⊕∑α<β(Hα ⊗ Hβ) ⊕ · · ·
◮ Reproducing kernel (r.k.):K (x , x ′) = 1 +
∑pα=1 Kα(x , x ′) +
∑
α<β Kαβ(x , x ′) + · · ·
Two-way ANOVA Decomposition off ∈ W2[0, 1] ⊗ W2[0, 1]
◮ Two-way ANOVA decomposition of f
f (x1, x2) = f0 + f1(x1) + f2(x2) + f12(x1, x2)
with the side conditions that∫ 1
0 fj(xj)dxj = 0 and∫ 1
0 f12(x1, x2)dxj = 0 for j = 1, 2.◮ The corresponding functional components
{1} {k1(x2)} H21
{1} mean p-main effect (x2) n-main effect (x2){k1(x1)} p-main effect (x1) p×p-interaction p×n-interaction
H11 n-main effect (x1) n×p-interaction n×n-interaction
where “p” and “n” mean parametric and nonparametric,respectively.
x1
x2
parametric main effect (x1)
x1
x2
parametric main effect (x2)
x1
x2nonparametric main effect (x1)
x1
x2
nonparametric main effect (x2)
x1
x2
p−p interaction
x1
x2
n−p interaction
x1
x2p−n interaction
x1
x2
n−n interaction
l1 Penalty on θ
◮ Modification of r.k. by rescaling parameters θ ≥ 0Kθ(x , x ′) = 1+
∑pα=1 θαKα(x , x ′)+
∑
α<β θαβKαβ(x , x ′)+· · ·◮ Truncating H to F = {1} ⊕d
ν=1 Fν , find f (x) ∈ F minimizing
1n
n∑
i=1
L(yi , f (x i)) + λ∑
ν
θ−1ν ‖Pν f‖2.
Then f (x) = b +∑n
i=1 ci
[
∑dν=1 θνKν(x i , x)
]
.
◮ For sparsity, minimize
1n
n∑
i=1
L(yi , f (x i)) + λd∑
ν=1
θ−1ν ‖Pν f‖2 + λθ
d∑
ν=1
θν
subject to θν ≥ 0,∀ν.
Related to Kernel Learning
◮ Micchelli and Pontil (2005), Learning the kernel functionvia regularization
◮ K = {Kν , ν ∈ N}: a compact and convex set of kernels◮ A variational problem for optimal kernel configuration
minK∈K
(
minf∈HK
1n
n∑
i=1
L(yi , f (x i)) + λJ(f ))
One-Step Update for Structured Regression
◮ Given b and {cj}, recalibrate θ to minimize
1n
n∑
i=1
(
yi − b −d∑
ν=1
θν
[
n∑
j=1
cjKν(x j , x i)])2
+ λ∑
ν
θν
n∑
i,j=1
ci cjKν(x i , x j)
subject to θν ≥ 0,∀ν, and∑
ν
θν ≤ s
Nonnegative Garrote
Breiman, L. (1995),Better Subset Regression Using the Nonnegative Garrote
◮ Stepwise variable selection can be unstable with respect tosmall perturbations in the data.
◮ Starting with the full LSE, it both shrinks and zeroescoefficients.
◮ Given βLS, take (c1, . . . , cp) to minimize
n∑
i=1
(yi −p∑
j=1
cj βLSj xij)
2
subject to cj ≥ 0 and∑p
j=1 cj ≤ s.◮ Generally lower prediction error than best subset selection
Structured SVM with ANOVA decomposition◮ The binary case (Zhang, 2006):
Find f (x) = b + h(x) minimizing
1n
n∑
i=1
(1 − yi f (x i))+ + λd∑
ν=1
θ−1ν ‖Pν f‖2 + λθ
d∑
ν=1
θν
subject to θν ≥ 0,∀ν.
◮ The multiclass case (Lee et al., 2006) :Find f = (f 1, . . . , f k ) = (b1 + h1(x), . . . , bk + hk (x)) withthe sum-to-zero constraint minimizing
1n
n∑
i=1
∑
j 6=yi
(f j(x i) − y ji )+ +
λ
2
k∑
j=1
(
d∑
ν=1
θ−1ν ‖Pνhj‖2
)
+λθ
d∑
ν=1
θν subject to θν ≥ 0,∀ν,
where (y1, . . . , yk ) is a class code with y j = 1 and−1/(k −1) elsewhere, if y = j , and φ(x) = arg maxj [f j(x)].
Updating Algorithm
Letting C = (b, {cj}) and denoting the objective function byΦ(θ, C),
◮ Initialize θ(0) = (1, . . . , 1)t and C(0) = argmin Φ(θ(0), C).
◮ At the m-th iteration (m = 1, 2, . . .)
(θ-step) find θ(m) minimizing Φ(θ, C(m−1)) with C fixed.
(c-step) find C(m) minimizing Φ(θ(m), C) with θ fixed.
◮ One-step update can be used in practice.
Two-Way Regularization
◮ c-step solutions range from the simplest model (or majorityrule) to the complete overfit to the data as λ decreases.(Standard regularization procedure)
◮ θ-step solutions range from the constant model to the fullmodel with all the variables as λθ decreases.(Functional component pursuit)
Breast Cancer Data
◮ Sharma et al. (2005), Early detection of breast cancerbased on gene-expression patterns in peripheral bloodcells, Breast Cancer Research.
◮ Develop accurate and convenient methods for detection ofbreast cancer using blood samples.
◮ 60 unique blood samples from 24 women with breastcancer and 32 women with no signs of the disease
◮ Mean normalized and cube-root transformed expressionlevels of 1,368 cDNAs
◮ The nearest shrunken centroid method (Tibshirani et al.2002) was used in the original paper.
Searching for Gene Signatures
◮ The Nearest Shrunken Centroid method◮ The ℓ1-norm SVM with
◮ 1,368 linear terms◮ 1,368 linear and quadratic terms
◮ The structured kernel SVM with 1,368 nonparametric maineffects terms
◮ ‘External’ 6-fold cross-validation(Ambroise and McLachlan, PNAS 2002)
◮ Split 60 samples into a training set of 50 and a test set of10.
◮ Internal 5-fold CV for selection of an optimal tuningparameter (a threshold for NSC, and penalty parameters forSVM).
Computation
◮ Both ℓ1-norm SVM and the θ-step of the structured kernelSVM require parametric linear programming.
◮ Solutions are piecewise constant in the tuning parameter(λ or λθ).
◮ Simplex algorithm can be used to generate the entireregularized solution path (Yao and Lee, 2007).
s
Coef
ficien
ts1234567891011121314151617181920212223242526272829303132333435363738394041424344
45
464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
109
110111112113114115116117118119
120
121122123124125
126
127
128129130131132133134135136137138139140141142143144145146147148149150151152153154
155
156157158159160161162163164165166167
168
169
170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
202
203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250
251
252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303
304
305306307308309310311312313314315316317318319320321322323324325
326
327
328
329330
331
332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387
388
389390391392393394395396397398399400
401
402403404405406407408409410411412413414415416417418419420421422423424
425
426427428429430
431
432433434435436437438439440441442443444445446447448449450451452453454
455
456457458459460461462463464465466467468469470471472473474475476477478479480
481
482483484
485
486487488489
490
491492493494495496
497
498499
500
501
502
503
504505506507508509
510
511512513514515516517518519520521522523524525526527528
529
530
531
532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574
575
576577578
579
580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620
621
622623
624
625626627
628
629630631632633634635636637638639640641642643644645646647648649
650
651652653654655656657658659660661662663664665666667668669670671672673674675676
677
678679680681682683684685686687688689690691692
693
694
695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748
749
750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800
801
802803804805806807808809810811
812
813814815
816
817818819820821822823824825826
827
828829830
831
832833
834
835836837838
839
840841842843844845846
847
848849850851852853
854
855856857858859860861862863864
865
866867868
869
870
871872873874875876877878879880881
882
883884885886887888889890891892893894895896897898899900901902903904905906
907
9089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003
1004
100510061007100810091010101110121013101410151016101710181019
1020
10211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064
1065
106610671068106910701071
1072
10731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128
1129
11301131113211331134113511361137
1138
1139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190
1191
11921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259
1260
12611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316
1317
1318131913201321132213231324132513261327132813291330133113321333133413351336
1337
133813391340134113421343134413451346134713481349135013511352135313541355135613571358
1359
1360
1361
1362136313641365136613671368
0.0 0.1 0.2 0.3 0.4 0.5 0.6
010
2030
4050
6070
010
2030
400
1020
30
Figure: Recalibration parameter path for structured SVM.
s
Risk
w.r.
t. 0−1
loss
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.20
0.25
0.30
0.35
0.40
0.65
0.70.7
50.8
0.85
Risk
w.r.
t. hing
e los
s
Figure: Error rate path for the θ-step of structured SVM.
Error Rates
◮ External 6-fold CV◮ Comparison
Method NSC L.SVM LQ.SVM StructSVMMean 0.186 0.197 0.279 0.170
SE 0.050 0.051 0.058 0.048
Summary
◮ Integrate feature selection with kernel methods using l1type penalty.
◮ Enhance interpretation without compromising predictionaccuracy.
◮ General approach for structured and sparse representationwith kernels.
◮ RKHS methods can solve a wide range of statisticallearning problems in a principled way.
Software
◮ Smoothing spline ANOVA models:ssanova for Gaussian response and gssanova fornon-Gaussian response in gss R library
◮ Support vector machines:kernlab for SVM, spectral clustering and kernel PCAhttp://www.kernel-machines.org for otherimplementations in Matlab and Fortransvmpath for binary SVM solution pathmsvmpath for multicategory SVM solution path (inprogress)
◮ Other path-finding algorithms:lars for LASSO, LAR, stagewise fittingglmpath for l1 regularized generalized linear modelslpRegPath for parametric linear programming with linearloss and l1 penalty (in progress)
Acknowledgments
◮ Grace Wahba and Yi Lin (Statistics, University ofWisconsin)
◮ Cheol-Koo Lee (Genomics, Korea University)◮ Zhenhuan Cui and Yonggang Yao (former students now at
SAS)◮ For preprints or reprints of my work, visit
http://www.stat.osu.edu/∼yklee◮ E-mail: [email protected]
References
Peter J. Bickel and Bo Li.
Regularization in statistics.Test, 15(2):271–344, Dec 2006.
Leo Breiman.
Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, aug 2001.
N. Cristianini and J. Shawe-Taylor.
An Introduction to Support Vector Machines.Cambridge University Press, 2000.
T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning.Springer Verlag, New York, 2001.
B. Scholkopf and A. Smola.
Learning with Kernels - Support Vector Machines, Regularization, Optimization and Beyond.MIT Press, 2002.
V. Vapnik.
The Nature of Statistical Learning Theory.Springer Verlag, New York, 1995.
G. Wahba.
Spline Models for Observational Data.Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990.