kernel methods in a regularization framework...kernel methods in a regularization framework...

Kernel Methodsin a Regularization Framework

Yoonkyung LeeDepartment of Statistics

The Ohio State University

October 31, 2008Korean Statistical Society Meeting

Chung-Ang University, Seoul

Overview

◮ Part I:Introduction to Kernel Methods for Statistical Learning andModeling

◮ Part II:Theory of Reproducing Kernel Hilbert Spaces Methods

◮ Part III:Regularization Approach to Feature Selection

Part I: Introduction to Kernel Methods for StatisticalLearning and Modeling

◮ Statistical learning problems◮ Methods of regularization◮ Smoothing splines◮ Support vector machines◮ Statistical issues

Prediction of Annual Household Incomehttp://www-stat.stanford.edu/∼tibs/ElemStatLearn/

=<

Gra

de 8

Gra

de 9

−11

Hig

h S

choo

l

Col

lege

1−

3

Col

lege

Gra

duat

e

20

40

60

80

100In

com

e

Education

14−

17

18−

24

25−

34

35−

44

45−

54

55−

64

>65

20

40

60

80

100

Inco

me

Age

Mal

e

Fem

ale

20

40

60

80

100

Inco

me

Gender

Mar

ried

Tog

ethe

r

Div

orce

d

Wid

owed

Sin

gle

20

40

60

80

100

Inco

me

Marital Status

Pro

f.

Sal

es

Fac

tory

Cle

rical

Hom

e

Stu

dent

Mili

tary

Ret

ire

Une

mpl

oy

20

40

60

80

100

Inco

me

Occupation

Ow

n

Ren

t

Par

ents

20

40

60

80

100

Inco

me

House Status

Figure: Boxplots of the annual household income with education,age, gender, marital status, occupation, and householder status outof 13 demographic attributes in the data

Cancer Diagnosis with Microarray Data

◮ Microarrays measure relative amount of mRNAs of (tensof) thousands of genes.

◮ Golub et al.(Science, 1999): Acute leukemia dataALL (acute lymphoblastic leukemia) with subtype B-celland T-cell lineage, and AML (acute myeloid leukemia)

Acute Leukemia Gene Expression Profiles

Patient gene1 gene2 · · · gene7129 class

1 x1,1 x1,2 · · · x1,7129 ALL− B2 x2,1 x2,2 · · · x2,7129 ALL− B...

......

...20 x20,1 x20,2 · · · x20,7129 ALL− T...

......

...28 x28,1 x28,2 · · · x28,7129 AML...

......

...38 x38,1 x38,2 · · · x38,7129 AML

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5

ALL−B

ALL−T

AML

−−−−−−

L049

53U0

9278

D385

24U4

9742

M58

285

U479

28D8

6640

D837

82U3

7707

U580

87HG

2525

−HT2

621

M76

559

D799

98M

8786

0U1

5085

U385

45U1

4518

HG30

63−H

T322

4

L263

36D4

9677

L173

30M

5858

3U5

2154

M74

719

D823

44U1

5177

U382

68HG

4185

−HT4

455

M74

096

D439

48HG

3987

−HT4

257

HG41

57−H

T442

7M

9673

9S7

3205

M77

016

M73

489

U505

34M

5554

3

L049

53U0

9278

D385

24U4

9742

M58

285

U479

28D8

6640

D837

82U3

7707

U580

87HG

2525

−HT2

621

M76

559

D799

98M

8786

0U1

5085

U385

45U1

4518

HG30

63−H

T322

4U2

1051

−rna

1L2

6336

D496

77L1

7330

M58

583

U521

54M

7471

9D8

2344

U151

77M

3798

4−rn

a1U3

8268

HG41

85−H

T445

5M

7409

6D4

3948

HG39

87−H

T425

7HG

4157

−HT4

427

M96

739

S732

05M

7701

6M

7348

9U5

0534

M55

543

−−−−−−

Figure: The heat map shows the expression levels of 40 most important genes for the training samples whenthey are appropriately standardized. Each row corresponds to a sample, which is grouped into the three classes,and the columns represent genes. The 40 genes are clustered in a way the similarity within each class and thedissimilarity between classes are easily recognized.

Handwritten Digit Recognition

Figure: 16 × 16 gray scale images

Statistical Learning

◮ Multivariate function estimation◮ A training data set {(x i , yi), i = 1, . . . , n}◮ Learn functional relationship f between x = (x1, . . . , xp)

and y from the training data, which can be generalized tonovel cases.e.g. f (x) = E(Y |X = x)

◮ Examples includeRegression: continuous y ∈ R, andClassification: categorical y ∈ {1, . . . , k}.

Goodness of a Statistical Procedure for Learning

◮ Accurate prediction with respect to a given loss L(y , f (x))e.g. L(y , f (x)) = (y − f (x))2 for regression

◮ Flexible (nonparametric) and data-adaptive◮ Interpretability

e.g. subset (variable/feature) selection◮ Computational ease for large p (high dimensional input)

and n (large sample)

Large Model Space

◮ Large number of variables for high dimensional data◮ Large number of basis functions for nonparametric

modeling◮ Need to deal with a large parameter (or model) space.◮ Classical maximum likelihood estimation (MLE) or

empirical risk minimization (ERM) no longer works as thesolution may not be well-defined or there may be infinitelymany solutions that overfit data.

◮ How to explore the large model space for stable modelfitting and prediction?

Regularization

◮ Tikhonov regularization (1943):solving ill-posed integral equation numerically

◮ Process of modifying ill-posed problems by introducingadditional information about the solution

◮ Modification of the maximum likelihood principle orempirical risk minimization principle(Bickel & Li 2006)

◮ Smoothness, sparsity, small norm, large margin, ...◮ Bayesian connection

Methods of Regularization (Penalization)

Find f (x) ∈ F minimizing

1n

n∑

i=1

L(yi , f (x i)) + λJ(f ).

◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): complexity of the model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem

Examples of Regularization Methods

◮ Ridge regression (Hoerl and Kennard 1970)◮ LASSO (Tibshirani 1996)◮ Smoothing splines (Wahba 1990)◮ Support vector machines (Vapnik 1998)◮ Regularized neural network, boosting, logistic regression,

...

Nonparametric Regression

yi = f (xi) + ǫi for i = 1, . . . , n where ǫi ∼ N(0, σ2)

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

x

y

Smoothing Splines

Wahba (1990), Spline Models for Observational Data.

Find f (x) ∈ W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing

1n

n∑

i=1

(yi − f (xi))2 + λ

∫ 1

0(f ′′(x))2dx .

◮ J(f ) =∫ 1

0 (f ′′(x))2dx = ‖P1f‖2: curvature of f◮ λ → 0: interpolation◮ λ → ∞: linear fit◮ 0 < λ < ∞: piecewise cubic polynomial with two

continuous derivatives

Small λ: Overfit

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

x

y

Figure: λ → 0: interpolation

Large λ: Underfit

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

x

y

Figure: λ → ∞: linear fit

Moderate λ

0.0 0.2 0.4 0.6 0.8 1.0

−4

−2

02

4

x

y

Figure: 0 < λ < ∞: piecewise cubic polynomial with two continuousderivatives

Risk Estimate◮ Risk: E [1

n

∑ni=1

(

fλ(xi) − f (xi))2

]

◮ The Mallows-type criterion:

U(λ) =1n‖(I − A(λ))y‖2 + 2

σ2

ntr [A(λ)].

◮ d .f .(λU) = 9.

5 10 15 20 25 30

0.5

1.0

1.5

2.0

df

Average Prediction ErrorUnbiased Risk Estimate

Classification

◮ x = (x1, . . . , xp) ∈ Rp

◮ y ∈ Y = {1, . . . , k}◮ Learn a rule φ : R

p → Y from the training data{(x i , yi), i = 1, . . . , n}.

◮ The 0-1 loss function:

L(y , φ(x)) = I(y 6= φ(x))

Separating Hyperplane

−1 −0.5 0 0.5 1−1

−0.5

0

0.5

1

wx+b=−1

wx+b=0

wx+b=1

Margin=2/||w||

Figure: y = +1 in red and y = −1 in blue

Support Vector Machines

Boser, Guyon, & Vapnik (1992)Vapnik (1995), The Nature of Statistical Learning Theory.

◮ yi ∈ {−1, 1}, class labels in binary case◮ f (x) = w⊤x + b (real-valued discriminant function)◮ Separating hyperplane with the maximum margin:

f minimizing ‖w‖2

subject to yi f (x i) ≥ 1 for all i = 1, . . . , n◮ Classification rule: φ(x) = sign(f (x))

Linear SVM in Non-Separable Case

◮ Relax the separability condition yi f (x i) ≥ 1.◮ Hinge loss: L(y , f (x)) = (1 − yf (x))+ where

(t)+ = max(t , 0).◮ If yf (x) < 1, the loss is proportional to the distance of x

from the soft margin yf (x) = 1◮ Find f ∈ F = {f (x) = w⊤x + b | w ∈ R

p and b ∈ R}minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖w‖2,

where J(f ) = J(w⊤x + b) = ‖w‖2.

Hinge Loss

−2 −1 0 1 2

0

1

2

t=yf

[−t]*

(1−t)+

Figure: (1 − yf (x))+ is an upper bound of the misclassification lossfunction I(y 6= φ(x)) = [−yf (x)]∗ ≤ (1 − yf (x))+ where [t ]∗ = I(t ≥ 0)and (t)+ = max{t , 0}.

Nonlinear SVM

◮ Linear SVM solution:

f (x) =n∑

i=1

cix⊤i x + b

◮ Replace the Euclidean inner product x⊤x ′ withK (x , x ′) = Φ(x)⊤Φ(x ′) for a mapping Φ from R

p to ahigher dimensional ‘feature space.’

◮ Nonlinear kernels:K (x , x ′) = (1 + x⊤x ′)d , exp(−‖x − x ′‖2/2σ2), ...e.g. For p = 2 and x = (x1, x2),Φ(x) = (1,

√2x1,

√2x2, x2

1 , x22 ,√

2x1x2) givesK (x , x ′) = (1 + x⊤x ′)2.

Classification

yi ∈ {1 : red , −1 : blue}

−2 0 2 4

−2

01

23

4

x1

x2

Classification Boundary with a Large λ

−2 0 2 4

−2

01

23

4

x1

x2

Classification Boundary with a Small λ

−2 0 2 4

−2

01

23

4

x1

x2

Test Error Rates

0.00 0.02 0.04 0.06 0.08

0.14

0.18

0.22

λ

test

err

or r

ate

Figure: Error rate over 1,000 test cases as a function of λ.

Statistical Issues

◮ Risk or generalization error estimation◮ Model selection/averaging - choice of tuning parameter(s)

(CV, GCV, resampling, risk bounds, ...)◮ Variable or feature selection◮ Computation

◮ e.g. Least squares problem for Smoothing splines andQuadratic Programming for SVM

◮ Need to solve a family of optimization problems indexed byλ.

◮ Use the characteristics of regularized solutions for efficientalgorithms.

Summary

◮ Many statistical learning methods can be cast in aregularization framework.

◮ Examples include Smoothing Splines and Support VectorMachines.

◮ Regularization entails a model selection problem.Tuning parameters need to be chosen to optimize the“bias-variance tradeoff.”

◮ More formal treatment of kernel methods will be given inPart II.

Part II: Theory of Reproducing Kernel Hilbert SpacesMethods

◮ Regularization in RKHS◮ Reproducing kernel Hilbert spaces◮ Properties of kernels◮ Examples of RKHS methods◮ Representer Theorem

Regularization in RKHS

Find f =∑M

ν=1 dνφν + h with h ∈ HK minimizing

1n

n∑

i=1

L(yi , f (x i)) + λ‖h‖2HK

.

◮ HK : a reproducing Kernel Hilbert space of functionsdefined on a domain which can be arbitrary

◮ J(f ) = ‖h‖2HK

: penalty

◮ The null space spanned by {φν}Mν=1

Why consider RKHS?

◮ Theoretical basis for some of popular regularizationmethods

◮ Unified framework for function estimation and modelingvarious data

◮ Allow general domains for functions◮ Permit geometric understanding◮ Can do much more than estimation of function values

(e.g. integrals and derivatives)

Reproducing Kernel Hilbert Spaces

◮ Consider a Hilbert space H of real valued functions on adomain X .

◮ A Hilbert space H is a complete inner product linear space.◮ For example, the domain X could be

◮ {1, . . . , k}◮ [0, 1]◮ {A, C, G, T}◮ R

p

◮ S: sphere.

◮ A reproducing kernel Hilbert space is a Hilbert space ofreal valued functions, where the evaluation functionalLx(f ) = f (x) is bounded in H for each x ∈ X .

Riesz Representation Theorem

◮ For every bounded linear functional L in a Hilbert space H,there exists a unique gL ∈ H such that L(f ) = (gL, f ),∀f ∈ H.

◮ gL is called the representer of L.

Reproducing Kernel

Aronszajn (1950), Theory of Reproducing kernels.

◮ By the Riesz representation theorem, there exists Kx ∈ H,the representer of Lx(·), such that (Kx , f ) = f (x), ∀f ∈ H.

◮ K (x , t) = Kx(t) is called the reproducing kernel.◮ K (x , ·) ∈ H for each x◮ (K (x , ·), f (·)) = f (x) for all f ∈ H

◮ Note that K (x , t) = Kx(t) = (Kt(·), Kx(·)) = (Kx(·), Kt(·))(the reproducing property)

Example of RKHS

◮ F = {f | f : {1, . . . , k} → R} = Rk with the Euclidean inner

product (f , g) = f tg =∑k

j=1 fjgj

◮ Note that |f (j)| ≤ ‖f‖. That is, the evaluation functionalLj(f ) = f (j) is bounded in F for each j ∈ X .

◮ Lj(f ) = f (j) = (ej , f ) where ej is the j th column of Ik .Hence K (i , j) = δij = I[i = j] or [K (i , j)] = Ik .

The Mercer-Hilbert-Schmidt Theorem

◮ If∫

X

∫

XK 2(s, t)dsdt < ∞ for a continuous symmetric

non-negative definite K , then there exists an orthonormalsequence of continuous eigenfunctions Φ1, Φ2, . . . in L2[X ]and eigenvalues λ1 ≥ λ2 ≥ · · · ≥ 0 with

∑∞i=1 λ2

i < ∞ with∫

XK (s, t)Φi(s)ds = λiΦi(t) and

K (s, t) =∞∑

i=1

λiΦi(s)Φi(t).

◮ The inner product in H of functions f with∑

i(f2i /λi) < ∞

(f , g) =∞∑

i=1

figi

λi

where fi =∫

Xf (t)Φi(t)dt .

◮ Feature mapping: Φ(x) = (√

λ1Φ1(x),√

λ2Φ2(x), . . . )

Reproducing Kernel is Non-Negative Definite

◮ A bivariate function K (·, ·) : X × X → R is non-negativedefinite if for every n, and every x1, . . . , xn ∈ X , and everya1, . . . , an ∈ R

n,

n∑

i,j=1

aiajK (xi , xj) ≥ 0.

In other words, letting a = (a1, . . . , an)t ,

at[

K (xi , xj)]

a ≥ 0.

◮ For a reproducing kernel K ,

n∑

i,j=1

aiajK (xi , xj) =∥

∥

∥

n∑

i=1

aiK (xi , ·)∥

∥

∥

2≥ 0.

The Moore-Aronszajn Theorem

◮ For every RKHS H of functions on X , there corresponds aunique reproducing kernel (RK) K (s, t), which is n.n.d.

◮ Conversely, for every n.n.d. function K (s, t) on X , therecorresponds a unique RKHS H that has K (s, t) as its RK.

How to construct RKHS given an n.n.d. K (s, t)

◮ For each fixed x ∈ X , define Kx(·) = K (x , ·).◮ Taking all finite linear combinations of the form

∑

i aiKxi forall choices of n, a1, . . . , an, and x1, . . . , xn, construct alinear space M.

◮ Define an inner product via (Kxi , Kxj ) = K (xi , xj) and extendit by linearity.

(∑

i

aiKxi ,∑

j

bjKtj ) =∑

i,j

aibj(Kxi , Ktj ) =∑

i,j

aibjK (xi , tj).

◮ For any f of the form∑

i aiKxi , (Kx , f ) = f (x).◮ Complete M by adjoining all the limits of Cauchy

sequences of functions in M.

Sum of Reproducing Kernels

◮ The sum of two n.n.d. matrices is n.n.d.◮ Hence, the sum of two n.n.d. functions defined on the

same domain X is n.n.d.◮ In particular, if H = H1 ⊕H2, and Kj(s, t) is the RK of Hj

for j = 1, 2, then K (s, t) = K1(s, t) + K2(s, t) is the RK forthe tensor sum space of H1 ⊕H2.

Product of Reproducing Kernels

◮ Suppose that K1(x1, x2) is n.n.d. on X1 and K2(t1, t2) isn.n.d. on X2.

◮ Consider the tensor product of K1 and K2

K(

(x1, t1), (x2, t2))

= K1(x1, x2)K2(t1, t2)

on X = X1 ×X2.K (·, ·) is n.n.d. on X .

◮ The tensor product of two RK’s K1 and K2 for H1 and H2 isan RK for the tensor product space of H1 ⊗H2 on X .

Constructing Kernels

◮ Use reproducing kernels on a univariate domain asbuilding blocks.

◮ Tensor sums and products of reproducing kernels.◮ Systematic approach to estimating multivariate functions.◮ Other tricks to expand and design kernels:

Haussler (1999), Convolution Kernels on DiscreteStructures.

◮ Learning kernels (n.n.d. matrices) from data:Lanckriet et al. (2004), Learning the Kernel Matrix withSemidefinite Programming, JMLR.

Cubic Smoothing Splines

Find f (x) ∈ W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing

1n

n∑

i=1

(yi − f (xi))2 + λ

∫ 1

0(f ′′(x))2dx .

◮ The null space: M = 2, φ1(x) = 1, and φ2(x) = x .◮ The penalized space: HK = W 0

2 [0, 1] ={f ∈ W2[0, 1] : f (0) = 0, f ′(0) = 0} is an RKHS withi)(f , g) =

∫ 10 f ′′(x)g′′(x)dx

ii) ‖f‖2 =∫ 1

0 (f ′′(x))2dx

iii) K (x , x ′) =∫ 1

0 (x − u)+(x ′ − u)+du.

SVM in General

Find f (x) = b + h(x) with h ∈ HK minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λ‖h‖2HK

.

◮ The null space: M = 1 and φ1(x) = 1◮ Linear SVM:

HK = {h(x) = w⊤x | w ∈ Rp} with

i) K (x , x ′) = x⊤x ′

ii) ‖h‖2HK

= ‖w⊤x‖2HK

= ‖w‖2

◮ Nonlinear SVM: K (x , x ′) = (1 + x⊤x ′)d (polynomialkernel), exp(−‖x − x ′‖2/2σ2) (Gaussian kernel), ...

Representer Theorem

Kimeldorf and Wahba (1971)◮ The minimizer f =

∑Mν=1 dνφν + h with h ∈ HK of

1n

n∑

i=1

L(yi , f (x i)) + λ‖h‖2HK

has a representation of the form

f (x) =

M∑

ν=1

dνφν(x) +

n∑

i=1

ciK (x i , x).

◮ ‖h‖2HK

=∑

i,j cicjK (x i , x j).

Sketch of the Proof

◮ Write h ∈ HK as

h(x) =n∑

i=1

ciK (x i , x) + ρ(x)

where ρ(x) is some element in HK perpendicular toK (x i , x) for i = 1, . . . , n.

◮ Note that h(x i) = (K (x i , ·), h(·))HK does not depend on ρand ‖h‖2

HK=∑n

i=1∑n

j=1 cicjK (x i , x j) + ‖ρ‖2HK

◮ Then, ‖ρ‖2HK

needs to be zero.

◮ Hence, the minimizer f is of the form

f (x) =M∑

ν=1

dνφν(x) +n∑

i=1

ciK (x i , x).

Remarks on the Representer Theorem

◮ It holds for an arbitrary loss function L.◮ Minimizer of RKHS method resides in a finite dimensional

space.◮ So the solution is computable even if the RKHS had infinite

dimension.◮ The resulting optimization problems with the

representation f depend on L and the penalty J(f ).

Loss Function and its Risk Minimizer f

L(y , f (x)) argmin E [L(Y , f (X ))|X = x ]

Regression(y − f (x))2 E(Y |X = x)|y − f (x)| median(Y |X = x)

Classification with y = ±1SVM: (1 − yf (x))+ sign{p(x) − 1/2}Logistic regression:log{1 + exp(−yf (x))} log p(x)/(1 − p(x))Boosting: exp(−yf (x)) (1/2) log p(x)/(1 − p(x))

where p(x) = P(Y = 1|X = x)

Hinge vs -log likelihood

−2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

x

truthlogistic regressionSVM

Figure: Solid: 2p(x) − 1, dotted: 2pLG(x) − 1 and dashed: fSVM(x)

Smoothing Splines for Modeling Non-GaussianResponse

◮ Generalized linear models for the data with y fromexponential families can be extended nonparametrically.

◮ The main idea of the extension of smoothing splines tonon-Gaussian case is to replace the squared error loss bythe negative log likelihood function associated with y .

◮ Thus, the penalized least squares method becomes themethod of penalized negative log likelihood.

◮ Non-Gaussian data can be treated in the framework ofRKHS method in a unified way, yet they require somecomputational modification.

Logistic Regression

◮ Suppose that we have independent data points of (xi , yi),i = 1, . . . , n, and Yi |xi ∼ Bernoulli(p(xi)), wherep(x) = P(Y = 1|X = x). Note that yi ∈ {0, 1}.

◮ The likelihood of the data conditional on x is

n∏

i=1

p(xi)yi (1 − p(xi))

1−yi .

◮ Model the logit function of p(x), log{p(x)/(1 − p(x))} byf ∈ H.

◮ In terms of the logit f (x), the likelihood is given by

n∏

i=1

exp(yi f (xi)−log(1+ef (xi ))) = exp(n∑

i=1

yi f (xi)−log(1+ef (xi ))).

Formulation of Penalized Logistic Regression

◮ Use the negative log likelihood as a loss function.◮ Find fλ ∈ H = H0 ⊕H1 minimizing

1n

n∑

i=1

{−yi f (xi) + log(1 + ef (xi ))} + λ‖P1f‖2.

◮ By the representer theorem,

fλ(x) =M∑

ν=1

dνφν(x) +n∑

i=1

ciK (xi , x).

◮ The optimization problem of finding d = (d1, . . . , dM)t andc = (c1, . . . , cn)

t entails iterations.◮ The solution can be approximated by iteratively reweighted

least squares.

Machine Learning View on Kernels

◮ K (x , x ′) = Φ(x)⊤Φ(x ′) through feature mapping Φ from Rp

to a feature space.◮ Kernel trick:

Kernelize, that is, replace the Euclidean inner product x⊤x ′

in your linear method with K (x , x ′)!

◮ This idea goes beyond supervised learning problems.◮ nonlinear dimension reduction and data visualization:

kernel PCA◮ clustering: kernel k-means algorithm◮ ...

Summary

◮ RKHS methods provide a unified framework for statisticalmodel building.

◮ Kernels are now used as a versatile tool for flexiblemodeling and learning in various contexts.

◮ Feature selection for kernel methods will be discussed inPart III based on the idea of kernel construction by sumsand products.

Part III: Regularization Approach to Feature Selection

◮ Feature selection procedures◮ LASSO, modeling with l1 constraint◮ Generalization of LASSO for kernel methods◮ Application for finding biomarkers

Motivation for Feature Selection

◮ Key questions in many scientific investigations.◮ Achieve parsimony (Occam’s razor)

“Entities should not be multiplied beyond necessity.”◮ Enhance interpretation.◮ Often reduce variance, hence improve prediction accuracy.

Feature Selection Procedures

◮ Combinatorial approach:Best subset selection, Forward selection, Backwardelimination, Stepwise regressione.g. Guyon et al. (2002), Recursive feature selection

◮ l1 penalty for simultaneous fitting and selection:e.g. Bradley and Mangasarian (1998),Linear SVM with l1 penaltyTibshirani (1996), LASSO(Least Absolute Shrinkage and Selection Operator)Chen and Donoho (1994), Basis Pursuit

◮ Other variants for groupings of variables, model hierarchy,adaptiveness, and efficiency.

LASSO

minβ

n∑

i=1

(yi−p∑

j=1

βjxij)2+λ‖β‖1 ⇔ min

β

n∑

i=1

(yi−p∑

j=1

βjxij)2 s.t. ‖β‖1 ≤ s.

β

Ridge Regression

minβ

n∑

i=1

(yi −p∑

j=1

βjxij)2 + λ‖β‖2

2.

LASSO Coefficient Paths

** * * * * * * * * ** *

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

** * * ** *

* * * ** *

**

* * * * * * * * ** *

** **

* * * * * * ** *

** * * * * **

* *

**

*

** * * * * * * * *

***

** * ** * * * * *

***

** * * * * * ** * ** *

**

* * * * * * * ***

*

** * * * * * * * * ** *

LASSO

52

18

69

0 2 3 4 5 7 8 10 12

Median Regression with l1 Penalty for Income Data

0 10 20 30 40 50 60

−4−2

02

46

8

s

β

Figure: The coefficient paths of the main effects model. Positive: home ownership (in dark blue relative torenting), education (in brown), dual income due to marriage (in purple relative to ‘not married’), age (in skyblue), andmale (in light green). Negative: single or divorced (in red relative to ‘married’) and student, clerical worker, retired orunemployed (in green relative to professionals/managers)

Median Regression with l1 Penalty for Income Data

0 20 40 60 80 100 120

−50

5

s

β

Figure: The coefficient paths of a partial two-way interaction model.Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗education’, and ‘married but no dual income ∗ education’. Negative:‘single ∗ education’ and ‘home ownership ∗ age’

Risk Path

20 40 60 80 100 120

7.4

7.6

7.8

8.0

8.2

8.4

s

Estim

ated

Risk

Figure: The risks of the two-way fitted models are estimated by usinga test data set with 4,876 observations.

Generalization of LASSO

◮ Kernel methods may be difficult to interpret when theembedding into feature space is implicit.

◮ Regression:Lin and Zhang (2003), COmponent Selection andSmoothing OperatorGunn and Kandola (2002), Structural modelling withsparse kernel

◮ Classification:Zhang (2006) for the binary SVMLee et al. (2006) for the multiclass SVM

Strategy for Feature Selection

◮ Structured representation of f using functional ANOVAdecomposition

◮ A sparse solution approach with l1 penalty◮ A unified treatment for regression and classification

(both linear and nonlinear cases)◮ Inexpensive additional computation◮ Systematic elaboration of f with features

Functional ANOVA Decomposition

◮ For f defined on a product domain X =∏p

j=1 Xj ,

f =∏

j

[Aj + (I − Aj)]f

= (∏

j

Aj)f +∑

i

(∏

j 6=i

Aj)(I − Ai)f

+∑

i<j

(∏

r 6=i,j

Ar )(I − Ai)(I − Aj)f + · · ·

◮ Functional “overall mean” + “main effects” + “two-wayinteractions” + · · · .

◮ Define Aj appropriately so that the decomposition of Aj andI − Aj corresponds to {1} ⊕ Hj .

ANOVA Spaces and Kernels

Wahba (1990), smoothing spline ANOVA models

◮ Function: f (x) = b +∑p

α=1 fα(xα) +∑

α<β fαβ(xα, xβ) + · · ·

◮ Functional space: f ∈ H = ⊗pα=1({1} ⊕ Hα),

H = {1} ⊕∑pα=1 Hα ⊕∑α<β(Hα ⊗ Hβ) ⊕ · · ·

◮ Reproducing kernel (r.k.):K (x , x ′) = 1 +

∑pα=1 Kα(x , x ′) +

∑

α<β Kαβ(x , x ′) + · · ·

Two-way ANOVA Decomposition off ∈ W2[0, 1] ⊗ W2[0, 1]

◮ Two-way ANOVA decomposition of f

f (x1, x2) = f0 + f1(x1) + f2(x2) + f12(x1, x2)

with the side conditions that∫ 1

0 fj(xj)dxj = 0 and∫ 1

0 f12(x1, x2)dxj = 0 for j = 1, 2.◮ The corresponding functional components

{1} {k1(x2)} H21

{1} mean p-main effect (x2) n-main effect (x2){k1(x1)} p-main effect (x1) p×p-interaction p×n-interaction

H11 n-main effect (x1) n×p-interaction n×n-interaction

where “p” and “n” mean parametric and nonparametric,respectively.

x1

x2

parametric main effect (x1)

x1

x2

parametric main effect (x2)

x1

x2nonparametric main effect (x1)

x1

x2

nonparametric main effect (x2)

x1

x2

p−p interaction

x1

x2

n−p interaction

x1

x2p−n interaction

x1

x2

n−n interaction

l1 Penalty on θ

◮ Modification of r.k. by rescaling parameters θ ≥ 0Kθ(x , x ′) = 1+

∑pα=1 θαKα(x , x ′)+

∑

α<β θαβKαβ(x , x ′)+· · ·◮ Truncating H to F = {1} ⊕d

ν=1 Fν , find f (x) ∈ F minimizing

1n

n∑

i=1

L(yi , f (x i)) + λ∑

ν

θ−1ν ‖Pν f‖2.

Then f (x) = b +∑n

i=1 ci

[

∑dν=1 θνKν(x i , x)

]

.

◮ For sparsity, minimize

1n

n∑

i=1

L(yi , f (x i)) + λd∑

ν=1

θ−1ν ‖Pν f‖2 + λθ

d∑

ν=1

θν

subject to θν ≥ 0,∀ν.

Related to Kernel Learning

◮ Micchelli and Pontil (2005), Learning the kernel functionvia regularization

◮ K = {Kν , ν ∈ N}: a compact and convex set of kernels◮ A variational problem for optimal kernel configuration

minK∈K

(

minf∈HK

1n

n∑

i=1

L(yi , f (x i)) + λJ(f ))

One-Step Update for Structured Regression

◮ Given b and {cj}, recalibrate θ to minimize

1n

n∑

i=1

(

yi − b −d∑

ν=1

θν

[

n∑

j=1

cjKν(x j , x i)])2

+ λ∑

ν

θν

n∑

i,j=1

ci cjKν(x i , x j)

subject to θν ≥ 0,∀ν, and∑

ν

θν ≤ s

Nonnegative Garrote

Breiman, L. (1995),Better Subset Regression Using the Nonnegative Garrote

◮ Stepwise variable selection can be unstable with respect tosmall perturbations in the data.

◮ Starting with the full LSE, it both shrinks and zeroescoefficients.

◮ Given βLS, take (c1, . . . , cp) to minimize

n∑

i=1

(yi −p∑

j=1

cj βLSj xij)

2

subject to cj ≥ 0 and∑p

j=1 cj ≤ s.◮ Generally lower prediction error than best subset selection

Structured SVM with ANOVA decomposition◮ The binary case (Zhang, 2006):

Find f (x) = b + h(x) minimizing

1n

n∑

i=1

(1 − yi f (x i))+ + λd∑

ν=1

θ−1ν ‖Pν f‖2 + λθ

d∑

ν=1

θν

subject to θν ≥ 0,∀ν.

◮ The multiclass case (Lee et al., 2006) :Find f = (f 1, . . . , f k ) = (b1 + h1(x), . . . , bk + hk (x)) withthe sum-to-zero constraint minimizing

1n

n∑

i=1

∑

j 6=yi

(f j(x i) − y ji )+ +

λ

2

k∑

j=1

(

d∑

ν=1

θ−1ν ‖Pνhj‖2

)

+λθ

d∑

ν=1

θν subject to θν ≥ 0,∀ν,

where (y1, . . . , yk ) is a class code with y j = 1 and−1/(k −1) elsewhere, if y = j , and φ(x) = arg maxj [f j(x)].

Updating Algorithm

Letting C = (b, {cj}) and denoting the objective function byΦ(θ, C),

◮ Initialize θ(0) = (1, . . . , 1)t and C(0) = argmin Φ(θ(0), C).

◮ At the m-th iteration (m = 1, 2, . . .)

(θ-step) find θ(m) minimizing Φ(θ, C(m−1)) with C fixed.

(c-step) find C(m) minimizing Φ(θ(m), C) with θ fixed.

◮ One-step update can be used in practice.

Two-Way Regularization

◮ c-step solutions range from the simplest model (or majorityrule) to the complete overfit to the data as λ decreases.(Standard regularization procedure)

◮ θ-step solutions range from the constant model to the fullmodel with all the variables as λθ decreases.(Functional component pursuit)

Breast Cancer Data

◮ Sharma et al. (2005), Early detection of breast cancerbased on gene-expression patterns in peripheral bloodcells, Breast Cancer Research.

◮ Develop accurate and convenient methods for detection ofbreast cancer using blood samples.

◮ 60 unique blood samples from 24 women with breastcancer and 32 women with no signs of the disease

◮ Mean normalized and cube-root transformed expressionlevels of 1,368 cDNAs

◮ The nearest shrunken centroid method (Tibshirani et al.2002) was used in the original paper.

Searching for Gene Signatures

◮ The Nearest Shrunken Centroid method◮ The ℓ1-norm SVM with

◮ 1,368 linear terms◮ 1,368 linear and quadratic terms

◮ The structured kernel SVM with 1,368 nonparametric maineffects terms

◮ ‘External’ 6-fold cross-validation(Ambroise and McLachlan, PNAS 2002)

◮ Split 60 samples into a training set of 50 and a test set of10.

◮ Internal 5-fold CV for selection of an optimal tuningparameter (a threshold for NSC, and penalty parameters forSVM).

Computation

◮ Both ℓ1-norm SVM and the θ-step of the structured kernelSVM require parametric linear programming.

◮ Solutions are piecewise constant in the tuning parameter(λ or λθ).

◮ Simplex algorithm can be used to generate the entireregularized solution path (Yao and Lee, 2007).

s

Coef

ficien

ts1234567891011121314151617181920212223242526272829303132333435363738394041424344

45

464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108

109

110111112113114115116117118119

120

121122123124125

126

127

128129130131132133134135136137138139140141142143144145146147148149150151152153154

155

156157158159160161162163164165166167

168

169

170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201

202

203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250

251

252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303

304

305306307308309310311312313314315316317318319320321322323324325

326

327

328

329330

331

332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387

388

389390391392393394395396397398399400

401

402403404405406407408409410411412413414415416417418419420421422423424

425

426427428429430

431

432433434435436437438439440441442443444445446447448449450451452453454

455

456457458459460461462463464465466467468469470471472473474475476477478479480

481

482483484

485

486487488489

490

491492493494495496

497

498499

500

501

502

503

504505506507508509

510

511512513514515516517518519520521522523524525526527528

529

530

531

532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574

575

576577578

579

580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620

621

622623

624

625626627

628

629630631632633634635636637638639640641642643644645646647648649

650

651652653654655656657658659660661662663664665666667668669670671672673674675676

677

678679680681682683684685686687688689690691692

693

694

695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748

749

750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800

801

802803804805806807808809810811

812

813814815

816

817818819820821822823824825826

827

828829830

831

832833

834

835836837838

839

840841842843844845846

847

848849850851852853

854

855856857858859860861862863864

865

866867868

869

870

871872873874875876877878879880881

882

883884885886887888889890891892893894895896897898899900901902903904905906

907

9089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003

1004

100510061007100810091010101110121013101410151016101710181019

1020

10211022102310241025102610271028102910301031103210331034103510361037103810391040104110421043104410451046104710481049105010511052105310541055105610571058105910601061106210631064

1065

106610671068106910701071

1072

10731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128

1129

11301131113211331134113511361137

1138

1139114011411142114311441145114611471148114911501151115211531154115511561157115811591160116111621163116411651166116711681169117011711172117311741175117611771178117911801181118211831184118511861187118811891190

1191

11921193119411951196119711981199120012011202120312041205120612071208120912101211121212131214121512161217121812191220122112221223122412251226122712281229123012311232123312341235123612371238123912401241124212431244124512461247124812491250125112521253125412551256125712581259

1260

12611262126312641265126612671268126912701271127212731274127512761277127812791280128112821283128412851286128712881289129012911292129312941295129612971298129913001301130213031304130513061307130813091310131113121313131413151316

1317

1318131913201321132213231324132513261327132813291330133113321333133413351336

1337

133813391340134113421343134413451346134713481349135013511352135313541355135613571358

1359

1360

1361

1362136313641365136613671368

0.0 0.1 0.2 0.3 0.4 0.5 0.6

010

2030

4050

6070

010

2030

400

1020

30

Figure: Recalibration parameter path for structured SVM.

s

Risk

w.r.

t. 0−1

loss

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.20

0.25

0.30

0.35

0.40

0.65

0.70.7

50.8

0.85

Risk

w.r.

t. hing

e los

s

Figure: Error rate path for the θ-step of structured SVM.

Error Rates

◮ External 6-fold CV◮ Comparison

Method NSC L.SVM LQ.SVM StructSVMMean 0.186 0.197 0.279 0.170

SE 0.050 0.051 0.058 0.048

Summary

◮ Integrate feature selection with kernel methods using l1type penalty.

◮ Enhance interpretation without compromising predictionaccuracy.

◮ General approach for structured and sparse representationwith kernels.

◮ RKHS methods can solve a wide range of statisticallearning problems in a principled way.

Software

◮ Smoothing spline ANOVA models:ssanova for Gaussian response and gssanova fornon-Gaussian response in gss R library

◮ Support vector machines:kernlab for SVM, spectral clustering and kernel PCAhttp://www.kernel-machines.org for otherimplementations in Matlab and Fortransvmpath for binary SVM solution pathmsvmpath for multicategory SVM solution path (inprogress)

◮ Other path-finding algorithms:lars for LASSO, LAR, stagewise fittingglmpath for l1 regularized generalized linear modelslpRegPath for parametric linear programming with linearloss and l1 penalty (in progress)

Acknowledgments

◮ Grace Wahba and Yi Lin (Statistics, University ofWisconsin)

◮ Cheol-Koo Lee (Genomics, Korea University)◮ Zhenhuan Cui and Yonggang Yao (former students now at

SAS)◮ For preprints or reprints of my work, visit

http://www.stat.osu.edu/∼yklee◮ E-mail: [email protected]

References

Peter J. Bickel and Bo Li.

Regularization in statistics.Test, 15(2):271–344, Dec 2006.

Leo Breiman.

Statistical modeling: The two cultures.Statistical Science, 16(3):199–215, aug 2001.

N. Cristianini and J. Shawe-Taylor.

An Introduction to Support Vector Machines.Cambridge University Press, 2000.

T. Hastie, R. Tibshirani, and J. Friedman.

The Elements of Statistical Learning.Springer Verlag, New York, 2001.

B. Scholkopf and A. Smola.

Learning with Kernels - Support Vector Machines, Regularization, Optimization and Beyond.MIT Press, 2002.

V. Vapnik.

The Nature of Statistical Learning Theory.Springer Verlag, New York, 1995.

G. Wahba.

Spline Models for Observational Data.Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990.

kernel methods in a regularization framework...kernel methods in a regularization framework...

Documents