learning the kernel: theory and applicationsyy298919/kl.pdfmotivation for learning the kernel...

Motivation for Learning the KernelRegularization Framework for Kernel Learning

Statistical Generalization AnalysisData Integration via KL-divergence

ExperimentsConclusion and Outlook

Learning the Kernel: Theory and Applications

Yiming Ying

Department of Engineering Mathematics – University of Bristol

Febuary 2009

Yiming Ying Learning the Kernel: Theory and Applications

Outline

1 Motivation for Learning the Kernel

2 Regularization Framework for Kernel Learning

3 Statistical Generalization Analysis

4 Data Integration via KL-divergence

5 Experiments

6 Conclusion and Outlook

Yeast Protein Function Prediction (CYGD): Multi-labeloutputs

Protein functional classes: metabolism, membrane, proteinsynthesis etc.

Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.

Yeast Protein Function Prediction (CYGD): Multi-labeloutputs

Protein functional classes: metabolism, membrane, proteinsynthesis etc.

Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.

Protein Fold Recognition (SCOP): Multi-class outputs

Protein fold: major secondarystructures and the same topology

Fold characteristics:physicochemical, structuralproperties of amino acids etc.

Protein Fold Recognition (SCOP): Multi-class outputs

Protein fold: major secondarystructures and the same topology

Fold characteristics:physicochemical, structuralproperties of amino acids etc.

Gene Network Inference: graph structured outputs

Protein-protein interaction or metabolic network (yeast)

Gene expression measurements; Phylogenetic profiles;Location of proteins/enzymes in the cell

Image from J.P. Vert

Practical Problems and Goals

Goal I: best prediction for single data source

Goal II: data integration of multiple data sources

Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources

SVM: linear maximum margin

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

n∑i=1

αi −n∑

αiαjyiyj〈xi , xj〉

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

n∑i=1

αi −n∑

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

n∑i=1

αi −n∑

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

Hyperplanes:

〈w , x〉+ b = ±1

Decision plane:

〈w , x〉+ b = 0

Margin: 2‖w‖

Primal Problem:

minw ,b

‖w‖2

s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1

Dual Problem:

n∑i=1

αi −n∑

s.t.∑n

i=1 αiyi = 0, αi ≥ 0 ∀i

SVM: Dual transition to nonlinear case

Feature map:φ : X 7−→ F

x → φ(x)

Dual Formulation of SVM (QP)

n∑i=1

αi −n∑

αiαjyiyj〈φ(xi ), φ(xj)〉

s.t.n∑

αiyi = 0, 0 ≤ αi ∀i .

SVM: Regularization

Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.

Gram matrix in Dual formulation of SVM:(K (xi , xj)

i ,j=1

p.s.d.

Reproducing kernel Hilbert space HK

SVM with regularization

minf ∈HK ,b∈R,ξ

‖f ‖2K + C

∑ni=1 ξi

s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n

SVM: Regularization

Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.

Gram matrix in Dual formulation of SVM:(K (xi , xj)

i ,j=1

p.s.d.

Reproducing kernel Hilbert space HK

SVM with regularization

minf ∈HK ,b∈R,ξ

‖f ‖2K + C

∑ni=1 ξi

s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n

Example of Kernels

K (x , t) = x>t

K (x , t) = e−(x−t)>A(x−t) with A p.s.d.

Diffusion kernel: eβ4

Sequence kernelCHART, CAT, CART

Example of Kernels

K (x , t) = x>t

Example of Kernels

K (x , t) = x>t

Problem I: Classical Model Selection

Problem I

For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2

Not automatic way: crossvalidation on training data

Problem I: Classical Model Selection

Problem I

For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2

Not automatic way: crossvalidation on training data

Problem II: Data Integration

Problem II

How to integrate differentsources of biological datasets?

Problem II: Data Integration

Problem II

How to integrate differentsources of biological datasets?

Regularization Framework for Kernel Learning

Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

General dual problem:

minK∈K

n∑i=1

αi −n∑

i ,j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ∀i .

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

minK∈K

n∑i=1

αi −n∑

i ,j=1

s.t. 0 ≤ αi ≤ C ∀i .

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

minK∈K

n∑i=1

αi −n∑

i ,j=1

s.t. 0 ≤ αi ≤ C ∀i .

minK∈K

minf ∈HK

‖f ‖2K + C

n∑i=1

(1− yi f (xi ))+

minK∈K

n∑i=1

αi −n∑

i ,j=1

s.t. 0 ≤ αi ≤ C ∀i .

Specific Examples of Kernel Learning

Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:

Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}

Dual problem:

minσ>0

n∑i=1

αi −n∑

i ,j=1

αiαjyiyje−σ‖xi−xj‖2

s.t. 0 ≤ αi ≤ C ∀i .

Optimization: gradient descent

Specific Examples of Kernel Learning

Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:

Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}

Dual problem:

minσ>0

n∑i=1

αi −n∑

i ,j=1

αiαjyiyje−σ‖xi−xj‖2

s.t. 0 ≤ αi ≤ C ∀i .

Optimization: gradient descent

Specific Examples of Kernel Learning (cont.)

Linear combination of candidate kernels [Lanckriet et al. 2004]:

Candidate kernels: K = {∑m

`=1 λ`K` :

∑` λ` = 1, λ` ≥ 0}

Dual problem (mini-max):

n∑i=1

αi −n∑

i ,j=1

αiαjyiyj

λ`K`(x`i , x

s.t.∑

` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i

Linear combination of candidate kernels [Lanckriet et al. 2004]:

Candidate kernels: K = {∑m

`=1 λ`K` :

∑` λ` = 1, λ` ≥ 0}

Dual problem (mini-max):

n∑i=1

αi −n∑

i ,j=1

αiαjyiyj

λ`K`(x`i , x

s.t.∑

` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i

Equivalent with nonparametric group lasso[Micchelli andPontil, 2005; Bach, 2008]:

minf`∈HK` ,∀`

(1− yi

f`(x`i ))+ + (

‖f`‖K `)2

Statistical Generalization Analysis

Challenging Problem: How to characterize the complexity andoverfitting?

Hypothesis space H ={HK : K ∈ K

}: H the larger, more complex

and more risk of overfitting.

Statistical Generalization Analysis: Empirical process

Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}

Uniform Glivenko-Cantelli (uGC) class (uniform convergence)

For any ε > 0, there holds

{supm≥`

supf ∈B

∣∣∣∣ 1

m∑i=1

f (xi )−∫

Xf (x)dP(x)

∣∣∣∣ > ε

}→ 0, as `→∞

Characterization by candidate kernels [Ying and Zhou,2004]:

B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=

{K (·, x) : K ∈ K, x ∈ X

}Yiming Ying Learning the Kernel: Theory and Applications

Statistical Generalization Analysis: Empirical process

Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}

Uniform Glivenko-Cantelli (uGC) class (uniform convergence)

For any ε > 0, there holds

{supm≥`

supf ∈B

∣∣∣∣ 1

m∑i=1

f (xi )−∫

Xf (x)dP(x)

∣∣∣∣ > ε

}→ 0, as `→∞

Characterization by candidate kernels [Ying and Zhou,2004]:

B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=

{K (·, x) : K ∈ K, x ∈ X

}Yiming Ying Learning the Kernel: Theory and Applications

Statistical Generalization Analysis: Shattering dimension

Vγ dimension [Alon et.al. 1997; Anthony and Bartlett, 1999]

A ⊂ X is Vγ shattered by G if there is a number b ∈ R such that, for

every subset E of A, there exists fE ∈ G with property fE (x) + b ≤ −γand for every x ∈ A \ E, there holds fE (x) + b ≥ γ for every x ∈ E. The

Vγ dimension of G is the maximal cardinality of a set A ⊂ X that is Vγshattered by G.

2-dim linear functions:

Statistical Generalization Analysis: Generalization bounds

Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]

True Err ≤ Train Err +( dK

training sample number×margin2

where dK is the pseudo-dimension of K (limγ→0+ Vγ).

Examples

Single kernel: dK = 1⇒ VC Theory for SVM

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Examples

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Examples

K ={e−σ‖x−t‖2

: σ > 0}⇒ dK ≤ 2.

Related Work on Learning the Combination of Kernels

Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP

Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...

Probabilistic Bayesian Models:

Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM

Related Work on Learning the Combination of Kernels

Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP

Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...

Probabilistic Bayesian Models:

Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM

Kernel Learning via KL-divergence

Learning kernel matrix by minimizing the KL-divergence[Lawrence and Sanguinetti, 2004]

KL(N (0,Ky)||N (0,Kx)

2Tr(KyK

−1x )+

2log |Kx|−

2log |Ky|

Kernel Learning via KL-divergence (cont.)

Input kernel matrix Kx =∑

` λ`K` with `-th data source

encoded into K`

Output kernel matrix Ky derived from label information

Data integration via KL-divergence [Ying, Huang and Campbell,2009]:

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

where σ > 0 is a small number to avoid singularity.

encoded into K`

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

encoded into K`

Tr(Ky(∑

`∈Nmλ`K

` + σIn)−1)

+ log∣∣∣∑`∈Nm

λ`K` + σIn

∣∣∣s.t.

∑` λ` = 1, λ` ≥ 0.

Why KL-divergence

Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice

Easily adapted to different learning scenarios via the outputkernel matrix Ky

Multi-label y (n × T matrix): Ky := yy>

Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian

KL-divergence based criterion is parameter-free

Why KL-divergence

Optimization Formulation

Letf (λ) := Tr

∑`∈Nm

λ`K` + σIn)−1)

andg(λ) := − log |

∑`∈Nm

λ`K` + σIn|

Theorem

KL-divergence based kernel learning is a convex concave (differenceof convex) problem:

min∑` λ`=1,λ`≥0

f (λ)− g(λ).

Optimization Formulation

Letf (λ) := Tr

∑`∈Nm

λ`K` + σIn)−1)

andg(λ) := − log |

∑`∈Nm

λ`K` + σIn|

Theorem

KL-divergence based kernel learning is a convex concave (differenceof convex) problem:

min∑` λ`=1,λ`≥0

f (λ)− g(λ).

Difference of convex (DC) Programming

DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]

Given a stopping criterion ε > 0

Initialize λ, e.g. λ` = 1m

Iteration:λt+1 = arg min

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

}Stop until f (λ(t))− g(λ(t))−

(f (λt+1)− g(λ(t+1))

)≤ ε

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

(f (λt+1)− g(λ(t+1))

)≤ ε

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

(f (λt+1)− g(λ(t+1))

)≤ ε

{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0

(f (λt+1)− g(λ(t+1))

)≤ ε

Subproblem in Each Iteration

Consider λt+1 = arg min{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑

` λ` = 1, λ` ≥ 0}

SILP or QCQP

arg maxλ,γ γs.t.∑

` λ` = 1, λ` ≥ 0

γ −∑

` λ`(Tr(αα>K`) + ∂g(λ(t))

∂λ`

)≤ Tr

(−2α>A + σα>α

), ∀α

where Ky = AA> with n × s matrix A and s = rank(Ky)

Protein Fold Recognition [Ding et al. 2001; Shen et al.2006; Damoulas and Girolami, 2008]

27 SCOP fold (class), 311 proteins for training and 381 proteinsfor testing with 12 different data sources:

Amino-acid composition (C) Second order polynomialPredicted secondary structure (S) Second order polynomialHydrophobicity (H) Second order polynomialPolarity (P) Second order polynomialvan der Waals volume (V) Second order polynomialPolarizability (Z) Second order polynomialFour pseudo-amino acid compositions (L1, L4, L14, L30) Second order polynomialsTwo local sequence alignment using Smith-Waterman Scores (SW1, SW2) Pairwise kernels

Comparative Kernel Learning Algorithms

MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM

SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based

MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis

VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression

Overall Features Performance

F.S. MKLdiv SimpleMKL VBKC MKL-RKDA Ding et al.C 51.69 51.83 51.2± 0.5 45.43 44.9S 40.99 40.73 38.1± 0.3 38.64 35.6H 36.55 36.55 32.5± 0.4 34.20 36.5P 35.50 35.50 32.2± 0.3 30.54 32.9V 37.07 37.85 32.8± 0.3 30.54 35Z 37.33 36.81 33.2± 0.4 30.28 32.9L1 44.64 45.16 41.5± 0.5 36.55 −L4 44.90 44.90 41.5± 0.4 38.12 −L14 43.34 43.34 38± 0.2 40.99 −L30 31.59 31.59 32± 0.2 36.03 −SW1 62.92 62.40 59.8± 1.9 61.87 −SW2 63.96 63.44 49± 0.7 64.49 −Overall 73.36 66.57 70 68.40 56Average 68.40 68.14 − 66.06 −

Kernel weights of Overall Features

Performance of Sequentially Added Features

Features MKLdiv SimpleMKL MKL-RKDASW1 62.92 62.40 61.87SW1S 65.27 64.22 64.75SW1SW2S 67.10 64.75 64.49SW1SW2CS 73.36 65.01 67.62SW1SW2CSH 74.67 66.31 67.88SW1SW2CSHP 74.93 66.31 69.71SW1SW2CSHPZ 75.19 68.92 66.05SW1SW2CSHPZV 74.41 66.31 69.19SW1SW2CSHPZVL1 73.10 66.84 68.66SW1SW2CSHPZVL1L4 72.84 67.10 67.62SW1SW2CSHPZVL1L4L14 72.58 66.84 69.19SW1SW2CSHPZVL1L4L14L30 73.36 66.57 68.40

Kernel Weights of Dominant Features

Yeast Protein Functional Prediction (Lanckriet et al. 2004)

Data sources and kernels:

SW: protein sequences with Smith-Waterman Score

B: protein sequences with BLAST

Pfam: protein sequences with Pfam HMM

FFT: hydropathy profile with FFT

LI: protein interactions with linear kernel

Diff: protein interactions with diffusion kernel

E: gene expression with radial basis kernel

MKLdiv performance on membrane protein prediction(binary)

Conclusion and Outlook

We saw

Kernel learning is well motivated

Sound theoretical results: regularization theory and statisticallearning theory

Novel data integration via KL-divergence implemented by DCprogramming

State-of-art performance on protein fold recognition

Future

More investigation into structure outputs datasets

Systematic study into the relation between KL-divergencebased criterion

We saw

Future

We saw

Future

We saw

Future

We saw

Future

Relevant Work

Yiming Ying, Kaizhu Huang, and Colin Campbell.

Enhanced protein fold recognition through a novel data integration approach.Preprint, 2009.

Yiming Ying and Colin Campbell.

The Rademacher chaos complexity of learning the kernel.Submitted, 2008.

Theodoros Damoulas, Yiming Ying, Mark A. Girolami, and Colin Campbell.

Inferring sparse kernel combinations and relevance vectors: an application to subcellular localization ofproteins.In Proceedings of the 7th International Conference on Machine Learning and Applications (ICMLA2008).

Yiming Ying and Ding-Xuan Zhou.

Learnability of Gaussians with flexible variances.J. of Machine Learning Research, 8: 249-276, 2007.

Qiang Wu, Yiming Ying, and Ding-Xuan Zhou.

Multi-kernel regularized classifiers.Journal of Complexity, 23: 108-134, 2007.

Available in http://www.cs.ucl.ac.uk/staff/Y.Ying

Thank you for your attention!

learning the kernel: theory and applicationsyy298919/kl.pdfmotivation for learning the kernel...

Documents