learning the kernel: theory and applicationsyy298919/kl.pdfmotivation for learning the kernel...
TRANSCRIPT
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Learning the Kernel: Theory and Applications
Yiming Ying
Department of Engineering Mathematics – University of Bristol
Febuary 2009
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Outline
1 Motivation for Learning the Kernel
2 Regularization Framework for Kernel Learning
3 Statistical Generalization Analysis
4 Data Integration via KL-divergence
5 Experiments
6 Conclusion and Outlook
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Yeast Protein Function Prediction (CYGD): Multi-labeloutputs
Protein functional classes: metabolism, membrane, proteinsynthesis etc.
Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Yeast Protein Function Prediction (CYGD): Multi-labeloutputs
Protein functional classes: metabolism, membrane, proteinsynthesis etc.
Functional characteristics: protein-protein interaction, mRNAexpression patterns, amino acid sequence etc.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Protein Fold Recognition (SCOP): Multi-class outputs
Protein fold: major secondarystructures and the same topology
Fold characteristics:physicochemical, structuralproperties of amino acids etc.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Protein Fold Recognition (SCOP): Multi-class outputs
Protein fold: major secondarystructures and the same topology
Fold characteristics:physicochemical, structuralproperties of amino acids etc.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Gene Network Inference: graph structured outputs
Protein-protein interaction or metabolic network (yeast)
Gene expression measurements; Phylogenetic profiles;Location of proteins/enzymes in the cell
Image from J.P. Vert
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Practical Problems and Goals
Goal I: best prediction for single data source
Goal II: data integration of multiple data sources
Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Practical Problems and Goals
Goal I: best prediction for single data source
Goal II: data integration of multiple data sources
Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Practical Problems and Goals
Goal I: best prediction for single data source
Goal II: data integration of multiple data sources
Challenging issues: structure outputs (protein functionprediction, network inference), number of classes is very large(protein fold recognition), multiple data sources
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: linear maximum margin
Hyperplanes:
〈w , x〉+ b = ±1
Decision plane:
〈w , x〉+ b = 0
Margin: 2‖w‖
Primal Problem:
minw ,b
‖w‖2
s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1
Dual Problem:
maxα
n∑i=1
αi −n∑
i,j=1
αiαjyiyj〈xi , xj〉
s.t.∑n
i=1 αiyi = 0, αi ≥ 0 ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: linear maximum margin
Hyperplanes:
〈w , x〉+ b = ±1
Decision plane:
〈w , x〉+ b = 0
Margin: 2‖w‖
Primal Problem:
minw ,b
‖w‖2
s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1
Dual Problem:
maxα
n∑i=1
αi −n∑
i,j=1
αiαjyiyj〈xi , xj〉
s.t.∑n
i=1 αiyi = 0, αi ≥ 0 ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: linear maximum margin
Hyperplanes:
〈w , x〉+ b = ±1
Decision plane:
〈w , x〉+ b = 0
Margin: 2‖w‖
Primal Problem:
minw ,b
‖w‖2
s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1
Dual Problem:
maxα
n∑i=1
αi −n∑
i,j=1
αiαjyiyj〈xi , xj〉
s.t.∑n
i=1 αiyi = 0, αi ≥ 0 ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: linear maximum margin
Hyperplanes:
〈w , x〉+ b = ±1
Decision plane:
〈w , x〉+ b = 0
Margin: 2‖w‖
Primal Problem:
minw ,b
‖w‖2
s.t. 〈w , xi 〉+ b ≥ 1, yi = 1〈w , xi 〉+ b ≤ −1, yi = −1
Dual Problem:
maxα
n∑i=1
αi −n∑
i,j=1
αiαjyiyj〈xi , xj〉
s.t.∑n
i=1 αiyi = 0, αi ≥ 0 ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: Dual transition to nonlinear case
Feature map:φ : X 7−→ F
x → φ(x)
Dual Formulation of SVM (QP)
maxα
n∑i=1
αi −n∑
i,j=1
αiαjyiyj〈φ(xi ), φ(xj)〉
s.t.n∑
i=1
αiyi = 0, 0 ≤ αi ∀i .
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: Regularization
Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.
Gram matrix in Dual formulation of SVM:(K (xi , xj)
)n
i ,j=1
p.s.d.
Reproducing kernel Hilbert space HK
SVM with regularization
minf ∈HK ,b∈R,ξ
‖f ‖2K + C
∑ni=1 ξi
s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
SVM: Regularization
Kernel: K (x , t) = 〈φ(x), φ(t)〉 symmetric, continuous, p.s.d.
Gram matrix in Dual formulation of SVM:(K (xi , xj)
)n
i ,j=1
p.s.d.
Reproducing kernel Hilbert space HK
SVM with regularization
minf ∈HK ,b∈R,ξ
‖f ‖2K + C
∑ni=1 ξi
s.t. yi (f (xi ) + b) ≥ 1− ξi , ξi ≥ 0 ∀ i = 1, 2, . . . , n
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Example of Kernels
K (x , t) = x>t
K (x , t) = e−(x−t)>A(x−t) with A p.s.d.
Diffusion kernel: eβ4
Sequence kernelCHART, CAT, CART
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Example of Kernels
K (x , t) = x>t
K (x , t) = e−(x−t)>A(x−t) with A p.s.d.
Diffusion kernel: eβ4
Sequence kernelCHART, CAT, CART
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Example of Kernels
K (x , t) = x>t
K (x , t) = e−(x−t)>A(x−t) with A p.s.d.
Diffusion kernel: eβ4
Sequence kernelCHART, CAT, CART
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Problem I: Classical Model Selection
Problem I
For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2
?
Not automatic way: crossvalidation on training data
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Problem I: Classical Model Selection
Problem I
For example, how to automaticallytune hyper-parameter σ of Gaussiankernel Kσ(x , t) = e−σ‖x−t‖2
?
Not automatic way: crossvalidation on training data
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Problem II: Data Integration
Problem II
How to integrate differentsources of biological datasets?
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Problem II: Data Integration
Problem II
How to integrate differentsources of biological datasets?
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Regularization Framework for Kernel Learning
Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:
minK∈K
minf ∈HK
‖f ‖2K + C
n∑i=1
(1− yi f (xi ))+
General dual problem:
minK∈K
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyjK (xi , xj)
s.t. 0 ≤ αi ≤ C ∀i .
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Regularization Framework for Kernel Learning
Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:
minK∈K
minf ∈HK
‖f ‖2K + C
n∑i=1
(1− yi f (xi ))+
General dual problem:
minK∈K
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyjK (xi , xj)
s.t. 0 ≤ αi ≤ C ∀i .
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Regularization Framework for Kernel Learning
Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:
minK∈K
minf ∈HK
‖f ‖2K + C
n∑i=1
(1− yi f (xi ))+
General dual problem:
minK∈K
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyjK (xi , xj)
s.t. 0 ≤ αi ≤ C ∀i .
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Regularization Framework for Kernel Learning
Maximizing the margin over all kernel spaces [Micchelli andPontil, 2005; Wu, Ying and Zhou, 2007]:
minK∈K
minf ∈HK
‖f ‖2K + C
n∑i=1
(1− yi f (xi ))+
General dual problem:
minK∈K
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyjK (xi , xj)
s.t. 0 ≤ αi ≤ C ∀i .
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Specific Examples of Kernel Learning
Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:
Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}
Dual problem:
minσ>0
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyje−σ‖xi−xj‖2
s.t. 0 ≤ αi ≤ C ∀i .
Optimization: gradient descent
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Specific Examples of Kernel Learning
Hyper-parameter tuning [Chapelle & Vapnik et al., 2002]:
Candidate kernels: K = {e−σ‖x−t‖2: σ ∈ (0,∞)}
Dual problem:
minσ>0
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyje−σ‖xi−xj‖2
s.t. 0 ≤ αi ≤ C ∀i .
Optimization: gradient descent
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Specific Examples of Kernel Learning (cont.)
Linear combination of candidate kernels [Lanckriet et al. 2004]:
Candidate kernels: K = {∑m
`=1 λ`K` :
∑` λ` = 1, λ` ≥ 0}
Dual problem (mini-max):
minλ
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyj
∑`
λ`K`(x`i , x
`j )
s.t.∑
` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Specific Examples of Kernel Learning (cont.)
Linear combination of candidate kernels [Lanckriet et al. 2004]:
Candidate kernels: K = {∑m
`=1 λ`K` :
∑` λ` = 1, λ` ≥ 0}
Dual problem (mini-max):
minλ
maxα
n∑i=1
αi −n∑
i ,j=1
αiαjyiyj
∑`
λ`K`(x`i , x
`j )
s.t.∑
` λ` = 1, λ` ≥ 0 0 ≤ αi ≤ C ∀i
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Specific Examples of Kernel Learning (cont.)
Equivalent with nonparametric group lasso[Micchelli andPontil, 2005; Bach, 2008]:
minf`∈HK` ,∀`
Cn∑
i=1
(1− yi
∑`
f`(x`i ))+ + (
∑`
‖f`‖K `)2
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis
Challenging Problem: How to characterize the complexity andoverfitting?
Hypothesis space H ={HK : K ∈ K
}: H the larger, more complex
and more risk of overfitting.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Empirical process
Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}
Uniform Glivenko-Cantelli (uGC) class (uniform convergence)
For any ε > 0, there holds
supP
Pr
{supm≥`
supf ∈B
∣∣∣∣ 1
m
m∑i=1
f (xi )−∫
Xf (x)dP(x)
∣∣∣∣ > ε
}→ 0, as `→∞
Characterization by candidate kernels [Ying and Zhou,2004]:
B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=
{K (·, x) : K ∈ K, x ∈ X
}Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Empirical process
Solution space fz ∈ B := {‖f ‖K ≤ 1 : f ∈ HK ,K ∈ K}
Uniform Glivenko-Cantelli (uGC) class (uniform convergence)
For any ε > 0, there holds
supP
Pr
{supm≥`
supf ∈B
∣∣∣∣ 1
m
m∑i=1
f (xi )−∫
Xf (x)dP(x)
∣∣∣∣ > ε
}→ 0, as `→∞
Characterization by candidate kernels [Ying and Zhou,2004]:
B is uGC iff Vγ-dimension of KX is finite for any γ > 0 whereKX :=
{K (·, x) : K ∈ K, x ∈ X
}Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Shattering dimension
Vγ dimension [Alon et.al. 1997; Anthony and Bartlett, 1999]
A ⊂ X is Vγ shattered by G if there is a number b ∈ R such that, for
every subset E of A, there exists fE ∈ G with property fE (x) + b ≤ −γand for every x ∈ A \ E, there holds fE (x) + b ≥ γ for every x ∈ E. The
Vγ dimension of G is the maximal cardinality of a set A ⊂ X that is Vγshattered by G.
2-dim linear functions:
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Generalization bounds
Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]
True Err ≤ Train Err +( dK
training sample number×margin2
) 12
where dK is the pseudo-dimension of K (limγ→0+ Vγ).
Examples
Single kernel: dK = 1⇒ VC Theory for SVM
K ={e−σ‖x−t‖2
: σ > 0}⇒ dK ≤ 2.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Generalization bounds
Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]
True Err ≤ Train Err +( dK
training sample number×margin2
) 12
where dK is the pseudo-dimension of K (limγ→0+ Vγ).
Examples
Single kernel: dK = 1⇒ VC Theory for SVM
K ={e−σ‖x−t‖2
: σ > 0}⇒ dK ≤ 2.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Statistical Generalization Analysis: Generalization bounds
Rough summary of generalization bound [Ying and Zhou, 2004;Ying and Campbell,2008]
True Err ≤ Train Err +( dK
training sample number×margin2
) 12
where dK is the pseudo-dimension of K (limγ→0+ Vγ).
Examples
Single kernel: dK = 1⇒ VC Theory for SVM
K ={e−σ‖x−t‖2
: σ > 0}⇒ dK ≤ 2.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Related Work on Learning the Combination of Kernels
Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP
Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...
Probabilistic Bayesian Models:
Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Related Work on Learning the Combination of Kernels
Lanckriet et al. (2004) for binary SVM classification: SDP,QCQP
Bach et al. (2004): SMOSonnenburg et al. (2006): SILPMini-max problem: others using optimization textbook ...
Probabilistic Bayesian Models:
Girolami and Rogers (2005): Hierarchic Bayesian ModelGirolami and Zhong (2007): Gaussian Process priorDamoulas and Girolami(2008): Multinomial probit model(VBKC)Damoulas, Ying, Girolami and Campbell (2008): RVM
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel Learning via KL-divergence
Learning kernel matrix by minimizing the KL-divergence[Lawrence and Sanguinetti, 2004]
KL(N (0,Ky)||N (0,Kx)
):=
1
2Tr(KyK
−1x )+
1
2log |Kx|−
1
2log |Ky|
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel Learning via KL-divergence (cont.)
Input kernel matrix Kx =∑
` λ`K` with `-th data source
encoded into K`
Output kernel matrix Ky derived from label information
Data integration via KL-divergence [Ying, Huang and Campbell,2009]:
minλ
Tr(Ky(∑
`∈Nmλ`K
` + σIn)−1)
+ log∣∣∣∑`∈Nm
λ`K` + σIn
∣∣∣s.t.
∑` λ` = 1, λ` ≥ 0.
where σ > 0 is a small number to avoid singularity.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel Learning via KL-divergence (cont.)
Input kernel matrix Kx =∑
` λ`K` with `-th data source
encoded into K`
Output kernel matrix Ky derived from label information
Data integration via KL-divergence [Ying, Huang and Campbell,2009]:
minλ
Tr(Ky(∑
`∈Nmλ`K
` + σIn)−1)
+ log∣∣∣∑`∈Nm
λ`K` + σIn
∣∣∣s.t.
∑` λ` = 1, λ` ≥ 0.
where σ > 0 is a small number to avoid singularity.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel Learning via KL-divergence (cont.)
Input kernel matrix Kx =∑
` λ`K` with `-th data source
encoded into K`
Output kernel matrix Ky derived from label information
Data integration via KL-divergence [Ying, Huang and Campbell,2009]:
minλ
Tr(Ky(∑
`∈Nmλ`K
` + σIn)−1)
+ log∣∣∣∑`∈Nm
λ`K` + σIn
∣∣∣s.t.
∑` λ` = 1, λ` ≥ 0.
where σ > 0 is a small number to avoid singularity.
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Why KL-divergence
Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice
Easily adapted to different learning scenarios via the outputkernel matrix Ky
Multi-label y (n × T matrix): Ky := yy>
Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian
KL-divergence based criterion is parameter-free
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Why KL-divergence
Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice
Easily adapted to different learning scenarios via the outputkernel matrix Ky
Multi-label y (n × T matrix): Ky := yy>
Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian
KL-divergence based criterion is parameter-free
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Why KL-divergence
Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice
Easily adapted to different learning scenarios via the outputkernel matrix Ky
Multi-label y (n × T matrix): Ky := yy>
Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian
KL-divergence based criterion is parameter-free
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Why KL-divergence
Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice
Easily adapted to different learning scenarios via the outputkernel matrix Ky
Multi-label y (n × T matrix): Ky := yy>
Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian
KL-divergence based criterion is parameter-free
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Why KL-divergence
Group lasso works effectively to remove noisy features andleads to sparse solution. However, this hypothesis could beinappropriate in practice
Easily adapted to different learning scenarios via the outputkernel matrix Ky
Multi-label y (n × T matrix): Ky := yy>
Structured outputs (e.g. network inference): diffusion kernelKy := eβ4 with 4 graph Laplacian
KL-divergence based criterion is parameter-free
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Optimization Formulation
Letf (λ) := Tr
(Ky(
∑`∈Nm
λ`K` + σIn)−1)
andg(λ) := − log |
∑`∈Nm
λ`K` + σIn|
Theorem
KL-divergence based kernel learning is a convex concave (differenceof convex) problem:
min∑` λ`=1,λ`≥0
f (λ)− g(λ).
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Optimization Formulation
Letf (λ) := Tr
(Ky(
∑`∈Nm
λ`K` + σIn)−1)
andg(λ) := − log |
∑`∈Nm
λ`K` + σIn|
Theorem
KL-divergence based kernel learning is a convex concave (differenceof convex) problem:
min∑` λ`=1,λ`≥0
f (λ)− g(λ).
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Difference of convex (DC) Programming
DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]
Given a stopping criterion ε > 0
Initialize λ, e.g. λ` = 1m
Iteration:λt+1 = arg min
{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0
}Stop until f (λ(t))− g(λ(t))−
(f (λt+1)− g(λ(t+1))
)≤ ε
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Difference of convex (DC) Programming
DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]
Given a stopping criterion ε > 0
Initialize λ, e.g. λ` = 1m
Iteration:λt+1 = arg min
{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0
}Stop until f (λ(t))− g(λ(t))−
(f (λt+1)− g(λ(t+1))
)≤ ε
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Difference of convex (DC) Programming
DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]
Given a stopping criterion ε > 0
Initialize λ, e.g. λ` = 1m
Iteration:λt+1 = arg min
{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0
}Stop until f (λ(t))− g(λ(t))−
(f (λt+1)− g(λ(t+1))
)≤ ε
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Difference of convex (DC) Programming
DC programming (concave convex procedure)[Hoang 1995; Yuilleet al. 2003]
Given a stopping criterion ε > 0
Initialize λ, e.g. λ` = 1m
Iteration:λt+1 = arg min
{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑` λ` = 1, λ` ≥ 0
}Stop until f (λ(t))− g(λ(t))−
(f (λt+1)− g(λ(t+1))
)≤ ε
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Subproblem in Each Iteration
Consider λt+1 = arg min{f (λ)− g(λ(t))−∇g(λ(t))(λ− λ(t)) :∑
` λ` = 1, λ` ≥ 0}
SILP or QCQP
arg maxλ,γ γs.t.∑
` λ` = 1, λ` ≥ 0
γ −∑
` λ`(Tr(αα>K`) + ∂g(λ(t))
∂λ`
)≤ Tr
(−2α>A + σα>α
), ∀α
where Ky = AA> with n × s matrix A and s = rank(Ky)
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Protein Fold Recognition [Ding et al. 2001; Shen et al.2006; Damoulas and Girolami, 2008]
27 SCOP fold (class), 311 proteins for training and 381 proteinsfor testing with 12 different data sources:
Amino-acid composition (C) Second order polynomialPredicted secondary structure (S) Second order polynomialHydrophobicity (H) Second order polynomialPolarity (P) Second order polynomialvan der Waals volume (V) Second order polynomialPolarizability (Z) Second order polynomialFour pseudo-amino acid compositions (L1, L4, L14, L30) Second order polynomialsTwo local sequence alignment using Smith-Waterman Scores (SW1, SW2) Pairwise kernels
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Comparative Kernel Learning Algorithms
MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM
SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based
MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis
VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Comparative Kernel Learning Algorithms
MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM
SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based
MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis
VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Comparative Kernel Learning Algorithms
MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM
SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based
MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis
VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Comparative Kernel Learning Algorithms
MKLdiv [our method]: kernel weights λ learned byKL-divergence, then fed into one-against-all multi-class SVM
SimpleMKL [Rakotomamonjy et al. 2007]: SVM-based
MKL-RKDA [Ye et al. 2008]: kernel learning based ondiscriminant analysis
VBKC [Damoulas and Girolami, 2008]: Bayesian model basedon multi-nomial probit regression
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Overall Features Performance
F.S. MKLdiv SimpleMKL VBKC MKL-RKDA Ding et al.C 51.69 51.83 51.2± 0.5 45.43 44.9S 40.99 40.73 38.1± 0.3 38.64 35.6H 36.55 36.55 32.5± 0.4 34.20 36.5P 35.50 35.50 32.2± 0.3 30.54 32.9V 37.07 37.85 32.8± 0.3 30.54 35Z 37.33 36.81 33.2± 0.4 30.28 32.9L1 44.64 45.16 41.5± 0.5 36.55 −L4 44.90 44.90 41.5± 0.4 38.12 −L14 43.34 43.34 38± 0.2 40.99 −L30 31.59 31.59 32± 0.2 36.03 −SW1 62.92 62.40 59.8± 1.9 61.87 −SW2 63.96 63.44 49± 0.7 64.49 −Overall 73.36 66.57 70 68.40 56Average 68.40 68.14 − 66.06 −
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel weights of Overall Features
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Performance of Sequentially Added Features
Features MKLdiv SimpleMKL MKL-RKDASW1 62.92 62.40 61.87SW1S 65.27 64.22 64.75SW1SW2S 67.10 64.75 64.49SW1SW2CS 73.36 65.01 67.62SW1SW2CSH 74.67 66.31 67.88SW1SW2CSHP 74.93 66.31 69.71SW1SW2CSHPZ 75.19 68.92 66.05SW1SW2CSHPZV 74.41 66.31 69.19SW1SW2CSHPZVL1 73.10 66.84 68.66SW1SW2CSHPZVL1L4 72.84 67.10 67.62SW1SW2CSHPZVL1L4L14 72.58 66.84 69.19SW1SW2CSHPZVL1L4L14L30 73.36 66.57 68.40
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Kernel Weights of Dominant Features
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Yeast Protein Functional Prediction (Lanckriet et al. 2004)
Data sources and kernels:
SW: protein sequences with Smith-Waterman Score
B: protein sequences with BLAST
Pfam: protein sequences with Pfam HMM
FFT: hydropathy profile with FFT
LI: protein interactions with linear kernel
Diff: protein interactions with diffusion kernel
E: gene expression with radial basis kernel
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
MKLdiv performance on membrane protein prediction(binary)
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Conclusion and Outlook
We saw
Kernel learning is well motivated
Sound theoretical results: regularization theory and statisticallearning theory
Novel data integration via KL-divergence implemented by DCprogramming
State-of-art performance on protein fold recognition
Future
More investigation into structure outputs datasets
Systematic study into the relation between KL-divergencebased criterion
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Conclusion and Outlook
We saw
Kernel learning is well motivated
Sound theoretical results: regularization theory and statisticallearning theory
Novel data integration via KL-divergence implemented by DCprogramming
State-of-art performance on protein fold recognition
Future
More investigation into structure outputs datasets
Systematic study into the relation between KL-divergencebased criterion
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Conclusion and Outlook
We saw
Kernel learning is well motivated
Sound theoretical results: regularization theory and statisticallearning theory
Novel data integration via KL-divergence implemented by DCprogramming
State-of-art performance on protein fold recognition
Future
More investigation into structure outputs datasets
Systematic study into the relation between KL-divergencebased criterion
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Conclusion and Outlook
We saw
Kernel learning is well motivated
Sound theoretical results: regularization theory and statisticallearning theory
Novel data integration via KL-divergence implemented by DCprogramming
State-of-art performance on protein fold recognition
Future
More investigation into structure outputs datasets
Systematic study into the relation between KL-divergencebased criterion
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Conclusion and Outlook
We saw
Kernel learning is well motivated
Sound theoretical results: regularization theory and statisticallearning theory
Novel data integration via KL-divergence implemented by DCprogramming
State-of-art performance on protein fold recognition
Future
More investigation into structure outputs datasets
Systematic study into the relation between KL-divergencebased criterion
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Relevant Work
Yiming Ying, Kaizhu Huang, and Colin Campbell.
Enhanced protein fold recognition through a novel data integration approach.Preprint, 2009.
Yiming Ying and Colin Campbell.
The Rademacher chaos complexity of learning the kernel.Submitted, 2008.
Theodoros Damoulas, Yiming Ying, Mark A. Girolami, and Colin Campbell.
Inferring sparse kernel combinations and relevance vectors: an application to subcellular localization ofproteins.In Proceedings of the 7th International Conference on Machine Learning and Applications (ICMLA2008).
Yiming Ying and Ding-Xuan Zhou.
Learnability of Gaussians with flexible variances.J. of Machine Learning Research, 8: 249-276, 2007.
Qiang Wu, Yiming Ying, and Ding-Xuan Zhou.
Multi-kernel regularized classifiers.Journal of Complexity, 23: 108-134, 2007.
Available in http://www.cs.ucl.ac.uk/staff/Y.Ying
Yiming Ying Learning the Kernel: Theory and Applications
Motivation for Learning the KernelRegularization Framework for Kernel Learning
Statistical Generalization AnalysisData Integration via KL-divergence
ExperimentsConclusion and Outlook
Thank you for your attention!
Yiming Ying Learning the Kernel: Theory and Applications