kernel learning with bregman matrix divergencesexample: svm training goes from o(n3) to o(nr2)...
TRANSCRIPT
![Page 1: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/1.jpg)
Kernel Learning with Bregman
Matrix Divergences
Inderjit S. Dhillon
The University of Texas at Austin
Workshop on Algorithms for Modern Massive Data Sets
Stanford University and Yahoo! Research
June 22, 2006
Joint work with Brian Kulis, Matyas Sustik and Joel Tropp
![Page 2: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/2.jpg)
Kernel Methods
Kernel methods are important for many problems in machine learning
Embed data into high-dimensional space via a mapping ψ
Kernel function gives the inner product in the feature space:
κ(ai,aj) = ψ(ai) · ψ(aj)
Kernel algorithms use the kernel matrix for learning in feature space
Kernels have been defined for various discrete structurestreesgraphsstringsimages...
![Page 3: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/3.jpg)
Low-Rank Kernels
A kernel matrix may not be of full rank, but rather of rank r < n
=G.
G
T
n
n
nr
r
Kn
Can decompose the kernel as K = GGT
![Page 4: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/4.jpg)
Low-Rank Kernels
A kernel matrix may not be of full rank, but rather of rank r < n
=G.
G
T
n
n
nr
r
Kn
Can decompose the kernel as K = GGT
Several advantages to such a decomposition
Storage reduces from O(n2) to O(nr)
Running time of most kernel algorithms improves to linear in n
Example: SVM training goes from O(n3) to O(nr2)
![Page 5: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/5.jpg)
Learning Low-Rank Kernels
A kernel matrix may not be of full rank, but rather of rank r < n
=G.
G
T
n
n
nr
r
Kn
Sometimes need to learn the kernel matrixNo kernel may be given, but other information may be availableAn existing kernel may be given, but may want to modify thekernel to satisfy additional constraintsMay want to combine multiple sources of data into a single kernel
Existing methods for learning a kernel are O(n3) (or worse)Semi-definite programming methods [Lanckriet et al., JMLR, 2004]Projection-based methods [Tsuda et al., JMLR, 2005]Hyperkernels and other methods [Ong et al, JMLR, 2005]
![Page 6: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/6.jpg)
Bregman Divergences
Let ϕ : S → R be a differentiable, strictly convex function of “Legendretype” (S ⊆ R
d)
The Bregman Divergence Dϕ : S × ri(S) → R is defined as
Dϕ(x,y) = ϕ(x) − ϕ(y) − (x − y)T∇ϕ(y)
![Page 7: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/7.jpg)
Bregman Divergences
Let ϕ : S → R be a differentiable, strictly convex function of “Legendretype” (S ⊆ R
d)
The Bregman Divergence Dϕ : S × ri(S) → R is defined as
Dϕ(x,y) = ϕ(x) − ϕ(y) − (x − y)T∇ϕ(y)
y x
ϕ(z)= 1
2z
Tz
h(z)
Dϕ(x,y)= 1
2‖x−y‖2
Squared Euclidean distance is a Bregman divergence
![Page 8: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/8.jpg)
Bregman Divergences
Let ϕ : S → R be a differentiable, strictly convex function of “Legendretype” (S ⊆ R
d)
The Bregman Divergence Dϕ : S × ri(S) → R is defined as
Dϕ(x,y) = ϕ(x) − ϕ(y) − (x − y)T∇ϕ(y)
y
x
Dϕ(x,y)=x log xy−x+y
h(z)
ϕ(z)=z log z
Relative Entropy (also called KL-divergence)
![Page 9: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/9.jpg)
Bregman Divergences
Let ϕ : S → R be a differentiable, strictly convex function of “Legendretype” (S ⊆ R
d)
The Bregman Divergence Dϕ : S × ri(S) → R is defined as
Dϕ(x,y) = ϕ(x) − ϕ(y) − (x − y)T∇ϕ(y)
y x
Dϕ(x,y)= xy−log x
y−1h(z)
ϕ(z)=− log z
Itakura-Saito Distance (or Burg divergence)—used in signal processing
![Page 10: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/10.jpg)
Properties of Bregman Divergences
Dϕ(x,y) ≥ 0, and equals 0 iff x = y
Not a metric (symmetry, triangle inequality do not hold)
Strictly convex in the first argument, but not convex (in general) in thesecond argument
Three-point property generalizes the “Law of cosines”:
Dϕ(x,y) = Dϕ(x, z) +Dϕ(z,y) − (x − z)T (∇ϕ(y) −∇ϕ(z))
![Page 11: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/11.jpg)
Bregman Projections
Nearness in Bregman divergence: the “Bregman” projection of y ontoa convex set Ω,
PΩ(y) = argminω∈Ω
Dϕ(ω,y)
![Page 12: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/12.jpg)
Bregman Projections
Nearness in Bregman divergence: the “Bregman” projection of y ontoa convex set Ω,
PΩ(y) = argminω∈Ω
Dϕ(ω,y)
y
PΩ(y)
Ω
![Page 13: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/13.jpg)
Bregman Projections
Nearness in Bregman divergence: the “Bregman” projection of y ontoa convex set Ω,
PΩ(y) = argminω∈Ω
Dϕ(ω,y)
y
x
PΩ(y)
Ω
Generalized Pythagoras Theorem:
Dϕ(x,y) ≥ Dϕ(x, PΩ(y)) +Dϕ(PΩ(y),y)
When Ω is an affine set, the above holds with equality
![Page 14: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/14.jpg)
Bregman Matrix Divergences
Extend Bregman divergences over vectors to divergences over real,symmetric matrices:
Dϕ(X,Y ) = ϕ(X) − ϕ(Y ) − trace((∇ϕ(Y ))T (X − Y ))
Squared Frobenius norm: ϕ(X) = 12 tr(XTX). Then
DFrob(X,Y ) =1
2‖X − Y ‖2
F
von Neumann Divergence: ϕ(X) = tr(X logX −X) (negative entropyof the eigenvalues). Then
DvN (X,Y ) = trace(X logX −X log Y −X + Y )
Burg Matrix Divergence (LogDet divergence): ϕ(X) = − log detX(Burg entropy of the eigenvalues). Then
DBurg(X,Y ) = trace(XY −1) − log det(XY −1) − n
![Page 15: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/15.jpg)
Optimization Framework
Optimization problem:
minimize Dϕ(K,K0)
subject to tr(KAi) ≤ bi, 1 ≤ i ≤ c
rank(K) ≤ r
K 0.
Problem is non-convex due to the rank constraint
Standard convex optimization techniques do not seem to apply here
![Page 16: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/16.jpg)
Rank-Preserving Matrix Divergences
Rewrite the Von Neumann divergence and Burg divergence using theeigendecompositions
Given matrices X and Y , let X = V ΛV T and Y = UΘUT
von Neumann:
DvN (X,Y ) = tr(X logX −X log Y −X + Y )
=i
λi log λi −i,j
(vTi uj)
2λi log θj −
i
(λi − θi)
Important Implication: DvN (X,Y ) is finite iff range(X) ⊆ range(Y )
![Page 17: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/17.jpg)
Rank-Preserving Matrix Divergences
Rewrite the Von Neumann divergence and Burg divergence using theeigendecompositions
Given matrices X and Y , let X = V ΛV T and Y = UΘUT
Burg divergence:
DBurg(X,Y ) = tr(XY −1) − log det(XY −1) − n
=i,j
λi
θj(vT
i uj)2 −
i
log
λi
θi
− n
Important Implication: DBurg(X,Y ) is finite iff range(X) = range(Y )
![Page 18: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/18.jpg)
Rank-Preserving Matrix Divergences
Rewrite the Von Neumann divergence and Burg divergence using theeigendecompositions
Given matrices X and Y , let X = V ΛV T and Y = UΘUT
Burg divergence:
DBurg(X,Y ) = tr(XY −1) − log det(XY −1) − n
=i,j
λi
θj(vT
i uj)2 −
i
log
λi
θi
− n
Important Implication: DBurg(X,Y ) is finite iff range(X) = range(Y )
Optimization problem:Satisfies rank(K) ≤ rank(K0) in vN divergenceSatisfies rank(K) = rank(K0) in Burg divergenceImplicitly maintains rank and positive semi-definite constraintsIs convex when rank(K0) ≤ r
![Page 19: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/19.jpg)
Method of Cyclic Projections
Use the “Bregman” projection of y onto affine set H,
PH(y) = argmina∈H
Dϕ(a,y)
y
a
Method projects onto each constraint, one at a time, and appliesappropriate “correction”—provably converges to the optimal solution
![Page 20: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/20.jpg)
Projections with rank-preserving Bregman divergences
Solve smaller optimization problem to get projection onto constraint i
minimize Dϕ(Kt+1,Kt)
subject to tr(Kt+1Ai) ≤ bi
For both divergences, projections can be calculated in O(r2) time
Requires all operations to be done on a suitable factored form of thekernel matrices
![Page 21: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/21.jpg)
Projections with rank-preserving Bregman divergences
Solve smaller optimization problem to get projection onto constraint i
minimize Dϕ(Kt+1,Kt)
subject to tr(Kt+1Ai) ≤ bi
For both divergences, projections can be calculated in O(r2) time
Requires all operations to be done on a suitable factored form of thekernel matrices
Burg divergence update:
Kt+1 = (K†t − αA
Ti )†
such that tr(Kt+1Ai) ≤ bi
Distance constraints expressed as Ai = zizTi
Projection parameter α has a closed form solution
O(r2) projection achieved using factored form of Kt
![Page 22: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/22.jpg)
Projections with rank-preserving Bregman divergences
Solve smaller optimization problem to get projection onto constraint i
minimize Dϕ(Kt+1,Kt)
subject to tr(Kt+1Ai) ≤ bi
For both divergences, projections can be calculated in O(r2) time
Requires all operations to be done on a suitable factored form of thekernel matrices
Von Neumann divergence update:
Kt+1 = exp(log(Kt) + αzizTi )
such that tr(Kt+1Ai) ≤ bi
Requires the fast multipole method to achieve O(r2) projectionWe have designed a custom non-linear solver to compute theprojection parameter α
![Page 23: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/23.jpg)
Projections with rank-preserving Bregman divergences
Solve smaller optimization problem to get projection onto constraint i
minimize Dϕ(Kt+1,Kt)
subject to tr(Kt+1Ai) ≤ bi
For both divergences, projections can be calculated in O(r2) time
Requires all operations to be done on a suitable factored form of thekernel matrices
Von Neumann divergence update:
Kt+1 = exp(log(Kt) + αzizTi )
such that tr(Kt+1Ai) ≤ bi
Requires the fast multipole method to achieve O(r2) projectionWe have designed a custom non-linear solver to compute theprojection parameter α
End Result: Algorithms are linear in n+ c, quadratic in r
![Page 24: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/24.jpg)
Special Cases and Related Work
Full-rank case with von Neumann divergence, bi = 0∀i, obtainDefiniteBoost optimization problem [Tsuda et al., JMLR, 2005]
Improve the running time from O(n3) to O(n2) per projectionAllow arbitrary linear constraints, both equality and inequality
For constraints Kii = 1∀i, obtain the nearest correlation matrixproblem [Higham, IMA J. Numerical Analysis, 2002]
Arises in financial applicationsNew efficient methods for finding low-rank correlation matrices
Can be used for nonlinear dimensionality reductionIsometry constraints are linear[Weinberger et al., ICML, 2004] maximize trace(X) (bysemi-definite programming)
![Page 25: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/25.jpg)
Experiments
Digits data: 317 digits, 3classes
Given a rank-16 kernelfor 317 digitsRandomly createconstraints:
d(i1, i2) ≤ (1 − ε)bi
d(i1, i2) ≥ (1 + ε)bi
Attempt to learn a“better” rank-16 kernel 0 20 40 60 80 100 120 140
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Number of Constraints
NM
I
Burg Divergencevon Neumann Divergence
Clustering: use kernel k-means with random initialization, computeaccuracy using normalized mutual information
![Page 26: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/26.jpg)
Experiments
GyrB protein data: 52proteins, 3 classes
Given only constraintsWant to learn a kernelbased on constraintsConstraints generatedfrom target kernel matrixAttempt to learn afull-rank kernel
100 200 300 400 500 600 700 800 900 1000
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Number of Constraints
Cla
ssifi
catio
n A
ccur
acy
Burg Divergencevon Neumann Divergence
Classification: use k-nearest neighbor, k = 5, 50/50 training/test split,2-fold cross validation averaged over 20 runs
![Page 27: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/27.jpg)
Relevant Papers
[Dhillon & Tropp] “Matrix Nearness Problems with BregmanDivergences”, submitted to the SIAM Journal on Matrix Analysis andApplications, 2006.
[Kulis, Sustik & Dhillon] “Low-Rank Kernel Learning”, InternationalConference on Machine Learning, 2006 (to appear).
![Page 28: Kernel Learning with Bregman Matrix DivergencesExample: SVM training goes from O(n3) to O(nr2) Learning Low-Rank Kernels A kernel matrix may not be of full rank, but rather of rank](https://reader034.vdocument.in/reader034/viewer/2022042917/5f5b5e69fe395704b940a6e8/html5/thumbnails/28.jpg)
Conclusions
Low-rank kernel matrices can be learnt efficiently using appropriateBregman matrix divergences
Burg divergence and von Neumann divergenceImplicitly maintain rank and positive semi-definiteness constraints
Algorithms scale linearly with n and quadratically with r
Future WorkFurther investigate the gains of preserving range spaceApply to varied applicationsImprovement over cyclic projection methods