powerpoint presentationfukumizu/papers/wcsp2008.pdf · 2008. 9. 8. · title: powerpoint...

24
1 Dependence Analysis with Reproducing Kernel Hilbert Spaces Kenji Fukumizu Institute of Statistical Mathematics Graduate University for Advanced Studies Based on collaborations with M. Jordan (UC Berkeley), F. Bach (Ecole Normale Supérieure), A. Gretton, X. Sun, and B. Schölkopf (Max-Planck Institute) 7th World Congress on Statistics and Probability July 14-19, 2008. Singapore

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

1

Dependence Analysis with Reproducing Kernel Hilbert Spaces

Kenji FukumizuInstitute of Statistical Mathematics

Graduate University for Advanced Studies

Based on collaborations with M. Jordan (UC Berkeley), F. Bach (Ecole Normale Supérieure), A. Gretton, X. Sun, and

B. Schölkopf (Max-Planck Institute)

7th World Congress on Statistics and Probability July 14-19, 2008. Singapore

Page 2: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

2

Outline

Introduction

Independence and conditional independence with RKHS

Kernel dimension reduction for regression

Summary

Page 3: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

“RKHS methods” for statistical inference– Reproducing kernel Hilbert space (RKHS) / positive definite kernel:

capture “nonlinearity” or “higher-order moments” of data. e.g. Support vector machine.

– Recent studies: RKHS applied to independence and conditional independence.

RKHS for statistical inference

ΦHX

X

Φ (X)

RKHS

Apply linear methods onthe transformed data

Page 4: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

4

Positive definite kernel and RKHSPositive definite kernelΩ: set.k is positive definite if k(x,y) = k(y,x) and for anythe matrix (Gram matrix) is positive semidefinite.

– Example: Gaussian RBF kernel

Reproducing kernel Hilbert space (RKHS)k: positive definite kernel on Ω.

H : Hilbert space consisting of functions on Ω such that1)2) is dense in H. 3)

R→Ω×Ω:kΩ∈∈ nxx,n K,1N

( )jiji xxk

,),(

H∈⋅ ),( xk for all .Ω∈x

)(),,( xffxk =⋅ H(reproducing property)

{ }Ω∈⋅ xxk |),(Span

1∃

( )22exp),( σyxyxk −−=

., Ω∈∈∀ xf H

Page 5: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

5

How to use RKHS for data analysis?Transform data into RKHS.

Data: X1, …, XN Φ(X1),…, Φ(XN) : functional data

),(,: xkx ⋅→ΩΦ aH

),()(.. xkxei ⋅=Φ

ΩX ΩY

ΦX ΦY

HX HY

X Y

ΦX(X) ΦY(Y) feature mapfeature map

RKHS RKHS

Illustration of dependence analysis with RKHS

Page 6: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

Why RKHS? Easy empirical computationThe inner product of H is efficiently computable, while the

dimensionality may be infinite.

– The computational cost essentially depends on the sample size N.c.f. L2 inner product / power expansion

– Advantageous for high-dimensional data of moderate sample size.

– Can be applied for non-Euclidean data (strings, graphs, etc.).

6

,)(1∑ = Φ= Ni ii xaf ∑ == N

ji jiji xxkbagf 1, ),(,∑ = Φ= Nj jj xbg 1 )(

),,,,,,,,,,,,(),,,( 2222 Ka YZXWXZXYWZYXWZYXWZYX

),()(),( yxkyx =ΦΦ

Page 7: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

7

Outline

Introduction

Independence and conditional independence with RKHS

Kernel dimension reduction for regression

Summary

Page 8: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

8

Covariance on RKHS(X , Y) : random vector taking values on ΩX x ΩY. (HX, kX), (HY , kY): RKHS on ΩX and ΩY, resp.Define random variables on the RKHS HX and HY by

Def. Cross-covariance operator ΣYX : HX HY

c.f. ordinary covariance matrix: VXY = Cov[X, Y] = E[YXT] – E[Y]E[X]T

),,()( XkX XX ⋅=Φ ).,()( YkY YY ⋅=Φ

)])(),([Cov()]([)]([)]()([, YgXfXfEYgEXfYgEfg YX =−=Σ

for all YX gf HH ∈∈ ,

)]([)]([)]()([ XEYEXYE XYXYYX Φ⊗Φ−Φ⊗Φ=Σ

Page 9: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

9

Characterization of independenceIndependence and cross-covariance operatorIf the RKHS’s are “rich enough” to express all the moments,

– Analog to Gaussian random vectors:

– c.f. characteristic function

– Applied to independence test (Gretton et al. 2008).

OXY =Σ⇔ )]([)]([)]()([ XfEYgEXfYgE =for all

[ ] [ ] [ ]YY

XX

YXXY

TTTT

eEeEeeE ηωηω 1111 −−−− =⇔for all ω and η.

X Y

YX gf HH ∈∈ ,X Y

f and g are test functions to compare the moments with respect to PXY and PXPY.

X Y .OVYX =⇔

Page 10: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

10

Characteristic kernelsA class for determining a probabilityX: random variable taking values on Ω.(H, k): RKHS on Ω with a bounded measurable kernel k.

H (or k) is called characteristic if, for probabilities P and Q on Ω,

(H works as a class of test functions to determine a probability.)

– If given by the product kernel kXkY is characteristic,

– An example on Rm: Gaussian RBF kernel

.)()]([)]([ ~~ QPfXfEXfE QXPX =∈∀= meansH

YX HH ⊗

.OXY =Σ⇔X Y

ΣXY = O )]()([)]()([ YgXfEYgXfEYXXY PPP = PXY = PXPY.( )

( )22exp σyx −−

Page 11: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

11

Estimation of cross-cov. operator: i.i.d. sample on

is represented by the Gram matrices.

– A uniform law of large numbers follows:

– Weak convergence of to a Gaussian process on is also known.

)(,),( ,1,1 NN YXYX K

.),(1),(1),(),(1ˆ111

)( ⎟⎠⎞

⎜⎝⎛ ⋅⊗⎟

⎠⎞

⎜⎝⎛ ⋅−⋅⊗⋅=Σ ∑∑∑

===

N

iiX

N

iiY

N

iiXiY

NYX Xk

NYk

NXkYk

N

( ) )(1ˆ )( ∞→=Σ−Σ NNOpHSYXN

YX

ΩX x ΩY.

).( pr.in 0)](),([Cov)](),([Covsup1||||,1||||

∞→→−≤≤

NYgXfYgXfempgf YHXH

YX HH ⊗

{ }{ }.)()()()(ˆ, 11

11

11)( ∑∑∑ === −=Σ N

i iNNi iN

Ni iiN

NYX XfYgXfYgfg

Theorem

(rank )N≤

)(ˆ NYXΣ

( )YXN

YXN Σ−Σ )(ˆ

Page 12: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

12

RKHS and conditional independenceConditional covariance operatorX and Y: random variables. HX, HY : RKHS with kernel kX, kY, resp.

– Relation to conditional variance:If kX is characteristic (e.g Gaussian RBF kernel),

– Empirical estimator

XYXXYXYYXYY ΣΣΣ−Σ≡Σ −1| : conditional covariance operator Def.

[ ] ( ) ( ) 2| )]([)()]([)(inf]|)([, XfEXfYgEYgEXYgVarEgg

XfXYY −−−==Σ∈H

( ) )(1)()()()(|

ˆˆˆˆˆ NXYN

NXX

NYX

NYY

NXYY I Σ+ΣΣ−Σ=Σ

−ε

εN: regularization coefficient

XYXXYXYY VVVV 1−−(Analogous to conditional covariance matrix )

)( Yg H∈∀

Can be represented by Gram matrices.

on HY

Page 13: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

13

Conditional independence

Theorem (FBJ 2004, 2006)U, V, and Y are random variables on ΩU, ΩV, and ΩY, resp. HU, HV, HY : RKHS on ΩU, ΩV, ΩY with kernel kU, kV, kY, resp. X = (U,V). RKHS on ΩX = ΩU x ΩV is defined by kX = kUkV.Assume HX , HU : characteristic. Then,

If further HY is characteristic, then

XYYUYYUXY ||| Σ=Σ⇔

XYYUYY || Σ≥Σ ≥ : the partial order of self-adjoint operators

[ ]XYYUYY ||Tr Σ−Σ works as a measure of conditional independence.

means that B – A is positive semidefinite. AB ≥

Page 14: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

14

Outline

Introduction

Independence and conditional independence with RKHS

Kernel dimension reduction for regression

Summary

Page 15: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

15

– Regression: Y : response variable, X=(X1,...,Xm): m-dim. explanatory variable

– Goal of dimension reduction for regression= Find an effective direction for regression (EDR space)

– Existing methods: Sliced Inverse Regression (SIR, Li 1991), principal Hessian direction (pHd, Li 1992), SAVE (Cook&Weisberg 1991), MAVE (Xia et al 2002), contour regression (Li et al 2005), among others.

( ))|(~),...,|(~)|( 1 XBYpXbXbYpXYp TTd

T ==

B=(b1,..,bd): matrixdm× d is fixed.

Dimension reduction for regression

X Y | BTX

Page 16: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

16

Kernel Dimension Reduction(Fukumizu, Bach, Jordan 2004, 2006)

Use characteristic kernels for BTX and Y.

– KDR objective function

– KDR contrast function with finite sample

|| T YY XYY B XΣ = Σ ⇔|| T YY XYY B XΣ ≥ Σ

X Y | BTX

|:min Tr TT

dYY B XB B B I=

⎡ ⎤Σ⎣ ⎦

EDR space

( )[ ]1

:Trmin −

=+ NNXBYIBBB

INGG T

dT

ε

),(, jT

iT

dijXB XBXBkK T =

( ) ( )TNNNNXB

TNNNNXB IKIG TT 11 11 11 −−= : centered Gram matrix

where

Page 17: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

17

Wide applicability of KDR– The most general approach to dimension reduction:

• no model is used for p(Y|X) or p(X) . • no strong assumptions on the distribution of X, Y and

dimensionality/type of Y. – Most conventional methods have some restrictions.

Computational issues– Computational cost with matrices of sample size.

Low-rank approximation, e.g. incomplete Cholesky decomposition.

– Non-convex contrast function, possibly local minima.Gradient method with an annealing technique

starting from a large σ in Gaussian RBF kernel.

KDR method

Page 18: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

18

Consistency of KDR

Suppose kd is bounded and continuous, and

Let S0 be the set of the optimal parameters;

Estimator:

Then, under some conditions, for any open set

1/ 20, ( ).N NN Nε ε→ → ∞ → ∞

Theorem (FBJ2006)

0U S⊃

( )( )ˆPr 1 ( ).NB U N∈ → → ∞

[ ] [ ]{ }XBYYBXBYYdT

TTIBBBS '|'|0 TrminTr,| Σ=Σ==

( )[ ]1

:

)( Trminˆ −

=+= NNXBYIBBB

N INGGB T

dT

ε

Page 19: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

19

Numerical results with KDRSynthetic data (A)

.)1()5.1(5.0

222

2

1 WXXXY +++

++=

),0(~ : 4. INX dim 4

).,0(~ 2τNW τ = 0.1, 0.4, 0.8.

KDR SIR SAVE pHdτ Mean SD Mean SD Mean SD Mean SD0.1 0.11 0.07 0.55 0.28 0.77 0.35 1.04 0.340.4 0.17 0.09 0.60 0.27 0.82 0.34 1.03 0.330.8 0.34 0.22 0.69 0.25 0.94 0.35 1.06 0.33

Frobenius norms of the projection matrices over 100 samples.(Means and standard deviations)

Sample size N = 100

±±±

±±±

±±±

±±±

Page 20: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

20

Synthetic data (B)

.)(21 2

1 WaXY −= ).1,0(~ NW a = 0, 0.5, 1.

),0(~ : 4INX dim. 10

Sample size N = 500

KDR SIR SAVE pHda Mean SD Mean SD Mean SD Mean SD0.0 0.17 0.05 1.83 0.22 0.30 0.07 1.48 0.270.5 0.17 0.04 0.58 0.19 0.35 0.08 1.52 0.281.0 0.18 0.05 0.30 0.08 0.57 0.20 1.58 0.28

±±±

±±±

±±±

±±±

Page 21: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

21

KDR on Real dataWine dataData

13 dim. 178 data.3 classes2 dim. projection

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20

-20 -10 0 10 20

-20

-15

-10

-5

0

5

10

15

20σ = 30 CCA

Partial Least Square

Sliced Inverse Regression

-20 -10 0 10 20-20

-15

-10

-5

0

5

10

15

20 KDR

( )1 2

2 21 2

( , )

exp

k z z

z z σ= − −

Page 22: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

22

Swiss bank notes dataX: 6 dim. (measurements of each bank note)Y: binary (genuine/counterfeit) 100 counterfeits and 100 genuine notes

-10 0 10

-10

0

10

Gaussian (a = 10)

-10 0 10

-10

0

10

Gaussian (a = 10000)

-10 0 10

-10

0

10

Gaussian (a = 100)

-10 0 10

-10

0

10

Linear

-10 0 10

-10

0

10

SAVE

)/||||exp(

),(2

21

21

azz

zzk

−−=

SAVE

KDR

KDR

KDR

Page 23: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

23

SummaryPositive definite kernels give a nice tool for dependence analysis– Covariance and conditional covariance operators on RKHS

characterize independence and conditional independence.

Kernel dimension reduction for regression (KDR)– The most general approach to dimension reduction.

Future/ongoing studies– Choice of kernel. Better than heuristics.– Choice of dimensionality for KDR. – Further asymptotic properties of the KDR estimator.

Page 24: PowerPoint Presentationfukumizu/papers/WCSP2008.pdf · 2008. 9. 8. · Title: PowerPoint Presentation Created Date: 7/16/2008 12:07:53 PM

ReferencesFukumizu, K., F.R. Bach, and M.I. Jordan. Dimensionality reduction for

supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research. 5(Jan):73-99, 2004.

Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. Tech. Report 715, Dept. Statistics, University of California, Berkeley, 2006.

Gretton, A. K. Fukumizu, C.H. Teo, L. Song, B.Schölkopf, A. Smola. A Kernel Statistical Test of Independence. Advances in Neural Information Processing Systems 20:585-592. 2008.

Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional Dependence. Advances in Neural Information Processing Systems 20:489-496. 2008.

24