kernel regression for signals over graphs kernel regression for signals over graphs arun...

11
1 Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter H¨ andel School of Electrical Engineering and ACCESS Linnaeus Center KTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden [email protected], [email protected], [email protected] Abstract—We propose kernel regression for signals over graphs. The optimal regression coefficients are learnt using the assumption that the target vector is a smooth signal over an underlying graph. The condition is imposed using a graph- Laplacian based regularization. We discuss how the proposed kernel regression exhibits a smoothing effect, simultaneously achieving noise-reduction and graph-smoothness. We further extend the kernel regression to simultaneously learn the un- derlying graph and the regression coefficients. We validate our theory by application to various synthesized and real-world graph signals. Our experiments show that kernel regression over graphs outperforms conventional regression, particularly for small sized training data and under noisy training. We also observe that kernel regression reveals the structure of the underlying graph even with a small number of training samples. Index Terms—Linear model, regression, kernels, machine learning, graph signal processing, graph-Laplacian. EDICS-NEG-SPGR, NEG-ADLE, MLR-GRKN. I. I NTRODUCTION Kernel regression constitutes one of the fundamental build- ing blocks in supervised and semi-supervised learning strate- gies, be it for simple regression tasks or in more advanced settings such as support vector machines [1], Gaussian pro- cesses [2], deep neural networks [3], [4] and extreme learning machines [5], [6]. Kernel regression is a data-driven approach which works by projecting the input vector into a higher dimensional space using kernel functions and performs a linear regression in that space [7], [8]. The projection is achieved typically by using a nonlinear transformation. The target output y (for both scalar and vector cases) for a given test input x is then expressible as a linear combination of the kernels between the test input and the training inputs {x i }: y = X i ρ i k(x, x i ), where k(·, ·) denotes the kernel function. ρ i s denote the coefficient vectors learnt from the training input-output pairs {(x i , y i )}. Given sufficiently large number of samples of reliable training data, kernel regression leads to accurate target predictions. However, in many cases, such as medical experiments or epidemic studies, it might not be feasible to collect large amounts of clean data. Insufficient or noisy training data often leads to deterioration in the prediction performance of kernel regression, just as in the case of any data-driven approach. In such cases, incorporating additional The authors would like to acknowledge the support received from the Swedish Research Council. structure or regularization during the training phase can boost performance. In this article, we propose kernel regression for signals over graphs, where it is assumed that target signals are smooth over graphs: we employ the graph signal structure to enhance prediction performance of kernel regression under noisy and scarce data. Experiments with real-world data such as temperature sensor network, electroencephalogram(EEG), atmospheric tracer diffusion, and functional MRI data demon- strate the potential of our approach. Recently, graph signal processing has emerged as a promis- ing area for incorporating graph-structural information in the analysis of vector signals [9], [10]. This takes particular relevance in contemporary applications that deal with data over networks or graphs. Several traditional signal processing meth- ods have been extended to graphs [9]–[12] including many conventional spectral analysis concepts such as the windowed Fourier transforms, filterbanks, multiresolution analysis, and wavelets, [13]–[26]. Spectral clustering approaches based on graph signal filtering have been proposed recently [27], [28]. A number of graph-aware sampling schemes have also been developed [29]–[36]. Principal component analysis (PCA) techniques were proposed for graph signals by Shahid et al. for recovery of low-rank matrices [37], [38] and were shown to perform on par with nuclear-norm based approaches. Parametric dictionary learning approaches have demonstrated that graph specific information can help to outperform standard approaches like the K-SVD at small sample sizes [39]–[43]. Statistical analysis of graph signals particularly with respect to stationarity have also been investigated [44]–[48]. Berger et al. [49] recently proposed a primal-dual based convex otpimization approach for recovery of graph signals from a subset of nodes by minimizing the total variation while controlling the global empirical error. A number of approaches have also been proposed for learning the graph-Laplacian directly from data using different metrics [50]–[54]. Graph-based kernels have been investigated for label- ing/coloring of graph nodes and graph clustering [55]–[62]. They deal with binary valued signals or data over the nodes. Kernel-based reconstruction strategies for graph signals were proposed recently by Romero et al. in the framework of reproducing kernel Hilbert spaces [63], [64]. Using the notion of joint space-time graphs, Romero et al. have also proposed kernel based reconstruction of graph signals and an extension of the Kalman filter for kernel-based learning [65], [66]. Recently, Shen et al. used kernels in structural equation models for identification of network topologies from graph signals [67]. Though these works employ kernel-based strategies they arXiv:1706.02191v3 [cs.IT] 30 Jan 2018

Upload: dohanh

Post on 17-May-2018

238 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

1

Kernel Regression for Signals over GraphsArun Venkitaraman, Saikat Chatterjee, Peter Handel

School of Electrical Engineering and ACCESS Linnaeus CenterKTH Royal Institute of Technology, SE-100 44 Stockholm, Sweden

[email protected], [email protected], [email protected]

Abstract—We propose kernel regression for signals overgraphs. The optimal regression coefficients are learnt using theassumption that the target vector is a smooth signal over anunderlying graph. The condition is imposed using a graph-Laplacian based regularization. We discuss how the proposedkernel regression exhibits a smoothing effect, simultaneouslyachieving noise-reduction and graph-smoothness. We furtherextend the kernel regression to simultaneously learn the un-derlying graph and the regression coefficients. We validate ourtheory by application to various synthesized and real-world graphsignals. Our experiments show that kernel regression over graphsoutperforms conventional regression, particularly for small sizedtraining data and under noisy training. We also observe thatkernel regression reveals the structure of the underlying grapheven with a small number of training samples.

Index Terms—Linear model, regression, kernels, machinelearning, graph signal processing, graph-Laplacian.

EDICS−NEG-SPGR, NEG-ADLE, MLR-GRKN.

I. INTRODUCTION

Kernel regression constitutes one of the fundamental build-ing blocks in supervised and semi-supervised learning strate-gies, be it for simple regression tasks or in more advancedsettings such as support vector machines [1], Gaussian pro-cesses [2], deep neural networks [3], [4] and extreme learningmachines [5], [6]. Kernel regression is a data-driven approachwhich works by projecting the input vector into a higherdimensional space using kernel functions and performs alinear regression in that space [7], [8]. The projection isachieved typically by using a nonlinear transformation. Thetarget output y (for both scalar and vector cases) for a giventest input x is then expressible as a linear combination of thekernels between the test input and the training inputs {xi}:

y =∑i

ρρρik(x,xi),

where k(·, ·) denotes the kernel function. ρρρis denote thecoefficient vectors learnt from the training input-output pairs{(xi,yi)}. Given sufficiently large number of samples ofreliable training data, kernel regression leads to accuratetarget predictions. However, in many cases, such as medicalexperiments or epidemic studies, it might not be feasibleto collect large amounts of clean data. Insufficient or noisytraining data often leads to deterioration in the predictionperformance of kernel regression, just as in the case of anydata-driven approach. In such cases, incorporating additional

The authors would like to acknowledge the support received from theSwedish Research Council.

structure or regularization during the training phase can boostperformance. In this article, we propose kernel regression forsignals over graphs, where it is assumed that target signalsare smooth over graphs: we employ the graph signal structureto enhance prediction performance of kernel regression undernoisy and scarce data. Experiments with real-world data suchas temperature sensor network, electroencephalogram(EEG),atmospheric tracer diffusion, and functional MRI data demon-strate the potential of our approach.

Recently, graph signal processing has emerged as a promis-ing area for incorporating graph-structural information in theanalysis of vector signals [9], [10]. This takes particularrelevance in contemporary applications that deal with data overnetworks or graphs. Several traditional signal processing meth-ods have been extended to graphs [9]–[12] including manyconventional spectral analysis concepts such as the windowedFourier transforms, filterbanks, multiresolution analysis, andwavelets, [13]–[26]. Spectral clustering approaches based ongraph signal filtering have been proposed recently [27], [28].A number of graph-aware sampling schemes have also beendeveloped [29]–[36]. Principal component analysis (PCA)techniques were proposed for graph signals by Shahid etal. for recovery of low-rank matrices [37], [38] and wereshown to perform on par with nuclear-norm based approaches.Parametric dictionary learning approaches have demonstratedthat graph specific information can help to outperform standardapproaches like the K-SVD at small sample sizes [39]–[43].Statistical analysis of graph signals particularly with respectto stationarity have also been investigated [44]–[48]. Bergeret al. [49] recently proposed a primal-dual based convexotpimization approach for recovery of graph signals froma subset of nodes by minimizing the total variation whilecontrolling the global empirical error. A number of approacheshave also been proposed for learning the graph-Laplaciandirectly from data using different metrics [50]–[54].

Graph-based kernels have been investigated for label-ing/coloring of graph nodes and graph clustering [55]–[62].They deal with binary valued signals or data over the nodes.Kernel-based reconstruction strategies for graph signals wereproposed recently by Romero et al. in the framework ofreproducing kernel Hilbert spaces [63], [64]. Using the notionof joint space-time graphs, Romero et al. have also proposedkernel based reconstruction of graph signals and an extensionof the Kalman filter for kernel-based learning [65], [66].Recently, Shen et al. used kernels in structural equation modelsfor identification of network topologies from graph signals[67]. Though these works employ kernel-based strategies they

arX

iv:1

706.

0219

1v3

[cs

.IT

] 3

0 Ja

n 20

18

Page 2: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

2

differ from our proposed approach by formulation. The priorworks define kernels across nodes of the same graph, whileconsidering a subset of the nodes to be the input. We considerthat the target to be predicted is a vector signal lying overa graph, and the input signal is not necessarily associatedwith the underlying graph. Given a new input, we providea prediction output which is a smooth graph signal.

A. Our contributions

We assume that the output of kernel regression and thetarget vector are smooth over a graph with a specified graph-Laplacian matrix. By noting that the input observations enterthe regression model only in the form of inner-products, wedevelop kernel regression over graphs (KRG) using a dualrepresentation or kernel trick. We also discuss how our ap-proach may be applied to cases when the graph is not specifiedapriori. In this case, we jointly learn the graph-Laplacianand the regression coefficients by formulating an alternatingminimization algorithm. In other words, we jointly perform theprediction for unseen inputs and learning of underlying graphtopology. Experiments with both synthesized and real-worlddata confirm our hypothesis that a significant gain in predictionperformance vis-a-vis conventional kernel regression can beobtained by regularizing the target vector to be a smoothsignal over an underlying graph. Our experiments also includerealistic noisy training scenarios of missing samples and largeperturbations.

B. Signal processing over graphs

We next briefly review the concepts from graph signalprocessing. Let G = (V, E ,A) denote a graph with Mnodes indexed by the vertex set V = {1, · · · ,M}. Let Eand A denote the edge set containing pairs of nodes, andthe weighted adjacency matrix, respectively. The (i, j)th entryof the adjacency matrix A(i, j) denotes the strength of theedge between the ith and jth nodes. There exists an edgebetween ith and jth nodes if A(i, j) > 0 and the edge pair(i, j) ∈ E ⇐⇒ A(i, j) 6= 0. In our analysis, we consider onlyundirected graphs with symmetric edge-weights or A = A>.The graph-Laplacian matrix L of G is defined as

L = D−A,

where D is the diagonal degree matrix with ith diagonalelement given by the sum of the elements in the ith row ofA. A vector x = [x(1)x(2) · · ·x(M)]> ∈ RM is said to be agraph signal if x(i) denotes the value of the signal at the ithnode of G. The quadratic form of x with L is given by

x>Lx =∑

(i,j)∈E

A(i, j)(x(i)− x(j))2.

We observe that x>Lx is minimized when the signal x takesthe same value across all connected nodes, which agreeswith the intuitive concept of a smooth signal. In general, agraph signal is said to be smooth or a low-frequency signalif it has similar values across connected nodes in a graph,and is said to be a high-frequency signal if it has dissimilarvalues across connected nodes, x>Lx being the measure of

similarity. This motivates the use of x>Lx as a constraint inapplications where either the signal x or the graph-LaplacianL is to be estimated [51], [68]. The eigenvectors of L arereferred to as the graph Fourier transform basis for G, andthe corresponding eigenvalues are referred to as the graphfrequencies. The smaller eigenvalues (the smallest being zeroby construction) are referred to as low frequencies since thecorresponding eigenvectors minimize the quadratic form of Land vary smoothly over the the nodes. Similarly, the largereigenvalues are referred to as the high frequencies. Then,a smooth graph signal is one which has GFT coefficientspredominantly in the low graph frequencies.

II. KERNEL REGRESSION OVER GRAPHS

A. Linear basis model for regression over graphs

Let {xn}Nn=1 denote the set of N input observations. Eachxn is paired with target tn ∈ RM . Our goal is to model thetarget tn with yn given by:

yn = W>φφφ(xn), (1)

where φφφ(xn) ∈ RK is a function of xn and W ∈ RK×M de-notes the regression coefficient matrix. Equation (1) is referredto as linear basis model for regression, often shortened to linearregression (cf. Chapters 3 and 6 in [7]). For brevity, we herafterfollow this shortened nomenclature and refer to the outcomeof using (1) as linear regression. Our central assumption isthat target yn is a smooth signal over an underlying graph Gwith M nodes. We learn the optimal parameter matrix W byminimizing the following cost function:

C(W) =

N∑n=1

‖tn−yn‖22+αtr(W>W)+β

N∑n=1

y>nLyn, (2)

where regularization coefficients α, β ≥ 0, tr(·) denotes thetrace operator, and ‖x‖2 the `2 norm of x. The cost in(2) is convex in W, since L is positive semidefinite onvirtue of it being the graph-Laplacian matrix. The choiceof α, β depends on the problem, and in our analysis wecompute these parameters through crossvalidation. The penaltytr(W>W) = ‖W‖2F ensures that W remains bounded. Thepenalty or regularization y>nLyn enforces yn to be smoothover G. We define matrices T, Y and Φ as follows:

T = [t1 t2 · · · tN ]> ∈ RN×M ,Y = [y1 y2 · · ·yN ]> ∈ RN×M ,Φ = [φφφ(x1) φφφ(x2) · · · φφφ(xN )]> ∈ RN×K .

(3)

Using (1) and (3), the cost function (2) is expressible as (4)where we use properties of the matrix trace operation. Sincethe cost function is quadratic in W, we get the optimal andunique solution by setting the gradient of C with respect to Wequal to zero. Using the following matrix derivative relations

∂Wtr (M1W) = M>

1 ,

∂Wtr(W>M1WM2

)= M>

1 WM>2 + M1WM2,

where M1 and M2 are matrices, and setting ∂C∂W = 0 we get

Page 3: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

3

C(W) =∑n

‖tn −W>φφφ(xn)‖22 + α tr(W>W) + β∑n

φφφ(xn)>WLW>φφφ(xn)

=∑n

‖tn‖22 − 2 tr

(∑n

φφφ(xn)>Wtn

)+ tr

(∑n

φφφ(xn)>WW>φφφ(xn)

)+ α tr(W>W)

+β tr

(∑n

φφφ(xn)>WLW>φφφ(xn)

)

=∑n

‖tn‖22 − 2 tr

(∑n

tnφφφ(xn)>W

)+ tr

(WW>

∑n

φφφ(xn)φφφ(xn)>

)+ α tr(W>W)

+β tr

(WLW>

∑n

φφφ(xn)φφφ(xn)>

)=

∑n

‖tn‖22 − 2 tr(T>ΦW

)+ tr(W>Φ>ΦW) + α tr(W>W) + β tr

(W>Φ>ΦWL

).

(4)

that−Φ>T + Φ>ΦW + αW + βΦ>ΦWL = 0,or,

(Φ>Φ + αIK

)W + βΦ>ΦWL = Φ>T.

(5)

On vectorizing both sides of (5), we have that

vec(Φ>T)=[(IM⊗(Φ>Φ+αIK))+(βL⊗Φ>Φ)

]vec(W),

where vec(·) denotes the standard vectorization operator and⊗ denotes the Kronecker product operation [69]. Then, the

optimal W denoted by?

W follows the relation:

vec(?

W)=[(IM⊗(Φ>Φ+αIK))+(βL⊗Φ>Φ)

]−1vec(Φ>T).

The predicted target for a new input x is then given by?t =

?

W>φφφ(x). (6)

From (6), it appears that the proposed target predictionapproach requires the explicit knowledge about the functionφφφ(·). We next show that using the ‘kernel trick’ this explicitrequirement of φφφ(·) is circumvented and that the target predic-tion may be done using only the knowledge of inner productsφφφ(xm)>φφφ(xn), ∀m,n, in other words, using the matrix ΦΦ>.Towards this end, we next discuss a dual representation of thecost in (2). We hereafter refer to (6) as the output of linearregression over graphs (LRG).

B. Dual representation of cost using kernel trickWe now use the substitution W = Φ>Ψ and express the

cost function in terms of the parameter Ψ ∈ RN×M . Thissubstitution is motivated by observing that on rearranging theterms in (5), we get that

W = Φ>ΨΨΨ,

where ΨΨΨ =[1α (T− βΦWL−Φ)

]. On substituting W =

Φ>Ψ in (4), where ΨΨΨ becomes the dual parameter matrixthat we wish to learn, and omitting terms that do not dependon W we get that

C(Ψ) =− 2tr(TTΦΦ>Ψ

)+ tr

(Ψ>ΦΦ>ΦΦ>Ψ

)+ α tr(Ψ>ΦΦ>Ψ) + β tr

(Ψ>ΦΦ>ΦΦ>ΨL

)=− 2tr

(T>KΨ

)+ tr

(Ψ>KKΨ

)+ α tr(Ψ>KΨ) + β tr

(Ψ>KKΨL

), (7)

where K = ΦΦ> ∈ RN×N denotes the kernel matrix for thetraining samples such that its (m,n)th entry is given by

km,n = φφφ(xm)Tφφφ(xn).

(7) is referred to as a dual representation of (4) in kernel re-gression literature (cf. Chapter 6 of [7]). Taking the derivativeof C(Ψ) with respect to Ψ and setting it to zero, we get that

(IM ⊗ (K + αIN ))vec(Ψ) + (βL⊗K)vec(Ψ) = vec(T), or[(IM ⊗ (K + αIN )) + (βL⊗K)] vec(Ψ) = vec(T), or

vec(Ψ) = [(IM ⊗ (K + αIN )) + (βL⊗K)]−1 vec(T).

We define the matrices

B = (IM ⊗ (K + αIN )),C = (βL⊗K).

(8)

Then, we have that

vec(Ψ) = (B + C)−1 vec(T). (9)

Once ΨΨΨ is computed, the kernel regression output for a newtest input x is given by

y = W>φφφ(x)= Ψ>Φφφφ(x)= Ψ>k(x)

=(

mat(

(B + C)−1 vec(T)

))>k(x),

(10)

where k(x) = [k1(x), k2(x), · · · , kN (x)]> and kn(x) =φφφ(xn)>φφφ(x). mat(·) denotes the reshaping operation of anargument vector into an appropriate matrix of size N ×M byconcatenating subsequent N length sections as columns. Ingeneral, a variety of valid kernel functions may be employedin kernel regression. Any kernel function k(x,x′) is a validkernel as long as it expresses an association between x andx′ and the kernel matrix K is positive semi-definite forall observation sizes [7]. The Gaussian kernel is one suchpopularly employed kernel which we use in our experiments.We hereafter refer to (10) as the output of kernel regressionover graphs (KRG). We note that KRG is a generalizationover the conventional kernel regression which does not useany knowledge of the underlying graph structure. On setting

Page 4: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

4

β = 0, the KRG output (10) reduces to the conventional KRoutput as follows

y =(

mat(

(B + C)−1 vec(T)

))>k(x)

=(

mat(

(B)−1 vec(T)

))>k(x)

=(

mat(

(IM ⊗ (K + αIN ))−1 vec(T)

))>k(x)

=(mat

((I−1M ⊗ (K + αIN )−1)vec(T)

))>k(x)

=((K + αIN )−1T

)>k(x)

= T>(K + αIN )−1k(x), (11)

where we have used the Kronecker product equality: vec((K+αIN )−1T) = (I−1M ⊗ (K +αIN )−1)vec(T). Further, we notethat KRG reduces to LRG on setting K = ΦΦ>.

C. Interpretation of KRG – a smoothing effect

We next discuss how the output of KRG for training inputis smooth across the training samples {1, · · · , N} and overthe M nodes of the graph. Before proceeding with KRG,we review a similar property exhibited by the KR output [8].Using (11) and concatenating KR outputs for the N trainingsamples, we get that

[y1 y2 . . . yN ] = T>(K + αIN )−1[k(x1) k(x2) . . .k(xN )]

or, Y = K(K + αIN )−1T,

where we use K = [k(x1) k(x2) . . .k(xN )]. Assuming Kis diagonalizable, let K = UJKUT , where JK and Udenote the eigenvalue and eigenvector matrices of K, re-spectively. Then, the values of mth (m ∈ {1, · · · , ,M})component of y collected over all time instances y(m) =[y1(m), y2(m), · · · , yN (m)]>, given by the mth column ofY is equal to

y(m) =

N∑i=1

θiθi + α

u>i t(m),

where θi and ui denote the ith eigenvalue and eigenvector ofK, respectively, and t(m) denotes vector containing the mthcomponent of the target vector for all training samples (mthcolumn of T). Thus, we observe that the KR output performsa shrinkage of t(m) along various eigenvector directions foreach m. The contribution from the eigenvectors correspond-ing to the smaller eigenvalues θi < α are effectively, andonly those corresponding to larger eigenvalues θi > α areretained. Since the eigenvectors corresponding to larger eigen-values of K represent smooth variations across observations{1, · · · , N}, we observe that KR performs a smoothing of tn.We next show that such is also the case with KRG: KRGacts as a smoothing filter across both observations and graphnodes. Using (10) and concatenating KRG outputs, we havethat

[y1 y2 . . . yN ] = Ψ>[k(x1) k(x2) . . .k(xN )] (12)or, Y = KΨ.

On vectorizing both sides of (12) , we get that

vec(Y) = (IM ⊗K)vec(Ψ)(a)= (IM ⊗K) (B + C)

−1 vec(T),(13)

where we have used (9) in (a). Let L be diagonalizable withthe eigendecomposition:

L = VJLV>,

where JL and V denote the eigenvalue and eigenvector matri-ces of L, respectively. We also assume that K is diagonalizableas earlier. Let λi and vi denote the ith eigenvalue andeigenvector of L, respectively. Then, we have that

V = [v1 v2 · · ·vN ] and JL = diag(λ1, λ2, · · · , λN ),

U = [u1 u2 · · ·uM ] and JK = diag(θ1, θ2, · · · , θM ).

Now, using (8), we have

B + C = (IM ⊗ (K + αIN )) + (βL⊗K)= [(VIMV>)⊗ (U(JK + αIN )U>)]

+[β(VJLV>)⊗ (UJKU>)](a)= [(V ⊗U)(IM ⊗ (JK + αIN ))(V> ⊗U>)]

+[β(V ⊗U)(JL ⊗ JK)(V> ⊗U>)]= ZJZ>,

(14)

where Z = V ⊗ U is the eigenvector matrix and J is thediagonal eigenvalue matrix given by

J = (IM ⊗ (JK + αIN )) + β(JL ⊗ JK). (15)

In (14)(a), we have used the distributivity of the Kroneckerproduct: (M1 ⊗M2)(M3 ⊗M4) = M1M3 ⊗M2M4 where{Mi}4i=1 are four matrices. We note that J is a diagonal matrixof size MN . Let J = diag(η1, η2, . . . , ηMN ). Then each ηi isa function of {λj}, {θj}, α, β. On dropping the subscripts forsimplicity, we observe that any eigenvalue ηi has the form

η = (θ + α) + β(λθ).

where θ and λ are the appropriate eigenvalues of K and L,respectively. Similarly, we have that

(IM ⊗K) = (VIMV>)⊗ (UJKU>)= (V ⊗U)(IM ⊗ JK)(V> ⊗U>)= Z(IM ⊗ JK)Z>,

(16)

and note that (IM ⊗ JK) is also a diagonal matrix of sizeMN . Then, on substituting (15) and (16) in (13), we get that

vec(Y) = (IM ⊗K) (B + C)−1 vec(T)

= (Z(IM ⊗ JK)Z>)(ZJ−1Z>)vec(T)= (Z(IM ⊗ JK)J−1Z>)vec(T).

(17)

We note again that (IM ⊗ JK)J−1 is a diagonal matrix withsize MN . Let (IM⊗JK)J−1 = diag(ζ1, ζ2, . . . , ζMN ). Then,on dropping the subscripts, any ηi is of the form

ζ =θ

η=

θ

(θ + α) + β(λθ).

From (17), we have that

vec (Y) =

MN∑i=1

ζiziz>i vec (T) ,

Page 5: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

5

where zi are column vectors of Z. In the case when ζi � 1,the component in vec(T) along zi is effectively eliminated.For most covariance or kernel functions k(·, ·) used in practice,the eigenvectors corresponding to the largest eigenvalues of Kcorrespond to the low-frequency components. Similarly, theeigenvectors corresponding to the smaller eigenvalues of Lare smooth over the graph [9]. We observe that the conditionζ � 1 is achieved when corresponding θ is small and/ orλ is large. This condition in turn corresponds to effectivelyretaining only components in vec(T) which vary smoothlyacross the samples {1, · · · , N} as well as over the M nodesof the graph.

D. Learning an underlying graph

In developing KRG, we have so far assumed that the under-lying graph is known apriori in terms of the graph-Laplacianmatrix L. Such an assumption may not hold for many practicalapplications since there is not necessarily one best graph forgiven networked data. This motivates us to develop a jointlearning approach where we learn L and KRG parametermatrix W (or its dual representation parameter Ψ). Our goalin this section is to provide a simple means of estimatingthe graph that helps enhance the prediction performance if agraph is not known apriori. We note that a vast and expandingbody of literature exists in the domain of estimating graphsfrom graph signals [50]–[54], [68], [70]–[73], and any ofthese approaches may be used as an alternative to the learningapproach proposed in this Section. Nevertheless, we pursuethe approach taken in this Section due to its effectivenessin terms of ease of implementation and reasonable predictionperformance as demonstrated by experiments.

We propose the minimization of the following cost functionto achieve our goal:

C(W,L) =∑n

‖tn − yn‖22 + α tr(W>W)

+β∑n

y>nLyn + ν tr(L>L),

where ν ≥ 0. Since our goal is to recover an undirectedgraph, we impose appropriate constaints [74]. Firstly, anynon-trivial graph has a graph-Laplacian matrix L which issymmetric and positive semi-definite [74]. Secondly, the vectorof all ones 1 forms the eigenvector of the graph-Laplaciancorresponding to the zero eigenvalue. Since L = D − A,we have that the Laplacian being positive semi-definite isequivalent to the constraint that all the off-diagonal elementsof L are non-positive which is a simpler constraint than thedirect positive semi-definiteness constraint. Then, the solutionto joint estimation of W and L is obtained by solving thefollowing:

minW,L

C(W,L) such that L(i, j) ≤ 0 ∀i 6= j,

L = L>,L1 = 0.

The optimization problem (18) is jointly non-convex over Wand L, but convex on W given L and vice-versa. Hence, weadopt an alternating minimization approach and solve (18) intwo steps alternatingly as follows:

• For a given L, solve minW

C(W) using KRG approach ofSection II.

• For a given W, solveminLC(L) such that L(i, j) ≤ 0 ∀i 6= j,L = L>, and

L1 = 0. Here C(L) = β∑n y>nLyn + ν tr(L>L).

We start the alternating optimization using the intializationL = 0, which corresponds to KR. In order to keep thesuccessive L estimates comparable, we scale L such that thelargest eigenvalue modulus is unity at each iteration.

III. EXPERIMENTS

We consider application of KRG to various synthesized andreal-world signal examples. We use normalized-mean-square-error (NMSE) as the performance measure:

NMSE = 10 log10

(E‖Y −T0‖2F

E‖T0‖2F

),

where Y denotes the regression output matrix and T0 thetrue value of target matrix, that means T0 does not containany noise. The expectation operation E(·) in performancemeasure is realized as sample average. In the case of real-world examples, we compare the performance of the followingfour approaches:

1) Linear regression (LR): km,n = φφφ(xm)>φφφ(xn), whereφφφ(x) = x and β = 0,

2) Linear regression over graphs (LRG):km,n = φφφ(xm)>φφφ(xn) and β 6= 0, where φφφ(x) = x,

3) Kernel regression (KR): Using radial basis function

(RBF) kernel km,n = exp

(−‖xm − xn‖22

σ2

)and

β = 0, and4) Kernel regression over graphs (KRG): Using RBF kernel

km,n = exp

(−‖xm − xn‖22

σ2

)and β 6= 0.

The regularization parameters α, β, and ν for different trainingdata sizes are found by five-fold crossvalidation on the trainingset. While performing crossvalidation, we assume that cleantarget vectors are available for the validation set. The RBFkernel parameter σ is set as σ2 = 1/N

∑m′,n′ ‖xm′ − xn′‖22

in our experiments, where the summation is made over thetraining inputs. This choice of σ is not necessarily optimizedfor the training datasets.

A. Regression for synthesized data

We perform regression for synthesized data where the targetvectors are known to be smooth over a specified graph, and thevarious target vectors are correlated among each other. A partof the data is used for training in presence of additive noise,and our task is to predict for the remaining part of the data,given the information of the correlations in form of kernels.In order to generate synthesized data, we use random small-world graphs from Erdos-Renyi and Barabasi-Albert models[75] with the number of nodes M = 50. We generate atotal of S target vector realizations. We adopt the follow-ing data generation strategy: We first pick M independentvector realizations from an S-dimensional Gaussian vector

Page 6: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

6

5 10 15SNR (dB)

-15

-10

-5N

MSE

(dB)

KRKRG

(a)

5 10 15SNR (dB)

-8

-6

-4

-2

NM

SE

KRKRG

(b)

Fig. 1. Results for synthesized data using Erdos-Renyi graphs. NMSE atdifferent SNR levels with training data size of N = 50 for (a) training data,and (b) test data.

source N (0,CS) where CS is S-dimensional covariancematrix drawn from inverse Wishart distribution with identityhyperparameter matrix. We use a highly correlated covariancematrix CS such that each S-dimensional vector has stronglycorrelated components. Thus, we create a data matrix with Mcolumns such that each column is an S-dimensional Gaussianvector. Each row vector of the data matrix has a size M andthe row vectors of the data matrix are correlated between eachother. We denote a row vector by r. We then select row vectors{r} one-by-one and project them on the specified graph togenerate target vectors {t} that are smooth over the graphwhile maintaining the correlation between observations, bysolving the following optimization problem:

t = arg minz

{‖r− z‖22 + z>Lz

}.

We randomly divide the S data samples into training and testsets of equal size Ntr = Nts = S/2. We define the kernelfunction between the ith and jth data samples zi and zj to be

ki,j , CS(i, j),

considering the same kernel for all the graph nodes. The choiceof the kernel is motivated by the assumed generating model.Given the training set of size Ntr, we choose a subset of Ndata samples to make predictions for the Nts test data samplesusing the kernel regression over graphs. The training targetvectors are corrupted with additive white Gaussian noise atvarying levels of signal-to-noise ratio (SNR). We repeat ourexperiments over 100 realizations of the graphs and noiserealizations. We compare the performance of KRG with KR.We observe from Figure 1 that for a fixed training data sizeof N = 50, KRG outperforms KR by a significant margin atlow SNR levels (below 10dB). As the SNR-level increases,the normalized MSE of KRG and KR almost coincide. Thisis expected since KR is able to make better predictions due toavailablity of cleaner training observations. A similar trend isobserved for also the case of Barabasi-Albert graphs, which isnot reported for brevity. In Figure 2, we show the normalizedMSE obtained with KRG and KR on both graph models asa function of training data size at an SNR-level of 5 dB.We observe that KRG consistently outperforms KR and thatthe gap between NMSE of KRG and KR reduces as trainingdata size increases as expected. The trend is similar for largedimensional graphs as well and is not reported here to avoidrepetition.

101

Training sample size N

-12-10

-8

-6

-4

NM

SE (d

B)

KRKRG

(a)

101

Training sample size N

-8

-6

-4

-2

NM

SE (d

B)

KRKRG

(b)

101

Training sample size N

-10

-8

-6

-4

NM

SE (d

B)

KRKRG

(c)

101

Training sample size N

-6

-5

-4

NM

SE (d

B)

KRKRG

(d)

Fig. 2. Results for synthesized data using NMSE at 5 dB SNR level: (a)for training data, and (b) for test data for Erdos-Renyi graphs. NMSE at 5dB SNR level: (c) for training data, and (d) for test data for Barabasi-Albertgraphs.

B. Temperature prediction for cities in Sweden

We collect temperature measurements from 45 most pop-ulated cities in Sweden for a period of three months fromSeptember to November 2017. The data is available publiclyfrom the Swedish Meteorological and Hydrological Institute[76]. We consider the target vector tn to be the temperaturemeasurement of a particular day, and xn to be the temperaturemeasurements (in degree celsius units) from the previous day.We have 90 input-target data pairs in total, divided into trainingset and test set of sizes Ntr = 60 and Nts = 30, respectively.Let dij denote the geodesic distance between cities i and j inkilometres, ∀i, j ∈ {1, · · · , 45}. We construct the adjacencymatrix by setting

A(i, j) = exp

(−

d2ij∑i,j d

2ij

).

In order to remove self loops, the diagonal of A is set tozero. We simulate noisy training data by adding zero-meanwhite Gaussian noise to the targets. For each N , we computethe NMSE by averaging over 50 different random trainingsubsets of N drawn from full training set of size Ntr. InFigure 3, we show the NMSE for test set at SNR-levels of5 dB and 0 dB. We observe that KRG outperforms otherregression methods by a significant margin, particularly at lowsample sizes N . In order to simulate a more realistic noisytraining scenario, we further consider two different cases ofnoisy training: missing data and large perturbations. In thecase of missing data experiment, we zero out the target valuesat randomly chosen five nodes for each training sample andapply the various regression approches. In the perturbationexperiment, we chose five of the nodes randomly for eachtraining sample and perturb the signal value to five times theoriginal values. The NMSE obtained for LR, LRG, KR, andKRG as a function of N is shown in Figures 3(d)-(e). Thus, weobserve that graph-structure improves robustness of regressionin terms of prediction performance for LR and KR.

Page 7: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

7

60° N

65° N

5° E 10° E 15° E 20° E 25° E 55° N

70° N

(a)-4

-2

0

2

4

6

8

10

12

(b)

5 15 30 45 60Training sample size N

-15

-10

-6

NM

SE (d

B)

LRLRGKRKRG

(c)

5 15 30 45 60Training sample size N

-20

-15

-10

-5

NM

SE (d

B)

LRLRGKRKRG

(d)

5 15 30 45 60Training sample size N

-14

-12

-10

-8N

MSE

(dB)

LRLRGKRKRG

(e)

5 15 30 45 60Training sample size N

-15

-10

-5

NM

SE (d

B)

LRLRGKRKRG

(f)

Fig. 3. Results for the temperature data of 45 cities in Sweden. (a) Map with cities, (b) True temperature measurement signal for a particular day in thedataset shown in units of degree celsius (The graph is the same as that in subplot (a) but without the underlying map.), (c) NMSE for test data with additivewhite Gaussian noise at 5dB SNR level, (d) NMSE for test data with additive white Gaussian noise at 0dB SNR level, (e) NMSE test data at with missingsamples, and (f) NMSE test data at with large perturbations.

5 15 30 45 60Training sample size N

-16-14

-12

-10

-8

NM

SE (d

B)

LRLRG-learnt LKRKRG-learnt L

(a)10 20 30 40

10

20

30

40 0

0.2

0.4

0.6

0.8

1

(b)10 20 30 40

10

20

30

40 0

0.2

0.4

0.6

0.8

1

(c)10 20 30 40

10

20

30

40 0

0.2

0.4

0.6

0.8

1

(d)

Fig. 4. Results with graph learning for temperature data of cities in Sweden, for training data samples with additive white Gaussian noise at 5dB SNR-levelfor N = 45. (a) Normalized MSE for test data, (b) Geodesic Laplacian, (c) Estimated Laplacian from LRG, (d) Estimated Laplacian from KRG.

We also consider the joint learning of the graph fromtraining data using the approach described in Section II-D.We perform the training at an SNR level of 5 dB. Wefind experimentally that the alternating optimization basedgraph learning algorithm converges typically after five to teniterations. The NMSE for training and test data obtained usinggeodesic L and the iteratively learnt L for both LRG and KRGare shown in Figure 4 for N = 45. The results show thatour alternating optimization based graph learning algorithm isuseful in the case when we do not know the underlying grapha-priori. We also observe that the estimated Laplacian matricesfrom both LRG and KRG are close to each other and are inreasonable agreement with that of the geodesic graph. Thisvalidates our intuition that the graph signal holds sufficientinformation to both infer the underlying graph structure andperform predictions.

C. Prediction on EEG signals

We repeat the prediction experiment on 64-channel EEGsignal data taken from the Physionet database [77]. The signalsare obtained by placing electrodes at various locations of asubject’s head, as shown in the schematic in Figure 5 (a).At each time instant, we consider a subset of 16 nodes asthe input and the target graph signal to be the M = 48-dimensional vector with components given by signal values atthe remaining electrodes. Our goal is then to predict the signalat the M electrodes for the rest of the electrodes given thoseat the input electrodes. Since the graph is not given apriori,we construct the graph for the M electrodes from a part ofnoiseless data using Gaussian kernels as follows:

A(i, j) = exp

(− ‖rk − rl‖22∑

k,l ‖rk − rl‖22

)∀k, l ∈ {1, · · · ,M},

Page 8: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

8

FPZFP1 FP2

AF7

AF3 AFZ AF4

AF8

F7

F5 F3 F1 FZF2 F4

F6

F8

FT7 FC5 FC3 FC1 FCZFC2

FC4FC6

FT8

T9 T7 C5 C1C3 CZ C2 C4 C6 T8 T10

TP7

CP5CP3

CP1CPZ CP2 CP4 CP6 TP8

P7

P5

P3 P1PZ P2 P4 P6

P8

PO7

PO3 POZPO4

PO8

O1

OZ

O2

IZ

2322 24

61 63

62

64

25

26 27 28

29

30 38

31 3732 33 34 3535 36

39 401

2 3 4

765

43 4441 428 9 10 11 12 13 14

45 4615 2116 2017 1918

4747 55

48 5449 50 52 5351

56

57 59

6058

(a)

4 8 16 32 64 128Training sample size N

-6

-4

-2

-1.25

NM

SE (d

B)

LRLRGKRKRG

(b)

4 8 16 32 64 128Training sample size N

-5

-4

-3

-2

-1-0.5

NM

SE (d

B)

LRLRGKRKRG

(c)

Fig. 5. Results for EEG data. (a) Schematic showing positions of 64 electrode channels, (b) NMSE for test data at 5dB SNR-level. (c) NMSE for test dataat 0dB SNR-level.

(a)

5 10 15 25 30Training sample size N

-8

-6

-4

-2.5

NM

SE (d

B)

LRLRGKRKRG

(b)

5 10 15 25 30Training sample size N

-8-6

-4

-2

-1

NM

SE (d

B)

LRLRGKRKRG

(c)

5 10 15 25 30Training sample size N

-8

-6

-4

-2N

MSE

(dB)

LRLRG with learnt LKRKRG with learnt L

(d)

Fig. 6. Results for ETEX data. (a) Schematic showing positions of the ground stations, (b) NMSE for test data at 5dB SNR-level, (c) NMSE for test data at0dB SNR-level, (d) NMSE with estimated Laplacian matrices at 5dB SNR.

where rl denotes the vector with first 1000 time-samples forthe lth electrode. Thus, we use the first 1000 time-samplesto generate the diffusion or Gaussian kernel graph for theelectrodes. Since EEG signals are highly nonstationary [78],we consider the dataset corresponding to 20 seconds consistingof approximately 256 input-target data samples, divided ran-domly into training and test data of 128 samples each. We thenconsider training with noisy targets corrupted with additivewhite Gaussian noise at SNR levels of 0 dB and 5 dB. TheNMSE is computed by averaging over 50 different randomtraining subsets of size N drawn from the full training set ofsize Ntr = 128. In Figure 5, we show the NMSE for LR, LRG,KR, and KRG for different training data sizes. We observeonce again that LRG and KRG consistently outperform LRand KR, respectively, particularly at small training data sizes.

D. Prediction for atmospheric tracer diffusion data

We next perform prediction experiments on the EuropeanTracer Experiment data (ETEX) which consists of measure-ments from diffusion of perfluorocarbon tracers released intothe atmosphere starting from a fixed location (Rennes inFrance) [79]. This is related to environmental pollution. Theobservations are made at 168 ground stations over Europeand consist of two experiments carried out over 72 hoursgiving 30 measurements each. We consider the problem ofpredicting the tracer concentrations on one half (84) of thetotal ground stations, given the concentrations at the remaininglocations. The target signal tn is then a graph signal over agraph of M = 84 nodes, the input vector xn being also of thesame dimension, where n denotes the measurement index. Aschematic diagram of the ground station locations is shown inFigure 6 (a). The red markers with edges correspond to theoutput nodes, whereas the rest of the markers denote the input.

Page 9: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

9

4 16 32 64 145Training sample size N

-25-20

-15

-10

NM

SE (d

B)LRLRGKRKRG

(a)

4 16 32 64 145Training sample size N

-25-20-15

-10

-5

NM

SE (d

B)

LRLRGKRKRG

(b)

4 16 32 64 145Training sample size N

-30-25-20-15

-10

-5

NM

SE (d

B)

LRLRG with learnt LKRKRG with learnt L

(c)

Fig. 7. Results for Cerebellum data. (a) NMSE for test data at 10dB SNR-level, (b) NMSE for test data at 0dB SNR-level, (c) NMSE for test data at withlearnt Laplacian at 10dB SNR-level.

We generate noisy training samples by adding white Gaussiannoise at different SNR levels. As with the case of temperaturedata considered in Section III-B, we consider the geodesicdistance based graph. We randomly divide the total dataset of60 samples equally into training and test datasets. We computethe NMSE for the different approaches by averaging over 50different randomly drawn training subsets of size N fromthe full training set of size Nts = 30. We report NMSE asa function of N in Figures 6(b)-(c) which show that graphstructure enhances the prediction performance under noisy andlow sample size conditions. We also perform LRG and KRGwith graph-learning and the NMSE is shown in Figure 6(d).The results again show that our alternating optimization basedgraph learning algorithm is useful in the absence of a-prioriknowledge about underlying graph.

E. Prediction for fMRI data of the cerebellum regionFinally, we consider prediction of voxel intensities in the

functional magnetic resonance imaging (fMRI) data obtainedfor the cerebellum region of the brain. The data is availablepublicly at https://openfmri.org/dataset/ds000102. Each pixelis considered as a node of the graph and the intensity tobe the signal. The original data graph is of dimension 4000obtained by mapping the 4000 cerebellum voxels anatomicallyfollowing the atlas template [80], [81]. In our analysis, weconsider the first 100 voxels from 4000 voxels due to ease ofexperiments in the sense of computational complexity. We usethe intensity values at the first 10 voxels as input x ∈ R10 tomake predictions for the output t ∈ R90 comprising the voxelinstensities at the remaining 90 voxels. The dataset consistsof 290 input-target data points. We use one half of the datafor training and the other half for testing. We construct noisytraining data at a signal-to-noise (SNR) levels of 10 dB and 0dB. For test data, the NMSE averaged over 50 different randomselections of training data of size N from the full trainingdataset is reported in Figure 7(a)-(c). We observe that LRGand KRG have superior performance over their conventionalcounterparts particularly for small N and low SNR-level. Wealso see that algorithm for estimation of underlying graph doesprovide a reasonable performance.

IV. CONCLUSIONS

We conclude that extension of a standard kernel regres-sion with inclusion of graph-Laplacian based regularization

is analytically tractable. The development of kernel regressionover graphs is useful for several practical signals and systems.Our experiments show that it is particularly beneficial forapplications where available training data is limited in sizeand noisy. We have also seen that it is possible to learnunderlying graph from data. A practical limitation of theproposed approach is that which it shares with all kernel-basedapproaches that it involves inversion of a large system matrix.

V. REPRODUCIBLE RESEARCH

In the spirit of reproducible research, all the codesrelevant to the experiments in this article are made availableat https://www.researchgate.net/profile/Arun Venkitaramanand https://www.kth.se/ise/research/reproducibleresearch-1.433797.

REFERENCES

[1] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning,vol. 20, no. 3, pp. 273–297, 1995.

[2] M. Seeger, “Gaussian processes for machine learning,” Int. J. NeuralSyst., vol. 14, no. 02, pp. 69–106, 2004.

[3] Y. B. L. Yann and G. Hinton., “Deep learning,” Nature, vol. 521, no.7553, pp. 436–444, 2015.

[4] Y. Cho and L. K. Saul, “Kernel methods for deep learning,” in Adv.Neural Inform. Process. Syst. Curran Associates, Inc., 2009, pp. 342–350.

[5] G.-B. Huang, “An insight into extreme learning machines: Randomneurons, random features and kernels,” Cognitive Computation, vol. 6,pp. 376–390, 2014.

[6] A. Iosifidis, A. Tefas, and I. Pitas, “On the kernel extreme learningmachine classifier,” Patt. Recognition Lett., vol. 54, pp. 11 – 17, 2015.

[7] C. M. Bishop, Pattern Recognition and Machine Learning (InformationScience and Statistics). Secaucus, NJ, USA: Springer-Verlag New York,Inc., 2006.

[8] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for MachineLearning (Adaptive Computation and Machine Learning). The MITPress, 2005.

[9] D. I. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst,“The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Process. Mag., vol. 30, no. 3, pp. 83–98, 2013.

[10] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing ongraphs,” IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644–1656,2013.

[11] ——, “Big data analysis with signal processing on graphs: Representa-tion and processing of massive data sets with irregular structure,” IEEESignal Process. Mag., vol. 31, no. 5, pp. 80–90, 2014.

[12] ——, “Discrete signal processing on graphs: Frequency analysis,” IEEETrans. Signal Process., vol. 62, no. 12, pp. 3042–3054, 2014.

[13] D. I. Shuman, B. Ricaud, and P. Vandergheynst, “A windowed graphFourier transform,” IEEE Statist. Signal Process. Workshop (SSP), pp.133–136, Aug 2012.

Page 10: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

10

[14] S. K. Narang and A. Ortega, “Local two-channel critically sampled filter-banks on graphs,” Proc. IEEE Int. Conf. Image Process. (ICIP), pp.333–336, 2010.

[15] ——, “Perfect reconstruction two-channel wavelet filter banks for graphstructured data,” IEEE Trans. Signal Process., vol. 60, no. 6, pp. 2786–2799, 2012.

[16] ——, “Compact support biorthogonal wavelet filterbanks for arbitraryundirected graphs,” IEEE Trans. Signal Process., vol. 61, no. 19, pp.4673–4685, 2013.

[17] R. R. Coifman and M. Maggioni, “Diffusion wavelets,” Appl. Computat.Harmonic Anal., vol. 21, no. 1, pp. 53–94, 2006.

[18] D. Ganesan, B. Greenstein, D. Estrin, J. Heidemann, and R. Govindan,“Multiresolution storage and search in sensor networks,” ACM Trans.Storage, vol. 1, no. 3, pp. 277–315, 2005.

[19] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets ongraphs via spectral graph theory,” Appl. Computat. Harmonic Anal.,vol. 30, no. 2, pp. 129–150, 2011.

[20] R. Wagner, V. Delouille, and R. Baraniuk, “Distributed wavelet de-noising for sensor networks,” Proc. 45th IEEE Conf. Decision Control,pp. 373–379, 2006.

[21] D. I. Shuman, B. Ricaud, and P. Vandergheynst, “Vertex-frequencyanalysis on graphs,” Applied and Computational Harmonic Analysis,vol. 40, no. 2, pp. 260 – 291, 2016.

[22] D. I. Shuman, M. J. Faraji, and P. Vandergheynst, “A multiscale pyramidtransform for graph signals,” IEEE Trans. Signal Process., vol. 64, no. 8,pp. 2119–2134, April 2016.

[23] O. Teke and P. P. Vaidyanathan, “Extending classical multirate signalprocessing theory to graphs-part i: Fundamentals,” IEEE Trans. SignalProcess., vol. 65, no. 2, pp. 409–422, Jan 2017.

[24] ——, “Extending classical multirate signal processing theory to graphs-part ii: M-channel filter banks,” IEEE Trans. Signal Process., vol. 65,no. 2, pp. 423–437, Jan 2017.

[25] A. Venkitaraman, S. Chatterjee, and P. Handel, “On Hilberttransform of signals on graphs,” Proc. Sampling Theory Appl.,2015. [Online]. Available: http://w.american.edu/cas/sampta/papers/a13-venkitaraman.pdf

[26] A. Venkitaraman, S. Chatterjee, and P. Handel, “Hilbert transform,analytic signal, and modulation analysis for graph signal processing,”CoRR, vol. abs/1611.05269, 2016. [Online]. Available: http://arxiv.org/abs/1611.05269

[27] N. Tremblay, G. Puy, P. Borgnat, R. Gribonval, and P. Vandergheynst,“Accelerated spectral clustering using graph filtering of random signals,”IEEE Int. Conf. Acoust. Speech Signal Process., 2016.

[28] N. Tremblay and P. Borgnat, “Joint filtering of graph and graph-signals,”Asilomar Conference on Signals, Systems and Computers, pp. 1824–1828, Nov 2015.

[29] S. Chen, R. Varma, A. Sandryhaila, and J. Kovacevic, “Discrete signalprocessing on graphs: Sampling theory,” IEEE Transactions on SignalProcessing, vol. 63, no. 24, pp. 6510–6523, Dec 2015.

[30] F. Gama, A. G. Marques, G. Mateos, and A. Ribeiro, “Rethinkingsketching as sampling: Linear transforms of graph signals,” Proc.Asilomar Conf. Signals, Systems Comput., pp. 522–526, Nov 2016.

[31] S. P. Chepuri and G. Leus, “Subsampling for graph power spectrumestimation,” Proc. IEEE Sensor Array Multichannel Signal Process.Workshop (SAM), pp. 1–5, July 2016.

[32] S. Chen, R. Varma, A. Singh, and J. Kovacevic, “Signal recovery ongraphs: Fundamental limits of sampling strategies,” IEEE Trans. SignalInform. Process. over Networks, vol. 2, no. 4, pp. 539–554, Dec 2016.

[33] A. G. Marques, S. Segarra, G. Leus, and A. Ribeiro, “Sampling of graphsignals with successive local aggregations,” IEEE Trans. Signal Process.,vol. 64, no. 7, pp. 1832–1843, April 2016.

[34] H. Q. Nguyen and M. N. Do, “Downsampling of signals on graphs viamaximum spanning trees,” IEEE Trans. Signal Process., vol. 63, no. 1,pp. 182–191, Jan 2015.

[35] M. Tsitsvero, S. Barbarossa, and P. D. Lorenzo, “Signals on graphs:Uncertainty principle and sampling,” IEEE Trans. Signal Processing,vol. 64, no. 18, pp. 4845–4860, 2016.

[36] A. Anis, A. Gadde, and A. Ortega, “Efficient sampling set selection forbandlimited graph signals using graph spectral proxies,” IEEE Trans.Signal Process., vol. 64, no. 14, pp. 3775–3789, July 2016.

[37] N. Shahid, V. Kalofolias, X. Bresson, M. Bronstein, and P. Van-dergheynst, “Robust principal component analysis on graphs,” IEEE Int.Conf. Comput. Vision (ICCV), pp. 2812 – 2820, 2015.

[38] N. Shahid, N. Perraudin, V. Kalofolias, G. Puy, and P. Vandergheynst,“Fast robust pca on graphs,” IEEE Journal of Selected Topics in SignalProcessing, vol. 10, no. 4, pp. 740–756, June 2016.

[39] D. Thanou, D. I Shuman, and P. Frossard, “Learning parametric dic-tionaries for signals on graphs,” IEEE Trans. Signal Process., vol. 62,no. 15, pp. 3849–3862, 2014.

[40] D. Thanou and P. Frossard, “Multi-graph learning of spectral graphdictionaries,” in IEEE Intl. Conf. Acoust. Speech Signal Processing,April 2015, pp. 3397–3401.

[41] ——, “Multi-graph learning of spectral graph dictionaries,” IEEE Int.Conf. Acoust. Speech Signal Process. (ICASSP), pp. 3397–3401, April2015.

[42] D. Thanou, D. I. Shuman, and P. Frossard, “Learning parametricdictionaries for signals on graphs,” IEEE Trans. Signal Process., vol. 62,no. 15, pp. 3849–3862, Aug 2014.

[43] Y. Yankelevsky and M. Elad, “Dual graph regularized dictionary learn-ing,” IEEE Trans. Signal Info. Process. Networks, vol. 2, no. 4, pp.611–624, Dec 2016.

[44] B. Girault, “Stationary graph signals using an isometric graph transla-tion,” Proc. Eur. Signal Process. Conf. (EUSIPCO), pp. 1516–1520, Aug2015.

[45] S. Segarra, A. G. Marques, G. Leus, and A. Ribeiro, “Stationary graphprocesses: Nonparametric spectral estimation,” Proc. IEEE Sensor ArrayMultichannel Signal Process. Workshop (SAM), pp. 1–5, July 2016.

[46] N. Perraudin and P. Vandergheynst, “Stationary signal processingon graphs,” CoRR, vol. abs/1601.02522, 2016. [Online]. Available:http://arxiv.org/abs/1601.02522

[47] B. Girault, P. Goncalves, and E. Fleury, “Translation on graphs: Anisometric shift operator,” IEEE Signal Process. Lett., vol. 22, no. 12,pp. 2416–2420, Dec. 2015.

[48] B. Girault, “Stationary graph signals using an isometric graph transla-tion,” Proc. Eur. Signal Process. Conf. (EUSIPCO), 2015.

[49] P. Berger, G. Hannak, and G. Matz, “Graph signal recovery via primal-dual algorithms for total variation minimization,” IEEE J. SelectedTopics Signal Process., vol. 11, no. 6, pp. 842–855, Sept 2017.

[50] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learninglaplacian matrix in smooth graph signal representations,” IEEE Trans.Signal Process., vol. 64, no. 23, pp. 6160–6173, Dec 2016.

[51] C. Hu, L. Cheng, J. Sepulcre, G. E. Fakhri, Y. M. Lu, and Q. Li, “Agraph theoretical regression model for brain connectivity learning ofalzheimer’s disease,” in 2013 IEEE 10th International Symposium onBiomedical Imaging, April 2013, pp. 616–619.

[52] S. I. Daitch, J. A. Kelner, and D. A. Spielman, “Fitting a graph to vectordata,” in Proceedings of the 26th Annual International Conference onMachine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009,pp. 201–208.

[53] N. Leonardi and D. V. D. Ville, “Wavelet frames on graphs definedby fMRI functional connectivity,” in IEEE International Symposium onBiomedical Imaging, March 2011, pp. 2136–2139.

[54] N. Leonardi, J. Richiardi, M. Gschwind, S. Simioni, J.-M. Annoni,M. Schluep, P. Vuilleumier, and D. V. D. Ville, “Principal componentsof functional connectivity: A new approach to study dynamic brainconnectivity during rest,” NeuroImage, vol. 83, pp. 937–950, 2013.

[55] R. I. Kondor and J. Lafferty, “Diffusion kernels on graphs and otherdiscrete structures,” Proc.of the ICML, pp. 315–322, 2002.

[56] A. J. Smola and R. Kondor, Kernels and Regularization on Graphs.Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 144–158.

[57] M. Belkin, I. Matveeva, and P. Niyogi, “Tikhonov regularization andsemi-supervised learning on large graphs,” Proc. IEEE Intl. Conf.Acoust.Speech Signal Process., May 2004.

[58] ——, “Regularization and semi-supervised learning on large graphs,”in Proc. 17th Annual Conf. Learning Theory, J. Shawe-Taylor andY. Singer, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004,pp. 624–638.

[59] A. Argyriou, M. Herbster, and M. Pontil, “Combining graph laplaciansfor semi-supervised learning,” NIPS’05, pp. 67–74, 2005.

[60] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Dec. 2006.

[61] K. Zhang, L. Lan, J. T. Kwok, S. Vucetic, and B. Parvin, “Scaling upgraph-based semisupervised learning via prototype vector machines,”IEEE Trans. Neural Networks Learning Systems, vol. 26, no. 3, pp.444–457, March 2015.

[62] K. I. Kim, J. Tompkin, H. Pfister, and C. Theobalt, “Context-guideddiffusion for label propagation on graphs,” in IEEE Intl. Conf. ComputerVision (ICCV), Dec 2015, pp. 2776–2784.

[63] D. Romero, M. Ma, and G. B. Giannakis, “Kernel-based reconstructionof graph signals,” IEEE Trans. Signal Process., vol. 65, no. 3, pp. 764–778, Feb 2017.

Page 11: Kernel Regression for Signals over Graphs Kernel Regression for Signals over Graphs Arun Venkitaraman, Saikat Chatterjee, Peter Handel¨ School of Electrical Engineering and ACCESS

11

[64] ——, “Estimating signals over graphs via multi-kernel learning,” inIEEE Statist. Signal Process. Workshop (SSP), June 2016, pp. 1–5.

[65] V. N. Ioannidis, D. Romero, and G. B. Giannakis, “Kernel-basedreconstruction of space-time functions via extended graphs,” in AsilomarConf. Signals Syst. Comput., Nov 2016, pp. 1829–1833.

[66] D. Romero, V. N. Ioannidis, and G. B. Giannakis, “Kernel-basedreconstruction of space-time functions on dynamic graphs,” CoRR, vol.abs/1612.03615, 2016. [Online]. Available: http://arxiv.org/abs/1612.03615

[67] Y. Shen, B. Baingana, and G. B. Giannakis, “Kernel-based structuralequation models for topology identification of directed networks,” IEEETrans. Signal Process., vol. 65, no. 10, pp. 2503–2516, May 2017.

[68] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learninggraphs from signal observations under smoothness prior,” CoRR, vol.abs/1406.7842, 2014. [Online]. Available: http://arxiv.org/abs/1406.7842

[69] C. F. V. Loan, “The ubiquitous Kronecker product,” J. Computat. Appl.Mathematics, vol. 123, no. 1–2, pp. 85–100, 2000.

[70] P. K. Shivaswamy and T. Jebara, Laplacian Spectrum Learning. Berlin,Heidelberg: Springer Berlin Heidelberg, 2010, pp. 261–276.

[71] X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Laplacianmatrix learning for smooth graph signal representation,” in IEEE Intl.Conf. Acoust. Speech Signal Processing, April 2015, pp. 3736–3740.

[72] H. E. Egilmez, E. Pavez, and A. Ortega, “Graph learning from data underlaplacian and structural constraints,” IEEE J. Selected Topics SignalProcess., vol. 11, no. 6, pp. 825–841, Sept 2017.

[73] V. Kalofolias, “How to learn a graph from smooth signals,” inProc. Intl. Conf. Artificial Intell. Statistics, ser. Proc. MachineLearning Research, A. Gretton and C. C. Robert, Eds., vol. 51.PMLR, 09–11 May 2016, pp. 920–929. [Online]. Available: http://proceedings.mlr.press/v51/kalofolias16.html

[74] F. R. K. Chung, Spectral Graph Theory. AMS, 1996.[75] M. E. J. Newman, Networks: An Introduction. Oxford University Press,

2010.[76] Swedish meteorological and hydrological institute (smhi). [Online].

Available: http://opendata-download-metobs.smhi.se/[77] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C.

Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E.Stanley, “Physiobank, physiotoolkit, and physionet,” vol. 101, no. 23,pp. e215–e220, 2000.

[78] S. Sanei and J. A. Chambers, EEG Signal Processing. Wiley, 2007.[79] ”European Tracer Experiment (ETEX),” 1995, [online]. [Online].

Available: https://rem.jrc.ec.europa.eu/RemWeb/etex/.[80] H. Behjat, N. Leonardi, L. S. a, and D. V. D. Ville, “Anatomically-

adapted graph wavelets for improved group-level fmri activation map-ping,” NeuroImage, vol. 123, pp. 185 – 199, 2015.

[81] J. Diedrichsen, J. H. Balsters, J. Flavell, E. Cussans, and N. Ramnani, “Aprobabilistic mr atlas of the human cerebellum,” NeuroImage, vol. 46,no. 1, pp. 39 – 46, 2009.