statistical foundation of spectral graph theory · 2016-05-18 · statistical foundation of...

Statistical Foundation of Spectral Graph Theory

Subhadeep Mukhopadhyay

Temple University, Department of Statistics

Philadelphia, Pennsylvania, 19122, U.S.A.

Dedicated to the memory of Manny Parzen (1929-2016), a pioneer in nonparametric

spectral domain time series analysis, from whom the author learned so much.

Abstract

Spectral graph theory is undoubtedly the most favored graph data analysis tech-

nique, both in theory and practice. It has emerged as a versatile tool for a wide variety

of applications including data mining, web search, quantum computing, computer vi-

sion, image segmentation, and among others. However, the way in which spectral

graph theory is currently taught and practiced is rather mechanical, consisting of a

series of matrix calculations that at first glance seem to have very little to do with

statistics, thus posing a serious limitation to our understanding of graph problems

from a statistical perspective. Our work is motivated by the following question: How

can we develop a general statistical foundation of “spectral heuristics” that avoids

the cookbook mechanical approach? A unified method is proposed that permits fre-

quency analysis of graphs from a nonparametric perspective by viewing it as function

estimation problem. We show that the proposed formalism incorporates seemingly

unrelated spectral modeling tools (e.g., Laplacian, modularity, regularized Laplacian,

diffusion map etc.) under a single general method, thus providing better fundamen-

tal understanding. It is the purpose of this paper to bridge the gap between two

spectral graph modeling cultures: Statistical theory (based on nonparametric func-

tion approximation and smoothing methods) and Algorithmic computing (based on

matrix theory and numerical linear algebra based techniques) to provide transparent

and complementary insight into graph problems.

Keywords and phrases: Nonparametric spectral graph analysis; Graph correlation den-

sity field; Spectral regularization; Orthogonal functions based spectral approximation; Trans-

form coding of graphs; Karhunen-Loeve representation of graphs; High-dimensional discrete

data smoothing.

Acknowledgments: The author benefited greatly by many fruitful discussions and ex-

change of ideas he had at the John W. Tukey 100th Birthday Celebration meeting, Prince-

ton University September 18, 2015. The author is particularly grateful to Professor Ronald

Coifman for suggesting the potential connection between our formulation and Diffusion map

(random walk on graph). We also thank Isaac Pesenson and Karl Rohe for helpful questions

and discussions.

1

Contents

1 Introduction 21.1 Graph: A Unified Data-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Two Complementary Viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Towards a Statistical Theory of Spectral Graph Analysis . . . . . . . . . . . . . . . 4

1.4 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Fundamentals of Statistical Spectral Graph Analysis 6

2.1 Graph Correlation Density Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Karhunen-Loeve Representation of Graph . . . . . . . . . . . . . . . . . . . . . . . 9

3 Nonparametric Spectral Approximation Theory 11

3.1 Orthogonal Functions Approach to Spectral Graph Analysis . . . . . . . . . . . . . 11

3.2 Graph Co-Moment Based Computational Mechanics . . . . . . . . . . . . . . . . . 13

4 A New Look At The Spectral Regularization 16

4.1 Smoothing High-dimensional Discrete Parameter Space . . . . . . . . . . . . . . . 16

4.2 Theory of Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Algorithm and Applications 18

5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Graph Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 Further Extensions 26

7 Concluding Remarks 28

1 Introduction

1.1 Graph: A Unified Data-Structure

Graph models are powerful enablers towards developing unified statistical algorithms for

two reasons: First, availability of graph structured data from wide range of data-disciplines,

like computational biology (eg. protein-protein, gene-gene and gene-protein interaction net-

works), and the social sciences (eg., social networks like Facebook, MySpace, and LinkedIn)

etc. Second and more importantly, graph models offer a unified way of understanding dif-

ferent data types. Many complex (non-standard) data problems arise in various engineering

and scientific fields, such as segmentation of images in computer vision, mobile communica-

tions, electric power grids, cyber-security, scientific computation, functional brain dynam-

ics, 3d mesh processing in computer graphics, information retrieval, and natural language

processing–all of which can be reformulated as graph problems.

Because of the potential to be a powerful generalized data structure, it is of outstanding

interest to develop new ways of looking at graphs that might lead to faster and better

understanding of the essential characteristics of this high-dimensional discrete object.

2

1.2 Two Complementary Viewpoints

Vertex (or node) domain and frequency (or spectral) domain are two complementary rep-

resentation techniques for graphs.

A. Vertex domain analysis. Tremendous advances have been made over the past few

decades in building vertex domain graph models. The main focus has been to model “link

probabilities” between the nodes. Vertex domain analysis can be divided into two categories

based on its modeling approach.

(A1) Parametric modeling. These are the most extensively studied models in the statistics

and probability literature. Some key parametric statistical graph models include the Erdos-

Renyi model, p1-p2 model, latent space models, stochastic blockmodels and their numerous

extensions. For an excellent overview of different models, see the survey paper by Salter-

Townshend et al. (2012).

(A2) Nonparametric modeling. Lovasz and Szegedy (2006) introduced the concept of

Graphon, a functional modeling tool that compactly describes the edge probabilities or

connection probabilities of a random graph. Characterization and estimation based on

graphon is a fast-growing branch of nonparametric statistics and graph theory; see for

instance Mukhopadhyay (2015), Airoldi et al. (2013), Chatterjee (2014), and references

therein.

B. Frequency domain analysis. This provides a natural alternative to look into the

graphs. The practical significance and role of spectral analysis can be best summarized by

stating Tukey (1984) “Failure to use spectrum analysis can cost us lack of new phenomena,

lack of insight, lacking of gaining an understanding of where the current model seems to fail

most seriously.”

(B1) Algorithmic spectral analysis. The way spectral graph analysis is currently taught and

practiced can be summarized as follows (also known as spectral heuristics):

1. Convert Gn undirected graph to A ∈ Rn×n binary adjacency matrix, where A(x, y;G) =

1 if the nodes x and y are connected by an edge and 0 otherwise.

2. Define “suitable” spectral graph matrix. Most popular and successful ones are listed

below:

• L = D−1/2AD−1/2; Chung (1997)

• B = A − N−1ddT ; Newman (2006)

• Type-I Reg. L = D−1/2τ AD

−1/2τ ; Chaudhuri et al. (2012)

• Type-II Reg. L = D−1/2τ Aτ D

−1/2τ ; Amini et al. (2013)

3

D = diag(d1, . . . , dn) ∈ Rn×n, di denotes the degree of a node, τ > 0 tuning parameter,

and N = 2|E| =∑

x,y A(x, y).

3. Perform SVD on the matrix selected at step 2.

Spectral graph theory seeks to understand the interesting properties and structure of a

graph by using the dominant singular values and vectors, first recognized by Fiedler (1973).

A premier book on this topic is Chung (1997).

1.3 Towards a Statistical Theory of Spectral Graph Analysis

Nonparametric spectral analysis. To the best of our knowledge, there is no existing theory

for the nonparametric frequency analysis of graphs, despite that similar approaches for time

series analysis (pioneered by Wiener (Wiener, 1930), Tukey (Blackman and Tukey, 1958,

Cooley and Tukey, 1965), Parzen (Parzen, 1961), and many others) revolutionized 20th

century signal processing, leading to many engineering and scientific breakthroughs.

This paper arose out of the attempt to find a statistical explanation of the extraordinary

success of spectral heuristics in terms of both theory and application. Our goal is to un-

derstand at least some part of this puzzle from the statistical perspective that allows graph

analysts to adopt nonparametric approach. This would allow us to avoid the cookbook me-

chanical way of approaching frequency-domain (spectral) analysis of graphs that is solely

based on a series of matrix calculations (B1), which at first glance seems to have very lit-

tle to do with statistics. Driven by the maxim: more can be learned from graph data by

wise use of spectral analysis, this paper addresses the intellectual challenge of building a

firm statistical foundation from scratch that permits the science and art of frequency graph

analysis–a problem of exceptional interest at the present time.

1.4 Our Contributions

In this paper, we are interested in the statistical foundation of spectral graph analysis.

We present the mathematical foundation of nonparametric spectral approximation methods

for graph structured data. The mathematical treatment of this paper is guided by the

intention of providing an applicable and unified theoretical foundation based on new ideas

and concepts without resorting to unnecessary rigor that has little or no practical value for

algorithm design and applications.

There are four unique aspects of our research contributions in this paper:

• Unification: We develop a general statistical theory from the first principle that provides

genuine motivation and understanding of the spectral graph methods from a new perspec-

tive. As a taste of its power, we show how this new point of view incorporates seemingly

unrelated traditional ‘black-box’ spectral techniques as a particular instance.

4

• Cultural Bridge: An attempt is made to bridge the gap between the two cultures: Statis-

tical (based on nonparametric function approximation and smoothing methods) and Algo-

rithmic (based on matrix theory and numerical linear algebra based techniques), to provide

transparent and complementary insight into graph problems.

• Generalization: Our modeling perspective brings useful tools and crucial insights to de-

sign powerful statistical algorithms for random graph modelling. Our research inspires the

development of new innovative (fast and approximate) computational harmonic analysis

tools, which is probably necessary for large-scale spectral graph analysis.

• Interdisciplinarity : The interplay between statistics, approximation theory, and compu-

tational harmonic analysis is one of the central themes of this research.

We believe that similar to time series analysis, the proposed nonparametric spectral char-

acterization will usefully provide both computational and statistical advantages in terms

of better (compressed) representations and computationally effective simpler solutions to

graph modeling problems.

Section 2 is devoted to the fundamentals of statistical spectral graph analysis that will

provide the necessary background for a good understanding of the beauty and utility of

the proposed frequency domain viewpoint. We introduce the concept of graph correlation

density field (GraField). The Karhunen-Loeve representation of graph is defined via spec-

tral expansion of discrete graph kernel C , which provides universal spectral embedding

coordinates and graph Fourier basis. Unified nonparametric spectral approximation tech-

nique is discussed in Section 3. We describe a specialised class of smoothed orthogonal

series spectral estimation algorithm. Surprising connections between our approach and the

algorithmic spectral methods are derived. In Section 4, the open problem of obtaining a

formal interpretation of “spectral regularization” is addressed. A deep connection between

high-dimensional discrete data smoothing and spectral regularization is discovered. This

new perspective provides, for the first time, the theoretical motivation and fundamental

justification for using regularized spectral methods, which were previously considered to be

empirical guesswork-based ad hoc solutions. The algorithm of nonparametric estimation

of graph Fourier basis is given in Section 5. Important applications towards spatial graph

regression and graph clustering are also discussed. Section 6 presents further extensions

and generalizations to orthogonal functions based generalized Fourier graph analysis. We

introduce concepts like transform coding of graphs, orthogonal discrete graph transform,

and spectral sparsified graph matrices. Our research sheds light on the role of properly

designed basis functions to provide enhanced quality spectral representation and powerful

numerical algorithms for fast approximate learning–a topic of paramount importance for

analyzing big graph data at scale. In Section 7, we end with concluding remarks.

5

2 Fundamentals of Statistical Spectral Graph Analysis

2.1 Graph Correlation Density Field

At the outset, it is not even clear what the natural starting point for developing a coher-

ent statistical description of spectral graph analysis is. For reasons which will be clear

a little later, we introduce the concept of ‘Graph Correlation Density Field’ (or in short

GraField) – a crucial tool for our analysis that will allow cross-fertilization between algo-

rithmic and statistical modeling culture. GraField is a bivariate step kernel over the unit

square that provides a unified way to study the structure of graphs in the frequency domain

by completely characterizing and compactly representing the ‘affinity’ or ‘strength’ of ties

(or interaction) between every pair of vertices in the graph.

Curious readers might be wondering whether our approach has an analogous time series

counterpart where spectral analysis is related to the autocorrelation function (ACF) by a

Fourier transform (Wiener–Khinchin theorem). The answer is yes. In the similar spirit, we

will show (in the next section): classical Laplacian graph spectrum can be interpreted as

raw (not smoothed) nonparametric estimator of the Fourier transform of GraField.

Definition 1. For given discrete graph G of size n, the piecewise-constant bivariate kernel

function C : [0, 1]2 → R+ ∪ {0} is defined almost everywhere through

C (u, v;Gn) =p(Q(u;X), Q(v;Y );Gn

)p(Q(u;X)

)p(Q(v;Y )

) , 0 < u, v < 1, (2.1)

where u = F (x;X), v = F (y;Y ) for x, y ∈ {1, 2, . . . , n} and degree sequence induced graph

mass functions

p(x;X) =n∑y=1

A(x, y)/N, p(y;Y ) =n∑x=1

A(x, y)/N, and p(x, y;G) = A(x, y)/N

with Q(u;X) and Q(v;Y ) are the respective quantile functions.

Theorem 1. GraField defined in Eq. (2.1) is a positive piecewise-constant kernel satisfying∫∫[0,1]2

C (u, v;G) du dv =∑

(i,j)∈{1,...,n}2

∫∫Iij

C (u, v;G) du dv = 1,

where

Iij(u, v) =

1, if (u, v) ∈ (F (i;X), F (i+ 1;X)]× (F (j;Y ), F (j + 1;Y )]

0, elsewhere.

6

Note 1. The bivariate step-like shape of the GraField kernel is governed by the (piecewise-

constant left continuous) quantile functions Q(u;X) and Q(v;Y ) of the discrete measures

p(x;X) and p(y;Y ). As a result, in the continuum limit (as the dimension of the graph

n→∞), the shape of the piecewise-constant discrete C approaches to a “continuous field”

over unit interval.

Motivation 1. As a toy-example, consider the following adjacency matrix of a social net-

work representing 4 employees of an organization

A =

0 2 0 02 0 3 30 3 0 30 3 3 0

,

where the weights reflect number of communication (say email messages or coappear-

ances in social events etc.). Our interest lies in understanding the strength of associa-

tion between the employees i.e., Strength(x, y) for all pairs of vertices. Looking at the

matrix A (or equivalently based on histogram Graphon estimator p(x, y;G) = A/N with

N =∑

x,y A(x, y) = 22) one might be tempted to conclude that the link between employee

1 and 2 is the weakest one as they have communicated only twice, whereas employee 2, 3

and 4 constitute strong-ties as they have interacted more frequently. Now here is the sur-

prise. It turns out that (i) Strength(1, 2) is twice that of Strength(2, 3) and Strength(2, 4);

also (ii) Strength(1, 2) is 1.5 times of Strength(3, 4)! To understand the paradox com-

pute the vertex-domain empirical GraField kernel matrix (Definition 1) with (x, y)th entry

N ·A(x, y;G)/d(x)d(y)

Cn =

0 22/8 0 0

22/8 0 22/16 22/16

0 22/16 0 22/12

0 22/16 22/12 0

.

This toy example is in fact a small portion (with members 1, 9, 31 and 33) of the famous

Zachary’s karate club data, where the first two members were from Mr. Hi’s group and the

remaining two were from John’s group1. The purpose of this illustrative example is not to

completely dismiss the adjacency or empirical graphon based analysis but to caution the

practitioners so as not to confuse the terminology “strength of association” with “weights”

of the adjacency matrix – two are very different objects. Existing literature use them

interchangeably without paying much attention. As a first step towards exploratory graph

data analysis we recommend looking at both the traditional adjacency matrix (or Graphon

1For details about the Zachary’s karate club see https://en.wikipedia.org/wiki/Zachary\%27s_

karate_club.

7

● ●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

1 2

3

4

5

6

7 8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

3334

Karate Graphon

● ●

●

●

●

●

● ●

●

●

●●

●●

●

●

●

1 2

3

4

5

6

7 8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

3334

Karate GraField

Figure 1: Graphon and GraField network display for the karate club data set. Nodes are

plotted in the same order in both the plots to match the degree of association and the

corresponding weight for each edge.

based) network plot and also the degree of association plot, as shown in the Fig.1 for the

Karate club data. It is important to emphasize that higher edge-weight do not necessarily

translates into stronger association. The crux of the matter is: Association does not depend

on the raw edge-density, it is a “comparison edge-density” that is captured by the GraField.

Motivation 2. As we have seen from the previous example, for finite dimensional graph

data analysis, empirical graphon captures exactly the same information that is contained

in adjacency matrix. Thus, it is not difficult to see that the graphon based spectral graph

analysis leads to the classical adjacency spectral embedding results. At the same time, it

has been long known that various normalized versions of adjacency spectral graph matrices

(such as Laplacian, modularity or diffusion matrix) perform far better in practice (Shen and

Cheng, 2010, Newman, 2006) as well as in theory (Von Luxburg et al., 2008, Sarkar et al.,

2015). One question naturally arises as to why we have such a narrow collection of ‘useful’

spectral graph matrices in our toolbox after almost half a century of active research by many

disciplines. The main reason for this is the highly non-trivial construction mechanisms for

each spectral matrices and there seem to be no general theory available that can unite

them under a single framework. How to overcome this long-standing barrier is an open

problem, which we address in this paper. A general strategy for spectral graph analysis

will be developed that includes most of the existing spectral methods as special cases. Our

formulation will be entirely (nonparametric) statistical where C plays a central role and an

elegant starting point.

8

Motivation 3. The other crucial point to note is that the “slices” of the GraField kernel

(2.1) can be expressed as p(y|x;G)/p(y;G) in the vertex domain. This alternative conditional

probability based viewpoint suggests a connection with the random walk on the graph.

Interpret p(y|x;G) as transition probability from vertex x to vertex y, and (when G is non-

bipartite) p(y;G) as the stationary distribution. This key observation along with Theorem

2 will allow us to integrate diffusion map (Coifman and Lafon, 2006) into our general

framework in Section 3.2.

Motivation 4. Graphon (Lovasz and Szegedy, 2006) captures edge probability density,

whereas GraField offers a way to measure and represent the ‘strength’ of connections (or

interaction) between pairs of vertices. GraField can also be viewed as properly “normalized

Graphon,” which is reminiscent of Wassily Hoeffding’s “standardized distributions” idea

(Hoeffding, 1940). Thus it can be interpreted as a discrete analogue of copula (the Latin

word copula means “a link, tie, bond”) density for random graphs that captures the un-

derlying correlation field. We study the structure of graphs in the spectral domain via this

fundamental graph kernel C that characterizes the implicit connectedness or tie-strength

between pairs of vertices.

Fourier-type spectral expansion result of the density matrix C is discussed in the ensuing

section, which is at the heart of our approach. We will demonstrate that this correlation

density operator based formalism provides a useful perspective for spectral analysis of graphs

that allows unification.

2.2 Karhunen-Loeve Representation of Graph

We define the Karhunen-Loeve (KL) representation of a graph G based on the spectral

expansion of its Graph Correlation Density function C (u, v;G). Schmidt decomposition

(Schmidt, 1907) of C yields the following spectral representation theorem of graph.

Theorem 2. The square integrable graph correlation density kernel C : [0, 1]2 → R+ ∪ {0}

of two-variables admits the following canonical representation

C (u, v;Gn) = 1 +n−1∑k=1

λkφk(u)φk(v), (2.2)

where the non-negative λ1 ≥ λ2 ≥ · · ·λn−1 ≥ 0 are singular values and {φk}k≥1 are the

orthonormal singular functions 〈φj, φk〉L2[0,1] = δjk, for j, k = 1, . . . , n − 1, which can be

evaluated as the solution of the following integral equation relation∫[0,1]

[C (u, v;G)− 1]φk(v) dv = λkφk(u), k = 1, 2, . . . , n− 1. (2.3)

9

Remark 1. What can we learn from the density kernel C ? We will show that this integral

eigenvalue equation-based formulation of correlation density operator provides a unified

view of spectral graph analysis. The spectral basis functions {φk} of the orthogonal se-

ries expansion of C are the central quantities of interest for graph learning and provide

the optimal representation in the spectral domain. The fundamental statistical modeling

problem hinges on finding approximate solutions to the optimal graph coordinate system

{φ1, . . . , φn−1} that satisfy the integral equation (2.3).

Remark 2. By virtue of the properties of Karhunen-Loeve (KL) expansion (Loeve, 1955),

the eigenfunction basis φk satisfying (2.3) provides the optimal low-rank representation of

a graph in the mean square error sense. In other words, {φk} bases capture the graph

topology in the smallest embedding dimension and thus carries practical significance for

graph compression. Hence, we can call those functions, the optimal coordinate functions or

Fourier representation basis.

Definition 2. Any function or signal y ∈ Rn defined on the vertices of the graph y : V 7→ R

such that ‖y‖2 =∑

x∈V (G) |y(x)|2p(x;G) <∞, can be represented as a linear combination of

the Schmidt bases of the correlation density matrix C . Define the generalized graph Fourier

transform of y

y(λk) := 〈y, φk〉 =n∑x=1

y(x)φk[F (x;G)].

This spectral or frequency domain representation of a signal, belonging to the square inte-

grable Hilbert space L2(G) equipped with the inner product

〈y, z〉L2(G) =∑

x∈V (G)

y(x)z(x)p(x;G),

allows us to construct efficient graph learning algorithms. As {φk}’s are KL spectral bases,

the vector of projections onto this basis function decay rapidly, hence may be truncated

aggressively to capture the structure in a small number of bits.

Definition 3. The entropy (or energy) of a discrete graph G, is defined using the Parseval

relation of the spectral representation

Entropy(G) =

∫∫[0,1]2

(C − 1)2 du dv =∑k

|λk|2.

This quantity, which captures the departure of uniformity of the C , can be interpreted as

a measure of ‘structure’ or the ‘compressibility’ of the graph. This entropy measure can

10

be used to (i) define graph homogeneity; (ii) design fast algorithms for graph isomorphism.

For homogenous graphs the shape of the correlation density field is flat uniform over unit

square. The power at each harmonic components, as a function of frequency, is called the

power spectrum of the graph.

The necessary theoretical background of a general method for solving the integral equation

(2.3) in order to produce good approximation of the optimal KL bases φk will be discussed

in Section 3 and 4. By doing so, we will reveal its connection with classical (algorithmic)

spectral analysis techniques.

3 Nonparametric Spectral Approximation Theory

We describe the nonparametric theory to approximate the canonical functions {φk}, which

play the role of Fourier basis for functions over graph G.

Definition 4. Define the spectral graph learning algorithm as method of approximating

(λk, φk)k≥1 that satisfies the integral equation (2.3) corresponding to the graph kernel

C (u, v;G). In practice, often the most important features of a graph can be well charac-

terized and approximated by few top dominating singular-pairs. The statistical estimation

problem can be summarize as follows:

An×n 7→ C 7→{(λ1, φ1

), . . . ,

(λn−1, φn−1

)}that satisfies Eq. (2.3).

This formalism and practical viewpoint of understanding the graph modeling problems

turns out to be very useful. To demonstrate its power, we will next show how this can

provide a general purpose unified framework for deriving traditional spectral graph analysis

techniques.

3.1 Orthogonal Functions Approach to Spectral Graph Analysis

Orthogonal Series Approximation. We will develop a specialized technique for approx-

imating spectra of integral kernel C . In particular, the theory and algorithm of Orthogonal

Series Spectral approximation (SOS) technique will be discussed. This general approxi-

mation scheme provides an effective and systematic way of discrete graph analysis in the

frequency domain.

Approximate the unknown function φk as a linear combination of elements from a complete

orthogonal system in L 2[0, 1]. Let {ξk} be a complete basis of Rn defined on the unit

interval [0, 1]. Accordingly, each singular function φk can be expressed as the expansion

11

1

2

3

4

1

2

3 4

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.00.

00.

61.

2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

Figure 2: Two graphs and their corresponding degree-adaptive block-pulse functions. The

amplitudes and the block length of the indicator basis functions (3.2) depends on the degree

distribution of that graph.

over this basis

φk(u) ≈n∑j=1

αjk ξj(u), u ∈ [0, 1]. (3.1)

where αjk are the unknown coefficients to be estimated.

Degree-Adaptive Block-pulse Basis Functions. The most fundamental yet universally

valid (for any graph) choice for {ξj}1≤j≤n is the indicator top hat functions (also known as

block-pulse basis functions, in short BPFs). Instead of defining the BPFs on a uniform

grid (which is the usual practice) here we define them on the non-uniform mesh 0 = u0 <

12

u1 · · · < un = 1 over [0,1], where uj =∑

x≤j p(x;X) with local support

ξj(u) =

{p−1/2(j) for uj−1 < u ≤ uj;

0 elsewhere.(3.2)

They are disjoint, orthogonal and complete set of functions satisfying

1∫0

ξj(u) du =√p(j),

1∫0

ξ2j (u) du = 1, and

1∫0

ξj(u)ξk(u) du = δjk.

Note 2. Because of how we have defined the BPFs, the shape (amplitudes and block

lengths) depend on the specific graph structure via p(x;G) as shown in Fig 1. The estimated

φk, by representing them as block pulse series, will be called raw-nonparametric estimates. A

computational procedure and algorithm for estimating the unknown expansion coefficients

{αjk} satisfying (2.3) will be discussed next.

3.2 Graph Co-Moment Based Computational Mechanics

We described a technique via block-pulse series approach to approximate the eigenpairs of

the integral correlation density kernel equation. In order to obtain the spectral domain rep-

resentation of the graph, it is required to estimate the block-pulse function coefficients (3.1).

The following result describes the required computational scheme by revealing surprising

connections between graph Laplacian and Modularity matrix based spectral approaches.

Theorem 3. Let φ1, . . . , φn the canonical Schmidt bases of L 2 graph kernel C (u, v;G),

satisfying the integral equation (2.3). Then the solution of (2.3) for block-pulse orthogonal

series approximated (3.2) Fourier coefficients {αjk} can equivalently be written down in

closed form as the following matrix eigen-value problem

L∗[α] = λα, (3.3)

where L∗ = L − uuT , L is the Laplacian matrix, u = D1/2p 1n, and Dp = diag(p1, . . . , pn).

To prove that define the residual of the governing equation (2.3) by expanding φk as series

expansion (3.2),

R(u) ≡∑j

αjk

[ 1∫0

(C (u, v;G)− 1

)ξj(v) dv − λkξj(u)

]= 0. (3.4)

Now if the set {ξj} is complete and orthonormal in L 2(0, 1), then requiring the error R(u) to

be zero is equivalent to the statement that R(u) is orthogonal to each of the basis functions⟨R(u), ξk(u)

⟩L 2[0,1]

= 0, k = 1, . . . , n. (3.5)

13

This leads to the following set of equations:

∑j

αjk

[ ∫∫[0,1]2

(C (u, v;G)−1

)ξj(v)ξk(u) dv du

]− λk

∑j

αjk[ 1∫

0

ξj(u)ξk(u) du]

= 0. (3.6)

Definition 5. For graph Gn, define the associated co-moment matrixM(G, ξ) ∈ Rn×n with

respect to an orthonormal system ξ as

M[j, k;G, ξ] =

∫∫[0,1]2

(C (u, v;G)− 1

)ξj(v)ξk(u) dv du j, k = 1, . . . , n. (3.7)

For {ξk} to be top hat indicator basis, straightforward calculation reduces (3.6) to the

following system of linear algebraic equations expressed in the vertex domain∑j

αjk

[p(j, k)√p(j)p(k)

−√p(j)

√p(k) − λkδjk

]= 0. (3.8)

By plug-in empirical estimators verify that the equation (3.8) can be rewritten in the fol-

lowing compact matrix form [L −N−1

√d√dT]α = λα. (3.9)

The matrix L∗ = L−N−1√d√dT

is the co-moments matrix of the correlation density kernel

C (u, v;G) under the indicator basis choice.

Note 3. The fundamental idea behind the Rietz-Galerkin (Ritz, 1908, Galerkin, 1915) style

approximation scheme for solving variational problems in Hilbert space played a pivotal

inspiring role to formalize the statistical basis of the proposed computational approach.

Significance 1. The nonparametric spectral approximation scheme described in Theorem

3 reveals a surprising connection with graph Laplacian. The Laplacian eigenfunctions can

be viewed as block pulse series approximated KL bases {φk}–raw-empirical spectral basis

of graphs. Our technique provides a completely nonparametric statistical derivation of an

algorithmic spectral graph analysis tool; we are not aware of any other method that can

achieve a similar feat of connecting these two cultures of statistics. The plot of λk versus k

can be considered as a “raw periodogram” analogue for graph data analysis.

Significance 2. Replacing the estimated coefficients from Theorem 3 into (3.1) yields φk =

D−1/2p uk, where uk is the kth eigenvector of the Laplacian matrix L. This immediately leads

to the following vertex-domain spectral decomposition result of the empirical GraField

p(y|x;G)

p(y;G)= 1 +

∑k

λkφk(x)φk(y), (3.10)

14

where (φk ◦ F )(·) is abbreviated as φk(·), p(y|x;G) = T (x, y), and T = D−1A is the tran-

sition matrix of a random walk on G with stationary distribution p(y;G) = dy/N . For

an alternative proof of the expansion (3.10) see Appendix section of Coifman and Lafon

(2006) by noting φk are the right eigenvectors of random walk Laplacian T . Since {φk}n−1k=1

approximate the ‘optimal’ Karhunen-Loeve representation basis, it is only natural to use

them for non-linear embedding of graphs. This procedure is known as diffusion map, which

has been extremely successful tool for manifold learning2 and nonparametric stochastic

dynamical systems modeling (Talmon and Coifman, 2015). Our approach provides an addi-

tional insight and justification for the diffusion coordinates by interpreting it as the strength

of connectivity profile for each vertex, thus establishing a close connection with empirical

GraField.

Theorem 4. To approximate KL graph basis φk =∑

j αjkξj, choose ξj(u) = I(uj−1 < u ≤

uj) to be characteristic function satisfying

1∫0

ξj(u) du =

1∫0

ξ2j (u) du = p(j;G).

Then the corresponding spectral estimating equation (3.6) can equivalently be reduced to the

following generalized eigenvalue equation in terms of the matrix B = A − N−1ddT

Bα = λDα. (3.11)

Significance 3. The matrix B, known as modularity matrix, was introduced by Newman

(2006) from an entirely different motivation. Our analysis reveals that the Laplacian and

Modularity based spectral graph analysis are equivalent in the sense that they inherently use

the same underlying basis expansion (one is a rescaled version of the other) to approximate

the optimal graph bases.

Note 4. Solutions (eigenfunctions) of the graph co-moment estimating equation based on

the correlation density kernel C under the proposed nonparametric SOS approximation

scheme provides a systematic and unified framework for spectral graph analysis. As an

application of this general formulation, we have shown how the block-pulse function based

nonparametric approximation methods synthesize the well-known Laplacian, modularity and

diffusion map spectral algorithms. It is one of those rare occasions where one can witness

the convergence of statistical, algorithmic and geometry-motivated computational models.

2Data-driven learning of the “appropriate” coordinates to identify the intrinsic nonlinear structure of

high-dimensional data. We claim the concept of GraField allows decoupling of the geometrical aspect from

the probability distribution on the manifold.

15

4 A New Look At The Spectral Regularization

4.1 Smoothing High-dimensional Discrete Parameter Space

As we have demonstrated in the preceding sections, the fundamental goal of spectral graph

theory is to find the approximate canonical bases of the graph density matrix C , by express-

ing them as a linear combination of trail basis functions ξ1, . . . , ξn. Recall that (3.2) the

shape (“amplitude”) of the top hat indicator basis functions {ξk} depends on the unknown

distribution p(x;G) for x = {1, 2, . . . , n}.

MLE in Sparse Regime. This leads us to the question of estimating the unknown

distribution P = (p1, p2, . . . , pn) (support size = size of the graph = n) based on N sample,

where N =∑n

i=1 di = 2|E|. We previously used the MLE estimate (Theorem 3), which is

the raw-empirical discrete probability estimate p(x;G) to construct the ξ-functions in our

spectral approximation algorithm. However, the empirical MLE is known to be strictly sub-

optimal (Witten and Bell, 1991) in the sparse-regime where N/n = O(1) i.e., bounded. This

situation can easily arise for modern day large sparse random graphs where both n and N are

of comparable order. The question remains: How to tackle this high-dimensional discrete

probability estimation problem, as this directly impacts the efficiency of our nonparametric

spectral approximation.

Laplace Smoothing. We seek a practical solution for countering this problem that lends

itself to fast computation. The solution that is both the simplest as well as remarkably

serviceable is the Laplace/Additive smoothing (Laplace, 1951) and its variants, which excels

in sparse regimes (Fienberg and Holland, 1973, Witten and Bell, 1991). The MLE and

Laplace estimates of the discrete distribution p(j;G) are respectively given by

Raw-empirical MLE estimates: p(j;G) =djN

;

Smooth Laplace estimates: pτ (j;G) =dj + τ

N + nτ(j = 1, . . . , n).

Note that the smoothed distribution pτ can be expressed as a convex combination of the

empirical distribution p and the discrete uninform distribution 1/n

pτ (j;G) =N

N + nτp(j;G) +

nτ

N + nτ

(1

n

), (4.1)

which provides a Stein-type shrinkage estimator of the unknown probability mass function

p. The shrinkage significantly reduces the variance, at the expense of slightly increasing the

bias.

Choice of τ . The next issue is how to select the “flattening constant” τ . Three choices of

16

τ are most popular in the literature:

τ =

1 Laplace estimator;

1/2 Krichevsky–Trofimov estimator;√N/n Minimax estimator (under L2 loss).

For more details on selection of τ see Fienberg and Holland (1973) and references therein. In

the next section, we will reveal a surprising connection between Laplace-smooth block-pulse

series approximation scheme and regularized spectral graph analysis.

4.2 Theory of Spectral Regularization

Construct τ -regularized top hat indicator basis ξj;τ by replacing the amplitude p−1/2(j) by

p−1/2τ (j) following (4.1). Incorporating this regularized trial basis, we have the following

modified graph co-moment based linear algebraic estimating equation:

∑j

αjk

[p(j, k)√pτ (j)pτ (k)

−√pτ (j)

√pτ (k) − λkδjk

]= 0. (4.2)

Theorem 5. τ -regularized block-pulse series based spectral approximation scheme is equiv-

alent to representing or embedding discrete graphs in the continuous eigenspace of

Type-I Regularized Laplacian = D−1/2τ AD−1/2

τ , (4.3)

where Dτ is a diagonal matrix with i-th entry di + τ .

Note 5. It is interesting to note that this exact regularized Laplacian formula was proposed

by Chaudhuri et al. (2012) and Qin and Rohe (2013), albeit from very different motivation.

Theorem 6. Estimate the joint probability p(j, k;G) by extending the formula given in (4.1)

for two-dimensional case as follows:

pτ (j, k;G) =N

N + nτp(j, k;G) +

nτ

N + nτ

(1

n2

), (4.4)

which is equivalent to replacing the original adjacency matrix by Aτ = A+ (τ/n)11T . This

modification via smoothing in the estimating equation (4.2) leads to the following spectral

graph matrix

Type-II Regularized Laplacian = D−1/2τ Aτ D

−1/2τ . (4.5)

Note 6. Exactly the same form of regularization of Laplacian graph matrix (4.5) was

proposed by Amini et al. (2013) as a fine-tuned empirical solution.

17

Significance 4. We have addressed the open problem of obtaining a rigorous interpretation

and understanding of spectral regularization. Our approach provides a more formal and

intuitive understanding of the spectral heuristics for regularization. We have shown how the

regularization naturally arises as a consequence of high-dimensional discrete data smoothing.

In addition, this point of view allows us to select appropriate regularization parameter τ with

no additional computation. To the best of our knowledge this is the first work that provides a

theoretical derivation and fundamental justification for regularized spectral methods, which

were previously considered empirical guesswork-based ad hoc solutions. This perspective

might also suggests how to construct other new types of regularization schemes3.

5 Algorithm and Applications

We describe how the core concepts and the general theory can be used to develop spectral

models of graph structured data for solving applied problems.

5.1 Algorithm

Our nonparametric theory provides an alternative perspective to the conventional algo-

rithmic spectral graph modeling allowing easy generalization. We start by estimating the

spectral domain graph decomposition basis functions for compact representation. In what

follows, we describe the unified algorithm that employs the proposed orthogonal series spec-

tral approximation technique to compute the Karhunen-Loeve basis of graph correlation

density kernel C , which yields the ‘optimal’ discrete graph Fourier transform.

Algorithm 1. Nonparametric Estimation of Graph Fourier Basis

1. Input: The adjacency matrixA ∈ Rn×n. The regularization parameter τ ∈ {0, 1, 1/2,√N/n}.

The number of spectral basis required is k.

2. Construct τ -regularized block-pulse trial basis functions ξj;τ = p−1/2j;τ I(uj−1 < u ≤ uj) for

j = 1, . . . , n.

3. Compute the co-moment matrix of the graph with respect to the τ -regularized block-pulse

trial basis functions {ξk;τ}1≤k≤n.

Mτ [j, k;G, ξ] =⟨ξj;τ ,

1∫0

(C − 1)ξk;τ

⟩L2[0,1]

for j, k = 1, . . . , n.

3Two such examples to construct smooth p(x;G) would be Stein estimate based on data-driven empirical

Bayes shrinkage parameter or Good-Turing estimate.

18

Zinc concentrations (ppm)

●● ●

●

●●

●●

●●

●●

●●●

●

●

●

●

● ●●●

●

●●●

●●

●

●●

●●●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●

●

●●●

●

200

400

600

800

1000

1200

1400

1600

1800

Flooding frequency classes

●● ●

●

●●

●●

●●

●●

●●●

●

●

●

●

● ●●●

●

●●●

●●

●

●●

●●●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●

●

●●●

●

●

●

●

123

Distance to river

●● ●

●

●●

●●

●●

●●

●●●

●

●

●

●

● ●●●

●

●●●

●●

●

●●

●●●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●

●

●●●

●

200

400

600

800

1000

Soil type classes

●● ●

●

●●

●●

●●

●●

●●●

●

●

●

●

● ●●●

●

●●●

●●

●

●●

●●●●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

● ●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●●●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●

●

●●●

●

●

●

●

123

Figure 3: The Meuse dataset consists of n = 155 observations taken on a support of 15×15

m from the top 0 − 20 cm of alluvial soils in a 5 × 2 km part of the right bank of the

floodplain of the the river Meuse, near Stein in Limburg Province (NL). The dependent

variable Y is the zinc concentration in the soil (in mg kg−1), shown in the leftmost figure.

The other three variables, flooding frequency class (1 = once in two years; 2 = once in ten

years; 3 = one in 50 years), distance to river Meuse (in metres), and soil type (1= light

sandy clay; 2 = heavy sandy clay; 3 =silty light clay), are explanatory variables.

4. Perform the singular value decomposition (SVD) of M = UΛUT =∑

k ukµkuTk , where

where uij are the elements of the singular vector of moment matrix U = (u1, . . . , un), and

Λ = diag(µ1, . . . , µn), µ1 ≥ · · ·µn ≥ 0. Set λk = µk.

5. Estimate the optimal spectral connectivity profile for each node {φ1i;τ , . . . , φ(n−1)i;τ}, i =

1, . . . , n

φk;τ (u) =n∑j=1

ujkξj;τ , for k = 1, . . . , n− 1.

6. Return estimated spectral basis matrix Φ =[φ1;τ , . . . , φk;τ

]∈ Rn×k for the graph G.

5.2 Graph Regression

We study the problem of graph regression as an interesting application of the proposed

nonparametric spectral analysis algorithm. Unlike traditional regression settings, here one

is given n observations of the response and predictor variables over graph. The goal is

to estimate the regression function by properly taking into account the underlying graph

structured information along with the set of covariates.

19

●●●●●

●●●●

●●●

●●●

●●●●●●●● ●●●●●● ●●

●●●●

●●●●●●

●●●

●●●●●●

●

●●

●●

●●●

●

●●

●●● ●

●●

●

●

●●●●●●●●

●●●●

●

●

●

●●●● ●●

●

●●

●

●●●

●● ●

●●●

●

●

●

●

●

●●●

●●● ●●

●

●

●●●●

●

●●

●●●

●●

●

●

●

●

●

●●●●

● ●●●

●●●●

●

●●●●

●

●

●

12

3

4

5

6

78

9

101112

13

14

15

1617

181920

2122

2324

25262728

29

30

31

3233

3435

363738

3940

41

4243

44

45

46

47

48

49

50

51

52

53

54

55

5657

58

59

60

61

62

6364

65

66

67

68

69

707172

73

74

7576

77

787980

81

82

83

84

8586

87 8889

90

91

92

93

94

95

9697

98 99 100

101

102

103

104

105

106

107

108

109

110111

112113

114115

116

117

118

119

120121122

123

124

125

126

127128

129130

131

132

133

134

135

136137

138

139

140 141

142 143

144145

146

147

148

149150

151152

153

154

155

degree of the spatial Meuse data graph

Fre

quen

cy

5 10 15 20

05

1015

Figure 4: From spatial to graph representation of Meuse river data set. For each node we

observe {yi;x1i, x2i, x3i}ni=1. We address the problem of approximating the regression func-

tion incorporating simultaneously the effect of explanatory variables X and the underlying

spatial graph dependency.

Meuse Data Modeling. We apply our frequency domain graph analysis for the spatial

prediction problem. Fig 3 describes the Meuse data set, a well known geostatistical dataset.

There is a considerable spatial pattern one can see from Fig 3. We seek to estimate a

smooth regression function of the dependent variable Y (zinc concentration in the soil) via

generalized spectral regression that can exploit this spatial dependency. The graph was

formed according to the geographic distance between points based on the spatial locations

of the observations. We convert spatial data into signal supported on graph by connecting

two vertices if the distance between two stations is smaller than a given coverage radius. The

maximum of the first nearest neighbor distances is used as a coverage radius to ensure at

least one neighbor for each node. Fig 4(C) shows the sizes of neighbours for each node that

ranges from 1 to 22. Three nodes (# 82, 148 and 155) have only 1 neighbour; additionally

one can see a very weakly connected small cluster of three nodes, which is completely

detached from the bulk of the other nodes. The reason behind this heterogeneous degree

distribution (as shown in Fig 4) is the irregular spatial pattern of the Meuse data.

We model the relationship between Y and spatial graph topology by incorporating non-

parametrically learned spectral representation basis. Thus, we expand Y in the eigenbasis

of C and the covariates for the purpose of smoothing, which effortlessly integrates the tools

from harmonic analysis on graphs and conventional regression analysis. The model (5.1) de-

scribed in the following algorithm simultaneously promotes spatial smoothness and sparsity,

which is motivated by Donoho’s “de-noising by soft-thresholding” idea (Donoho, 1995).

20

Algorithm 2. Nonparametric Spectral Graph Regression

Step 1. Input: We observe {yi;xi1, . . . , xip} at vertex i of the graph G = (V,A) with size

|V | = n.

Step 2. Apply Algorithm 1 to construct the orthogonal series approximated spectral graph

basis and store it in Φ ∈ Rn×k.

Step 3. Construct combined graph regressor matrix XG =[Φ;X

], where X = [X1, . . . Xp] ∈

Rn×p is the matrix of predictor variables.

Step 4. Solve for βG = (βΦ, βX)T

βG = argminβG∈Rk+p

‖y −XGβG‖22 + λ‖βG‖1. (5.1)

Algorithm 2 extends traditional regression to data sets represented over graph. The pro-

posed frequency domain smoothing algorithm efficiently captures the spatial graph topology

via the spectral coefficients βΦ ∈ Rk–can be interpreted as covariate-adjusted discrete graph

Fourier transform of the response variable Y . The `1 sparsity penalty automatically selects

the coefficients with largest magnitudes thus provides compression.

The following table shows that incorporating the spatial correlation in the baseline unstruc-

tured regression model using spectral orthogonal basis functions (which are estimated from

the spatial graph with k = 25) boost the model fitting from 62.78% to 80.47%, which is an

improvement of approximately 18%.

X Φ0 +X ΦIτ=1 +X ΦI

τ=.5 +X ΦIτ=.28 +X ΦII

τ=1 +X ΦIτ=.5 +X ΦI

τ=.28 +X

62.78 74.80 78.16 80.47 80.43 80.37 80.45 80.43

Here Φ0, ΦIτ and ΦII

τ are the graph-adaptive piecewise constant orthogonal basis functions de-

rived respectively from the ordinary, Type-I regularized, and Type-II regularized Laplacian

matrix. Our spectral method can also be interpreted as a kernel smoother where the spatial

dependency is captured by the graph correlation density field C . We finally conclude that

extension from traditional regression to graph structured spectral regression significantly

improve the model accuracy.

5.3 Graph Clustering

To further demonstrate the potential application of our algorithm, we discuss the community

detection problem that seeks to divides nodes into into k groups (clusters), with larger

21

proportion of edges inside the group (homogeneous) and comparatively sparser connections

between groups to understand the large-scale structure of network. Discovering community

structure is of great importance in many fields such as LSI design, parallel computing,

computer vision, social networks, and image segmentation.

In mathematical terms, the goal is to recover the graph signals (class labels) y : V 7→{1, 2, . . . , k} based on the connectivity pattern or the relationship between the nodes.

Representation of graph in the spectral or frequency domain via the nonlinear mapping

Φ : G(V,E) 7→ Rm using discrete KL basis of density matrix C as the co-ordinate is the

most important learning step in the community detection. This automatically generates

spectral features {φ1i, . . . , φmi}1≤i≤n for each vertex that can be used for building the dis-

tance or similarity matrix to apply k-means or hierarchical clustering methods. In our

examples, we will apply k-means algorithm in the spectral domain, which seeks to minimiz-

ing the within-cluster sum of squares. In practice, often the most stable spectral clustering

algorithms determine k by spectral gap: k = argmaxj |λj − λj+1|+ 1.

Algorithm 3. Nonparametric Spectral Graph Partitioning

1. Input: The adjacency matrix A ∈ Rn×n. Number of clusters k. The regularization

parameter τ .

2. Estimate the top k − 1 spectral connectivity profile for each node {φ1i;τ , . . . , φ(k−1)i;τ}using Algorithm 1. Store it in Φ ∈ Rn×k−1.

3. Apply k-means clustering by treating each row of Φ as a point in Rk−1.

4. Output: The cluster assignments of n vertices of the graph C1, . . . , Ck.

In addition to graph partitioning, the spectral ensemble {λk, φk}1≤k≤m of C contain a wealth

of information on the graph structure. For example, the quantity 1− λ1(G; ξ) for the choice

of {ξk} to be normalized top hat basis (3.2), is referred to as the algebraic connectivity,

whose magnitude reflects how well connected the overall graph is. The kmeans cluster-

ing after spectral embedding {φj(G; ξ)}1≤j≤k−1 finds approximate solution to the NP-hard

combinatorial optimization problem based on the normalized cut (Shi and Malik, 2000)

by relaxing the discreteness constraints into one that is continuous (sometimes known as

spectral relaxation).

Data and Results. We investigate four well-studied real-world networks for community

structure detection based on 7 variants of spectral clustering methods.

Example A [Political Blog data, Adamic and Glance (2005)] The data, which contains

1222 nodes and 16, 714 edges, were collected over a period of two months preceding the U.S.

22

Table 1: We report % of misclassification error. We compare following seven different

Laplacian variants. K denotes the number of communities.

Type-I Reg. Laplacian Type-II Reg. Laplacian

Data K Laplacian τ = 1 τ = 1/2 τ =√N/n τ = 1 τ = 1/2 τ =

√N/n

PolBlogs 2 47.95% 4.9% 4.8% 5.4% 4.8% 4.7% 5.4%

Football 11 11.3% 7.83% 6.96% 6.96% 6.96% 7.83% 7.83%

MexicoPol 2 17.14% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%

Adjnoun 2 13.4% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5%

Presidential Election of 2004 to study how often the political blogs refer to one another.

The linking structure of the political blogosphere was constructed by identifying whether

a URL present on the page of one blog references another political blog (extracted from

blogrolls). Each blog was manually labeled as liberal or conservative by Adamic and Glance

(2005), which we take as ground truth. The goal is to discover the community structure

based on these blog citations, which will shed light on the polarization in political blogs.

Table 1 shows the result of applying the spectral graph clustering algorithm on this political

web-blog data. The un-regularized Laplacian performs very poorly, whereas as both type-

I/II regularized versions give significantly better results. The misclassification error drops

from 47.95% to 4.7% because of regularization. To better understand why regularization

plays a vital role, consider the degree distribution of the web-blog network as shown in

the bottom panel of Figure 2. It clearly shows the presence of a large number of low-

degree nodes, which necessitates the smoothing of high-dimensional discrete probability

p1, . . . , p1222. Thus, we perform the kmeans clustering after projecting the graph in the

Euclidean space spanned by Laplace smooth KL spectral basis {φk;τ}. Regularized spectral

methods correctly identify two dense clusters: liberal and conservative blogs, which rarely

links to a blog of a different political leaning, as shown in the middle panel of Fig 6.

Example B [US College Football, Grivan and Newman (2002)] The American football

network (with 115 vertices, 615 edges) depicts the schedule of football games between NCAA

Division IA colleges during the regular season of Fall 2000. Each node represents a college

team (identified by their college names) in the division, and two teams are linked if they have

played each other that season. The teams were divided into 11 “conferences” containing

around 8 to 12 teams each, which formed actual communities. The teams in the same

23

●●

●●

●

●

● ●

●

●

●

●

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

Figure 5: Mexican political network. Two different colors (golden and blue) denotes the

two communities (military and civilians).

conference played more often compared to the other conferences, as shown in the middle

panel of Fig 6. A team played on average 7 intra- and 4 inter-conference games in the season.

Inter-conference play is not uniformly distributed; teams that are geographically close to one

another but belong to different conferences are more likely to play one another than teams

separated by large geographic distances. As the communities are well defined, the American

football network provides an excellent real-world benchmark for testing community detection

algorithms.

Table 1 shows the performance of spectral community detection algorithms to identify the 11

clusters in the American football network data. The regularization boosts the performance

by 3-4%. In particular, τ = 1/2 and√N/n produces the best result for Type-I regularized

Laplacian, while τ = 1 exhibits the best performance for Type-II regularized Laplacian.

Example C [The Political Network in Mexico, Gil-Mendieta and Schmidt (1996)] The data

(with 35 vertices and 117 edges) represents the complex relationship between politicians in

Mexico (including presidents and their close associates). The edge between two politicians

indicates a significant tie, which can either be political, business, or friendship. A classifi-

cation of the politicians according to their professional background (1 - military force, 2 -

civilians: they fought each other for power) is given. We use this information to compare

our 7 spectral community detection algorithms.

Although this is a “small” network, challenges arise from the fact that the two commu-

nities cannot be separated easily due to the presence of a substantial number of between-

community edges, as depicted in Figs 5 and 6. The degree-sparsity is also evident from Fig 6

(bottom-panel). Table 1 compares seven spectral graph clustering methods. Regularization

yields 3% fewer misclassified nodes. Both the type-I and II regularized Laplacian methods

for all the choices of τ produce the same result.

24

degree of the web−blog network

Fre

quen

cy

0 50 100 150 200 250 300 350

010

020

030

040

050

0

degree of the US college football network

Fre

quen

cy

7 8 9 10 11 12 13

010

2030

4050

60

degree of the political network in Mexico

Fre

quen

cy

5 10 15

02

46

810

12

degree of the word adjacencies network

Fre

quen

cy

0 10 20 30 40 50

010

2030

40

Figure 6: The columns denote the 4 datasets corresponding to the political blog, US col-

lege football, politicians in Mexico, and word adjacencies networks (the description of the

datasets are given in Section 5.3). The first two rows display the un-ordered and ordered

adjacency matrix, and the final row depicts the degree distributions.

Example D [Word Adjacencies, Newman (2006)] This is a adjacency network (with 112

vertices and 425 edges) of common adjectives and nouns in the novel David Copperfield

by English 19th century writer Charles Dickens. The graph was constructed by Newman

(2006). Nodes represent the 60 most commonly occurring adjectives and nouns, and an

edge connects any two words that appear adjacent to one another at any point in the book.

Eight of the words never appear adjacent to any of the others and are excluded from the

network, leaving a total of 112 vertices. The goal is to identify which words are adjectives

and nouns from the given adjacency network.

Note that typically adjectives occur next to nouns in English. Although it is possible for

adjectives to occur next to other adjectives (e.g., “united nonparametric statistics”) or for

25

nouns to occur next to other nouns (e.g., “machine learning”), these juxtapositions are less

common. As expected, Fig 6 (middle panel) shows an approximately bipartite connection

pattern among the nouns and adjectives.

A degree-sparse (skewed) distribution is evident from the bottom right of Fig 6. We apply

the seven spectral methods and the result is shown in Table 1. The traditional Laplacian

yields a 13.4% misclassification error for this dataset. We get better performance (although

the margin is not that significant) after spectral regularization via Laplace smoothing.

6 Further Extensions

We have employed a nonparametric method for approximating the optimal Karhunen-

Loeve spectral graph basis (eigenfunction of graph correlation density matrix C ) via block-

pulse orthogonal series approach. The beauty of our general approach is in its ability to

unify seemingly disparate spectral graph algorithms while being simple and flexible. Our

theory and algorithm are readily applicable to other sophisticated orthogonal transforms

for obtaining more enhanced quality approximate solution of the functional basis.

Transform Coding of Graphs. Our general nonparametric spectral approximation theory

remains unchanged for any choice of orthogonal piecewise-constant trial basis functions (e.g.,

Walsh, Rademacher function, B-spline, or Haar functions etc.), which paves the way for the

generalized harmonic analysis of graphs.

Definition 6 (Orthogonal Discrete Graph Transform). We introduce a generalized concept

of matrices associated with graphs. Define orthogonal discrete graph transform as

M[j, k;G, ξ] =⟨ξj,

1∫0

(C − 1)ξk

⟩L2[0,1]

for j, k = 1, . . . , n. (6.1)

The family of Laplacian-like graph matrices can be thought of as block-pulse transform

coefficients matrix. Equivallently, we can define the discrete graph transform to be the

coefficient matrix of the orthogonal series expansion of the kernel C (u, v;G) with respect to

the product bases {ξjξk}1≤j,k≤n.

Significance 5 (Broadening the Class of Graph Matrices). As a practical significance, this

generalization provides a unified and systematic recipe for converting the graph problem

into a “suitable” matrix problem in the new transformed domain:

Gn(V,E) −→ An×n −→ C (u, v;Gn){ξ1,...,ξn}−−−−−→Eq. (6.1)

M(G) ∈ Rn×n.

The matrix M(G) allows us to broaden currently existing narrow classes of graph matri-

ces. Furthermore, verify that M = ΨTPΨ, where Ψjk = ξj(F (k;G)) for j, k = 1, . . . , n.

26

The Laplacian-like matrices (both normalized and regularized versions) and the Modularity

matrix are merely special instances of this general family.

Spectral Sparsification and Smoothing. Choice of the orthogonal piecewise-constant

trial bases {ξk}1≤k≤n can have a dramatic impact (in terms of theory and practice) on analyz-

ing graphs. Wisely constructed bases not only lead to better quality spectral representation,

but also powerful numerical algorithms for fast approximate learning. Our statistical view-

point sheds light on the scope and role of orthogonal piecewise-constant functions in discrete

graph processing.

Significance 6 (Spectral Smoothing). Orthogonal series approximation of φk =∑

j αjk ξj(u)

based on suitable and more sophisticated piecewise-constant basis functions (compared to

the block-pulse functions) provide increased flexibility that might produce enhanced quality

nonparametric “smooth” estimates instead of “raw” estimate of φk ( k = 1, . . . , n). We

emphasize that all of our theory remains completely valid and unchanged for any choice of

orthogonal transform (i.e., trial bases).

Significance 7 (An Accelerated Spectral Approximation Scheme). The conventional spec-

tral coding method based on singular value decomposition (SVD of partial order k) of dense

spectral matrix is roughly of the order O(kn2), which is costly and unrealistic to compute

for large graphs. For sparse matrices there exists iterative algorithms like Lanczos or ap-

proximate methods like Nystrom or Randomized SVD, whose computational cost scales as

O(ks), where s = number of non-zero elements of matrix M. Thus, it is logical to prefer

sparsifier trial basis (that produces sparse transform matrix) to speed-up the computation.

As a future research, we plan to investigate the role of Haar wavelet on large graph pro-

cessing. The key issues to be investigated include whether it provides a computational

advantage over standard approaches (which use indicator top-hat basis) for frequency anal-

ysis of graphs without sacrificing the accuracy, as well as whether localized compact support

of wavelet bases lead to sparsified graph transform. Such an approach will require efficient

techniques for sparse representation and recovery based on the pioneering ideas of Donoho

and Johnstone (Donoho and Johnstone, 1994).

All in all, the notion of generalized discrete graph transform inspire us to develop practical

and efficient compressible algorithms for large graphs via spectral sparsified graph matrices,

which could dramatically reduce memory and time requirements. However, more research

is required to examine this largely uncharted territory of graph data analysis.

27

7 Concluding Remarks

This paper introduces a new statistical way of thinking, teaching and practice of the spectral

analysis of graphs. In what follows, we highlight some of the key characteristics of this work:

• This paper does not offer a new spectral graph analysis tool, rather, it offers only a

new point of view from statistics perspective by appropriately transforming it into a non-

parametric approximation and smoothing problem. In particular, we have reformulated

the spectral graph theory as a method of obtaining approximate solution of the integral

eigenvalue equation of the correlation density matrix C – Correlation Density Functional

theory.

• A specialised orthogonal series spectral approximation technique is developed, which ap-

pears to unify and generalize the existing paradigm. We also reported striking finding that

clarifies, for the first time, the mystery of the spectral regularized algorithm by connecting

it with high-dimensional discrete parameter smoothing.

• Historically, the development of efficient computing algorithms for spectral graph analysis

has been inspired either by numerical linear algebra or combinatorics. The intention of this

paper was to develop a general theory of approximate spectral methods that

(i) offers a complete and autonomous description, totally free from the classical combinato-

rial or algorithmic linear algebra based languages;

(ii) can be fruitfully utilized to design statistically-motivated modern computational learning

techniques, which are of great practical interest since they might lead to fast algorithms for

large graphs.

• This work shows an exciting confluence of three different cultures : Nonparametric func-

tional statistical theory, Mathematical approximation theory (of integral equations), and

Computational harmonic analysis.

In essence, this paper presents a modern statistical view on spectral analysis of graphs that

contributes new insights for developing unified theory and algorithm from a nonparametric

perspective.

References

Adamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 us election:

divided they blog. In Proceedings of the 3rd international workshop on Link discovery.

ACM, 36–43.

Airoldi, E. M., Costa, T. B. and Chan, S. H. (2013). Stochastic blockmodel ap-

28

proximation of a graphon: Theory and consistent estimation. In Advances in Neural

Information Processing Systems. 692–700.

Amini, A. A., Chen, A., Bickel, P. J., Levina, E. et al. (2013). Pseudo-likelihood

methods for community detection in large sparse networks. The Annals of Statistics, 41

2097–2122.

Blackman, R. B. and Tukey, J. W. (1958). The measurement of power spectra from the

point of view of communications engineering—Part I-III. Bell System Technical Journal,

The, 37 485–569.

Chatterjee, S. (2014). Matrix estimation by universal singular value thresholding. The

Annals of Statistics, 43 177–214.

Chaudhuri, K., Graham, F. C. and Tsiatas, A. (2012). Spectral clustering of graphs

with general degrees in the extended planted partition model. Journal of Machine Learn-

ing Research Workshop and Conference Proceedings, 35 1–23.

Chung, F. R. (1997). Spectral graph theory, vol. 92. American Mathematical Soc.

Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Applied and computational

harmonic analysis, 21 5–30.

Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of

complex fourier series. Mathematics of computation, 19 297–301.

Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Informa-

tion Theory, 41 613–627.

Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet

shrinkage . Biometrika., 81 425–455.

Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal,

23 298–305.

Fienberg, S. E. and Holland, P. W. (1973). Simultaneous estimation of multinomial

cell probabilities. Journal of the American Statistical Association, 68 683–691.

Galerkin, B. (1915). Series development for some cases of equilibrium of plates and beams

(in Russian). Wjestnik Ingenerow Petrograd, 19 897–908.

Gil-Mendieta, J. and Schmidt, S. (1996). The political network in mexico. Social

Networks, 18 355–381.

Grivan, M. and Newman, M. E. J. (2002). Community structure in social and biological

networks. Proc. Natl. Acad. Sci 7821–7826.

29

Hoeffding, W. (1940). Massstabinvariante korrelationstheorie. Schriften des Mathema-

tischen Seminars und des Instituts fur Angewandte Mathematik der Universitat Berlin,

5 179–233.

Laplace, P. S. (1951). A philosophical essay on probabilities, translated from the 6th

French edition by Frederick Wilson Truscott and Frederick Lincoln Emory. Dover Publi-

cations (New York, 1951).

Loeve, M. (1955). Probability Theory; Foundations, Random Sequences. New York: D.

Van Nostrand Company.

Lovasz, L. and Szegedy, B. (2006). Limits of dense graph sequences. Journal of Com-

binatorial Theory, Series B, 96 933–957.

Mukhopadhyay, S. (2015). Strength of connections in a random graph: Definition, char-

acterization, and estimation. arXiv:1412.1530.

Newman, M. E. (2006). Finding community structure in networks using the eigenvectors

of matrices. Physical review E, 74 036104.

Parzen, E. (1961). An approach to time series analysis. The Annals of Mathematical

Statistics 951–989.

Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-corrected

stochastic blockmodel. In Advances in Neural Information Processing Systems. 3120–

3128.

Ritz, W. (1908). Uber eine neue methode zur losung gewisser randwertaufgaben.

Nachrichten von der Gesellschaft der Wissenschaften zu Gottingen, Mathematisch-

Physikalische Klasse 236–248.

Salter-Townshend, M., White, A., Gollini, I. and Murphy, T. B. (2012). Review

of statistical network analysis: models, algorithms, and software. Statistical Analysis and

Data Mining: The ASA Data Science Journal, 5 243–264.

Sarkar, P., Bickel, P. J. et al. (2015). Role of normalization in spectral clustering

for stochastic blockmodels. The Annals of Statistics, 43 962–990.

Schmidt, E. (1907). Zur Theorie der linearen und nicht linearen Integralgleichungen Zweite

Abhandlung. Mathematische Annalen, 64 433–476.

Shen, H.-W. and Cheng, X.-Q. (2010). Spectral methods for the detection of network

community structure: a comparative analysis. Journal of Statistical Mechanics: Theory

and Experiment, 2010 P10020.

30

Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. Pattern Analysis

and Machine Intelligence, IEEE Transactions on, 22 888–905.

Talmon, R. and Coifman, R. R. (2015). Intrinsic modeling of stochastic dynamical

systems using empirical geometry. Applied and Computational Harmonic Analysis, 39

138–160.

Tukey, J. W. (1984). Styles of spectrum analysis. In A Celebration in Geophysics and

Oceanography-1982, in Honor of Walter Munk, La Jolla, CA 1143–1153.

Von Luxburg, U., Belkin, M. and Bousquet, O. (2008). Consistency of spectral

clustering. The Annals of Statistics 555–586.

Wiener, N. (1930). Generalized harmonic analysis. Acta mathematica, 55 117–258.

Witten, I. H. and Bell, T. (1991). The zero-frequency problem. Information Theory,

IEEE Transactions on, 37 1085–1094.

31

statistical foundation of spectral graph theory · 2016-05-18 · statistical foundation of...

Documents