statistical foundation of spectral graph theory · 2016-05-18 · statistical foundation of...
TRANSCRIPT
Statistical Foundation of Spectral Graph Theory
Subhadeep Mukhopadhyay
Temple University, Department of Statistics
Philadelphia, Pennsylvania, 19122, U.S.A.
Dedicated to the memory of Manny Parzen (1929-2016), a pioneer in nonparametric
spectral domain time series analysis, from whom the author learned so much.
Abstract
Spectral graph theory is undoubtedly the most favored graph data analysis tech-
nique, both in theory and practice. It has emerged as a versatile tool for a wide variety
of applications including data mining, web search, quantum computing, computer vi-
sion, image segmentation, and among others. However, the way in which spectral
graph theory is currently taught and practiced is rather mechanical, consisting of a
series of matrix calculations that at first glance seem to have very little to do with
statistics, thus posing a serious limitation to our understanding of graph problems
from a statistical perspective. Our work is motivated by the following question: How
can we develop a general statistical foundation of “spectral heuristics” that avoids
the cookbook mechanical approach? A unified method is proposed that permits fre-
quency analysis of graphs from a nonparametric perspective by viewing it as function
estimation problem. We show that the proposed formalism incorporates seemingly
unrelated spectral modeling tools (e.g., Laplacian, modularity, regularized Laplacian,
diffusion map etc.) under a single general method, thus providing better fundamen-
tal understanding. It is the purpose of this paper to bridge the gap between two
spectral graph modeling cultures: Statistical theory (based on nonparametric func-
tion approximation and smoothing methods) and Algorithmic computing (based on
matrix theory and numerical linear algebra based techniques) to provide transparent
and complementary insight into graph problems.
Keywords and phrases: Nonparametric spectral graph analysis; Graph correlation den-
sity field; Spectral regularization; Orthogonal functions based spectral approximation; Trans-
form coding of graphs; Karhunen-Loeve representation of graphs; High-dimensional discrete
data smoothing.
Acknowledgments: The author benefited greatly by many fruitful discussions and ex-
change of ideas he had at the John W. Tukey 100th Birthday Celebration meeting, Prince-
ton University September 18, 2015. The author is particularly grateful to Professor Ronald
Coifman for suggesting the potential connection between our formulation and Diffusion map
(random walk on graph). We also thank Isaac Pesenson and Karl Rohe for helpful questions
and discussions.
1
Contents
1 Introduction 21.1 Graph: A Unified Data-Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Two Complementary Viewpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Towards a Statistical Theory of Spectral Graph Analysis . . . . . . . . . . . . . . . 4
1.4 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Fundamentals of Statistical Spectral Graph Analysis 6
2.1 Graph Correlation Density Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Karhunen-Loeve Representation of Graph . . . . . . . . . . . . . . . . . . . . . . . 9
3 Nonparametric Spectral Approximation Theory 11
3.1 Orthogonal Functions Approach to Spectral Graph Analysis . . . . . . . . . . . . . 11
3.2 Graph Co-Moment Based Computational Mechanics . . . . . . . . . . . . . . . . . 13
4 A New Look At The Spectral Regularization 16
4.1 Smoothing High-dimensional Discrete Parameter Space . . . . . . . . . . . . . . . 16
4.2 Theory of Spectral Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Algorithm and Applications 18
5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Graph Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6 Further Extensions 26
7 Concluding Remarks 28
1 Introduction
1.1 Graph: A Unified Data-Structure
Graph models are powerful enablers towards developing unified statistical algorithms for
two reasons: First, availability of graph structured data from wide range of data-disciplines,
like computational biology (eg. protein-protein, gene-gene and gene-protein interaction net-
works), and the social sciences (eg., social networks like Facebook, MySpace, and LinkedIn)
etc. Second and more importantly, graph models offer a unified way of understanding dif-
ferent data types. Many complex (non-standard) data problems arise in various engineering
and scientific fields, such as segmentation of images in computer vision, mobile communica-
tions, electric power grids, cyber-security, scientific computation, functional brain dynam-
ics, 3d mesh processing in computer graphics, information retrieval, and natural language
processing–all of which can be reformulated as graph problems.
Because of the potential to be a powerful generalized data structure, it is of outstanding
interest to develop new ways of looking at graphs that might lead to faster and better
understanding of the essential characteristics of this high-dimensional discrete object.
2
1.2 Two Complementary Viewpoints
Vertex (or node) domain and frequency (or spectral) domain are two complementary rep-
resentation techniques for graphs.
A. Vertex domain analysis. Tremendous advances have been made over the past few
decades in building vertex domain graph models. The main focus has been to model “link
probabilities” between the nodes. Vertex domain analysis can be divided into two categories
based on its modeling approach.
(A1) Parametric modeling. These are the most extensively studied models in the statistics
and probability literature. Some key parametric statistical graph models include the Erdos-
Renyi model, p1-p2 model, latent space models, stochastic blockmodels and their numerous
extensions. For an excellent overview of different models, see the survey paper by Salter-
Townshend et al. (2012).
(A2) Nonparametric modeling. Lovasz and Szegedy (2006) introduced the concept of
Graphon, a functional modeling tool that compactly describes the edge probabilities or
connection probabilities of a random graph. Characterization and estimation based on
graphon is a fast-growing branch of nonparametric statistics and graph theory; see for
instance Mukhopadhyay (2015), Airoldi et al. (2013), Chatterjee (2014), and references
therein.
B. Frequency domain analysis. This provides a natural alternative to look into the
graphs. The practical significance and role of spectral analysis can be best summarized by
stating Tukey (1984) “Failure to use spectrum analysis can cost us lack of new phenomena,
lack of insight, lacking of gaining an understanding of where the current model seems to fail
most seriously.”
(B1) Algorithmic spectral analysis. The way spectral graph analysis is currently taught and
practiced can be summarized as follows (also known as spectral heuristics):
1. Convert Gn undirected graph to A ∈ Rn×n binary adjacency matrix, where A(x, y;G) =
1 if the nodes x and y are connected by an edge and 0 otherwise.
2. Define “suitable” spectral graph matrix. Most popular and successful ones are listed
below:
• L = D−1/2AD−1/2; Chung (1997)
• B = A − N−1ddT ; Newman (2006)
• Type-I Reg. L = D−1/2τ AD
−1/2τ ; Chaudhuri et al. (2012)
• Type-II Reg. L = D−1/2τ Aτ D
−1/2τ ; Amini et al. (2013)
3
D = diag(d1, . . . , dn) ∈ Rn×n, di denotes the degree of a node, τ > 0 tuning parameter,
and N = 2|E| =∑
x,y A(x, y).
3. Perform SVD on the matrix selected at step 2.
Spectral graph theory seeks to understand the interesting properties and structure of a
graph by using the dominant singular values and vectors, first recognized by Fiedler (1973).
A premier book on this topic is Chung (1997).
1.3 Towards a Statistical Theory of Spectral Graph Analysis
Nonparametric spectral analysis. To the best of our knowledge, there is no existing theory
for the nonparametric frequency analysis of graphs, despite that similar approaches for time
series analysis (pioneered by Wiener (Wiener, 1930), Tukey (Blackman and Tukey, 1958,
Cooley and Tukey, 1965), Parzen (Parzen, 1961), and many others) revolutionized 20th
century signal processing, leading to many engineering and scientific breakthroughs.
This paper arose out of the attempt to find a statistical explanation of the extraordinary
success of spectral heuristics in terms of both theory and application. Our goal is to un-
derstand at least some part of this puzzle from the statistical perspective that allows graph
analysts to adopt nonparametric approach. This would allow us to avoid the cookbook me-
chanical way of approaching frequency-domain (spectral) analysis of graphs that is solely
based on a series of matrix calculations (B1), which at first glance seems to have very lit-
tle to do with statistics. Driven by the maxim: more can be learned from graph data by
wise use of spectral analysis, this paper addresses the intellectual challenge of building a
firm statistical foundation from scratch that permits the science and art of frequency graph
analysis–a problem of exceptional interest at the present time.
1.4 Our Contributions
In this paper, we are interested in the statistical foundation of spectral graph analysis.
We present the mathematical foundation of nonparametric spectral approximation methods
for graph structured data. The mathematical treatment of this paper is guided by the
intention of providing an applicable and unified theoretical foundation based on new ideas
and concepts without resorting to unnecessary rigor that has little or no practical value for
algorithm design and applications.
There are four unique aspects of our research contributions in this paper:
• Unification: We develop a general statistical theory from the first principle that provides
genuine motivation and understanding of the spectral graph methods from a new perspec-
tive. As a taste of its power, we show how this new point of view incorporates seemingly
unrelated traditional ‘black-box’ spectral techniques as a particular instance.
4
• Cultural Bridge: An attempt is made to bridge the gap between the two cultures: Statis-
tical (based on nonparametric function approximation and smoothing methods) and Algo-
rithmic (based on matrix theory and numerical linear algebra based techniques), to provide
transparent and complementary insight into graph problems.
• Generalization: Our modeling perspective brings useful tools and crucial insights to de-
sign powerful statistical algorithms for random graph modelling. Our research inspires the
development of new innovative (fast and approximate) computational harmonic analysis
tools, which is probably necessary for large-scale spectral graph analysis.
• Interdisciplinarity : The interplay between statistics, approximation theory, and compu-
tational harmonic analysis is one of the central themes of this research.
We believe that similar to time series analysis, the proposed nonparametric spectral char-
acterization will usefully provide both computational and statistical advantages in terms
of better (compressed) representations and computationally effective simpler solutions to
graph modeling problems.
Section 2 is devoted to the fundamentals of statistical spectral graph analysis that will
provide the necessary background for a good understanding of the beauty and utility of
the proposed frequency domain viewpoint. We introduce the concept of graph correlation
density field (GraField). The Karhunen-Loeve representation of graph is defined via spec-
tral expansion of discrete graph kernel C , which provides universal spectral embedding
coordinates and graph Fourier basis. Unified nonparametric spectral approximation tech-
nique is discussed in Section 3. We describe a specialised class of smoothed orthogonal
series spectral estimation algorithm. Surprising connections between our approach and the
algorithmic spectral methods are derived. In Section 4, the open problem of obtaining a
formal interpretation of “spectral regularization” is addressed. A deep connection between
high-dimensional discrete data smoothing and spectral regularization is discovered. This
new perspective provides, for the first time, the theoretical motivation and fundamental
justification for using regularized spectral methods, which were previously considered to be
empirical guesswork-based ad hoc solutions. The algorithm of nonparametric estimation
of graph Fourier basis is given in Section 5. Important applications towards spatial graph
regression and graph clustering are also discussed. Section 6 presents further extensions
and generalizations to orthogonal functions based generalized Fourier graph analysis. We
introduce concepts like transform coding of graphs, orthogonal discrete graph transform,
and spectral sparsified graph matrices. Our research sheds light on the role of properly
designed basis functions to provide enhanced quality spectral representation and powerful
numerical algorithms for fast approximate learning–a topic of paramount importance for
analyzing big graph data at scale. In Section 7, we end with concluding remarks.
5
2 Fundamentals of Statistical Spectral Graph Analysis
2.1 Graph Correlation Density Field
At the outset, it is not even clear what the natural starting point for developing a coher-
ent statistical description of spectral graph analysis is. For reasons which will be clear
a little later, we introduce the concept of ‘Graph Correlation Density Field’ (or in short
GraField) – a crucial tool for our analysis that will allow cross-fertilization between algo-
rithmic and statistical modeling culture. GraField is a bivariate step kernel over the unit
square that provides a unified way to study the structure of graphs in the frequency domain
by completely characterizing and compactly representing the ‘affinity’ or ‘strength’ of ties
(or interaction) between every pair of vertices in the graph.
Curious readers might be wondering whether our approach has an analogous time series
counterpart where spectral analysis is related to the autocorrelation function (ACF) by a
Fourier transform (Wiener–Khinchin theorem). The answer is yes. In the similar spirit, we
will show (in the next section): classical Laplacian graph spectrum can be interpreted as
raw (not smoothed) nonparametric estimator of the Fourier transform of GraField.
Definition 1. For given discrete graph G of size n, the piecewise-constant bivariate kernel
function C : [0, 1]2 → R+ ∪ {0} is defined almost everywhere through
C (u, v;Gn) =p(Q(u;X), Q(v;Y );Gn
)p(Q(u;X)
)p(Q(v;Y )
) , 0 < u, v < 1, (2.1)
where u = F (x;X), v = F (y;Y ) for x, y ∈ {1, 2, . . . , n} and degree sequence induced graph
mass functions
p(x;X) =n∑y=1
A(x, y)/N, p(y;Y ) =n∑x=1
A(x, y)/N, and p(x, y;G) = A(x, y)/N
with Q(u;X) and Q(v;Y ) are the respective quantile functions.
Theorem 1. GraField defined in Eq. (2.1) is a positive piecewise-constant kernel satisfying∫∫[0,1]2
C (u, v;G) du dv =∑
(i,j)∈{1,...,n}2
∫∫Iij
C (u, v;G) du dv = 1,
where
Iij(u, v) =
1, if (u, v) ∈ (F (i;X), F (i+ 1;X)]× (F (j;Y ), F (j + 1;Y )]
0, elsewhere.
6
Note 1. The bivariate step-like shape of the GraField kernel is governed by the (piecewise-
constant left continuous) quantile functions Q(u;X) and Q(v;Y ) of the discrete measures
p(x;X) and p(y;Y ). As a result, in the continuum limit (as the dimension of the graph
n→∞), the shape of the piecewise-constant discrete C approaches to a “continuous field”
over unit interval.
Motivation 1. As a toy-example, consider the following adjacency matrix of a social net-
work representing 4 employees of an organization
A =
0 2 0 02 0 3 30 3 0 30 3 3 0
,
where the weights reflect number of communication (say email messages or coappear-
ances in social events etc.). Our interest lies in understanding the strength of associa-
tion between the employees i.e., Strength(x, y) for all pairs of vertices. Looking at the
matrix A (or equivalently based on histogram Graphon estimator p(x, y;G) = A/N with
N =∑
x,y A(x, y) = 22) one might be tempted to conclude that the link between employee
1 and 2 is the weakest one as they have communicated only twice, whereas employee 2, 3
and 4 constitute strong-ties as they have interacted more frequently. Now here is the sur-
prise. It turns out that (i) Strength(1, 2) is twice that of Strength(2, 3) and Strength(2, 4);
also (ii) Strength(1, 2) is 1.5 times of Strength(3, 4)! To understand the paradox com-
pute the vertex-domain empirical GraField kernel matrix (Definition 1) with (x, y)th entry
N ·A(x, y;G)/d(x)d(y)
Cn =
0 22/8 0 0
22/8 0 22/16 22/16
0 22/16 0 22/12
0 22/16 22/12 0
.
This toy example is in fact a small portion (with members 1, 9, 31 and 33) of the famous
Zachary’s karate club data, where the first two members were from Mr. Hi’s group and the
remaining two were from John’s group1. The purpose of this illustrative example is not to
completely dismiss the adjacency or empirical graphon based analysis but to caution the
practitioners so as not to confuse the terminology “strength of association” with “weights”
of the adjacency matrix – two are very different objects. Existing literature use them
interchangeably without paying much attention. As a first step towards exploratory graph
data analysis we recommend looking at both the traditional adjacency matrix (or Graphon
1For details about the Zachary’s karate club see https://en.wikipedia.org/wiki/Zachary\%27s_
karate_club.
7
● ●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
1 2
3
4
5
6
7 8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
3334
Karate Graphon
● ●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
1 2
3
4
5
6
7 8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
3334
Karate GraField
Figure 1: Graphon and GraField network display for the karate club data set. Nodes are
plotted in the same order in both the plots to match the degree of association and the
corresponding weight for each edge.
based) network plot and also the degree of association plot, as shown in the Fig.1 for the
Karate club data. It is important to emphasize that higher edge-weight do not necessarily
translates into stronger association. The crux of the matter is: Association does not depend
on the raw edge-density, it is a “comparison edge-density” that is captured by the GraField.
Motivation 2. As we have seen from the previous example, for finite dimensional graph
data analysis, empirical graphon captures exactly the same information that is contained
in adjacency matrix. Thus, it is not difficult to see that the graphon based spectral graph
analysis leads to the classical adjacency spectral embedding results. At the same time, it
has been long known that various normalized versions of adjacency spectral graph matrices
(such as Laplacian, modularity or diffusion matrix) perform far better in practice (Shen and
Cheng, 2010, Newman, 2006) as well as in theory (Von Luxburg et al., 2008, Sarkar et al.,
2015). One question naturally arises as to why we have such a narrow collection of ‘useful’
spectral graph matrices in our toolbox after almost half a century of active research by many
disciplines. The main reason for this is the highly non-trivial construction mechanisms for
each spectral matrices and there seem to be no general theory available that can unite
them under a single framework. How to overcome this long-standing barrier is an open
problem, which we address in this paper. A general strategy for spectral graph analysis
will be developed that includes most of the existing spectral methods as special cases. Our
formulation will be entirely (nonparametric) statistical where C plays a central role and an
elegant starting point.
8
Motivation 3. The other crucial point to note is that the “slices” of the GraField kernel
(2.1) can be expressed as p(y|x;G)/p(y;G) in the vertex domain. This alternative conditional
probability based viewpoint suggests a connection with the random walk on the graph.
Interpret p(y|x;G) as transition probability from vertex x to vertex y, and (when G is non-
bipartite) p(y;G) as the stationary distribution. This key observation along with Theorem
2 will allow us to integrate diffusion map (Coifman and Lafon, 2006) into our general
framework in Section 3.2.
Motivation 4. Graphon (Lovasz and Szegedy, 2006) captures edge probability density,
whereas GraField offers a way to measure and represent the ‘strength’ of connections (or
interaction) between pairs of vertices. GraField can also be viewed as properly “normalized
Graphon,” which is reminiscent of Wassily Hoeffding’s “standardized distributions” idea
(Hoeffding, 1940). Thus it can be interpreted as a discrete analogue of copula (the Latin
word copula means “a link, tie, bond”) density for random graphs that captures the un-
derlying correlation field. We study the structure of graphs in the spectral domain via this
fundamental graph kernel C that characterizes the implicit connectedness or tie-strength
between pairs of vertices.
Fourier-type spectral expansion result of the density matrix C is discussed in the ensuing
section, which is at the heart of our approach. We will demonstrate that this correlation
density operator based formalism provides a useful perspective for spectral analysis of graphs
that allows unification.
2.2 Karhunen-Loeve Representation of Graph
We define the Karhunen-Loeve (KL) representation of a graph G based on the spectral
expansion of its Graph Correlation Density function C (u, v;G). Schmidt decomposition
(Schmidt, 1907) of C yields the following spectral representation theorem of graph.
Theorem 2. The square integrable graph correlation density kernel C : [0, 1]2 → R+ ∪ {0}
of two-variables admits the following canonical representation
C (u, v;Gn) = 1 +n−1∑k=1
λkφk(u)φk(v), (2.2)
where the non-negative λ1 ≥ λ2 ≥ · · ·λn−1 ≥ 0 are singular values and {φk}k≥1 are the
orthonormal singular functions 〈φj, φk〉L2[0,1] = δjk, for j, k = 1, . . . , n − 1, which can be
evaluated as the solution of the following integral equation relation∫[0,1]
[C (u, v;G)− 1]φk(v) dv = λkφk(u), k = 1, 2, . . . , n− 1. (2.3)
9
Remark 1. What can we learn from the density kernel C ? We will show that this integral
eigenvalue equation-based formulation of correlation density operator provides a unified
view of spectral graph analysis. The spectral basis functions {φk} of the orthogonal se-
ries expansion of C are the central quantities of interest for graph learning and provide
the optimal representation in the spectral domain. The fundamental statistical modeling
problem hinges on finding approximate solutions to the optimal graph coordinate system
{φ1, . . . , φn−1} that satisfy the integral equation (2.3).
Remark 2. By virtue of the properties of Karhunen-Loeve (KL) expansion (Loeve, 1955),
the eigenfunction basis φk satisfying (2.3) provides the optimal low-rank representation of
a graph in the mean square error sense. In other words, {φk} bases capture the graph
topology in the smallest embedding dimension and thus carries practical significance for
graph compression. Hence, we can call those functions, the optimal coordinate functions or
Fourier representation basis.
Definition 2. Any function or signal y ∈ Rn defined on the vertices of the graph y : V 7→ R
such that ‖y‖2 =∑
x∈V (G) |y(x)|2p(x;G) <∞, can be represented as a linear combination of
the Schmidt bases of the correlation density matrix C . Define the generalized graph Fourier
transform of y
y(λk) := 〈y, φk〉 =n∑x=1
y(x)φk[F (x;G)].
This spectral or frequency domain representation of a signal, belonging to the square inte-
grable Hilbert space L2(G) equipped with the inner product
〈y, z〉L2(G) =∑
x∈V (G)
y(x)z(x)p(x;G),
allows us to construct efficient graph learning algorithms. As {φk}’s are KL spectral bases,
the vector of projections onto this basis function decay rapidly, hence may be truncated
aggressively to capture the structure in a small number of bits.
Definition 3. The entropy (or energy) of a discrete graph G, is defined using the Parseval
relation of the spectral representation
Entropy(G) =
∫∫[0,1]2
(C − 1)2 du dv =∑k
|λk|2.
This quantity, which captures the departure of uniformity of the C , can be interpreted as
a measure of ‘structure’ or the ‘compressibility’ of the graph. This entropy measure can
10
be used to (i) define graph homogeneity; (ii) design fast algorithms for graph isomorphism.
For homogenous graphs the shape of the correlation density field is flat uniform over unit
square. The power at each harmonic components, as a function of frequency, is called the
power spectrum of the graph.
The necessary theoretical background of a general method for solving the integral equation
(2.3) in order to produce good approximation of the optimal KL bases φk will be discussed
in Section 3 and 4. By doing so, we will reveal its connection with classical (algorithmic)
spectral analysis techniques.
3 Nonparametric Spectral Approximation Theory
We describe the nonparametric theory to approximate the canonical functions {φk}, which
play the role of Fourier basis for functions over graph G.
Definition 4. Define the spectral graph learning algorithm as method of approximating
(λk, φk)k≥1 that satisfies the integral equation (2.3) corresponding to the graph kernel
C (u, v;G). In practice, often the most important features of a graph can be well charac-
terized and approximated by few top dominating singular-pairs. The statistical estimation
problem can be summarize as follows:
An×n 7→ C 7→{(λ1, φ1
), . . . ,
(λn−1, φn−1
)}that satisfies Eq. (2.3).
This formalism and practical viewpoint of understanding the graph modeling problems
turns out to be very useful. To demonstrate its power, we will next show how this can
provide a general purpose unified framework for deriving traditional spectral graph analysis
techniques.
3.1 Orthogonal Functions Approach to Spectral Graph Analysis
Orthogonal Series Approximation. We will develop a specialized technique for approx-
imating spectra of integral kernel C . In particular, the theory and algorithm of Orthogonal
Series Spectral approximation (SOS) technique will be discussed. This general approxi-
mation scheme provides an effective and systematic way of discrete graph analysis in the
frequency domain.
Approximate the unknown function φk as a linear combination of elements from a complete
orthogonal system in L 2[0, 1]. Let {ξk} be a complete basis of Rn defined on the unit
interval [0, 1]. Accordingly, each singular function φk can be expressed as the expansion
11
1
2
3
4
1
2
3 4
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.00.
00.
61.
2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.0
2.0
Figure 2: Two graphs and their corresponding degree-adaptive block-pulse functions. The
amplitudes and the block length of the indicator basis functions (3.2) depends on the degree
distribution of that graph.
over this basis
φk(u) ≈n∑j=1
αjk ξj(u), u ∈ [0, 1]. (3.1)
where αjk are the unknown coefficients to be estimated.
Degree-Adaptive Block-pulse Basis Functions. The most fundamental yet universally
valid (for any graph) choice for {ξj}1≤j≤n is the indicator top hat functions (also known as
block-pulse basis functions, in short BPFs). Instead of defining the BPFs on a uniform
grid (which is the usual practice) here we define them on the non-uniform mesh 0 = u0 <
12
u1 · · · < un = 1 over [0,1], where uj =∑
x≤j p(x;X) with local support
ξj(u) =
{p−1/2(j) for uj−1 < u ≤ uj;
0 elsewhere.(3.2)
They are disjoint, orthogonal and complete set of functions satisfying
1∫0
ξj(u) du =√p(j),
1∫0
ξ2j (u) du = 1, and
1∫0
ξj(u)ξk(u) du = δjk.
Note 2. Because of how we have defined the BPFs, the shape (amplitudes and block
lengths) depend on the specific graph structure via p(x;G) as shown in Fig 1. The estimated
φk, by representing them as block pulse series, will be called raw-nonparametric estimates. A
computational procedure and algorithm for estimating the unknown expansion coefficients
{αjk} satisfying (2.3) will be discussed next.
3.2 Graph Co-Moment Based Computational Mechanics
We described a technique via block-pulse series approach to approximate the eigenpairs of
the integral correlation density kernel equation. In order to obtain the spectral domain rep-
resentation of the graph, it is required to estimate the block-pulse function coefficients (3.1).
The following result describes the required computational scheme by revealing surprising
connections between graph Laplacian and Modularity matrix based spectral approaches.
Theorem 3. Let φ1, . . . , φn the canonical Schmidt bases of L 2 graph kernel C (u, v;G),
satisfying the integral equation (2.3). Then the solution of (2.3) for block-pulse orthogonal
series approximated (3.2) Fourier coefficients {αjk} can equivalently be written down in
closed form as the following matrix eigen-value problem
L∗[α] = λα, (3.3)
where L∗ = L − uuT , L is the Laplacian matrix, u = D1/2p 1n, and Dp = diag(p1, . . . , pn).
To prove that define the residual of the governing equation (2.3) by expanding φk as series
expansion (3.2),
R(u) ≡∑j
αjk
[ 1∫0
(C (u, v;G)− 1
)ξj(v) dv − λkξj(u)
]= 0. (3.4)
Now if the set {ξj} is complete and orthonormal in L 2(0, 1), then requiring the error R(u) to
be zero is equivalent to the statement that R(u) is orthogonal to each of the basis functions⟨R(u), ξk(u)
⟩L 2[0,1]
= 0, k = 1, . . . , n. (3.5)
13
This leads to the following set of equations:
∑j
αjk
[ ∫∫[0,1]2
(C (u, v;G)−1
)ξj(v)ξk(u) dv du
]− λk
∑j
αjk[ 1∫
0
ξj(u)ξk(u) du]
= 0. (3.6)
Definition 5. For graph Gn, define the associated co-moment matrixM(G, ξ) ∈ Rn×n with
respect to an orthonormal system ξ as
M[j, k;G, ξ] =
∫∫[0,1]2
(C (u, v;G)− 1
)ξj(v)ξk(u) dv du j, k = 1, . . . , n. (3.7)
For {ξk} to be top hat indicator basis, straightforward calculation reduces (3.6) to the
following system of linear algebraic equations expressed in the vertex domain∑j
αjk
[p(j, k)√p(j)p(k)
−√p(j)
√p(k) − λkδjk
]= 0. (3.8)
By plug-in empirical estimators verify that the equation (3.8) can be rewritten in the fol-
lowing compact matrix form [L −N−1
√d√dT]α = λα. (3.9)
The matrix L∗ = L−N−1√d√dT
is the co-moments matrix of the correlation density kernel
C (u, v;G) under the indicator basis choice.
Note 3. The fundamental idea behind the Rietz-Galerkin (Ritz, 1908, Galerkin, 1915) style
approximation scheme for solving variational problems in Hilbert space played a pivotal
inspiring role to formalize the statistical basis of the proposed computational approach.
Significance 1. The nonparametric spectral approximation scheme described in Theorem
3 reveals a surprising connection with graph Laplacian. The Laplacian eigenfunctions can
be viewed as block pulse series approximated KL bases {φk}–raw-empirical spectral basis
of graphs. Our technique provides a completely nonparametric statistical derivation of an
algorithmic spectral graph analysis tool; we are not aware of any other method that can
achieve a similar feat of connecting these two cultures of statistics. The plot of λk versus k
can be considered as a “raw periodogram” analogue for graph data analysis.
Significance 2. Replacing the estimated coefficients from Theorem 3 into (3.1) yields φk =
D−1/2p uk, where uk is the kth eigenvector of the Laplacian matrix L. This immediately leads
to the following vertex-domain spectral decomposition result of the empirical GraField
p(y|x;G)
p(y;G)= 1 +
∑k
λkφk(x)φk(y), (3.10)
14
where (φk ◦ F )(·) is abbreviated as φk(·), p(y|x;G) = T (x, y), and T = D−1A is the tran-
sition matrix of a random walk on G with stationary distribution p(y;G) = dy/N . For
an alternative proof of the expansion (3.10) see Appendix section of Coifman and Lafon
(2006) by noting φk are the right eigenvectors of random walk Laplacian T . Since {φk}n−1k=1
approximate the ‘optimal’ Karhunen-Loeve representation basis, it is only natural to use
them for non-linear embedding of graphs. This procedure is known as diffusion map, which
has been extremely successful tool for manifold learning2 and nonparametric stochastic
dynamical systems modeling (Talmon and Coifman, 2015). Our approach provides an addi-
tional insight and justification for the diffusion coordinates by interpreting it as the strength
of connectivity profile for each vertex, thus establishing a close connection with empirical
GraField.
Theorem 4. To approximate KL graph basis φk =∑
j αjkξj, choose ξj(u) = I(uj−1 < u ≤
uj) to be characteristic function satisfying
1∫0
ξj(u) du =
1∫0
ξ2j (u) du = p(j;G).
Then the corresponding spectral estimating equation (3.6) can equivalently be reduced to the
following generalized eigenvalue equation in terms of the matrix B = A − N−1ddT
Bα = λDα. (3.11)
Significance 3. The matrix B, known as modularity matrix, was introduced by Newman
(2006) from an entirely different motivation. Our analysis reveals that the Laplacian and
Modularity based spectral graph analysis are equivalent in the sense that they inherently use
the same underlying basis expansion (one is a rescaled version of the other) to approximate
the optimal graph bases.
Note 4. Solutions (eigenfunctions) of the graph co-moment estimating equation based on
the correlation density kernel C under the proposed nonparametric SOS approximation
scheme provides a systematic and unified framework for spectral graph analysis. As an
application of this general formulation, we have shown how the block-pulse function based
nonparametric approximation methods synthesize the well-known Laplacian, modularity and
diffusion map spectral algorithms. It is one of those rare occasions where one can witness
the convergence of statistical, algorithmic and geometry-motivated computational models.
2Data-driven learning of the “appropriate” coordinates to identify the intrinsic nonlinear structure of
high-dimensional data. We claim the concept of GraField allows decoupling of the geometrical aspect from
the probability distribution on the manifold.
15
4 A New Look At The Spectral Regularization
4.1 Smoothing High-dimensional Discrete Parameter Space
As we have demonstrated in the preceding sections, the fundamental goal of spectral graph
theory is to find the approximate canonical bases of the graph density matrix C , by express-
ing them as a linear combination of trail basis functions ξ1, . . . , ξn. Recall that (3.2) the
shape (“amplitude”) of the top hat indicator basis functions {ξk} depends on the unknown
distribution p(x;G) for x = {1, 2, . . . , n}.
MLE in Sparse Regime. This leads us to the question of estimating the unknown
distribution P = (p1, p2, . . . , pn) (support size = size of the graph = n) based on N sample,
where N =∑n
i=1 di = 2|E|. We previously used the MLE estimate (Theorem 3), which is
the raw-empirical discrete probability estimate p(x;G) to construct the ξ-functions in our
spectral approximation algorithm. However, the empirical MLE is known to be strictly sub-
optimal (Witten and Bell, 1991) in the sparse-regime where N/n = O(1) i.e., bounded. This
situation can easily arise for modern day large sparse random graphs where both n and N are
of comparable order. The question remains: How to tackle this high-dimensional discrete
probability estimation problem, as this directly impacts the efficiency of our nonparametric
spectral approximation.
Laplace Smoothing. We seek a practical solution for countering this problem that lends
itself to fast computation. The solution that is both the simplest as well as remarkably
serviceable is the Laplace/Additive smoothing (Laplace, 1951) and its variants, which excels
in sparse regimes (Fienberg and Holland, 1973, Witten and Bell, 1991). The MLE and
Laplace estimates of the discrete distribution p(j;G) are respectively given by
Raw-empirical MLE estimates: p(j;G) =djN
;
Smooth Laplace estimates: pτ (j;G) =dj + τ
N + nτ(j = 1, . . . , n).
Note that the smoothed distribution pτ can be expressed as a convex combination of the
empirical distribution p and the discrete uninform distribution 1/n
pτ (j;G) =N
N + nτp(j;G) +
nτ
N + nτ
(1
n
), (4.1)
which provides a Stein-type shrinkage estimator of the unknown probability mass function
p. The shrinkage significantly reduces the variance, at the expense of slightly increasing the
bias.
Choice of τ . The next issue is how to select the “flattening constant” τ . Three choices of
16
τ are most popular in the literature:
τ =
1 Laplace estimator;
1/2 Krichevsky–Trofimov estimator;√N/n Minimax estimator (under L2 loss).
For more details on selection of τ see Fienberg and Holland (1973) and references therein. In
the next section, we will reveal a surprising connection between Laplace-smooth block-pulse
series approximation scheme and regularized spectral graph analysis.
4.2 Theory of Spectral Regularization
Construct τ -regularized top hat indicator basis ξj;τ by replacing the amplitude p−1/2(j) by
p−1/2τ (j) following (4.1). Incorporating this regularized trial basis, we have the following
modified graph co-moment based linear algebraic estimating equation:
∑j
αjk
[p(j, k)√pτ (j)pτ (k)
−√pτ (j)
√pτ (k) − λkδjk
]= 0. (4.2)
Theorem 5. τ -regularized block-pulse series based spectral approximation scheme is equiv-
alent to representing or embedding discrete graphs in the continuous eigenspace of
Type-I Regularized Laplacian = D−1/2τ AD−1/2
τ , (4.3)
where Dτ is a diagonal matrix with i-th entry di + τ .
Note 5. It is interesting to note that this exact regularized Laplacian formula was proposed
by Chaudhuri et al. (2012) and Qin and Rohe (2013), albeit from very different motivation.
Theorem 6. Estimate the joint probability p(j, k;G) by extending the formula given in (4.1)
for two-dimensional case as follows:
pτ (j, k;G) =N
N + nτp(j, k;G) +
nτ
N + nτ
(1
n2
), (4.4)
which is equivalent to replacing the original adjacency matrix by Aτ = A+ (τ/n)11T . This
modification via smoothing in the estimating equation (4.2) leads to the following spectral
graph matrix
Type-II Regularized Laplacian = D−1/2τ Aτ D
−1/2τ . (4.5)
Note 6. Exactly the same form of regularization of Laplacian graph matrix (4.5) was
proposed by Amini et al. (2013) as a fine-tuned empirical solution.
17
Significance 4. We have addressed the open problem of obtaining a rigorous interpretation
and understanding of spectral regularization. Our approach provides a more formal and
intuitive understanding of the spectral heuristics for regularization. We have shown how the
regularization naturally arises as a consequence of high-dimensional discrete data smoothing.
In addition, this point of view allows us to select appropriate regularization parameter τ with
no additional computation. To the best of our knowledge this is the first work that provides a
theoretical derivation and fundamental justification for regularized spectral methods, which
were previously considered empirical guesswork-based ad hoc solutions. This perspective
might also suggests how to construct other new types of regularization schemes3.
5 Algorithm and Applications
We describe how the core concepts and the general theory can be used to develop spectral
models of graph structured data for solving applied problems.
5.1 Algorithm
Our nonparametric theory provides an alternative perspective to the conventional algo-
rithmic spectral graph modeling allowing easy generalization. We start by estimating the
spectral domain graph decomposition basis functions for compact representation. In what
follows, we describe the unified algorithm that employs the proposed orthogonal series spec-
tral approximation technique to compute the Karhunen-Loeve basis of graph correlation
density kernel C , which yields the ‘optimal’ discrete graph Fourier transform.
Algorithm 1. Nonparametric Estimation of Graph Fourier Basis
1. Input: The adjacency matrixA ∈ Rn×n. The regularization parameter τ ∈ {0, 1, 1/2,√N/n}.
The number of spectral basis required is k.
2. Construct τ -regularized block-pulse trial basis functions ξj;τ = p−1/2j;τ I(uj−1 < u ≤ uj) for
j = 1, . . . , n.
3. Compute the co-moment matrix of the graph with respect to the τ -regularized block-pulse
trial basis functions {ξk;τ}1≤k≤n.
Mτ [j, k;G, ξ] =⟨ξj;τ ,
1∫0
(C − 1)ξk;τ
⟩L2[0,1]
for j, k = 1, . . . , n.
3Two such examples to construct smooth p(x;G) would be Stein estimate based on data-driven empirical
Bayes shrinkage parameter or Good-Turing estimate.
18
Zinc concentrations (ppm)
●● ●
●
●●
●●
●●
●●
●●●
●
●
●
●
● ●●●
●
●●●
●●
●
●●
●●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●●
●●●
●
●
●
●●●
●
200
400
600
800
1000
1200
1400
1600
1800
Flooding frequency classes
●● ●
●
●●
●●
●●
●●
●●●
●
●
●
●
● ●●●
●
●●●
●●
●
●●
●●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●●
●●●
●
●
●
●●●
●
●
●
●
123
Distance to river
●● ●
●
●●
●●
●●
●●
●●●
●
●
●
●
● ●●●
●
●●●
●●
●
●●
●●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●●
●●●
●
●
●
●●●
●
200
400
600
800
1000
Soil type classes
●● ●
●
●●
●●
●●
●●
●●●
●
●
●
●
● ●●●
●
●●●
●●
●
●●
●●●●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
● ●
●
●●
●●●
●
●
●
●●●
●
●
●
●
123
Figure 3: The Meuse dataset consists of n = 155 observations taken on a support of 15×15
m from the top 0 − 20 cm of alluvial soils in a 5 × 2 km part of the right bank of the
floodplain of the the river Meuse, near Stein in Limburg Province (NL). The dependent
variable Y is the zinc concentration in the soil (in mg kg−1), shown in the leftmost figure.
The other three variables, flooding frequency class (1 = once in two years; 2 = once in ten
years; 3 = one in 50 years), distance to river Meuse (in metres), and soil type (1= light
sandy clay; 2 = heavy sandy clay; 3 =silty light clay), are explanatory variables.
4. Perform the singular value decomposition (SVD) of M = UΛUT =∑
k ukµkuTk , where
where uij are the elements of the singular vector of moment matrix U = (u1, . . . , un), and
Λ = diag(µ1, . . . , µn), µ1 ≥ · · ·µn ≥ 0. Set λk = µk.
5. Estimate the optimal spectral connectivity profile for each node {φ1i;τ , . . . , φ(n−1)i;τ}, i =
1, . . . , n
φk;τ (u) =n∑j=1
ujkξj;τ , for k = 1, . . . , n− 1.
6. Return estimated spectral basis matrix Φ =[φ1;τ , . . . , φk;τ
]∈ Rn×k for the graph G.
5.2 Graph Regression
We study the problem of graph regression as an interesting application of the proposed
nonparametric spectral analysis algorithm. Unlike traditional regression settings, here one
is given n observations of the response and predictor variables over graph. The goal is
to estimate the regression function by properly taking into account the underlying graph
structured information along with the set of covariates.
19
●●●●●
●●●●
●●●
●●●
●●●●●●●● ●●●●●● ●●
●●●●
●●●●●●
●●●
●●●●●●
●
●●
●●
●●●
●
●●
●●● ●
●●
●
●
●●●●●●●●
●●●●
●
●
●
●●●● ●●
●
●●
●
●●●
●● ●
●●●
●
●
●
●
●
●●●
●●● ●●
●
●
●●●●
●
●●
●●●
●●
●
●
●
●
●
●●●●
● ●●●
●●●●
●
●●●●
●
●
●
12
3
4
5
6
78
9
101112
13
14
15
1617
181920
2122
2324
25262728
29
30
31
3233
3435
363738
3940
41
4243
44
45
46
47
48
49
50
51
52
53
54
55
5657
58
59
60
61
62
6364
65
66
67
68
69
707172
73
74
7576
77
787980
81
82
83
84
8586
87 8889
90
91
92
93
94
95
9697
98 99 100
101
102
103
104
105
106
107
108
109
110111
112113
114115
116
117
118
119
120121122
123
124
125
126
127128
129130
131
132
133
134
135
136137
138
139
140 141
142 143
144145
146
147
148
149150
151152
153
154
155
degree of the spatial Meuse data graph
Fre
quen
cy
5 10 15 20
05
1015
Figure 4: From spatial to graph representation of Meuse river data set. For each node we
observe {yi;x1i, x2i, x3i}ni=1. We address the problem of approximating the regression func-
tion incorporating simultaneously the effect of explanatory variables X and the underlying
spatial graph dependency.
Meuse Data Modeling. We apply our frequency domain graph analysis for the spatial
prediction problem. Fig 3 describes the Meuse data set, a well known geostatistical dataset.
There is a considerable spatial pattern one can see from Fig 3. We seek to estimate a
smooth regression function of the dependent variable Y (zinc concentration in the soil) via
generalized spectral regression that can exploit this spatial dependency. The graph was
formed according to the geographic distance between points based on the spatial locations
of the observations. We convert spatial data into signal supported on graph by connecting
two vertices if the distance between two stations is smaller than a given coverage radius. The
maximum of the first nearest neighbor distances is used as a coverage radius to ensure at
least one neighbor for each node. Fig 4(C) shows the sizes of neighbours for each node that
ranges from 1 to 22. Three nodes (# 82, 148 and 155) have only 1 neighbour; additionally
one can see a very weakly connected small cluster of three nodes, which is completely
detached from the bulk of the other nodes. The reason behind this heterogeneous degree
distribution (as shown in Fig 4) is the irregular spatial pattern of the Meuse data.
We model the relationship between Y and spatial graph topology by incorporating non-
parametrically learned spectral representation basis. Thus, we expand Y in the eigenbasis
of C and the covariates for the purpose of smoothing, which effortlessly integrates the tools
from harmonic analysis on graphs and conventional regression analysis. The model (5.1) de-
scribed in the following algorithm simultaneously promotes spatial smoothness and sparsity,
which is motivated by Donoho’s “de-noising by soft-thresholding” idea (Donoho, 1995).
20
Algorithm 2. Nonparametric Spectral Graph Regression
Step 1. Input: We observe {yi;xi1, . . . , xip} at vertex i of the graph G = (V,A) with size
|V | = n.
Step 2. Apply Algorithm 1 to construct the orthogonal series approximated spectral graph
basis and store it in Φ ∈ Rn×k.
Step 3. Construct combined graph regressor matrix XG =[Φ;X
], where X = [X1, . . . Xp] ∈
Rn×p is the matrix of predictor variables.
Step 4. Solve for βG = (βΦ, βX)T
βG = argminβG∈Rk+p
‖y −XGβG‖22 + λ‖βG‖1. (5.1)
Algorithm 2 extends traditional regression to data sets represented over graph. The pro-
posed frequency domain smoothing algorithm efficiently captures the spatial graph topology
via the spectral coefficients βΦ ∈ Rk–can be interpreted as covariate-adjusted discrete graph
Fourier transform of the response variable Y . The `1 sparsity penalty automatically selects
the coefficients with largest magnitudes thus provides compression.
The following table shows that incorporating the spatial correlation in the baseline unstruc-
tured regression model using spectral orthogonal basis functions (which are estimated from
the spatial graph with k = 25) boost the model fitting from 62.78% to 80.47%, which is an
improvement of approximately 18%.
X Φ0 +X ΦIτ=1 +X ΦI
τ=.5 +X ΦIτ=.28 +X ΦII
τ=1 +X ΦIτ=.5 +X ΦI
τ=.28 +X
62.78 74.80 78.16 80.47 80.43 80.37 80.45 80.43
Here Φ0, ΦIτ and ΦII
τ are the graph-adaptive piecewise constant orthogonal basis functions de-
rived respectively from the ordinary, Type-I regularized, and Type-II regularized Laplacian
matrix. Our spectral method can also be interpreted as a kernel smoother where the spatial
dependency is captured by the graph correlation density field C . We finally conclude that
extension from traditional regression to graph structured spectral regression significantly
improve the model accuracy.
5.3 Graph Clustering
To further demonstrate the potential application of our algorithm, we discuss the community
detection problem that seeks to divides nodes into into k groups (clusters), with larger
21
proportion of edges inside the group (homogeneous) and comparatively sparser connections
between groups to understand the large-scale structure of network. Discovering community
structure is of great importance in many fields such as LSI design, parallel computing,
computer vision, social networks, and image segmentation.
In mathematical terms, the goal is to recover the graph signals (class labels) y : V 7→{1, 2, . . . , k} based on the connectivity pattern or the relationship between the nodes.
Representation of graph in the spectral or frequency domain via the nonlinear mapping
Φ : G(V,E) 7→ Rm using discrete KL basis of density matrix C as the co-ordinate is the
most important learning step in the community detection. This automatically generates
spectral features {φ1i, . . . , φmi}1≤i≤n for each vertex that can be used for building the dis-
tance or similarity matrix to apply k-means or hierarchical clustering methods. In our
examples, we will apply k-means algorithm in the spectral domain, which seeks to minimiz-
ing the within-cluster sum of squares. In practice, often the most stable spectral clustering
algorithms determine k by spectral gap: k = argmaxj |λj − λj+1|+ 1.
Algorithm 3. Nonparametric Spectral Graph Partitioning
1. Input: The adjacency matrix A ∈ Rn×n. Number of clusters k. The regularization
parameter τ .
2. Estimate the top k − 1 spectral connectivity profile for each node {φ1i;τ , . . . , φ(k−1)i;τ}using Algorithm 1. Store it in Φ ∈ Rn×k−1.
3. Apply k-means clustering by treating each row of Φ as a point in Rk−1.
4. Output: The cluster assignments of n vertices of the graph C1, . . . , Ck.
In addition to graph partitioning, the spectral ensemble {λk, φk}1≤k≤m of C contain a wealth
of information on the graph structure. For example, the quantity 1− λ1(G; ξ) for the choice
of {ξk} to be normalized top hat basis (3.2), is referred to as the algebraic connectivity,
whose magnitude reflects how well connected the overall graph is. The kmeans cluster-
ing after spectral embedding {φj(G; ξ)}1≤j≤k−1 finds approximate solution to the NP-hard
combinatorial optimization problem based on the normalized cut (Shi and Malik, 2000)
by relaxing the discreteness constraints into one that is continuous (sometimes known as
spectral relaxation).
Data and Results. We investigate four well-studied real-world networks for community
structure detection based on 7 variants of spectral clustering methods.
Example A [Political Blog data, Adamic and Glance (2005)] The data, which contains
1222 nodes and 16, 714 edges, were collected over a period of two months preceding the U.S.
22
Table 1: We report % of misclassification error. We compare following seven different
Laplacian variants. K denotes the number of communities.
Type-I Reg. Laplacian Type-II Reg. Laplacian
Data K Laplacian τ = 1 τ = 1/2 τ =√N/n τ = 1 τ = 1/2 τ =
√N/n
PolBlogs 2 47.95% 4.9% 4.8% 5.4% 4.8% 4.7% 5.4%
Football 11 11.3% 7.83% 6.96% 6.96% 6.96% 7.83% 7.83%
MexicoPol 2 17.14% 14.2% 14.2% 14.2% 14.2% 14.2% 14.2%
Adjnoun 2 13.4% 12.5% 12.5% 12.5% 12.5% 12.5% 12.5%
Presidential Election of 2004 to study how often the political blogs refer to one another.
The linking structure of the political blogosphere was constructed by identifying whether
a URL present on the page of one blog references another political blog (extracted from
blogrolls). Each blog was manually labeled as liberal or conservative by Adamic and Glance
(2005), which we take as ground truth. The goal is to discover the community structure
based on these blog citations, which will shed light on the polarization in political blogs.
Table 1 shows the result of applying the spectral graph clustering algorithm on this political
web-blog data. The un-regularized Laplacian performs very poorly, whereas as both type-
I/II regularized versions give significantly better results. The misclassification error drops
from 47.95% to 4.7% because of regularization. To better understand why regularization
plays a vital role, consider the degree distribution of the web-blog network as shown in
the bottom panel of Figure 2. It clearly shows the presence of a large number of low-
degree nodes, which necessitates the smoothing of high-dimensional discrete probability
p1, . . . , p1222. Thus, we perform the kmeans clustering after projecting the graph in the
Euclidean space spanned by Laplace smooth KL spectral basis {φk;τ}. Regularized spectral
methods correctly identify two dense clusters: liberal and conservative blogs, which rarely
links to a blog of a different political leaning, as shown in the middle panel of Fig 6.
Example B [US College Football, Grivan and Newman (2002)] The American football
network (with 115 vertices, 615 edges) depicts the schedule of football games between NCAA
Division IA colleges during the regular season of Fall 2000. Each node represents a college
team (identified by their college names) in the division, and two teams are linked if they have
played each other that season. The teams were divided into 11 “conferences” containing
around 8 to 12 teams each, which formed actual communities. The teams in the same
23
●●
●●
●
●
● ●
●
●
●
●
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
Figure 5: Mexican political network. Two different colors (golden and blue) denotes the
two communities (military and civilians).
conference played more often compared to the other conferences, as shown in the middle
panel of Fig 6. A team played on average 7 intra- and 4 inter-conference games in the season.
Inter-conference play is not uniformly distributed; teams that are geographically close to one
another but belong to different conferences are more likely to play one another than teams
separated by large geographic distances. As the communities are well defined, the American
football network provides an excellent real-world benchmark for testing community detection
algorithms.
Table 1 shows the performance of spectral community detection algorithms to identify the 11
clusters in the American football network data. The regularization boosts the performance
by 3-4%. In particular, τ = 1/2 and√N/n produces the best result for Type-I regularized
Laplacian, while τ = 1 exhibits the best performance for Type-II regularized Laplacian.
Example C [The Political Network in Mexico, Gil-Mendieta and Schmidt (1996)] The data
(with 35 vertices and 117 edges) represents the complex relationship between politicians in
Mexico (including presidents and their close associates). The edge between two politicians
indicates a significant tie, which can either be political, business, or friendship. A classifi-
cation of the politicians according to their professional background (1 - military force, 2 -
civilians: they fought each other for power) is given. We use this information to compare
our 7 spectral community detection algorithms.
Although this is a “small” network, challenges arise from the fact that the two commu-
nities cannot be separated easily due to the presence of a substantial number of between-
community edges, as depicted in Figs 5 and 6. The degree-sparsity is also evident from Fig 6
(bottom-panel). Table 1 compares seven spectral graph clustering methods. Regularization
yields 3% fewer misclassified nodes. Both the type-I and II regularized Laplacian methods
for all the choices of τ produce the same result.
24
degree of the web−blog network
Fre
quen
cy
0 50 100 150 200 250 300 350
010
020
030
040
050
0
degree of the US college football network
Fre
quen
cy
7 8 9 10 11 12 13
010
2030
4050
60
degree of the political network in Mexico
Fre
quen
cy
5 10 15
02
46
810
12
degree of the word adjacencies network
Fre
quen
cy
0 10 20 30 40 50
010
2030
40
Figure 6: The columns denote the 4 datasets corresponding to the political blog, US col-
lege football, politicians in Mexico, and word adjacencies networks (the description of the
datasets are given in Section 5.3). The first two rows display the un-ordered and ordered
adjacency matrix, and the final row depicts the degree distributions.
Example D [Word Adjacencies, Newman (2006)] This is a adjacency network (with 112
vertices and 425 edges) of common adjectives and nouns in the novel David Copperfield
by English 19th century writer Charles Dickens. The graph was constructed by Newman
(2006). Nodes represent the 60 most commonly occurring adjectives and nouns, and an
edge connects any two words that appear adjacent to one another at any point in the book.
Eight of the words never appear adjacent to any of the others and are excluded from the
network, leaving a total of 112 vertices. The goal is to identify which words are adjectives
and nouns from the given adjacency network.
Note that typically adjectives occur next to nouns in English. Although it is possible for
adjectives to occur next to other adjectives (e.g., “united nonparametric statistics”) or for
25
nouns to occur next to other nouns (e.g., “machine learning”), these juxtapositions are less
common. As expected, Fig 6 (middle panel) shows an approximately bipartite connection
pattern among the nouns and adjectives.
A degree-sparse (skewed) distribution is evident from the bottom right of Fig 6. We apply
the seven spectral methods and the result is shown in Table 1. The traditional Laplacian
yields a 13.4% misclassification error for this dataset. We get better performance (although
the margin is not that significant) after spectral regularization via Laplace smoothing.
6 Further Extensions
We have employed a nonparametric method for approximating the optimal Karhunen-
Loeve spectral graph basis (eigenfunction of graph correlation density matrix C ) via block-
pulse orthogonal series approach. The beauty of our general approach is in its ability to
unify seemingly disparate spectral graph algorithms while being simple and flexible. Our
theory and algorithm are readily applicable to other sophisticated orthogonal transforms
for obtaining more enhanced quality approximate solution of the functional basis.
Transform Coding of Graphs. Our general nonparametric spectral approximation theory
remains unchanged for any choice of orthogonal piecewise-constant trial basis functions (e.g.,
Walsh, Rademacher function, B-spline, or Haar functions etc.), which paves the way for the
generalized harmonic analysis of graphs.
Definition 6 (Orthogonal Discrete Graph Transform). We introduce a generalized concept
of matrices associated with graphs. Define orthogonal discrete graph transform as
M[j, k;G, ξ] =⟨ξj,
1∫0
(C − 1)ξk
⟩L2[0,1]
for j, k = 1, . . . , n. (6.1)
The family of Laplacian-like graph matrices can be thought of as block-pulse transform
coefficients matrix. Equivallently, we can define the discrete graph transform to be the
coefficient matrix of the orthogonal series expansion of the kernel C (u, v;G) with respect to
the product bases {ξjξk}1≤j,k≤n.
Significance 5 (Broadening the Class of Graph Matrices). As a practical significance, this
generalization provides a unified and systematic recipe for converting the graph problem
into a “suitable” matrix problem in the new transformed domain:
Gn(V,E) −→ An×n −→ C (u, v;Gn){ξ1,...,ξn}−−−−−→Eq. (6.1)
M(G) ∈ Rn×n.
The matrix M(G) allows us to broaden currently existing narrow classes of graph matri-
ces. Furthermore, verify that M = ΨTPΨ, where Ψjk = ξj(F (k;G)) for j, k = 1, . . . , n.
26
The Laplacian-like matrices (both normalized and regularized versions) and the Modularity
matrix are merely special instances of this general family.
Spectral Sparsification and Smoothing. Choice of the orthogonal piecewise-constant
trial bases {ξk}1≤k≤n can have a dramatic impact (in terms of theory and practice) on analyz-
ing graphs. Wisely constructed bases not only lead to better quality spectral representation,
but also powerful numerical algorithms for fast approximate learning. Our statistical view-
point sheds light on the scope and role of orthogonal piecewise-constant functions in discrete
graph processing.
Significance 6 (Spectral Smoothing). Orthogonal series approximation of φk =∑
j αjk ξj(u)
based on suitable and more sophisticated piecewise-constant basis functions (compared to
the block-pulse functions) provide increased flexibility that might produce enhanced quality
nonparametric “smooth” estimates instead of “raw” estimate of φk ( k = 1, . . . , n). We
emphasize that all of our theory remains completely valid and unchanged for any choice of
orthogonal transform (i.e., trial bases).
Significance 7 (An Accelerated Spectral Approximation Scheme). The conventional spec-
tral coding method based on singular value decomposition (SVD of partial order k) of dense
spectral matrix is roughly of the order O(kn2), which is costly and unrealistic to compute
for large graphs. For sparse matrices there exists iterative algorithms like Lanczos or ap-
proximate methods like Nystrom or Randomized SVD, whose computational cost scales as
O(ks), where s = number of non-zero elements of matrix M. Thus, it is logical to prefer
sparsifier trial basis (that produces sparse transform matrix) to speed-up the computation.
As a future research, we plan to investigate the role of Haar wavelet on large graph pro-
cessing. The key issues to be investigated include whether it provides a computational
advantage over standard approaches (which use indicator top-hat basis) for frequency anal-
ysis of graphs without sacrificing the accuracy, as well as whether localized compact support
of wavelet bases lead to sparsified graph transform. Such an approach will require efficient
techniques for sparse representation and recovery based on the pioneering ideas of Donoho
and Johnstone (Donoho and Johnstone, 1994).
All in all, the notion of generalized discrete graph transform inspire us to develop practical
and efficient compressible algorithms for large graphs via spectral sparsified graph matrices,
which could dramatically reduce memory and time requirements. However, more research
is required to examine this largely uncharted territory of graph data analysis.
27
7 Concluding Remarks
This paper introduces a new statistical way of thinking, teaching and practice of the spectral
analysis of graphs. In what follows, we highlight some of the key characteristics of this work:
• This paper does not offer a new spectral graph analysis tool, rather, it offers only a
new point of view from statistics perspective by appropriately transforming it into a non-
parametric approximation and smoothing problem. In particular, we have reformulated
the spectral graph theory as a method of obtaining approximate solution of the integral
eigenvalue equation of the correlation density matrix C – Correlation Density Functional
theory.
• A specialised orthogonal series spectral approximation technique is developed, which ap-
pears to unify and generalize the existing paradigm. We also reported striking finding that
clarifies, for the first time, the mystery of the spectral regularized algorithm by connecting
it with high-dimensional discrete parameter smoothing.
• Historically, the development of efficient computing algorithms for spectral graph analysis
has been inspired either by numerical linear algebra or combinatorics. The intention of this
paper was to develop a general theory of approximate spectral methods that
(i) offers a complete and autonomous description, totally free from the classical combinato-
rial or algorithmic linear algebra based languages;
(ii) can be fruitfully utilized to design statistically-motivated modern computational learning
techniques, which are of great practical interest since they might lead to fast algorithms for
large graphs.
• This work shows an exciting confluence of three different cultures : Nonparametric func-
tional statistical theory, Mathematical approximation theory (of integral equations), and
Computational harmonic analysis.
In essence, this paper presents a modern statistical view on spectral analysis of graphs that
contributes new insights for developing unified theory and algorithm from a nonparametric
perspective.
References
Adamic, L. A. and Glance, N. (2005). The political blogosphere and the 2004 us election:
divided they blog. In Proceedings of the 3rd international workshop on Link discovery.
ACM, 36–43.
Airoldi, E. M., Costa, T. B. and Chan, S. H. (2013). Stochastic blockmodel ap-
28
proximation of a graphon: Theory and consistent estimation. In Advances in Neural
Information Processing Systems. 692–700.
Amini, A. A., Chen, A., Bickel, P. J., Levina, E. et al. (2013). Pseudo-likelihood
methods for community detection in large sparse networks. The Annals of Statistics, 41
2097–2122.
Blackman, R. B. and Tukey, J. W. (1958). The measurement of power spectra from the
point of view of communications engineering—Part I-III. Bell System Technical Journal,
The, 37 485–569.
Chatterjee, S. (2014). Matrix estimation by universal singular value thresholding. The
Annals of Statistics, 43 177–214.
Chaudhuri, K., Graham, F. C. and Tsiatas, A. (2012). Spectral clustering of graphs
with general degrees in the extended planted partition model. Journal of Machine Learn-
ing Research Workshop and Conference Proceedings, 35 1–23.
Chung, F. R. (1997). Spectral graph theory, vol. 92. American Mathematical Soc.
Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Applied and computational
harmonic analysis, 21 5–30.
Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of
complex fourier series. Mathematics of computation, 19 297–301.
Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Informa-
tion Theory, 41 613–627.
Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet
shrinkage . Biometrika., 81 425–455.
Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Mathematical Journal,
23 298–305.
Fienberg, S. E. and Holland, P. W. (1973). Simultaneous estimation of multinomial
cell probabilities. Journal of the American Statistical Association, 68 683–691.
Galerkin, B. (1915). Series development for some cases of equilibrium of plates and beams
(in Russian). Wjestnik Ingenerow Petrograd, 19 897–908.
Gil-Mendieta, J. and Schmidt, S. (1996). The political network in mexico. Social
Networks, 18 355–381.
Grivan, M. and Newman, M. E. J. (2002). Community structure in social and biological
networks. Proc. Natl. Acad. Sci 7821–7826.
29
Hoeffding, W. (1940). Massstabinvariante korrelationstheorie. Schriften des Mathema-
tischen Seminars und des Instituts fur Angewandte Mathematik der Universitat Berlin,
5 179–233.
Laplace, P. S. (1951). A philosophical essay on probabilities, translated from the 6th
French edition by Frederick Wilson Truscott and Frederick Lincoln Emory. Dover Publi-
cations (New York, 1951).
Loeve, M. (1955). Probability Theory; Foundations, Random Sequences. New York: D.
Van Nostrand Company.
Lovasz, L. and Szegedy, B. (2006). Limits of dense graph sequences. Journal of Com-
binatorial Theory, Series B, 96 933–957.
Mukhopadhyay, S. (2015). Strength of connections in a random graph: Definition, char-
acterization, and estimation. arXiv:1412.1530.
Newman, M. E. (2006). Finding community structure in networks using the eigenvectors
of matrices. Physical review E, 74 036104.
Parzen, E. (1961). An approach to time series analysis. The Annals of Mathematical
Statistics 951–989.
Qin, T. and Rohe, K. (2013). Regularized spectral clustering under the degree-corrected
stochastic blockmodel. In Advances in Neural Information Processing Systems. 3120–
3128.
Ritz, W. (1908). Uber eine neue methode zur losung gewisser randwertaufgaben.
Nachrichten von der Gesellschaft der Wissenschaften zu Gottingen, Mathematisch-
Physikalische Klasse 236–248.
Salter-Townshend, M., White, A., Gollini, I. and Murphy, T. B. (2012). Review
of statistical network analysis: models, algorithms, and software. Statistical Analysis and
Data Mining: The ASA Data Science Journal, 5 243–264.
Sarkar, P., Bickel, P. J. et al. (2015). Role of normalization in spectral clustering
for stochastic blockmodels. The Annals of Statistics, 43 962–990.
Schmidt, E. (1907). Zur Theorie der linearen und nicht linearen Integralgleichungen Zweite
Abhandlung. Mathematische Annalen, 64 433–476.
Shen, H.-W. and Cheng, X.-Q. (2010). Spectral methods for the detection of network
community structure: a comparative analysis. Journal of Statistical Mechanics: Theory
and Experiment, 2010 P10020.
30
Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. Pattern Analysis
and Machine Intelligence, IEEE Transactions on, 22 888–905.
Talmon, R. and Coifman, R. R. (2015). Intrinsic modeling of stochastic dynamical
systems using empirical geometry. Applied and Computational Harmonic Analysis, 39
138–160.
Tukey, J. W. (1984). Styles of spectrum analysis. In A Celebration in Geophysics and
Oceanography-1982, in Honor of Walter Munk, La Jolla, CA 1143–1153.
Von Luxburg, U., Belkin, M. and Bousquet, O. (2008). Consistency of spectral
clustering. The Annals of Statistics 555–586.
Wiener, N. (1930). Generalized harmonic analysis. Acta mathematica, 55 117–258.
Witten, I. H. and Bell, T. (1991). The zero-frequency problem. Information Theory,
IEEE Transactions on, 37 1085–1094.
31