lecture 21: random walks on graphs, part ii

10
CMSE 890-001: Spectral Graph Theory and Related Topics, MSU, Spring 2021 Lecture 21: Random Walks on Graphs, Part II April 1, 2021 Lecturer: Matthew Hirn 28.3 Stationary distribution and mixing time It turns out that no matter what distribution p 0 we start with, the lazy random walk on a connected graph G =(V,E,w) will always converge to the stationary distribution : V ! R, which is defined as = d kdk 1 , where kdk 1 = X a2V |d(a)| = X a2V d(a) , since d(a) 0 for all a 2 V . First, observe that d is an eigenvector of f W with eigenvalue ! 1 =1: f W d = I - 1 2 D 1/2 ND -1/2 d = d - 1 2 D 1/2 Nd 1/2 = d , since d 1/2 is an eigenvector of N with eigenvalue 1 =0. Thus is an eigenvector of f W with eigenvalue ! 1 =1. Since all the other eigenvalues of f W are non-negative and strictly less than one, we will see that p t = f W t p 0 will converge to as t !1. The following theorem gives a precise statement. Theorem 54. Let G =(V,E,w) be connected, let p 0 : V ! R be a probability distribution, and set p t = f W t p 0 , t 0 . Then, kp t - k ! t 2 d max d min 1/2 kp 0 k . Proof. The main diculty of this proof is that the eigenvectors of f W are not orthogonal. So let φ 1 ,..., φ n be an orthonormal basis of eigenvectors of N , which means they are an orthonormal basis of eigenvectors of A. By Theorem 53, we know that D 1/2 φ 1 ,..., D 1/2 φ n are eigenvectors of f W . While they are not orthonormal, they do form a basis for R n . Therefore we can write p 0 = n X i=1 i D 1/2 φ i , 113

Upload: others

Post on 11-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

CMSE 890-001: Spectral Graph Theory and Related Topics, MSU, Spring 2021

Lecture 21: Random Walks on Graphs, Part IIApril 1, 2021

Lecturer: Matthew Hirn

28.3 Stationary distribution and mixing time

It turns out that no matter what distribution p0 we start with, the lazy random walk on aconnected graph G = (V,E,w) will always converge to the stationary distribution ⇡ : V ! R,which is defined as

⇡ =d

kdk1,

wherekdk1 =

X

a2V

|d(a)| =X

a2V

d(a) ,

since d(a) � 0 for all a 2 V . First, observe that d is an eigenvector of fW with eigenvalue!1 = 1:

fWd =

✓I � 1

2D

1/2ND

�1/2

◆d = d� 1

2D

1/2Nd

1/2 = d ,

since d1/2 is an eigenvector of N with eigenvalue ⌫1 = 0. Thus ⇡ is an eigenvector of fW

with eigenvalue !1 = 1. Since all the other eigenvalues of fW are non-negative and strictlyless than one, we will see that pt = fW t

p0 will converge to ⇡ as t ! 1. The followingtheorem gives a precise statement.

Theorem 54. Let G = (V,E,w) be connected, let p0 : V ! R be a probability distribution,and set

pt = fW tp0 , t � 0 .

Then,

kpt � ⇡k !t

2

✓dmax

dmin

◆1/2

kp0k .

Proof. The main difficulty of this proof is that the eigenvectors of fW are not orthogonal.So let �1, . . . ,�n be an orthonormal basis of eigenvectors of N , which means they are anorthonormal basis of eigenvectors of A. By Theorem 53, we know that D1/2

�1, . . . ,D1/2�n

are eigenvectors of fW . While they are not orthonormal, they do form a basis for Rn.Therefore we can write

p0 =nX

i=1

↵iD1/2

�i ,

113

for some unique coefficients ↵i. This in turn implies,

D�1/2

p0 =nX

i=1

↵i�i .

Now since �1, . . . ,�n does form an orthonormal basis, we know that

↵i = hD�1/2p0,�ii .

In particular,

↵1 = hD�1/2p0,�1i = (D�1/2

p0)T

d1/2

kd1/2k =pT

0D�1/2

d1/2

kd1/2k =pT

0 1

kd1/2k =1

kd1/2k .

Now we compute

pt = fW tp0 = fW t

nX

i=1

↵iD1/2

�i

=nX

i=1

↵ifW t

D1/2

�i

=nX

i=1

!t

i↵iD

1/2�i

= ↵1D1/2

�1 +nX

i=2

!t

i↵iD

1/2�i

=D

1/2�1

kd1/2k +nX

i=2

!t

i↵iD

1/2�i

=1

kd1/2kD1/2 d

1/2

kd1/2k +nX

i=2

!t

i↵iD

1/2�i

=d

kdk1+

nX

i=2

!t

i↵iD

1/2�i

= ⇡ +nX

i=2

!t

i↵iD

1/2�i .

It follows that

pt � ⇡ =nX

i=2

!t

i↵iD

1/2�i =) D

�1/2(pt � ⇡) =nX

i=2

!t

i↵i�i .

114

Thus,

kD�1/2(pt � ⇡)k2 =nX

i=2

!2ti|↵i|2

!2t2

nX

i=2

|↵i|2

!2t2

nX

i=1

|↵i|2

= !2t2 kD�1/2

p0k2 .

To finish the proof, we observe that for any x : V ! R,

kD�1/2xk2 =

X

a2V

1

d(a)x(a)2 � 1

dmax

X

a2V

x(a)2 =kxk2

dmax,

andkD�1/2

xk2 =X

a2V

1

d(a)x(a)2 1

dmin

X

a2V

x(a)2 =kxk2

dmin.

Therefore,

kpt � ⇡k2

dmax kD�1/2(pt � ⇡)k2 !2t

2 kD�1/2p0k2 !2t

2

kp0k2

dmin,

and so:

kpt � ⇡k !t

2

✓dmax

dmin

◆1/2

kp0k .

It follows that the closer !2 is to 0, the faster pt converges to ⇡. Recall that !2 = 1�⌫2/2.Theorem 54 says:

kpt � ⇡k ⇣1� ⌫2

2

⌘t✓dmax

dmin

◆1/2

kp0k .

Now remember that ⌫2 measures the connectivity of G since, by Theorems 45 and 51, wehave

⌫22

'G p2⌫2 ,

where 'G is the conductance of G. The upper bound on 'G (Cheeger’s inequality, Theorem51) can be used to rewrite Theorem 54 one more time as:

kpt � ⇡k ✓1� '2

G

4

◆t✓dmax

dmin

◆1/2

kp0k .

115

Unfortunately, since 'G 1, this bound is a little too loose in this approach and it is betterto consider ⌫2 (equivalently !2) directly as the measure of connectivity of G.

Thus, for graphs that are well connected (meaning that ⌫2 will be large, or equivalently!2 will be small) will have pt converge fast. Indeed, if G is well connected then there willbe many paths from any vertex a to any vertex b and many realizations of the lazy randomwalk will get from a to b in a small number of steps. Thus the lazy random walk will traversethe graph fast (on average) and in turn converge to ⇡ quickly. On the other hand, graphsthat are not well connected will have small ⌫2 (equivalently, large !2) and the bound for theconvergence of pt to ⇡ will be weak. In fact, for many such graphs convergence will indeedbe slow. For example, if there exist pairs of vertices a, b 2 V for which there are only a fewpaths from a to b, then only a few realizations of the lazy random walk will get from a tob in a small number of steps, and most realizations of the lazy random walk will take manysteps to get from a to b. As such, the convergence of pt to ⇡ will be slow.

To quantify the number of steps it takes for a walk to mix through G, let us define themixing time of the lazy random walk as the minimum value of t for which

kpt � ⇡k k⇡k2

.

Applying Theorem 54, we see the mixing time t is no more than the t for which

kpt � ⇡k !t

2

✓dmax

dmin

◆1/2

kp0k k⇡k2

()⇣1� ⌫2

2

⌘t✓dmax

dmin

◆1/2

kp0k k⇡k2

()⇣1� ⌫2

2

⌘t✓dmin

dmax

◆1/2 k⇡k2kp0k

Since 1� x e�x, we have (1� x)t e�xt. As such, we can again upper bound the mixingtime t via:

⇣1� ⌫2

2

⌘t e�⌫2t/2

✓dmin

dmax

◆1/2 k⇡k2kp0k

() �⌫2t

2 log

"✓dmin

dmax

◆1/2 k⇡k2kp0k

#

() t � �2

⌫2log

"✓dmin

dmax

◆1/2 k⇡k2kp0k

#

() t � 2

⌫2log

"✓dmax

dmin

◆1/2 2kp0kk⇡k

#.

We thus obtain an upper bound for the mixing time that is proportional to 1/⌫2. Thus if Gis well connected, and hence ⌫2 is large, then the mixing time will be small. On the otherhand, if G is not well connected (meaning ⌫2 is small), the mixing time could be large.

116

In the case of when G is d-regular, we can simplify the log term. Indeed, dmax = dmin = din this case and

k⇡k =kdkkdk1

=

pnd

nd=

1pn.

If p0 = �b for some b as well, then kp0k = 1 and we obtain:

t � 2

⌫2log(2

pn) =

2

⌫2log�(4n)1/2

�=

log(4n)

⌫2.

In some graphs the log(4n) factor is too pessimistic and in fact the mixing time is O(1/⌫2).

28.4 Diffusion

In this section we consider the operators

P = WT and eP := fW T .

Recall that fW maps probability distributions to probability distributions, which is some-times referred to as a Markov operator. We also have fW (a, b) = p(a|b), i.e., the probabilityof walking to a given that you are located at b. On the other hand,

eP (a, b) = fW T = p(b|a) ,

that is, the probability of walking to b given that you are a. Since the column sums of fWare equal to one, the row sums of eP are equal to one. This means that eP acts as a diffusion,or averaging operator, since:

ePx(a) = eP (a, a)x(a) +X

b2N(a)

eP (a, b)x(b)

=1

2x(a) +

1

2

X

b2N(a)

P (a, b)x(b)

=1

2x(a) +

1

2

X

b2N(a)

w(a, b)

d(a)x(b) ,

and in particular if G = (V,E) is not weighted, then

ePx(a) =1

2

2

4x(a) + 1

d(a)

X

b2N(a)

x(b)

3

5 =1

2

2

4x(a) + 1

|N(a)|X

b2N(a)

x(b)

3

5 .

Thus ePx(a) replaces the value x(a) at a 2 V with a weighted average of x(a) and the valuesx(b) for b 2 N(a).

117

Notice that eP t(a, b) is the probability of walking from a to b in exactly t steps. We canuse the family of matrices eP t to define the family of diffusion distances between a and b:

dt(a, b) :=

X

c2V

heP t(a, c)� eP t(b, c)

i2 1

d(c)

!1/2

.

Since

Row a of eP t = ( eP t(a, c))c2V = �T

aeP t = �

T

a(fW T )t = �

T

a(fW t)T = (fW t

�a)T ,

we see that

dt(a, b) =

X

c2V

hfW t

�a(c)� fW t�b(c)

i2 1

d(c)

!1/2

.

In other words, the diffusion distance measures the distance between a and b by measuringthe overlap of the t-step random walk distribution for walks started at a with the t-steprandom walk distribution for walks started at b. It provides an alternate distance to theshortest path (geodesic) distance between a and b. In some cases the diffusion distancebetween a and b better reflects the clustering structure of the graph G; see for exampleFigure 34. It is also more robust to noise than the shortest path distance if the graph G isgenerated from noisy data.

We can rewrite the diffusion distance in terms of the eigenvectors and eigenvalues of eP .Let us first observe that

eP = fW T =1

2(I +W

T ) =1

2(I +D

�1M )

=1

2D

�1/2(I +D�1/2

MD�1/2)D1/2 = D

�1/2(I/2 +A/2)D1/2 .

On the other hand, since I and A are symmetric,

fW = D1/2(I/2 +A/2)D�1/2 .

Thus the eigenvalues of eP and fW are the same, given by 1 = !1 � !2 � · · · � !n � 0.Furthermore, if the eigenvectors of A are �1, . . . ,�n (reminder, these are also the eigenvectorsof N ), then as we saw earlier the eigenvectors of fW are D

1/2�1, . . . ,D1/2

�n, but theeigenvectors of eP are D

�1/2�1, . . . ,D�1/2

�n. Since �1 = d1/2, it follows that the first

eigenvector of eP is 1 (although we could have verified this directly using that the row sumsof eP are equal to one):

eP1 = 1 .

Denote the eigenvectors of eP by

e�i := D�1/2

�i , 1 i n .

118

Figure 34: Illustration of the shortest path distance versus the diffusion distance. Here Gis a barbell type graph, with a, b, c 2 V . The shortest path distance, as indicated by thered lines, is approximately the same between a and b as it is between b and c. However,their diffusion distances, which are inversely proportional to the overlap of the shaded diskscentered at each vertex, are very different. Indeed, due to the bottleneck in the barbell graph,the t-step distribution started from a overlaps very little with the t-step distribution startedat b. On the other hand, since b and c lie in the same cluster, their t-step distributionsoverlap significantly.

Define the diffusion map e�t : V ! Rn�1 as

e�t(a) := (!t

2e�2(a), . . . ,!

t

ne�n(a)) .

The diffusion map should remind you of the eigenvector graph embeddings we studied inSection 9, except that we use the eigenvectors of eP instead of L and we re-scale the eigen-vector coordinates by !t

i. The following theorem shows the diffusion distance between a and

b is equal to the Euclidean distance between e�t(a) and e�t(b).

Theorem 55. Let G = (V,E,w) be a connected graph. Then

dt(a, b) = ke�t(a)� e�t(b)k , 8 a, b 2 V, 8 t � 0 .

I leave the proof to you as an upcoming homework exercise. Whereas the eigenvectorembedding of the graph using the eigenvectors of L preserved local relations between verticesin the graph, the diffusion map embedding preserves the diffusion distance. Additionally,since for a connected graph we have 1 > !2 � · · · � !n � 0, if the eigenvalues of eP decay

119

fast and/or if t is large, we can approximate dt(a, b) by truncating e�t(a) to only includethe first k entries (where k depends on the desired error upper bound and the decay of theeigenvalues of the size of t).

The diffusion distance and the diffusion map were introduced in [5]. In that paper theyexplore several different aspects of the diffusion map. We highlight here one part of thatpaper, related to clustering. The idea is that the diffusion distances provides a family ofmultiscale distances that reflect hierarchical clusters in a graph (e.g., a graph derived fromdata). Small times t distinguish many of the clusters, while medium times t merge togethernearby clusters and large times t collapse nearly all clusters together. See Figure 35 for anillustration.

120

Figure 35: Left: A 3-cluster data set in which two clusters are closer together than theirdistance from the third. Right: The matrix eP t for a graph G = (V,E,w) derived from thedata. (a) t = 8: The rows of eP t, and hence the diffusion distance, distinguish the threeclusters. (b) t = 64: The rows of eP t are the nearly identical for the two closer clusters,but are still different than the rows of the vertices in the third cluster. Hence the diffusiondistance effectively merges the closer two clusters together but keeps them separate fromthe third cluster. (c) t = 1024: All rows of eP t have nearly converged to the stationarydistribution ⇡ (see Theorem 54) and hence do not distinguish any of the clusters. Figuretaken from [5].

121

References[1] Daniel A. Spielman. Spectral and algebraic graph theory. Book draft, available at:

http://cs-www.cs.yale.edu/homes/spielman/sagt/, 2019.

[2] Michael Perlmutter, Feng Gao, Guy Wolf, and Matthew Hirn. Geometric scatteringnetworks on compact Riemannian manifolds. In Proceedings of The First Mathematicaland Scientific Machine Learning Conference, Proceedings of Machine Learning Research,volume 107, pages 570–604, 2020.

[3] David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and PierreVandergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Pro-cessing Magazine, 30(3):83–98, 2013.

[4] Stéphane Mallat. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way.Academic Press, 3rd edition, 2008.

[5] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and ComputationalHarmonic Analysis, 21:5–30, 2006.

122