math-shu 236 data representation and di usion mapssling/math-shu-236-2020-spring/math-sh… ·...

MATH-SHU 236Data Representation and Diffusion maps

Shuyang Ling

March 18, 2020

1 Distance

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Observation: “∗” is closer to “�” than to “◦” and “�” in Euclidean distance.

Clusters: each circle forms one cluster.

What wrong with it?

Use diffusion-based distance and take advantage of the local information:

(a) If placing a heat source at “ � ”, the heat is likely to diffuse faster to “�” than to“ ∗ ” and “ ◦ ”.

(b) Why? The resistance between “ � ” and “ ◦ ” could be higher than that between “ � ”and “�”.

Question: Can we derive a distance based on the heat diffusion process on graph?

2 Random walk on the networks

Given a weighted graph G = (V,E,W ), one can construct a random walk (a diffusionprocess) associated to this graph. This graph may be constructed from data points or thedataset itself has a graph structure including social network and biological network.

1

We consider a random walk {Xt} on the vertices {1, · · · , n} as follows:

P(Xt+1 = j|Xt = i) =wijdi, di =

n∑j=1

wij, ∀t ≥ 0, 1 ≤ i, j ≤ n (2.1)

where t ∈ Z≥0 is the time index and Xt is the state of this random walk at time t. In otherwords, starting from a vertex, the random walker picks the next state with probabilityproportional to the value of weight.

Markov property: Note that the state Xt+1 only depends on Xt and does not rely onany information before time t. This is called Markov property.

Question: why (2.1) defines a probability distribution?

n∑j=1

wijdi

=1

di

n∑j=1

wij = 1, ∀1 ≤ i ≤ n.

Definition 2.1. The Markov transition matrix P is defined by

Pij = P(Xt+1 = j|Xt = i) =wijdi.

If written in matrix form, it holds that

P = D−1W .

2.1 Examples

Let’s simulate a random walk on the network.

0 10 20 30 40 50 60 70 80 90 100

1

2

3

4

5

6

7

8

Figure 1: Illustration of the network and first 100 states of the Markov chain. The randomwalker spends a long time before moving to the other complete graph.

2

Consider such a network: two complete graphs is glued with an edge.

A =

1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 1 0 0 00 0 0 1 1 1 1 10 0 0 0 1 1 1 10 0 0 0 1 1 1 10 0 0 0 1 1 1 1

, P =

1/4 1/4 1/4 1/4 0 0 0 01/4 1/4 1/4 1/4 0 0 0 01/4 1/4 1/4 1/4 0 0 0 01/5 1/5 1/5 1/5 1/5 0 0 00 0 0 1/5 1/5 1/5 1/5 1/50 0 0 0 1/4 1/4 1/4 1/40 0 0 0 1/4 1/4 1/4 1/40 0 0 0 1/4 1/4 1/4 1/4

Starting from X0 = 1, we simulate one trajectory of {Xt}t≥0 based on the transitionprobability:

Pij = P(Xt+1 = j|Xt = i).

2.2 Spectra of transition matrix

Lemma 2.1. The leading eigenvalue and eigenvector of P are 1 and 1n respectively.

Proof: The proof is straightforward:

P1n = D−1W1n = D−1d = 1n.

Question: why is 1 one of leading eigenvalues?

Question: when does the transition matrix has a unique leading eigenvalue, i.e., λ1(P ) =1 with multiplicity 1.

The answer follows from our result for Laplacian. Recall that

P = D−1W ⇐⇒D12PD−

12 = D−

12WD−

12 = In − L.

Let λ`,↓(P ) be the eigenvalues of P in descent order; and λ`,↑(L) be the eigenvalues of Lin ascent order. Then

λ`,↓(P ) + λ`,↑(L) = 1.

Lemma 2.2. The leading eigenvalue of P is unique if and only if the graph is connected.

2.3 Stationary distribution

Question: Suppose P(Xt = i) = πti . When does Xt+1 share the same distribution with

Xt?

P(Xt+1 = i) =n∑j=1

P(Xt+1 = i,Xt = j)

=n∑j=1

P(Xt+1 = i|Xt = j)P(Xt = j)

=n∑j=1

πtjPji = πt+1

i .

3

If πt = πt+1 , it holds that

πi =n∑j=1

πjPji, ∀1 ≤ i ≤ n.

Definition 2.2. The row vector π ∈ R1×n is called the stationary distribution if

πP = π,

n∑i=1

πi = 1, π ≥ 0.

i.e., the leading left eigenvector of P .

Question: does it exist? If so, is it unique?

In fact, the stationary distribution for random walk on finite network equals the degreeof each node.

Lemma 2.3. The stationary distribution is

πi =di

Vol(G), Vol(G) =

n∑i=1

di =∑i,j

wij

where Vol(G) is the volume of the graph G.

We leave the proof to the readers. This Lemma directly implies that the stationarydistribution of the example

π =1

34[4, 4, 4, 5, 5, 4, 4, 4] .

Theorem 2.4 (“Space average equals time average.”). For a random walk with anyinitialization, it holds that

limN→∞

#{t : Xt = i, 1 ≤ t ≤ N}N

= πi

for random walk on a finite connected graph

We simulate the random walk in Figure 1 to illustrate this theorem. One trajectory issimulated and compute the frequency of each state appearing in the whole sequence.Figure 2 implies that Ni(t) converges to πi as t→∞.

3 Diffusion on the graph

Consider a random walker, starting from X0 = i, what is the probability of P(Xt =j|X0 = i)? Note that

P(Xt = j|X0 = i)

defines a probability mass function associated to the state i. This probability massfunction actually characterizes the diffusion process of one particle moving on the graphfrom a given initial value.

Question: How to compute P(Xt = j|X0 = i)?

4

0 0.5 1 1.5 2 2.5 3

104

0

0.05

0.1

0.15

0.2

0.25

0.3

Pro

babili

ty a

nd fre

quency

1

2

3

4

5

6

7

8

Figure 2: Frequency of Ni(t) := {t : Xt = i, 1 ≤ t ≤ N}.

Theorem 3.1. The probability is given by

P(Xt = j|X0 = i) = (e>i Pt)j

where ei is the ith vector of the canonical basis in Rn. This is exactly the (i, j)-entry ofP t.

Proof: Let’s prove by induction. If t = 0, then

P(X0 = j|X0 = i) =

{1, j = i,

0, j 6= i.

In other words, P(X0|X0 = i) = e>i holds. Now we assume for 0, · · · , t − 1, the resultholds, i.e.,

P(Xs|X0 = i) = e>i Ps, 0 ≤ s ≤ t− 1.

Then

P(Xt = j|X0 = i) =n∑k=1

P(Xt = j,Xt−1 = k|X0 = i)

=n∑k=1

P(Xt = j,Xt−1 = k,X0 = i)

P(X0 = i)

P(Xt−1 = k,X0 = i)

P(Xt−1 = k,X0 = i)

=n∑k=1

P(Xt = j|Xt−1 = k,X0 = i)P(Xt−1 = k|X0 = i)

=n∑k=1

P(Xt = j|Xt−1 = k)P(Xt−1 = k|X0 = i)

=n∑k=1

(e>i Pt−1)kPkj

=n∑k=1

(P t−1)ikPkj = (P t)ij

where the fourth equality follows from Markov property.

5

3.1 Simulation of the diffusion process

Now we pick two points on the concentric circles and compute the diffusion starting fromthese two initializations by studying the conditional probability:

pt(i, k) := P(Xt = k|X0 = i) = (e>i Pt)k

The colorbar in the following figures reflects the magnitude of pt(i, k) at time t. In thiscase, we construct the weight matrix by using Gaussian kernel function:

wij = exp

(−‖xi − xj‖

2

2σ2

).

We choose Gaussian kernel because the actual diffusion is also characterized by Gaussianfunction.

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

Figure 3: Diffusion on graphs from two intializations

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

2

4

6

8

10

12

14

16

18

10-3

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0.005

0.01

0.015

0.02

0.025

0.03

Figure 4: Transition matrix at time t = 1

6

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

10-3

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

1

2

3

4

5

6

7

8

9

10-3


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

4

6

8

10

12

14

16

10-4

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0.6

0.8

1

1.2

1.4

1.6

1.8

10-3


-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

6

7

8

9

10

11

12

13

14

10-4

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

7

8

9

10

11

12

10-4


4 Diffusion maps

How to define a “notion” of distance between two nodes on the graph? Diffusion maps [1]provide a method to compute the distance between nodes based on the diffusion processon the network.

7

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

6

7

8

9

10

11

12

13

14

10-4

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

6

7

8

9

10

11

12

13

14

10-4


The distance between state i and j is derived from the probability measure of this Markovchain at time t starting from i and j:

pt(i, k) = P(Xt = k|X0 = i) = (e>i Pt)k

and the diffusion distance is obtained by

Dt(i, j) =

√√√√ n∑k=1

1

πk|pt(i, k)− pt(j, k)|2, πk = dk/Vol(G).

In matrix form, we have

D2t (i, j) =

1

Vol(G)(ei − ej)>PD−1P>(ei − ej).

It is the weighted `2 distance between the probability measure P(Xt|X0 = i) and P(Xt|X0 =j).

1. It reflects the connectivity of the data at any given time scale t. As t → ∞, thediffusion distance between any two points vanishes.

2. The distance Dt(i, j) basically sums over all paths of length t connecting from i to jas well as from j to i. Thus it is more robust than shortest path (geodesic distance)

We present an example here by looking at how P t evolves w.r.t. time t, as shown inFigure 9 and 10.

Theorem 4.1. The diffusion distance satisfies

D2t (i, j) =

n∑`=1

λ2t` (ψ`(i)−ψ`(j))2

where ψ` is the `th eigenvector of P , i.e.,

Pψ` = λlψ`, 1 ≤ ` ≤ n, 1 ≥ λ1 ≥ λ2 ≥ · · · ≥ λn ≥ −1

and ψ`(i) is the ith entry of ψ`.

8

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

0 100 200 300 400 500 600 700 800 900 1000

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0 100 200 300 400 500 600 700 800 900 1000

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

0 100 200 300 400 500 600 700 800 900 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

210

-3

0 100 200 300 400 500 600 700 800 900 1000

0.4

0.6

0.8

1

1.2

1.4

1.610

-3

0 100 200 300 400 500 600 700 800 900 1000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.510

-3

Figure 9: t-step transition matrix P t.

9

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000

0.005

0.01

0.015

0.02

0.025

0.03

0.035

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000

1

2

3

4

5

6

7

8

9

10

10-3

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10-3

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

10005

6

7

8

9

10

11

12

13

14

10-4

100 200 300 400 500 600 700 800 900 1000

100

200

300

400

500

600

700

800

900

1000

6

7

8

9

10

11

12

13

14

10-4

Figure 10: t-step transition matrix P t.

Proof: Note that

D2t (i, j) = Vol(G)(ei − ej)>PD−1P>(ei − ej)

since

e>i P = [pt(i, 1), · · · , pt(i, n)], πk =dk

Vol(G).

Let Ψ = [ψ1, · · · ,ψn] be the matrix consisting of all the eigenvectors of P . It holds that

PΨ = ΨΛ⇐⇒ P = ΨΛΨ−1

10

where Λ = diag(λ1, · · · , λn).

In fact Ψ satisfies Ψ>DΨ = In. It follows from

PΨ = ΨΛ⇐⇒D−1WΨ = ΨΛ⇐⇒D−12WD−

12D

12 Ψ = D

12 ΨΛ

In other words, D12 Ψ are the eigenvectors of the normalized graph Laplacian. Therefore,

Ψ>DΨ = In, Ψ−1D−1Ψ−> = In

since the eigenvectors of symmetric matrices are perpendicular to one another. Now wecompute D2

t (i, j):

PD−1P> = ΨΛΨ−1D−1Ψ−>ΛΨ>

= ΨΛ2Ψ>

Therefore, the distance between i and j is given by the kernel ΨΛ2Ψ>.

Thus we have proven

D2t (i, j) = Vol(G)(ei − ej)>ΨΛ2Ψ>(ei − ej).

What does it mean? One can find an embedding of the data points via diffusion.

Definition 4.1 (Diffusion maps). The diffusion maps are defined as

Ψt(i) :=

λt1ψ1(i)λt2ψ2(i)

...λtnψn(i)

, 1 ≤ i ≤ n

where ψ` is the `th eigenvector of P . Each component in Ψt(i) is termed diffusion coor-dinate.

Remark: it is important to note that the diffusion maps are time dependent, i.e., de-pending on the time parameter t. This allows us to have a representation of dataset atdifferent time scale.

By construction of diffusion maps, we have

D2t (i, j) = Vol(G)‖Ψt(i)−Ψt(j)‖2

where

‖Ψt(i)−Ψt(j)‖2 =m∑`=2

λ2t` (ψ`(i)− ψ`(j))2.

In other words, we embed each vertex i to Ψt(i) ∈ Rm whose Euclidean distance equalsdiffusion distance.

Note that in most cases, the eigenvalue λ` decays very fast. Therefore, only a few coordi-nates are needed in some cases. Therefore, we consider the truncated diffusion maps:

Ψt,m(i) =

λt1ψ1(i)λt2ψ2(i)

...λtmψm(i)

where m(t, δ) = max{` : λt` ≥ δλt2} for some small δ.

11

4.1 Examples

4.1.1 n-cycle graphs

In this case, let’s compute the diffusion maps.

P =

0 1/2 0 · · · 0 0 1/21/2 0 1/2 · · · 0 0 00 1/2 0 · · · 0 0 0...

......

. . ....

......

0 0 0 · · · 0 1/2 00 0 0 · · · 1/2 0 1/2

1/2 0 0 · · · 0 1/2 0

The second and third eigenvectors ψ` of P are

ψ2 =

[cos

(2π

n

), · · · , cos

(2(n− 1)π

n

), 1

]ψ3 =

[sin

(2π

n

), · · · , sin

(2(n− 1)π

n

), 1

]with λ2 = λ3 = cos(2π

n). As a result, the diffusion maps are given by

Ψt,2(i) = cost(

2π

n

)[cos(2πin

)sin(2πin

)]4.1.2 Swiss roll

Suppose the dataset is in the shape “swiss roll”. We construct the weight matrix andMarkov transition matrix from the data. Then use the first two eigenvectors of P toembed the data points in 2D, shown in Figure 11.

The toy example is generated by

X = sort(rand(n,1))*14;

Y = rand(n,1)*10;

data = [X.*cos(X) Y X.*sin(X)];

sigma = 5;

pdist_data = squareform(pdist(data));

W = exp(-pdist_data.^2/(2*sigma^2));

D = diag(sum(W));

P = D\W;

[eigP_vec, eigP_val] = eig(P);

eigP_val = real(diag(eigP_val));

[~,id] = sort(abs(eigP_val),’descend’);

Phi = real([eigP_val(id==1)^t*eigP_vec(:,id==1), ...

eigP_val(id==2)^t*eigP_vec(:,id==2), eigP_val(id==3)^t*eigP_vec(:,id==3)]);

For the connection to commute time distance and effective resistance, find more detailsin [2].

12

0

5

15

-15

10 5 0

-10

-5 -1010

-5

0

5

10

15

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

Figure 11: Swiss roll and Embedded data in 2D

References

[1] R. R. Coifman and S. Lafon. Diffusion maps. Applied and Computational HarmonicAnalysis, 21(1):5–30, 2006.

[2] D. Spielman. Spectral graph theory, Lecture notes. 2019.

13

math-shu 236 data representation and di usion mapssling/math-shu-236-2020-spring/math-sh… ·...

Documents